From hnguyen at linux.vnet.ibm.com Tue Sep 4 16:44:35 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 5 Sep 2007 01:44:35 +0200 Subject: [ofa-general] [PATCH] libibverbs: increment comp_events_completed only if channel is set Message-ID: <200709050144.36141.hnguyen@linux.vnet.ibm.com> Hello Roland! I created this patch against your libibverbs git, stable branch. Regards Nam increment counter comp_events_completed only if channel is set this will prevent the while loop below in ibv_cmd_destroy_cq() to hang if consumer calls ibv_ack_cq_events() without any assigned channel int ibv_cmd_destroy_cq(struct ibv_cq *cq) { ... pthread_mutex_lock(&cq->mutex); while (cq->comp_events_completed != resp.comp_events_reported || cq->async_events_completed != resp.async_events_reported) Signed-off-by: Hoang-Nam Nguyen --- src/verbs.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/src/verbs.c b/src/verbs.c index f5cf4d3..3460844 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -344,6 +344,8 @@ default_symver(__ibv_get_cq_event, ibv_get_cq_event); void __ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents) { + if (!cq->channel) + return; pthread_mutex_lock(&cq->mutex); cq->comp_events_completed += nevents; pthread_cond_signal(&cq->cond); -- 1.5.2 From postmaster at lists.openfabrics.org Sat Sep 1 00:34:23 2007 From: postmaster at lists.openfabrics.org (Content-filter at lists.openfabrics.org) Date: Sat, 1 Sep 2007 00:34:23 -0700 (PDT) Subject: [ofa-general] BANNED (multipart/mixed | application/x-msdownload, .exe, plug.exe, load'ka.exe) IN MAIL FROM YOU In-Reply-To: <20070901073422.03E80E6083E@openfabrics.org> Message-ID: BANNED CONTENTS ALERT Our content checker found banned name: multipart/mixed | application/x-msdownload,.exe,plug.exe,load'ka.exe MIME error: error: unexpected end of preamble in email presumably from you (), to the following recipient: -> general at lists.openfabrics.org First upstream SMTP client IP address: [87.244.136.93] cable-87-244-136-93.upc.chello.be According to the 'Received:' trace, the message originated at: [87.244.136.93] Our internal reference code for your message is 18345-03/PbL+al9YiPil. Delivery of the email was stopped! The message has been blocked because it contains a component (as a MIME part or nested within) with declared name or MIME type or contents type violating our access policy. To transfer contents that may be considered risky or unwanted by site policies, or simply too large for mailing, please consider publishing your content on the web, and only sending an URL of the document to the recipient. Depending on the recipient and sender site policies, with a little effort it might still be possible to send any contents (including viruses) using one of the following methods: - encrypted using pgp, gpg or other encryption methods; - wrapped in a password-protected or scrambled container or archive (e.g.: zip -e, arj -g, arc g, rar -p, or other methods) Note that if the contents is not intended to be secret, the encryption key or password may be included in the same message for recipient's convenience. We are sorry for inconvenience if the contents was not malicious. The purpose of these restrictions is to cut the most common propagation methods used by viruses and other malware. These often exploit automatic mechanisms and security holes in more popular mail readers (Microsoft mail readers and browsers are a common target). By requiring an explicit and decisive action from the recipient to decode mail, the dangers of automatic malware propagation is largely reduced. For your reference, here are headers from your email: ------------------------- BEGIN HEADERS ----------------------------- Return-Path: Received: from cable-87-244-136-93.upc.chello.be (cable-87-244-136-93.upc.chello.be [87.244.136.93]) by openfabrics.org (Postfix) with ESMTP id 03E80E6083E for ; Sat, 1 Sep 2007 00:34:21 -0700 (PDT) From: "Leslie Espinoza" Subject: re: plugin Date: Sat, 01 Sep 2007 09:34:17 +0300 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0010_01C7EC7B.1D82EA80" X-Priority: 3 X-MSMail-Priority: Normal X-Unsent: 1 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Message-Id: <20070901073422.03E80E6083E at openfabrics.org> To: undisclosed-recipients:; -------------------------- END HEADERS ------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 642 bytes Desc: Undelivered-message headers URL: From altablair at web.de Sat Sep 1 01:48:50 2007 From: altablair at web.de (altablair at web.de) Date: Sat, 1 Sep 2007 12:48:50 +0400 Subject: [ofa-general] The Weekend trader Message-ID: <46D92772.4050704@web.de> VGPM goes after subscription gaming industry. Vega Promotional Sys V G P M $0.07 Subscription gamers spend over 1 billion a year. World of Warcraft investors saw a gross of 471 million in 2006. VGPM is bringing a whole new world of subscription games to the market. Get ahead of the game and get on VGPM Tuesday morning. From vlad at lists.openfabrics.org Sat Sep 1 02:47:53 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 1 Sep 2007 02:47:53 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070901-0200 daily build status Message-ID: <20070901094753.611FFE60821@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070901-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sat Sep 1 03:26:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 1 Sep 2007 13:26:01 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <20070831155529.2bf8d902.weiny2@llnl.gov> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> <20070831155529.2bf8d902.weiny2@llnl.gov> Message-ID: <20070901102601.GM11549@sashak.voltaire.com> Hi Ira, On 15:55 Fri 31 Aug , Ira Weiny wrote: > We just ran into a problem with this patch applied. > > It seems that the output file is not y.tab.h but osm_qos_parser_y.h so should > the move be: mv -f osm_qos_parser_y.h $(srcdir)/../include/opensm/osm_qos_parser_y.h I applied this fix for now. > weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > 15:51:42 > ls *.h > ls: *.h: No such file or directory > > weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > 15:52:02 > bison -d -o ./osm_qos_parser_y.c -p__qos_parser_ ./osm_qos_parser.y > > weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > 15:52:21 > ls *.h > osm_qos_parser_y.h > > weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > 15:53:17 > bison --version > bison (GNU Bison) 1.875c I have bison-2.3 and similar results. With yacc-1.9.1 this line doesn't work at all. The only "compatible" rules I found are: yacc -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y , or bison -y -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y , and then I get osm_qos_parser.tab.h and osm_qos_parser.tab.c files in current directory. Yevgeny! Is this could be useful? Sasha From emo at xn--z92b0yd4r1ia.com Sat Sep 1 04:21:59 2007 From: emo at xn--z92b0yd4r1ia.com (Rhonda Schneider) Date: Sat, 32 Aug 2007 20:21:59 +0900 Subject: [ofa-general] US $ 269.90 adobe suite 3 Message-ID: <0107ffa4$0107fe78$48f25ddc@emo> autodesk 2008 Retail price - $6720 You save: $6590.05 http://www.ohkakmnogocn.cn From lwalker at golovely.com Sat Sep 1 09:16:32 2007 From: lwalker at golovely.com (Cruz Eldridge) Date: Sat, 32 Aug 2007 25:16:32 +0900 Subject: [ofa-general] photoshop 9 Message-ID: <0107ffa4$0107fe78$9e9e9579@lwalker> Deep in the fog that quenches every ray, With sun's warmth wasted on a stone,A matter of getting all that right . . From or.gerlitz at gmail.com Sat Sep 1 13:19:04 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sat, 1 Sep 2007 23:19:04 +0300 Subject: [ofa-general] [PATCH V4 0/10] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46C9B474.5020202@voltaire.com> References: <46C9B474.5020202@voltaire.com> Message-ID: <15ddcffd0709011319n3d458a9cm3c97344807a72c43@mail.gmail.com> On 8/20/07, Moni Shoua wrote: > This patch series is the fourth version (see below link to V3) of the > suggested changes to the bonding driver so it would be able to support > non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. > > The motivation is to enable the bonding driver on its HA mode to work with > the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave > IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with > fail-over and fail-back working fine. The working environment was the net-2.6 git. > > This series also includes patches to the IPoIB driver that fix some fix > some neighboring related issues. > > Major changes from the previous version: > > 1) Addressing the issue of safety when unloading the IPoIB module before > the bonding module > 2) style changes > > > Links to earlier discussion: > > 1. A discussion in netdev about bonding support for IPoIB. > http://lists.openwall.net/netdev/2006/11/30/46 > > 2. A discussion in openfabrics regarding changes in the IPoIB that > enable using it as a slave for bonding. > http://lists.openfabrics.org/pipermail/general/2007-July/038914.html Roland, Jay, Dave These patches hang around for about a year now where the V4 took care of the open issues pointed during the review process. Aiming for 2.6.24, the merge window becomes close, can you provide your take here? thanks, Or. From swelch at systemfabricworks.com Sat Sep 1 15:03:10 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Sat, 1 Sep 2007 17:03:10 -0500 Subject: [ofa-general] [PATCH] infiniband/hw/mthca: Add optional router mode initialization In-Reply-To: <20070831170932.GE4472@obsidianresearch.com> References: <46D82B52.mailO2X1DVFVD@systemfabricworks.com> <20070831170932.GE4472@obsidianresearch.com> Message-ID: <001201c7ece3$e2b9b0a0$a865a8c0@catcher> Jason, To date this code has been used to initialize Infinihost III Lx and Ex flavors running in memfree and Tavor compatibility modes with current and some back rev firmware. It has been driven mostly by what was available to me and has not been exhaustive. Steve > -----Original Message----- > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com] > Sent: Friday, August 31, 2007 12:10 PM > To: swelch at systemfabricworks.com > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH] infiniband/hw/mthca: Add optional > router mode initialization > > On Fri, Aug 31, 2007 at 09:53:06AM -0500, swelch at systemfabricworks.com > wrote: > > This patch allows for the kernel mthca driver to optionally > initialize the > > mthca devices in router mode. Router mode is enabled at module load > with > > the setting of the module parm "router_mode=1". This setting > > acts on the > > What MT firmware versions/devices is this compatible with? > > Thanks, > Jason From kliteyn at dev.mellanox.co.il Sat Sep 1 15:34:35 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 02 Sep 2007 01:34:35 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <20070901102601.GM11549@sashak.voltaire.com> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> <20070831155529.2bf8d902.weiny2@llnl.gov> <20070901102601.GM11549@sashak.voltaire.com> Message-ID: <46D9E8FB.50507@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Ira, > > On 15:55 Fri 31 Aug , Ira Weiny wrote: >> We just ran into a problem with this patch applied. >> >> It seems that the output file is not y.tab.h but osm_qos_parser_y.h so should >> the move be: mv -f osm_qos_parser_y.h $(srcdir)/../include/opensm/osm_qos_parser_y.h > > I applied this fix for now. > >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >> 15:51:42 > ls *.h >> ls: *.h: No such file or directory >> >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >> 15:52:02 > bison -d -o ./osm_qos_parser_y.c -p__qos_parser_ ./osm_qos_parser.y >> >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >> 15:52:21 > ls *.h >> osm_qos_parser_y.h >> >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >> 15:53:17 > bison --version >> bison (GNU Bison) 1.875c > > I have bison-2.3 and similar results. With yacc-1.9.1 this line doesn't > work at all. The only "compatible" rules I found are: > > yacc -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > , or > > bison -y -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > , and then I get osm_qos_parser.tab.h and osm_qos_parser.tab.c files in > current directory. > > Yevgeny! Is this could be useful? How about dropping all these yacc/bison/lex/flex/version dependent commands and going back to something like what I've submitted in the original patch: osm_qos_parser_y.c: $(srcdir)/osm_qos_parser.y $(srcdir)/../include/opensm/osm_qos_policy.h $(YACC) -d $(srcdir)/osm_qos_parser.y mv -f y.tab.c $(srcdir)/osm_qos_parser_y.c mv -f y.tab.h $(srcdir)/../include/opensm/osm_qos_parser_y.h osm_qos_parser_l.c: $(srcdir)/osm_qos_parser.l $(srcdir)/../include/opensm/osm_qos_policy.h $(LEX) $(srcdir)/osm_qos_parser.l mv -f lex.yy.c $(srcdir)/osm_qos_parser_l.c And if we're really worried about prefixes, we can add it too: $(YACC) -d -p__qos_parser_ $(srcdir)/osm_qos_parser.y and $(LEX) -P__qos_parser_ $(srcdir)/osm_qos_parser.l -- Yevgeny > Sasha > From sashak at voltaire.com Sat Sep 1 16:24:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Sep 2007 02:24:24 +0300 Subject: [ofa-general] ib_umad method mask problems on big-endian 64-bit archs In-Reply-To: References: <20070822190519.GD1397@sashak.voltaire.com> <20070830143547.GM7140@sashak.voltaire.com> Message-ID: <20070901232424.GA16108@sashak.voltaire.com> On 14:20 Fri 31 Aug , Roland Dreier wrote: > > > Do we have another another user_umad users (OFA or another known, where > > switch could be painful)? If not, I will prefer this way instead of > > keeping two OpenSMs for ppc64. > > I agree. After thinking it over, the least painful thing to do to > update "documentation" (ib_user_mad.h) to match what the kernel does > now for 64-bit big endian systems, and fix the case of 32-bit big > endian userspace on a 64-bit kernel. > > I'll post a patch for 2.6.24. Thanks. I will fix libibumad header and related code then. Sasha From sashak at voltaire.com Sat Sep 1 16:38:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Sep 2007 02:38:27 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <46D88E4F.4020603@dev.mellanox.co.il> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> Message-ID: <20070901233827.GB16108@sashak.voltaire.com> On 00:55 Sat 01 Sep , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 18:39 Thu 30 Aug , Yevgeny Kliteynik wrote: > >> Fixing bison command to more general yacc syntax > >> > >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > >> --- > >> opensm/opensm/Makefile.am | 3 ++- > >> 1 files changed, 2 insertions(+), 1 deletions(-) > >> > >> diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am > >> index abfa913..4ab7227 100644 > >> --- a/opensm/opensm/Makefile.am > >> +++ b/opensm/opensm/Makefile.am > >> @@ -60,7 +60,8 @@ opensm_SOURCES = main.c osm_console.c osm_db_files.c \ > >> osm_qos_parser_y.c osm_qos_parser_l.c osm_qos_policy.c > >> > >> osm_qos_parser_y.c: $(srcdir)/osm_qos_parser.y > >> $(srcdir)/../include/opensm/osm_qos_policy.h > >> - $(YACC) -y -d $(srcdir)/osm_qos_parser.y -o $(srcdir)/osm_qos_parser_y.c > >> --defines=$(srcdir)/../include/opensm/osm_qos_parser_y.h > >> --name-prefix=__qos_parser_ > >> + $(YACC) -d -o $(srcdir)/osm_qos_parser_y.c -p__qos_parser_ > >> $(srcdir)/osm_qos_parser.y > >> + mv -f y.tab.h $(srcdir)/../include/opensm/osm_qos_parser_y.h > > BTW if osm_qos_parser_y.h file is generated one and used only by > > generated *.c files is not it would be simpler just to lease it in > > current directory? > > Perhaps, but then it would be the only header file in opensm > source that's not located under the include/ directory. This generated file is used internally by generated parser only, I think it is fine to keep them together. > Is it the only header that isn't included by other headers in osm? Not header, by source files. Sasha From sashak at voltaire.com Sat Sep 1 16:40:49 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Sep 2007 02:40:49 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <46D9E8FB.50507@dev.mellanox.co.il> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> <20070831155529.2bf8d902.weiny2@llnl.gov> <20070901102601.GM11549@sashak.voltaire.com> <46D9E8FB.50507@dev.mellanox.co.il> Message-ID: <20070901234049.GC16108@sashak.voltaire.com> On 01:34 Sun 02 Sep , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > Hi Ira, > > On 15:55 Fri 31 Aug , Ira Weiny wrote: > >> We just ran into a problem with this patch applied. > >> > >> It seems that the output file is not y.tab.h but osm_qos_parser_y.h so > >> should > >> the move be: mv -f osm_qos_parser_y.h > >> $(srcdir)/../include/opensm/osm_qos_parser_y.h > > I applied this fix for now. > >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >> 15:51:42 > ls *.h > >> ls: *.h: No such file or directory > >> > >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >> 15:52:02 > bison -d -o ./osm_qos_parser_y.c -p__qos_parser_ > >> ./osm_qos_parser.y > >> > >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >> 15:52:21 > ls *.h > >> osm_qos_parser_y.h > >> > >> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >> 15:53:17 > bison --version > >> bison (GNU Bison) 1.875c > > I have bison-2.3 and similar results. With yacc-1.9.1 this line doesn't > > work at all. The only "compatible" rules I found are: > > yacc -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > , or > > bison -y -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > , and then I get osm_qos_parser.tab.h and osm_qos_parser.tab.c files in > > current directory. > > Yevgeny! Is this could be useful? > > How about dropping all these yacc/bison/lex/flex/version dependent commands Is this $(YACC) -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y yacc/bison/lex/flex/version dependent? Sasha > and going back to something like what I've submitted in the original patch: > > osm_qos_parser_y.c: $(srcdir)/osm_qos_parser.y > $(srcdir)/../include/opensm/osm_qos_policy.h > $(YACC) -d $(srcdir)/osm_qos_parser.y > mv -f y.tab.c $(srcdir)/osm_qos_parser_y.c > mv -f y.tab.h $(srcdir)/../include/opensm/osm_qos_parser_y.h > > osm_qos_parser_l.c: $(srcdir)/osm_qos_parser.l > $(srcdir)/../include/opensm/osm_qos_policy.h > $(LEX) $(srcdir)/osm_qos_parser.l > mv -f lex.yy.c $(srcdir)/osm_qos_parser_l.c > > And if we're really worried about prefixes, we can add it too: > > $(YACC) -d -p__qos_parser_ $(srcdir)/osm_qos_parser.y > and > $(LEX) -P__qos_parser_ $(srcdir)/osm_qos_parser.l > > > -- Yevgeny > > > Sasha > From sashak at voltaire.com Sat Sep 1 16:53:30 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Sep 2007 02:53:30 +0300 Subject: [ofa-general] [PATCHv2] opensm: set hop limit when creating ipoib multicast groups In-Reply-To: <20070831161359.GA9728@obsidianresearch.com> References: <20070830154812.GB5617@obsidianresearch.com> <20070830164453.GA5680@obsidianresearch.com> <20070830230834.GA5756@obsidianresearch.com> <20070831135552.GL11549@sashak.voltaire.com> <20070831161359.GA9728@obsidianresearch.com> Message-ID: <20070901235330.GE16108@sashak.voltaire.com> On 10:13 Fri 31 Aug , Rolf Manderscheid wrote: > Hi Sasha, > > This patch sets the hop limit for the multicast groups for ipoib according > to the scope so that ipoib works over multiple IB subnets. > > Signed-off-by: Rolf Manderscheid Applied. Thanks. Sasha From sashak at voltaire.com Sat Sep 1 16:54:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 2 Sep 2007 02:54:04 +0300 Subject: [ofa-general] Re: [PATCH] opensm/opensm/Makefile.am : add osm_event_plugin.h to the installed In-Reply-To: <20070831150923.6b0660b3.weiny2@llnl.gov> References: <20070831150923.6b0660b3.weiny2@llnl.gov> Message-ID: <20070901235404.GF16108@sashak.voltaire.com> On 15:09 Fri 31 Aug , Ira Weiny wrote: > From 12f3c744a609916ce5c52885db63558773223062 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 30 Aug 2007 14:32:29 -0700 > Subject: [PATCH] opensm/opensm/Makefile.am : add osm_event_plugin.h to the installed headers to > > be able to compile external plugins > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From vlad at lists.openfabrics.org Sun Sep 2 02:47:47 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 2 Sep 2007 02:47:47 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070902-0200 daily build status Message-ID: <20070902094747.CAF4CE6082C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070902-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From monisonlists at gmail.com Sun Sep 2 04:32:53 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Sun, 02 Sep 2007 14:32:53 +0300 Subject: [ofa-general] Re: [PATCH V4 10/10] net/bonding: Destroy bonding master when last slave is gone In-Reply-To: <16908.1188417014@death> References: <46C9B474.5020202@voltaire.com> <46C9BA2D.7060204@voltaire.com> <3403.1188343986@death> <46D57D5D.3060706@gmail.com> <16908.1188417014@death> Message-ID: <46DA9F65.2020201@gmail.com> Jay Vosburgh wrote: > Moni Shoua wrote: > >> Jay Vosburgh wrote: >>> Moni Shoua wrote: >>> >>>> When bonding enslaves non Ethernet devices it takes pointers to functions >>>> in the module that owns the slaves. In this case it becomes unsafe >>>> to keep the bonding master registered after last slave was unenslaved >>>> because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero >>>> ensures that these functions be used anymore. >>> Would it not be simpler to run the bonding master through >>> ether_setup() again when the final slave is released (to reset all of >>> the pointers to their "ethernet" values)? I'm presuming here the >>> pointers of questionable validity are the ones set in the >>> bond_setup_by_slave() copied from the slave_dev->hard_header, et al. >>> >>> Having the bonding master disappear (but only sometimes) after >>> the last slave is removed is a semantic change I'd rather not introduce >>> if it's not necessary. >> Thanks for the comments. >> >> Having the master disappear is one way I could think of to solve the problem of leaving >> the bonding module with pointers to illegal addresses. >> The other way is to increase the usage count, with try_module_get(), of the module which owns of the slave. >> To do that I have to restore the field owner in structure net_device (it was removed in 2.6). > > What I was asking above is really whether or not it's feasible > to simply reset the affected pointers back to the "ethernet" values from > ether_setup(). I would think this should return the bonding master back > to the original state it started in before any slaves were added. > Unless I'm missing something; I'm willing to believe there's some > IB-specific tidbit I'm unaware of that makes this more complicated than > it seems. It's possible to reset the bonding master by calling ether_setup but with one exception: its neighbors. When enslaving IPoIB devices, the bonding master neighbors point to a destructor function in the ib_ipoib module. When ib_ipoib goes down the neighbors of the bonding master still exist and when their turn come to die they will try to access this function and the kernel will crash. This is why I want to destroy the bonding master before ib_ipoib is unloaded (to kill its neighbors). For any other issue (i.e. taken pointer), ether_setup would solve the problem. > > This presumes that I'm correct in thinking that the pointers > you're talking about (as being unsafe after removal of last slave) are > the ones copied in your new function bond_setup_by_slave(). > > I don't think it's desirable to acquire a reference to the slave > driver module. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com > From ogerlitz at voltaire.com Sun Sep 2 05:04:12 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 2 Sep 2007 15:04:12 +0300 (IDT) Subject: [ofa-general] [PATCH RFC] IB/ipoib: enable IGMP for userpsace multicast IB apps Message-ID: The kernel IB stack allows (through the RDMA CM) user space multicast applications to interoperate with IP based apps optionally running at a different IP subnet. To support this inter-op for the case where the receiving party resides at the IB side, there is a need to handle IGMP (reports/queries) else the local IP router would not forward this multicast traffic. This patch does a lookup on the database used for multicast reference counting and enhances IPoIB to ignore mulicast group which is already handled by user space, all this under a per device policy flag. That is when the policy flag allows it, IPoIB will not join/attach its QP to a multicast group which has an entry on the database. The default value is "disallowed", where through /sys/class/net/$dev/umcast one can allow/disallow and read it. Signed-off-by: Or Gerlitz Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-09 02:32:17.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-02 13:46:29.000000000 +0300 @@ -783,6 +783,7 @@ void ipoib_mcast_restart_task(struct wor struct ipoib_mcast *mcast, *tmcast; LIST_HEAD(remove_list); unsigned long flags; + struct ib_sa_mcmember_rec rec; ipoib_dbg_mcast(priv, "restarting multicast task\n"); @@ -816,6 +817,15 @@ void ipoib_mcast_restart_task(struct wor if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { struct ipoib_mcast *nmcast; + /* ignore group which is directly joined by user space */ + if (test_bit(IPOIB_FLAG_ADMIN_ALLOW_UMCAST, &priv->flags) && + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) + { + ipoib_dbg_mcast(priv, "ignoring multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + continue; + } + /* Not found or send-only group, let's add a new entry */ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-09 02:32:17.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-02 13:46:29.000000000 +0300 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_ADMIN_ALLOW_UMCAST = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -364,6 +365,7 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-09 02:32:17.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-02 14:01:33.000000000 +0300 @@ -1017,6 +1017,44 @@ static ssize_t show_pkey(struct device * } static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static ssize_t show_umcast(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (test_bit(IPOIB_FLAG_ADMIN_ALLOW_UMCAST, &priv->flags)) + return sprintf(buf, "allowed\n"); + else + return sprintf(buf, "disallowed\n"); +} + +static ssize_t set_umcast(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (!strcmp(buf, "allowed\n")) { + set_bit(IPOIB_FLAG_ADMIN_ALLOW_UMCAST, &priv->flags); + ipoib_warn(priv, "ignoring multicast groups joined directly " + "by user space\n"); + return count; + } + + if (!strcmp(buf, "disallowed\n")) { + clear_bit(IPOIB_FLAG_ADMIN_ALLOW_UMCAST, &priv->flags); + return count; + } + + return -EINVAL; +} +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); + +int ipoib_add_umcast_attr(struct net_device *dev) +{ + return device_create_file(&dev->dev, &dev_attr_umcast); +} + static ssize_t create_child(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -1134,6 +1172,8 @@ static struct net_device *ipoib_add_port goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-07-09 02:32:17.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-09-02 13:46:29.000000000 +0300 @@ -119,6 +119,8 @@ int ipoib_vlan_add(struct net_device *pd goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; From eitan at mellanox.co.il Sun Sep 2 07:02:25 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 2 Sep 2007 17:02:25 +0300 Subject: [ofa-general] [opensm] bugs in build system Message-ID: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> Hi Sasha, For some reason OpenSM (and the required management libs) do not build correctly when I use manual autogen.sh, configure --prefix=/tmp/ez/usr ; make; make install mode. It seems the build system is probably broken as it relies on fixed paths? Here is what I do, errors are included in this list: OK 1. git clone .... --------------- LIBIBCOMMON ------------------ OK 2. cd management/libibcommon; autogen.sh; ./configure --prefix=/tmp/ez/usr ; make ; make install --------------- LIBIBUMAD ------------------ OK 3. cd management/libibumad; autogen.sh; FAIL 4. ./configure --prefix=/tmp/ez/usr checking for sys_read_string in -libcommon... no configure: error: sys_read_string() not found. libibumad requires libibcommon. To overcome this I manually added the --disable-libcheck ./configure --prefix=/tmp/ez/usr --disable-libcheck I do not understand why after installing the common lib I still get this error? Isn't the search path should include the /lib ??? FAIL 5. make Make fails as it does not find the infiniband/common.h To overcome this I manually added -I/include .... make CFLAGS="-I/tmp/ez/usr/include" OK 6. make install --------------- OPENSM ------------------ OK 7. cd management/opensm; autogen.sh; FAIL 8. configure --prefix=/tmp/ez/usr checking for umad_init in -libumad... no configure: error: umad_init() not found. libosmvendor of type openib requires libibumad. configure: error: /bin/sh './configure' failed for libvendor To overcome this I manually added the --disable-libcheck ./configure --prefix=/tmp/ez/usr --disable-libcheck This problem is same as the above: lib path for linking should use the /lib. FAIL 9. make Here again the include path is missing the /include: ./../include/vendor/osm_vendor_ibumad.h:44:31: infiniband/common.h: No such file or directory ./../include/vendor/osm_vendor_ibumad.h:45:29: infiniband/umad.h: No such file or directory To overcome this I manually added -I/include .... make CFLAGS="-I/tmp/ez/usr/include" But this is not enough as the linker fail: /usr/bin/ld: cannot find -libumad To overcome this I had to add -L/lib .... make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib -libumad -libcommon" OK 10. make install I hope the above issues could be fixed such that the installation would be simpler. Also I propose removing the un-needed extra levels of autotools inside OpenSM code as there is no need/reason to have it eb declared as 5 different projects resulting with "configure" time longer than the compile time. Thanks Eitan Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From princemanavhoroz at huntersimmonds.com Sun Sep 2 08:24:04 2007 From: princemanavhoroz at huntersimmonds.com (Tiara) Date: Mon, 03 Sep 2007 03:24:04 +1200 Subject: [ofa-general] Stay happy Message-ID: <107501c7edd9$e1220de0$d5f9f9ab@princemanavhoroz> It cannot be denied that straight the unrestrained thick hammer liberty of association for political purposes long is the privi Someone observed to exactly me one count juggle day, in Philadelphia, that almost all crimes in America father are caused by the Some years nose ago several occipital pious church individuals undertook to ameliorate the condition of the glass prisons. The p lose Nevertheless, all terrible commodities and ideas circulate throughout the Union fatally awful as freely as in a country inh The same abuses of play heart edge power which still maintain slavery, would then become art the source of the most alar [Footnote f: warmly plead Unless this term be applied to the shorn functions which coal many of them fill in the schools. Al [Footnote l: I high-pitched remind test the reader of the general signification which soup overdid I give to the word "manners," na In certain remote corners obedient of the Old World you may still broadcast sometimes fax stumble upon yearly a small district whi lip To quick animal Raise Rents And Shorten The sawed Terms Of Leases tensely Future boast spoke sun Condition Of Three Races There was once a time at which iron impossible epithetic we also might add have created a great French nation in the American wild escape The root oil hang Americans traced out the circuit of an immense city on the site which they intended to make thei Different rat ways in which the right of association is understood in search Europeand horse rule in the United States - D Unlimited beset forbade art Power Of undress Majority, And Its Consequences In the kiss United States a bathe man builds a solemnly house to spend his latter years in it, and swiftly he sells it before th won drain Why The sleep glorious Federal System Is Not Adapted To All Peoples, cry And How painfully The rapid late Anglo-Americans Were Enabled To Adopt It helpful As long as the rub negro remains verse a slave, he may be kept in a condition not very far removed advertisement from that o Part VIII When these things are pointed out to the paint American end lost statesmen, they content wander themselves with assuring y [Footnote g: See the Constitution flower of New shine York, art. 7, fallen Section 4: - throughout "And whereas the ministers of th The great privilege of stone soothe the Americans deliver does not simply regret consist in their being more enlightened than ot See note dislike also the bid constitutions of North Carolina, bleed art. 31; Virginia; South Carolina, art. I, Section 23; As soon flame test as it curve is admitted that the whites and the umbrella emancipated blacks are placed upon the same territ I heard them inveigh against made ambition and deceit, kept under condition whatever political opinions rode these vices migh -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BILYR8(aU).gif Type: image/gif Size: 11263 bytes Desc: not available URL: From mst at dev.mellanox.co.il Sun Sep 2 12:15:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 2 Sep 2007 22:15:55 +0300 Subject: [ofa-general] Re: [PATCH] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070830130852.GF2532@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> Message-ID: <20070902191555.GA8966@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: [PATCH] IB/ipoib: S/G and HW checksum support > > Add module option hw_csum: when set, IPoIB will report S/G > support, and rely on hardware end-to-end transport checksum (ICRC) > instead of software-level protocol checksums. > > Since this will not inter-operate with older IPoIB modules, > this option is off by default. > > Signed-off-by: Michael S. Tsirkin Roland, any opinion on this one? Can this be queued for 2.6.24? -- MST From tziporet at dev.mellanox.co.il Sun Sep 2 23:52:56 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 03 Sep 2007 09:52:56 +0300 Subject: [ofa-general] OFED Aug 27 meeting summary on OFED 1.3 development status In-Reply-To: <46D440D2.3050904@noaa.gov> References: <6C2C79E72C305246B504CBA17B5500C901563C50@mtlexch01.mtl.com> <46D440D2.3050904@noaa.gov> Message-ID: <46DBAF48.2090407@mellanox.co.il> Nathan Dauchy wrote: > for distros, but does this mean that the RPM build and install will have > to be done in one step? That would make life MUCH more difficult for > sites like ours that have a diskless cluster, where the nodes running > OFED and where the compile is done do not have write access to the > image. We must build one place and install (with chroot or 'rpm > --root') in another. Separating build and install worked very well for > us in the past. > > > This will not be possible any more :-(. You will have to run an install on a machine that you can also install to. Tziporet From atabachnik.of at gmail.com Mon Sep 3 00:02:02 2007 From: atabachnik.of at gmail.com (Alex Tabachnik) Date: Mon, 3 Sep 2007 10:02:02 +0300 Subject: [ofa-general] symlink to the latest OFED 1.3 package Message-ID: <001d01c7edf8$5622c4e0$090519ac@voltaire.com> Vlad, Can you please add making a symlink to the latest OFED package on the OFA server during the automatic build, like is being done for OFED 1.2. Thank you Alex. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at mellanox.co.il Mon Sep 3 01:02:31 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 3 Sep 2007 11:02:31 +0300 Subject: [ofa-general] Error building OFED 1.2 sources In-Reply-To: <3307cdf90708302154s68f103d3s75d2231a654d3dc@mail.gmail.com> References: <3307cdf90708302154s68f103d3s75d2231a654d3dc@mail.gmail.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C902313B00@mtlexch01.mtl.com> > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf Of Rajouri Jammu > Sent: Friday, August 31, 2007 7:54 AM > To: openib-general at openib.org > Subject: [ofa-general] Error building OFED 1.2 sources > > Hi, > > I would like to make some changes to the OFED-1.2 kernel source. In > order to make sure I'm doing the right thing I tried compiling the > original sources (without any changes) > the make is bailing out with an error (see below for the steps). > > I'm building on stock Centos5 kernel. > > Note that build.sh script worked but I would like to make changes to > the sources and make sure it compiles before following the > ofed_path.sh procedure. > > Any help would be great. > > > This is what I did: > > tar -zxvf OFED-1.2.tgz > rpm -ivh OFED-1.2/SRPM/ofa_kernel-1.2-0.src.rpm > > This created the source tar ball in > /usr/src/redhat/SOURCES/ > > tar -zxvf /usr/src/redhat/SOURCES/ofa_kernel-1.2.tgz > cd /usr/src/redhat/SOURCES/ofa_kernel-1.2 > > ./configure > make > You should run configure with parameters. E.g. ./configure --with-cxgb3-mod --with-ipath_inf-mod --with-ipoib-mod --with-iser-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_m ad-mod --with-user_access-mod --with-addr_trans-mod --with-rds-mod make make install Regards, Vladimir From krkumar2 at in.ibm.com Mon Sep 3 02:21:21 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 3 Sep 2007 14:51:21 +0530 Subject: [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB In-Reply-To: <20070828.215150.98552996.davem@davemloft.net> Message-ID: Hi Dave, David Miller wrote on 08/29/2007 10:21:50 AM: > From: Krishna Kumar2 > Date: Wed, 29 Aug 2007 08:53:30 +0530 > > > I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send > > side. > > Bad choice of test, this is cpu limited since the scp > has to encrypt and MAC hash all the data it sends. > > Use something like straight ftp or "bw_tcp" from lmbench. I used bw_tcp from lmbench-3. I transfered 500MB and captured the tcpdump, and analysis at various points gave pipeline sizes: 26064, 27792, 22888, 23168, 23448, 20272, 23168, 4344, 10136, 164792, 35920, 26344, 24336, 24336, 23168, 25784, 23168, There was one huge 164K, otherwise most were in smaller ranges like 20-30K. I ran the following test script: SERVER=192.168.1.2 BYTES=100m BUFFERSIZES="4096 16384 65536 131072 262144" PROCS="1 8" ITERATIONS=5 for m in $BUFFERSIZES do for procs in $PROCS do echo TEST: Size:$m Procs:$procs bw_tcp -N $ITERATIONS -m $m -M $BYTES -P $procs $SERVER done done Result is: Test without batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.49 3 65536 1 117.55 4 131072 1 117.55 5 262144 1 117.58 6 4096 8 117.18 7 16384 8 117.47 8 65536 8 117.54 9 131072 8 117.59 10 262144 8 117.55 Test with batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.48 3 65536 1 117.55 4 131072 1 117.58 5 262144 1 117.58 6 4096 8 117.19 7 16384 8 117.46 8 65536 8 117.53 9 131072 8 117.55 10 262144 8 117.60 So it doesn't seem to harm e1000. Can someone give a link to the E1000E driver? I couldn't find it after downloading Jeff's netdev-2.6 tree. Thanks, - KK From vlad at lists.openfabrics.org Mon Sep 3 02:47:55 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 3 Sep 2007 02:47:55 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070903-0200 daily build status Message-ID: <20070903094755.F36E8E60805@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070903-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From kliteyn at dev.mellanox.co.il Mon Sep 3 05:12:00 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 03 Sep 2007 15:12:00 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <20070901234049.GC16108@sashak.voltaire.com> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> <20070831155529.2bf8d902.weiny2@llnl.gov> <20070901102601.GM11549@sashak.voltaire.com> <46D9E8FB.50507@dev.mellanox.co.il> <20070901234049.GC16108@sashak.voltaire.com> Message-ID: <46DBFA10.8020101@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 01:34 Sun 02 Sep , Yevgeny Kliteynik wrote: >> Sasha Khapyorsky wrote: >>> Hi Ira, >>> On 15:55 Fri 31 Aug , Ira Weiny wrote: >>>> We just ran into a problem with this patch applied. >>>> >>>> It seems that the output file is not y.tab.h but osm_qos_parser_y.h so >>>> should >>>> the move be: mv -f osm_qos_parser_y.h >>>> $(srcdir)/../include/opensm/osm_qos_parser_y.h >>> I applied this fix for now. >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >>>> 15:51:42 > ls *.h >>>> ls: *.h: No such file or directory >>>> >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >>>> 15:52:02 > bison -d -o ./osm_qos_parser_y.c -p__qos_parser_ >>>> ./osm_qos_parser.y >>>> >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >>>> 15:52:21 > ls *.h >>>> osm_qos_parser_y.h >>>> >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm >>>> 15:53:17 > bison --version >>>> bison (GNU Bison) 1.875c >>> I have bison-2.3 and similar results. With yacc-1.9.1 this line doesn't >>> work at all. The only "compatible" rules I found are: >>> yacc -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y >>> , or >>> bison -y -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y >>> , and then I get osm_qos_parser.tab.h and osm_qos_parser.tab.c files in >>> current directory. >>> Yevgeny! Is this could be useful? >> How about dropping all these yacc/bison/lex/flex/version dependent commands > > Is this > > $(YACC) -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > yacc/bison/lex/flex/version dependent? I don't know whether the '-b' flag is yacc/bison version dependent, but what do you gain by using it if the generated file should be moved to another location anyway? -- Yevgeny > Sasha > >> and going back to something like what I've submitted in the original patch: >> >> osm_qos_parser_y.c: $(srcdir)/osm_qos_parser.y >> $(srcdir)/../include/opensm/osm_qos_policy.h >> $(YACC) -d $(srcdir)/osm_qos_parser.y >> mv -f y.tab.c $(srcdir)/osm_qos_parser_y.c >> mv -f y.tab.h $(srcdir)/../include/opensm/osm_qos_parser_y.h >> >> osm_qos_parser_l.c: $(srcdir)/osm_qos_parser.l >> $(srcdir)/../include/opensm/osm_qos_policy.h >> $(LEX) $(srcdir)/osm_qos_parser.l >> mv -f lex.yy.c $(srcdir)/osm_qos_parser_l.c >> >> And if we're really worried about prefixes, we can add it too: >> >> $(YACC) -d -p__qos_parser_ $(srcdir)/osm_qos_parser.y >> and >> $(LEX) -P__qos_parser_ $(srcdir)/osm_qos_parser.l >> >> >> -- Yevgeny >> >>> Sasha > From kliteyn at dev.mellanox.co.il Mon Sep 3 05:15:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 03 Sep 2007 15:15:55 +0300 Subject: [ofa-general] [PATCH] osm: QoS: selecting PathRecord according to QoS policy Message-ID: <46DBFAFB.4090000@dev.mellanox.co.il> Selecting path according to QoS policy level that the PathRecord query matches. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_path_record.c | 383 ++++++++++++++++++++++++++++++------ 1 files changed, 320 insertions(+), 63 deletions(-) diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 1b781f0..8fc5eac 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -67,6 +67,7 @@ #include #include #include +#include #ifdef ROUTER_EXP #include #include @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, { const osm_node_t *p_node; const osm_physp_t *p_physp; + const osm_physp_t *p_src_physp; const osm_physp_t *p_dest_physp; - const osm_prtn_t *p_prtn; + const osm_prtn_t *p_prtn = NULL; const ib_port_info_t *p_pi; ib_api_status_t status = IB_SUCCESS; ib_net16_t pkey; @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, uint8_t required_rate; uint8_t required_pkt_life; uint8_t sl; + uint8_t in_port_num; ib_net16_t dest_lid; + uint8_t i; + uint8_t vl; + ib_slvl_table_t *p_slvl_tbl = NULL; + boolean_t valid_sls[IB_MAX_NUM_VLS]; + boolean_t sl2vl_valid_path; + uint8_t first_valid_sl; + osm_qos_level_t *p_qos_level = NULL; OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); + memset(valid_sls, TRUE, sizeof(valid_sls)); dest_lid = cl_hton16(dest_lid_ho); p_dest_physp = p_dest_port->p_physp; p_physp = p_src_port->p_physp; + p_src_physp = p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap(p_pi); @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, p_node = osm_physp_get_node_ptr(p_physp); if (p_node->sw) { + /* source node is a switch */ + in_port_num = osm_physp_get_port_num(p_physp); + /* * If the dest_lid_ho is equal to the lid of the switch pointed by * p_sw then p_physp will be the physical port of the switch port zero. + * Make sure that p_physp points to the out port of the + * switch that routes to the destination lid (dest_lid_ho) */ - p_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F02: " @@ -304,17 +319,36 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, status = IB_NOT_FOUND; goto Exit; } + if (!p_rcv->p_subn->opt.no_qos) + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + } + + if (!p_rcv->p_subn->opt.no_qos) { + if (p_node->sw) + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + else + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); + + /* update valid SLs that still exist on this route */ + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + vl = ib_slvl_table_get(p_slvl_tbl, i); + if (vl == IB_DROP_VL) + valid_sls[i] = FALSE; + } + } } /* - * Same as above + * now get pointer to the destination port (same as above) */ p_node = osm_physp_get_node_ptr(p_dest_physp); if (p_node->sw) { - p_dest_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + /* + * if destination is switch, we want p_dest_physp to point to port 0 + */ + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_dest_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, @@ -328,6 +362,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } + /* + * Now go through the path step by step + */ + while (p_physp != p_dest_physp) { p_physp = osm_physp_get_remote(p_physp); @@ -341,6 +379,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, goto Exit; } + in_port_num = osm_physp_get_port_num(p_physp); + /* This is point to point case (no switch in between) */ @@ -409,6 +449,20 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, CL_ASSERT(p_physp); CL_ASSERT(osm_physp_is_valid(p_physp)); + p_node = osm_physp_get_node_ptr(p_physp); + if (!p_node->sw) { + /* + * There is some sort of problem in the subnet object! + * If this isn't a switch, we should have reached + * the destination by now! + */ + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F05: " + "Internal error, bad path\n"); + status = IB_ERROR; + goto Exit; + } + p_pi = &p_physp->port_info; if (mtu > ib_port_info_get_mtu_cap(p_pi)) { @@ -435,6 +489,21 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, osm_physp_get_port_num(p_physp)); } + if (!p_rcv->p_subn->opt.no_qos) { + /* + * Check SL2VL table of the switch and update valid SLs + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + vl = ib_slvl_table_get(p_slvl_tbl, i); + if (vl == IB_DROP_VL) + valid_sls[i] = FALSE; + } + } + } + + /* go to the next step in the path */ } /* @@ -467,9 +536,118 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, "__osm_pr_rcv_get_path_parms: " "Path min MTU = %u, min rate = %u\n", mtu, rate); + if (!p_rcv->p_subn->opt.no_qos) { + /* check whether there is some SL that won't lead to VL15 eventually */ + sl2vl_valid_path = FALSE; + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + sl2vl_valid_path = TRUE; + first_valid_sl = i; + break; + } + } + + if (!sl2vl_valid_path) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 on this path\n"); + } + status = IB_NOT_FOUND; + goto Exit; + } + } + + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { + /* Get QoS Level object according to the path request */ + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, + p_rcv, p_pr, + p_src_physp, p_dest_physp, + comp_mask, &p_qos_level); + + if (p_qos_level + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "PathRecord request matches QoS Level '%s' (%s)\n", + p_qos_level->name, + (p_qos_level->use) ? p_qos_level-> + use : "no description"); + } + } + + /* Adjust path parameters according to QoS settings */ + + if (p_qos_level) { + /* adjust MTU limit according to QoS constraints */ + if (p_qos_level->mtu_limit_set + && (mtu > p_qos_level->mtu_limit)) { + mtu = p_qos_level->mtu_limit; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS constaraints: new smallest MTU = %u\n", + mtu); + } + } + + /* adjust Rate limit according to QoS constraints */ + if (p_qos_level->rate_limit_set + && (rate > p_qos_level->rate_limit)) { + rate = p_qos_level->rate_limit; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS constaraints: new smallest Rate = %u\n", + rate); + } + } + + /* adjust Packet Lifetime according to QoS constraints */ + if (p_qos_level->pkt_life_set + && (pkt_life > p_qos_level->pkt_life)) { + pkt_life = p_qos_level->pkt_life; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS constaraints: new smallest Packet Lifetime = %u\n", + pkt_life); + } + } + + /* adjust SL according to QoS constraints */ + if (p_qos_level->sl_set) { + if (!valid_sls[p_qos_level->sl]) { + status = IB_NOT_FOUND; + goto Exit; + } + sl = p_qos_level->sl; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS constaraints: new SL = %u\n", + sl); + } + } + } + + /* + * Set packet lifetime. + * According to spec definition IBA 1.2 Table 205 + * PacketLifeTime description, for loopback paths, + * packetLifeTime shall be zero. + */ + if (p_src_port == p_dest_port) + pkt_life = 0; + else + if ( !(p_qos_level && p_qos_level->pkt_life_set) ) + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + + /* - Determine if these values meet the user criteria - and adjust appropriately + * Done adjusting parameters according to QoS constraints. + * Determine if these values meet the user criteria and + * adjust appropriately. */ /* we silently ignore cases where only the MTU selector is defined */ @@ -511,6 +689,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the Rate selector is defined */ if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && @@ -551,14 +731,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } - - /* Verify the pkt_life_time */ - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, - for loopback paths, packetLifeTime shall be zero. */ - if (p_src_port == p_dest_port) - pkt_life = 0; /* loopback */ - else - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the PktLife selector is defined */ if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && @@ -603,38 +777,68 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (status != IB_SUCCESS) goto Exit; - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); - else if (comp_mask & IB_PR_COMPMASK_PKEY) { - pkey = p_pr->pkey; - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_pr_rcv_get_path_parms: ERR 1F1A: " - "Ports do not share specified PKey 0x%04x\n", - cl_ntoh16(pkey)); - status = IB_NOT_FOUND; - goto Exit; - } - } else { - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); - if (!pkey) { - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_pr_rcv_get_path_parms: ERR 1F1B: " - "Ports do not have any shared PKeys\n"); - status = IB_NOT_FOUND; - goto Exit; + /* + * set Pkey for this path record request + */ + + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); + else { + if (comp_mask & IB_PR_COMPMASK_PKEY) { + /* + * PR request has a specific pkey: + * Check that source and destination share this pkey. + * If QoS level has pkeys, check that this pkey exists + * in the QoS level pkeys. + * PR returned pkey is the requested pkey. + */ + pkey = p_pr->pkey; + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1A: " + "Ports do not share specified PKey 0x%04x\n", + cl_ntoh16(pkey)); + status = IB_NOT_FOUND; + goto Exit; + } + if (p_qos_level && p_qos_level->pkey_range_len && + !osm_qos_level_has_pkey(p_qos_level, pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } else { + /* PR request doesn't have a specific pkey */ + + if (p_qos_level && p_qos_level->pkey_range_len) { + /* If QoS level has pkeys, get shared pkey from QoS level pkeys */ + pkey = osm_qos_level_get_shared_pkey(p_qos_level, + p_src_physp, + p_dest_physp); + if (!pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } else { + pkey = osm_physp_find_common_pkey(p_src_physp, + p_dest_physp); + if (!pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1B: " + "Ports do not have any shared PKeys\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } } } - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, - p_dest_port); - else - sl = OSM_DEFAULT_SL; - if (pkey) { p_prtn = (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, @@ -642,34 +846,87 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, 0x8000)); if (p_prtn == (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) + p_prtn = NULL; + } + + /* + * Set PathRecord SL. + * + * ToDo: What about QoS and LASH routing? How can they coexist? + * And what happens when there's a pkey, hence there is a + * partition with a certain SL, and this SL doesn't match + * the one that's defined by LASH? + */ + + if (comp_mask & IB_PR_COMPMASK_SL) { + /* + * Specific SL was requested + */ + sl = ib_path_rec_sl(p_pr); + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS constaraints: required PR SL (%u) doesn't match QoS SL (%u)\n", + sl, p_qos_level->sl); + } + status = IB_NOT_FOUND; + goto Exit; + } + } else if (p_qos_level && p_qos_level->sl_set) { + /* + * No specific SL was requested, + * but there is an SL in QoS level + */ + sl = p_qos_level->sl; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS level SL (%u) overrides partition SL (%u)\n", + p_qos_level->sl, p_prtn->sl); + } + } + } else if (pkey) { + /* + * No specific SL in request or in QoS level - use partition SL + */ + if (!p_prtn) { /* this may be possible when pkey tables are created somehow in previous runs or things are going wrong here */ osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1C: " "No partition found for PKey 0x%04x - using default SL %d\n", cl_ntoh16(pkey), sl); - else { - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, - "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, - p_src_port, p_dest_port); - else - sl = p_prtn->sl; - } - - /* reset pkey when raw traffic */ - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = 0; + } else + sl = p_prtn->sl; + } else if (p_rcv->p_subn->opt.routing_engine_name && + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { + /* slid and dest_lid are stored in network in lash */ + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, + p_src_port, p_dest_port); + } else if (!p_rcv->p_subn->opt.no_qos) { + sl = first_valid_sl; } + else + sl = OSM_DEFAULT_SL; - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { + /* selected SL will eventually lead to VL15 */ + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "Selected SL (%u) leads to VL15\n", p_prtn->sl); + } status = IB_NOT_FOUND; goto Exit; } + /* reset pkey when raw traffic */ + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) + pkey = 0; + p_parms->mtu = mtu; p_parms->rate = rate; p_parms->pkt_life = pkt_life; -- 1.5.1.4 From sashak at voltaire.com Mon Sep 3 05:29:57 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 3 Sep 2007 15:29:57 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - fixing yacc command In-Reply-To: <46DBFA10.8020101@dev.mellanox.co.il> References: <46D6E4C3.80201@dev.mellanox.co.il> <20070831121013.GF11549@sashak.voltaire.com> <46D88E4F.4020603@dev.mellanox.co.il> <20070831155529.2bf8d902.weiny2@llnl.gov> <20070901102601.GM11549@sashak.voltaire.com> <46D9E8FB.50507@dev.mellanox.co.il> <20070901234049.GC16108@sashak.voltaire.com> <46DBFA10.8020101@dev.mellanox.co.il> Message-ID: <20070903122957.GA29384@sashak.voltaire.com> On 15:12 Mon 03 Sep , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 01:34 Sun 02 Sep , Yevgeny Kliteynik wrote: > >> Sasha Khapyorsky wrote: > >>> Hi Ira, > >>> On 15:55 Fri 31 Aug , Ira Weiny wrote: > >>>> We just ran into a problem with this patch applied. > >>>> > >>>> It seems that the output file is not y.tab.h but osm_qos_parser_y.h so > >>>> should > >>>> the move be: mv -f osm_qos_parser_y.h > >>>> $(srcdir)/../include/opensm/osm_qos_parser_y.h > >>> I applied this fix for now. > >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >>>> 15:51:42 > ls *.h > >>>> ls: *.h: No such file or directory > >>>> > >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >>>> 15:52:02 > bison -d -o ./osm_qos_parser_y.c -p__qos_parser_ > >>>> ./osm_qos_parser.y > >>>> > >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >>>> 15:52:21 > ls *.h > >>>> osm_qos_parser_y.h > >>>> > >>>> weiny2 at woprjr0:~/OpenIB/git-trees/management/opensm/opensm > >>>> 15:53:17 > bison --version > >>>> bison (GNU Bison) 1.875c > >>> I have bison-2.3 and similar results. With yacc-1.9.1 this line doesn't > >>> work at all. The only "compatible" rules I found are: > >>> yacc -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > >>> , or > >>> bison -y -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > >>> , and then I get osm_qos_parser.tab.h and osm_qos_parser.tab.c files in > >>> current directory. > >>> Yevgeny! Is this could be useful? > >> How about dropping all these yacc/bison/lex/flex/version dependent > >> commands > > Is this > > $(YACC) -d -b osm_qos_parser -p __qos_parser_ ./osm_qos_parser.y > > yacc/bison/lex/flex/version dependent? > > I don't know whether the '-b' flag is yacc/bison version dependent, > but what do you gain by using it if the generated file should be moved > to another location anyway? Then it should not be moved at all. Sasha From snagai at jp.ibm.com Mon Sep 3 07:31:27 2007 From: snagai at jp.ibm.com (snagai at jp.ibm.com) Date: Mon, 3 Sep 2007 10:31:27 -0400 Subject: [ofa-general] kdapl build error on ppc64 Message-ID: <10257140.1188829887714.JavaMail.root@wombat.diezmil.com> I have trouble with building kdapl module on ppc64 machine. I saw a lot of error messages during building it using any version of dapl source code downloaded from sourceforge (http://sourceforge.net/projects/dapl). I tried to build the module on both RHEL5 and Fedora Core6 using kernel 2.6.20, gcc 4.1.1. Did anyone succeed to build kdapl module on ppc64 machine ? If anyone has an idea about this issue, please advise me. -------------------------------------------------------------------------- In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:76, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:44:24: error: vapi_types.h: No such file or directory /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:45:18: error: vapi.h: No such file or directory /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:46:19: error: evapi.h: No such file or directory In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:76, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:59: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_hca_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:60: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_cq_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:61: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_kcq_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:62: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_qp_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:63: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_pd_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:64: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_mr_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:65: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_mw_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:66: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_qp_state_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:67: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_hca_name_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:68: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_error_record_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:69: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_work_completion_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:70: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_notification_type_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:71: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_async_handler_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:72: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_comp_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:73: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_data_segment_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:74: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_async_event_type’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:86: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_bool_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:100: error: expected specifier-qualifier-list before â€ib_cq_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:125: error: expected specifier-qualifier-list before â€ib_mr_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:130: error: expected â€)’ before â€*’ token /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:133: error: expected â€)’ before â€*’ token /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:136: error: expected â€)’ before â€*’ token /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:140: error: expected specifier-qualifier-list before â€ib_bool_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h: In function â€dapl_ib_status_convert’: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:195: error: â€VAPI_OK’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:195: error: (Each undeclared identifier is reported only once /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:195: error: for each function it appears in.) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:199: error: â€VAPI_EAGAIN’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:200: error: â€VAPI_EBUSY’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:204: error: â€VAPI_ENOMEM’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:208: error: â€VAPI_EINVAL_CQ_HNDL’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:209: error: â€VAPI_EINVAL_HCA_HNDL’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:213: error: â€VAPI_CQ_EMPTY’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:217: error: â€VAPI_ETIMEOUT’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:221: error: â€VAPI_EINTR’ undeclared (first use in this function) /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h: At top level: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:238: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€dapls_modify_qp_state_to_init’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:243: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€dapls_modify_qp_state_to_error’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_util.h:247: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€dapls_modify_qp_state_to_reset’ In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:77, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:61:29: error: ts_ib_sa_client.h: No such file or directory In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:62, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:77, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:36:59: error: ts_ib_useraccess_cm.h: No such file or directory In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:62, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:77, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:39: error: expected â€)’ before â€comm_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:41: error: expected â€)’ before â€comm_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:46: error: expected â€)’ before â€comm_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:48: error: expected â€)’ before â€comm_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:51: error: expected â€)’ before â€listen_handle’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:54: error: expected â€)’ before â€service_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:62: error: expected â€)’ before â€hca_handle’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:76: error: expected â€)’ before â€comm_id’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:80: error: expected â€)’ before â€hca_handle’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm_util.h:89: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_cm_vapi_service_assign’ In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:77, from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:87: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_cm_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:88: error: expected â€=’, â€,’, â€;’, â€asm’ or â€__attribute__’ before â€ib_cm_srvc_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:102: error: expected â€)’ before â€consumer_qp’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../openib_gen_one/dapl_openib_cm.h:108: error: expected â€)’ before â€consumer_qp’ In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:192:5: warning: "NDEBUG" is not defined In file included from /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:44: /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:283: error: expected specifier-qualifier-list before â€ib_hca_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:359: error: expected specifier-qualifier-list before â€ib_cq_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:443: error: expected specifier-qualifier-list before â€ib_qp_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:496: error: expected specifier-qualifier-list before â€ib_pd_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:505: error: expected specifier-qualifier-list before â€ib_mr_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:521: error: expected specifier-qualifier-list before â€ib_mw_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:548: error: expected specifier-qualifier-list before â€ib_cm_srvc_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/../include/dapl.h:565: error: expected specifier-qualifier-list before â€ib_cm_handle_t’ /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:54: error: expected â€)’ before string constant /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:58: error: expected â€)’ before string constant /home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.c:61: error: expected â€)’ before string constant make[4]: *** [/home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl/dapl_module.o] Error 1 make[3]: *** [_module_/home/user/archives/ppc64/dapl-RHEL5/dapl_gamma3.2/dapl/kdapl] Error 2 make[2]: *** [modules] Error 2 make[1]: *** [modules] Error 2 make: *** [all] Error 2 -------------------------------------------------------------------------- -- This message was sent on behalf of snagai at jp.ibm.com at openSubscriber.com http://www.opensubscriber.com/messages/openib-general at openib.org/topic.html From vlad at dev.mellanox.co.il Mon Sep 3 08:11:14 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 03 Sep 2007 18:11:14 +0300 Subject: [ofa-general] symlink to the latest OFED 1.3 package In-Reply-To: <001d01c7edf8$5622c4e0$090519ac@voltaire.com> References: <001d01c7edf8$5622c4e0$090519ac@voltaire.com> Message-ID: <46DC2412.9080001@dev.mellanox.co.il> Alex Tabachnik wrote: > Vlad, > > Can you please add making a symlink to the latest OFED package on the > OFA server during the automatic build, like is being done for OFED 1.2. > > Thank you > > Alex. > Done. Regards, Vladimir From vlad at dev.mellanox.co.il Mon Sep 3 08:38:03 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 03 Sep 2007 18:38:03 +0300 Subject: [ofa-general] [ANNOUNCE] ofed_1_3/linux-2.6.git updated to 2.6.23-rc5 Message-ID: <46DC2A5B.1000101@dev.mellanox.co.il> FYI, git://git.openfabrics.org/ofed_1_3/linux-2.6.git I've merged in 2.6.23-rc5. Regards, Vladimir From sashak at voltaire.com Mon Sep 3 10:20:11 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 3 Sep 2007 20:20:11 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46DBFAFB.4090000@dev.mellanox.co.il> References: <46DBFAFB.4090000@dev.mellanox.co.il> Message-ID: <20070903172010.GB29384@sashak.voltaire.com> Hi Yevgeny, The initial comments below. Basically I think some code cleanup is needed, and please decrease number of osm_log(...OSM_LOG_DEBUG...). Sasha On 15:15 Mon 03 Sep , Yevgeny Kliteynik wrote: > Selecting path according to QoS policy level that > the PathRecord query matches. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_sa_path_record.c | 383 ++++++++++++++++++++++++++++++------ > 1 files changed, 320 insertions(+), 63 deletions(-) > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 1b781f0..8fc5eac 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -67,6 +67,7 @@ > #include > #include > #include > +#include > #ifdef ROUTER_EXP > #include > #include > @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > { > const osm_node_t *p_node; > const osm_physp_t *p_physp; > + const osm_physp_t *p_src_physp; > const osm_physp_t *p_dest_physp; > - const osm_prtn_t *p_prtn; > + const osm_prtn_t *p_prtn = NULL; > const ib_port_info_t *p_pi; > ib_api_status_t status = IB_SUCCESS; > ib_net16_t pkey; > @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > uint8_t required_rate; > uint8_t required_pkt_life; > uint8_t sl; > + uint8_t in_port_num; > ib_net16_t dest_lid; > + uint8_t i; > + uint8_t vl; > + ib_slvl_table_t *p_slvl_tbl = NULL; > + boolean_t valid_sls[IB_MAX_NUM_VLS]; Use here uint16_t sl_mask instead of array - flow will be simpler. > + boolean_t sl2vl_valid_path; > + uint8_t first_valid_sl; > + osm_qos_level_t *p_qos_level = NULL; > > OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > > + memset(valid_sls, TRUE, sizeof(valid_sls)); > dest_lid = cl_hton16(dest_lid_ho); > > p_dest_physp = p_dest_port->p_physp; > p_physp = p_src_port->p_physp; > + p_src_physp = p_physp; > p_pi = &p_physp->port_info; > > mtu = ib_port_info_get_mtu_cap(p_pi); > @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > p_node = osm_physp_get_node_ptr(p_physp); > > if (p_node->sw) { > + /* source node is a switch */ > + in_port_num = osm_physp_get_port_num(p_physp); > + > /* > * If the dest_lid_ho is equal to the lid of the switch pointed by > * p_sw then p_physp will be the physical port of the switch port zero. > + * Make sure that p_physp points to the out port of the > + * switch that routes to the destination lid (dest_lid_ho) > */ > - p_physp = > - osm_switch_get_route_by_lid(p_node->sw, > - cl_ntoh16(dest_lid_ho)); > + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > if (p_physp == 0) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F02: " > @@ -304,17 +319,36 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > status = IB_NOT_FOUND; > goto Exit; > } > + if (!p_rcv->p_subn->opt.no_qos) > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); Here > + } > + > + if (!p_rcv->p_subn->opt.no_qos) { > + if (p_node->sw) > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > + else > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); and here - is it double initialization? > + > + /* update valid SLs that still exist on this route */ > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + vl = ib_slvl_table_get(p_slvl_tbl, i); > + if (vl == IB_DROP_VL) > + valid_sls[i] = FALSE; > + } > + } > } > > /* > - * Same as above > + * now get pointer to the destination port (same as above) What was wrong with comment? Is not 'p_dest_physp = ' clear? > */ > p_node = osm_physp_get_node_ptr(p_dest_physp); > > if (p_node->sw) { > - p_dest_physp = > - osm_switch_get_route_by_lid(p_node->sw, > - cl_ntoh16(dest_lid_ho)); > + /* > + * if destination is switch, we want p_dest_physp to point to port 0 > + */ > + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > > if (p_dest_physp == 0) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > @@ -328,6 +362,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > > } > > + /* > + * Now go through the path step by step > + */ > + > while (p_physp != p_dest_physp) { > p_physp = osm_physp_get_remote(p_physp); > > @@ -341,6 +379,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > goto Exit; > } > > + in_port_num = osm_physp_get_port_num(p_physp); > + > /* > This is point to point case (no switch in between) > */ > @@ -409,6 +449,20 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > CL_ASSERT(p_physp); > CL_ASSERT(osm_physp_is_valid(p_physp)); > > + p_node = osm_physp_get_node_ptr(p_physp); > + if (!p_node->sw) { > + /* > + * There is some sort of problem in the subnet object! > + * If this isn't a switch, we should have reached > + * the destination by now! > + */ > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F05: " > + "Internal error, bad path\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > p_pi = &p_physp->port_info; > > if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > @@ -435,6 +489,21 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > osm_physp_get_port_num(p_physp)); > } > > + if (!p_rcv->p_subn->opt.no_qos) { > + /* > + * Check SL2VL table of the switch and update valid SLs > + */ > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + vl = ib_slvl_table_get(p_slvl_tbl, i); > + if (vl == IB_DROP_VL) > + valid_sls[i] = FALSE; > + } > + } > + } > + > + /* go to the next step in the path */ Please drop this useless comment. > } > > /* > @@ -467,9 +536,118 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > "__osm_pr_rcv_get_path_parms: " > "Path min MTU = %u, min rate = %u\n", mtu, rate); > > + if (!p_rcv->p_subn->opt.no_qos) { > + /* check whether there is some SL that won't lead to VL15 eventually */ > + sl2vl_valid_path = FALSE; > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + sl2vl_valid_path = TRUE; > + first_valid_sl = i; > + break; > + } > + } > + > + if (!sl2vl_valid_path) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 on this path\n"); > + } > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } > + > + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > + /* Get QoS Level object according to the path request */ > + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > + p_rcv, p_pr, > + p_src_physp, p_dest_physp, > + comp_mask, &p_qos_level); > + > + if (p_qos_level > + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "PathRecord request matches QoS Level '%s' (%s)\n", > + p_qos_level->name, > + (p_qos_level->use) ? p_qos_level-> > + use : "no description"); > + } > + } > + > + /* Adjust path parameters according to QoS settings */ > + > + if (p_qos_level) { > + /* adjust MTU limit according to QoS constraints */ > + if (p_qos_level->mtu_limit_set > + && (mtu > p_qos_level->mtu_limit)) { > + mtu = p_qos_level->mtu_limit; > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS constaraints: new smallest MTU = %u\n", > + mtu); > + } > + } > + > + /* adjust Rate limit according to QoS constraints */ > + if (p_qos_level->rate_limit_set > + && (rate > p_qos_level->rate_limit)) { > + rate = p_qos_level->rate_limit; > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS constaraints: new smallest Rate = %u\n", > + rate); > + } > + } > + > + /* adjust Packet Lifetime according to QoS constraints */ > + if (p_qos_level->pkt_life_set > + && (pkt_life > p_qos_level->pkt_life)) { > + pkt_life = p_qos_level->pkt_life; > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS constaraints: new smallest Packet Lifetime = %u\n", > + pkt_life); > + } > + } > + > + /* adjust SL according to QoS constraints */ > + if (p_qos_level->sl_set) { > + if (!valid_sls[p_qos_level->sl]) { > + status = IB_NOT_FOUND; > + goto Exit; > + } > + sl = p_qos_level->sl; > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS constaraints: new SL = %u\n", > + sl); > + } > + } Please drop all osm_log(..OSM_LOG_DEBUG..) in this block - not each single line should be logged. If you think that those parameters may be useful for debugging put final values in single osm_log() somewhere at end of PR generator. > + } > + > + /* > + * Set packet lifetime. > + * According to spec definition IBA 1.2 Table 205 > + * PacketLifeTime description, for loopback paths, > + * packetLifeTime shall be zero. > + */ > + if (p_src_port == p_dest_port) > + pkt_life = 0; > + else > + if ( !(p_qos_level && p_qos_level->pkt_life_set) ) > + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + > + > /* > - Determine if these values meet the user criteria > - and adjust appropriately > + * Done adjusting parameters according to QoS constraints. > + * Determine if these values meet the user criteria and > + * adjust appropriately. > */ > > /* we silently ignore cases where only the MTU selector is defined */ > @@ -511,6 +689,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > break; > } > } > + if (status != IB_SUCCESS) > + goto Exit; > > /* we silently ignore cases where only the Rate selector is defined */ > if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && > @@ -551,14 +731,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > break; > } > } > - > - /* Verify the pkt_life_time */ > - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, > - for loopback paths, packetLifeTime shall be zero. */ > - if (p_src_port == p_dest_port) > - pkt_life = 0; /* loopback */ > - else > - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + if (status != IB_SUCCESS) > + goto Exit; > > /* we silently ignore cases where only the PktLife selector is defined */ > if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && > @@ -603,38 +777,68 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > if (status != IB_SUCCESS) > goto Exit; > > - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > - else if (comp_mask & IB_PR_COMPMASK_PKEY) { > - pkey = p_pr->pkey; > - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { > - osm_log(p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_pr_rcv_get_path_parms: ERR 1F1A: " > - "Ports do not share specified PKey 0x%04x\n", > - cl_ntoh16(pkey)); > - status = IB_NOT_FOUND; > - goto Exit; > - } > - } else { > - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > - if (!pkey) { > - osm_log(p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_pr_rcv_get_path_parms: ERR 1F1B: " > - "Ports do not have any shared PKeys\n"); > - status = IB_NOT_FOUND; > - goto Exit; > + /* > + * set Pkey for this path record request > + */ > + > + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) No extra () was needed - this generates confused diff lines. > + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); > + else { > + if (comp_mask & IB_PR_COMPMASK_PKEY) { > + /* > + * PR request has a specific pkey: > + * Check that source and destination share this pkey. > + * If QoS level has pkeys, check that this pkey exists > + * in the QoS level pkeys. > + * PR returned pkey is the requested pkey. > + */ > + pkey = p_pr->pkey; > + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1A: " > + "Ports do not share specified PKey 0x%04x\n", > + cl_ntoh16(pkey)); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + if (p_qos_level && p_qos_level->pkey_range_len && > + !osm_qos_level_has_pkey(p_qos_level, pkey)) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " > + "Ports do not share PKeys defined by QoS level\n"); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } else { > + /* PR request doesn't have a specific pkey */ > + > + if (p_qos_level && p_qos_level->pkey_range_len) { > + /* If QoS level has pkeys, get shared pkey from QoS level pkeys */ > + pkey = osm_qos_level_get_shared_pkey(p_qos_level, > + p_src_physp, > + p_dest_physp); > + if (!pkey) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " > + "Ports do not share PKeys defined by QoS level\n"); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } else { > + pkey = osm_physp_find_common_pkey(p_src_physp, > + p_dest_physp); > + if (!pkey) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1B: " > + "Ports do not have any shared PKeys\n"); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } > } > } Please arrange the code above as: if () ... else if () ... else ... , and please try to not exeed 80 chars in the line. > > - if (p_rcv->p_subn->opt.routing_engine_name && > - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) > - /* slid and dest_lid are stored in network in lash */ > - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, > - p_dest_port); > - else > - sl = OSM_DEFAULT_SL; > - > if (pkey) { > p_prtn = > (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, > @@ -642,34 +846,87 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > 0x8000)); > if (p_prtn == > (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) > + p_prtn = NULL; > + } > + > + /* > + * Set PathRecord SL. > + * > + * ToDo: What about QoS and LASH routing? How can they coexist? > + * And what happens when there's a pkey, hence there is a > + * partition with a certain SL, and this SL doesn't match > + * the one that's defined by LASH? > + */ > + > + if (comp_mask & IB_PR_COMPMASK_SL) { > + /* > + * Specific SL was requested > + */ > + sl = ib_path_rec_sl(p_pr); > + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS constaraints: required PR SL (%u) doesn't match QoS SL (%u)\n", > + sl, p_qos_level->sl); > + } > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } else if (p_qos_level && p_qos_level->sl_set) { > + /* > + * No specific SL was requested, > + * but there is an SL in QoS level > + */ > + sl = p_qos_level->sl; > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS level SL (%u) overrides partition SL (%u)\n", > + p_qos_level->sl, p_prtn->sl); > + } > + } > + } else if (pkey) { > + /* > + * No specific SL in request or in QoS level - use partition SL > + */ > + if (!p_prtn) { > /* this may be possible when pkey tables are created somehow in > previous runs or things are going wrong here */ > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F1C: " > "No partition found for PKey 0x%04x - using default SL %d\n", > cl_ntoh16(pkey), sl); > - else { > - if (p_rcv->p_subn->opt.routing_engine_name && > - strcmp(p_rcv->p_subn->opt.routing_engine_name, > - "lash") == 0) > - /* slid and dest_lid are stored in network in lash */ > - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > - p_src_port, p_dest_port); > - else > - sl = p_prtn->sl; > - } > - > - /* reset pkey when raw traffic */ > - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > - pkey = 0; > + } else > + sl = p_prtn->sl; > + } else if (p_rcv->p_subn->opt.routing_engine_name && > + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { > + /* slid and dest_lid are stored in network in lash */ > + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > + p_src_port, p_dest_port); > + } else if (!p_rcv->p_subn->opt.no_qos) { > + sl = first_valid_sl; > } > + else > + sl = OSM_DEFAULT_SL; > > - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { > + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { > + /* selected SL will eventually lead to VL15 */ > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "Selected SL (%u) leads to VL15\n", p_prtn->sl); > + } > status = IB_NOT_FOUND; > goto Exit; > } > > + /* reset pkey when raw traffic */ > + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > + pkey = 0; > + > p_parms->mtu = mtu; > p_parms->rate = rate; > p_parms->pkt_life = pkt_life; > -- > 1.5.1.4 > From tziporet at dev.mellanox.co.il Mon Sep 3 12:27:55 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 03 Sep 2007 22:27:55 +0300 Subject: [ofa-general] kdapl build error on ppc64 In-Reply-To: <10257140.1188829887714.JavaMail.root@wombat.diezmil.com> References: <10257140.1188829887714.JavaMail.root@wombat.diezmil.com> Message-ID: <46DC603B.7010901@mellanox.co.il> snagai at jp.ibm.com wrote: > I have trouble with building kdapl module on ppc64 machine. I saw a lot of error messages during building it using any version of dapl source code downloaded from sourceforge (http://sourceforge.net/projects/dapl). > I tried to build the module on both RHEL5 and Fedora Core6 using kernel 2.6.20, gcc 4.1.1. > > Did anyone succeed to build kdapl module on ppc64 machine ? If anyone has an idea about this issue, please advise me. > Note that kdapl is not being supported in OFA any more. You can use the kernel verbs + CMA to any RDMA/RC transport you need. Tziporet From jackm at dev.mellanox.co.il Tue Sep 4 00:37:13 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 4 Sep 2007 10:37:13 +0300 Subject: [ofa-general] [PATCH 1 of 2] libmlx4: Handle new FW requirement for send request prefetching, for WQE sg lists Message-ID: <200709041037.13996.jackm@dev.mellanox.co.il> This is an addendum to Roland's commit 561da8d10e419ffb333fe6faf05004d9a3670e7a (June 13). This addendum adds prefetch headroom marking processing for s/g segments. We write s/g segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing s/g segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. Signed-off-by: Jack Morgenstein Index: libmlx4/src/qp.c =================================================================== --- libmlx4.orig/src/qp.c 2007-09-04 10:03:38.264742000 +0300 +++ libmlx4/src/qp.c 2007-09-04 10:04:35.536784000 +0300 @@ -312,10 +312,19 @@ int mlx4_post_send(struct ibv_qp *ibqp, } else { struct mlx4_wqe_data_seg *seg = wqe; - for (i = 0; i < wr->num_sge; ++i) { - seg[i].byte_count = htonl(wr->sg_list[i].length); + /* + * Write the s/g entries in reverse order, so that the + * first dword of all cachelines is written last. + */ + for (i = wr->num_sge - 1; i >= 0; --i) { seg[i].lkey = htonl(wr->sg_list[i].lkey); seg[i].addr = htonll(wr->sg_list[i].addr); + /* + * This entry may start a new cacheline. + * See barrier comment above. + */ + wmb(); + seg[i].byte_count = htonl(wr->sg_list[i].length); } size += wr->num_sge * (sizeof *seg / 16); From dotanb at dev.mellanox.co.il Tue Sep 4 00:34:22 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 4 Sep 2007 10:34:22 +0300 Subject: [ofa-general] [PATCH] libibumad 1/2: add valgrind support to auto-tools configuration file Message-ID: <200709041034.22516.dotanb@dev.mellanox.co.il> Added valgrind support to the auto-tools configuration file. Signed-off-by: Dotan Barak --- Index: connectx_user/src/userspace/management/libibumad/configure.in =================================================================== --- connectx_user.orig/src/userspace/management/libibumad/configure.in 2007-09-02 08:01:42.000000000 +0300 +++ connectx_user/src/userspace/management/libibumad/configure.in 2007-09-04 10:24:39.000000000 +0300 @@ -20,6 +20,19 @@ AC_ARG_ENABLE(libcheck, [ --disable-lib fi ]) +AC_ARG_WITH([valgrind], + AC_HELP_STRING([--with-valgrind], + [Enable Valgrind annotations (small runtime overhead, default NO)])) +if test x$with_valgrind = x || test x$with_valgrind = xno; then + want_valgrind=no + AC_DEFINE([NVALGRIND], 1, [Define to 1 to disable Valgrind annotations.]) +else + want_valgrind=yes + if test -d $with_valgrind; then + CPPFLAGS="$CPPFLAGS -I$with_valgrind/include" + fi +fi + dnl Checks for programs AC_PROG_CXX AC_PROG_CC @@ -55,6 +68,13 @@ AC_CHECK_FUNCS([memset]) dnl Checks for typedefs, structures, and compiler characteristics. AC_C_INLINE +AC_CHECK_HEADER(valgrind/memcheck.h, + [AC_DEFINE(HAVE_VALGRIND_MEMCHECK_H, 1, + [Define to 1 if you have the header file.])], + [if test $want_valgrind = yes; then + AC_MSG_ERROR([Valgrind memcheck support requested, but not found.]) + fi]) + AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then ac_cv_version_script=yes From dotanb at dev.mellanox.co.il Tue Sep 4 00:35:42 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 4 Sep 2007 10:35:42 +0300 Subject: [ofa-general] [PATCH] libibumad 2/2: add valgrind support to the umad code Message-ID: <200709041035.42404.dotanb@dev.mellanox.co.il> Added valgrind support to the umad code and marked buffers that were filled by the kernel level with the macro VALGRIND_MAKE_MEM_DEFINED. Signed-off-by: Dotan Barak --- Index: connectx_user/src/userspace/management/libibumad/src/umad.c =================================================================== --- connectx_user.orig/src/userspace/management/libibumad/src/umad.c 2007-09-02 08:01:42.000000000 +0300 +++ connectx_user/src/userspace/management/libibumad/src/umad.c 2007-09-04 10:25:56.000000000 +0300 @@ -51,6 +51,20 @@ #define IB_OPENIB_OUI (0x001405) +#ifdef HAVE_VALGRIND_MEMCHECK_H + +# include + +# ifndef VALGRIND_MAKE_MEM_DEFINED +# warning "Valgrind support requested, but VALGRIND_MAKE_MEM_DEFINED not available" +# endif + +#endif /* HAVE_VALGRIND_MEMCHECK_H */ + +#ifndef VALGRIND_MAKE_MEM_DEFINED +# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +#endif + typedef struct ib_user_mad_reg_req { uint32_t id; uint32_t method_mask[4]; @@ -926,6 +940,8 @@ umad_register(int portid, int mgmt_class memcpy(&req.oui, (char *)&oui + 1, sizeof req.oui); + VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); + if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { DEBUG("portid %d registered to use agent %d qp %d", portid, req.id, qp); From jackm at dev.mellanox.co.il Tue Sep 4 00:47:31 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 4 Sep 2007 10:47:31 +0300 Subject: [ofa-general] [PATCH 2 of 2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists Message-ID: <200709041047.32062.jackm@dev.mellanox.co.il> This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745 (June 18). This addendum adds prefetch headroom marking processing for s/g segments. We write s/g segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing s/g segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. Signed-off-by: Jack Morgenstein Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 16:35:01.000000000 +0300 +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 17:05:47.000000000 +0300 @@ -1215,9 +1215,18 @@ static void set_datagram_seg(struct mlx4 static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) { - dseg->byte_count = cpu_to_be32(sg->length); dseg->lkey = cpu_to_be32(sg->lkey); dseg->addr = cpu_to_be64(sg->addr); + + /* Need a barrier before writing the byte_count field + * to make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment + * begins a new cacheline, the HCA prefetcher could + * grab the 64-byte chunk and get a valid (!= * 0xffffffff) + * byte count but stale data, and end up sending the wrong + * data. */ + wmb(); + dseg->byte_count = cpu_to_be32(sg->length); } int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, @@ -1226,6 +1235,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp struct mlx4_ib_qp *qp = to_mqp(ibqp); void *wqe; struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_data_seg *seg; unsigned long flags; int nreq; int err = 0; @@ -1325,19 +1335,22 @@ int mlx4_ib_post_send(struct ib_qp *ibqp break; } - for (i = 0; i < wr->num_sge; ++i) { - set_data_seg(wqe, wr->sg_list + i); - - wqe += sizeof (struct mlx4_wqe_data_seg); + seg = (struct mlx4_wqe_data_seg *) wqe; + /* Add one more inline data segment for ICRC for MLX sends. + * Write this inline and all s/g segments in reverse order, + * so as to overwrite cacheline stamp last within each + * cacheline. */ + if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { + void *t = wqe + (wr->num_sge) * sizeof(struct mlx4_wqe_data_seg); + ((u32 *) t)[1] = 0; + wmb(); + ((struct mlx4_wqe_inline_seg *) t)->byte_count = + cpu_to_be32((1 << 31) | 4); size += sizeof (struct mlx4_wqe_data_seg) / 16; } - /* Add one more inline data segment for ICRC for MLX sends */ - if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { - ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = - cpu_to_be32((1 << 31) | 4); - ((u32 *) wqe)[1] = 0; - wqe += sizeof (struct mlx4_wqe_data_seg); + for (i = wr->num_sge - 1; i >= 0; --i) { + set_data_seg(seg + i, wr->sg_list + i); size += sizeof (struct mlx4_wqe_data_seg) / 16; } From root at rgm.dardasha.net Tue Sep 4 01:55:40 2007 From: root at rgm.dardasha.net (root) Date: Tue, 04 Sep 2007 04:55:40 -0400 Subject: [ofa-general] Your Online Banking is Blocked Message-ID: <1188896140.410302.qmail@bankofamerica.com> From: "Bank of Amercia" Content-type: text/html Message-ID: From mst at dev.mellanox.co.il Tue Sep 4 02:11:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 12:11:33 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070830130852.GF2532@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> Message-ID: <20070904091133.GA23437@mellanox.co.il> Add module option hw_csum: when set, IPoIB will report S/G support, and rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums. Since this will not inter-operate with older IPoIB modules, this option is off by default. Signed-off-by: Michael S. Tsirkin --- Updates since v1: fixed thinko in setting header flags. When applied on top of previously posted mlx4 patches, and with hw_csum enabled, this patch speeds up single-stream netperf bandwidth on connectx DDR from 1000 to 1250 MBytes/sec. I know some people find this approach controversial, but from my perspective, this is not worse than e.g. SDP which does not have SW checksums pretty much by design. Hopefully the option being off by default is enough to pacify the critics :). diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..f597afe 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -104,9 +104,11 @@ enum { /* structs */ +#define IPOIB_HEADER_F_HWCSUM 0x1 + struct ipoib_header { __be16 proto; - u16 reserved; + __be16 flags; }; struct ipoib_pseudoheader { @@ -122,9 +124,52 @@ struct ipoib_rx_buf { struct ipoib_tx_buf { struct sk_buff *skb; - u64 mapping; + u64 mapping[MAX_SKB_FRAGS + 1]; }; +static inline int ipoib_dma_map_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int i, frags; + + mapping[0] = ib_dma_map_single(ca, skb->data, skb_headlen(skb), DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[0]))) + return -EIO; + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + mapping[i + 1] = ib_dma_map_page(ca, frag->page, frag->page_offset, + frag->size, DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[i + 1]))) + goto partial_error; + } + return 0; + +partial_error: + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + for (; i > 0; --i) + ib_dma_unmap_page(ca, mapping[i], PAGE_SIZE, DMA_TO_DEVICE); + return -EIO; +} + +static inline void ipoib_dma_unmap_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int i, frags; + + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + ib_dma_unmap_page(ca, mapping[i + 1], frag->size, DMA_TO_DEVICE); + } +} + struct ib_cm_id; struct ipoib_cm_data { @@ -269,7 +314,7 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; + struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; struct ib_send_wr tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..a308e92 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -407,6 +407,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; int frags; + struct ipoib_header *header; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -469,7 +470,10 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); @@ -491,14 +495,21 @@ repost: static inline int post_send(struct ipoib_dev_priv *priv, struct ipoib_cm_tx *tx, unsigned int wr_id, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -507,7 +518,6 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > tx->mtu)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -530,20 +540,19 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ */ tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; - if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), - addr, skb->len))) { + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, + skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -577,7 +586,7 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx tx_req = &tx->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); /* FIXME: is this right? Shouldn't we only increment on success? */ ++priv->stats.tx_packets; @@ -814,7 +823,7 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; - attr.cap.max_send_sge = 1; + attr.cap.max_send_sge = dev->features & NETIF_F_SG ? MAX_SKB_FRAGS + 1 : 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -981,8 +990,7 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) if (p->tx_ring) { while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1094488..59b1735 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -170,6 +170,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; struct sk_buff *skb; + struct ipoib_header *header; u64 addr; ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n", @@ -220,7 +221,10 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); @@ -257,8 +261,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) tx_req = &priv->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); ++priv->stats.tx_packets; priv->stats.tx_bytes += tx_req->skb->len; @@ -343,16 +346,23 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) static inline int post_send(struct ipoib_dev_priv *priv, unsigned int wr_id, struct ib_ah *address, u32 qpn, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr.ud.remote_qpn = qpn; + priv->tx_wr.wr.ud.ah = address; return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } @@ -362,7 +372,6 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -385,20 +394,19 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, */ tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { + address->ah, qpn, + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -604,10 +612,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, - tx_req->mapping, - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..42efcbf 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -55,11 +55,14 @@ MODULE_LICENSE("Dual BSD/GPL"); int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +static int ipoib_hw_csum __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +module_param_named(hw_csum, ipoib_hw_csum, int, 0444); +MODULE_PARM_DESC(hw_csum, "Rely on hardware end-to-end checksum (ICRC) if > 0"); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -782,7 +785,10 @@ static int ipoib_hard_header(struct sk_buff *skb, header = (struct ipoib_header *) skb_push(skb, sizeof *header); header->proto = htons(type); - header->reserved = 0; + if (skb->ip_summed == CHECKSUM_COMPLETE) + header->flags = 0; + else + header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); /* * If we don't have a neighbour structure, stuff the @@ -964,6 +970,8 @@ static void ipoib_setup(struct net_device *dev) dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + if (ipoib_hw_csum) + dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM; /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 563aeac..1699269 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -149,14 +149,14 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) .cap = { .max_send_wr = ipoib_sendq_size, .max_recv_wr = ipoib_recvq_size, - .max_send_sge = 1, + .max_send_sge = dev->features & NETIF_F_SG ? MAX_SKB_FRAGS + 1 : 0, .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; - int ret, size; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,11 +197,11 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; + for (i = 0; i < MAX_SKB_FRAGS + 1; ++i) + priv->tx_sge[i].lkey = priv->mr->lkey; priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; + priv->tx_wr.sg_list = priv->tx_sge; priv->tx_wr.send_flags = IB_SEND_SIGNALED; return 0; -- MST -- MST From rdreier at cisco.com Tue Sep 4 02:17:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 04 Sep 2007 02:17:19 -0700 Subject: [ofa-general] Re: [PATCH v2 for-2.6.24] IB/mthca: enable MSI-X by default In-Reply-To: <20070807131034.GB24064@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Aug 2007 16:10:34 +0300") References: <20070807131034.GB24064@mellanox.co.il> Message-ID: I applied this and the mlx4 version. I tried to fix it up so it didn't print screwed up messages like NOP failed, aborting trying again without MSI-X trying again without MSI-X Please take a look at my for-2.6.24 branch if you get a chance to make sure I didn't screw up when I did that. - R. From vlad at lists.openfabrics.org Tue Sep 4 02:51:31 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 4 Sep 2007 02:51:31 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070904-0200 daily build status Message-ID: <20070904095131.322E4E60863@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070904-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Tue Sep 4 04:32:24 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 14:32:24 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: whitespace tweak Message-ID: <20070904113224.GA27306@mellanox.co.il> Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index fd88d99..e7a7199 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1081,7 +1081,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, qp->sq.head = 0; qp->sq.tail = 0; if (!ibqp->srq) - *qp->db.db = 0; + *qp->db.db = 0; } out: -- MST From ogerlitz at voltaire.com Tue Sep 4 04:50:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 04 Sep 2007 14:50:07 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904091133.GA23437@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> Message-ID: <46DD466F.8020607@voltaire.com> Michael S. Tsirkin wrote: > Add module option hw_csum: when set, IPoIB will report S/G > support, and rely on hardware end-to-end transport checksum (ICRC) > instead of software-level protocol checksums. > Since this will not inter-operate with older IPoIB modules, > this option is off by default. Hi Michael, looking on slide 18 of Dror's Sonoma presentation (*) which states - > Checksum Offload > TCP/UDP/IP Checksum Offloading - Query device for checksum offload support > QP Creation - Mark QP for IPoIB checksum support > TX - ibv_send_flags indicate checksum offload request > RX - ibv_wc_flags indicate checksum status (good, bad, unverified) I don't see that there is such dependency, nor I can understand the design that creates the dependency, unless you rely on the IB CRC and not compute the actual TCP/UDP/IP csum. Can you clarify the exact limitation that prevents inter-operation? why does the receiving side side cares if the checksum at the sender was computed by the SW or the HW (and vise versa)? > I know some people find this approach controversial, > but from my perspective, this is not worse than e.g. > SDP which does not have SW checksums pretty much by design. Is this b/c of the non interoperability? SDP is not inter-operate by design where IP/TCP stack over IPoIB MUST support interoperability. Not inter-operable TCP checksum offload is not very useful, I think. Or. (*) see http://openfabrics.org/archives/spring2007sonoma/Tuesday%20May%201/gdror%20Next%20Generation%20Hardware%20Assists%20And%20Scalability2.pdf From kliteyn at dev.mellanox.co.il Tue Sep 4 05:49:36 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Sep 2007 15:49:36 +0300 Subject: [ofa-general] [PATCH] osm: QoS - fixing ServiceID and PKey bug in match rules Message-ID: <46DD5460.9050405@dev.mellanox.co.il> Hi Sasha. Small patch that fixes ServiceID and PKey bug in QoS policy match rules. Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 3 ++- opensm/opensm/osm_qos_policy.c | 7 ++++--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 4ab4145..0a096f9 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -2379,7 +2379,8 @@ typedef struct _ib_path_rec { *********/ /* Path Record Component Masks */ -#define IB_PR_COMPMASK_SERVICEID (CL_HTON64(((uint64_t)1)<<1)) +#define IB_PR_COMPMASK_SERVICEID_MSB (CL_HTON64(((uint64_t)1)<<0)) +#define IB_PR_COMPMASK_SERVICEID_LSB (CL_HTON64(((uint64_t)1)<<1)) #define IB_PR_COMPMASK_DGID (CL_HTON64(((uint64_t)1)<<2)) #define IB_PR_COMPMASK_SGID (CL_HTON64(((uint64_t)1)<<3)) #define IB_PR_COMPMASK_DLID (CL_HTON64(((uint64_t)1)<<4)) diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index a5a8856..059a861 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -640,7 +640,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( to have a matching Service ID to match the rule */ if (p_qos_match_rule->service_id_range_len) { - if (!(comp_mask & IB_PR_COMPMASK_SERVICEID)) { + if (!(comp_mask & IB_PR_COMPMASK_SERVICEID_MSB) || + !(comp_mask & IB_PR_COMPMASK_SERVICEID_LSB)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -648,7 +649,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->service_id_range_arr, p_qos_match_rule->service_id_range_len, - p_pr->service_id)) { + cl_ntoh64(p_pr->service_id))) { list_iterator = cl_list_next(list_iterator); continue; } @@ -667,7 +668,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->pkey_range_arr, p_qos_match_rule->pkey_range_len, - ib_path_rec_qos_class(p_pr))) { + cl_ntoh16(p_pr->pkey))) { list_iterator = cl_list_next(list_iterator); continue; } -- 1.5.1.4 From narravul at cse.ohio-state.edu Tue Sep 4 07:29:53 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Tue, 4 Sep 2007 10:29:53 -0400 (EDT) Subject: [ofa-general] Re: [mvapich-discuss] [PATCH] ofed-1.2.5/mvapich2 bug fixes In-Reply-To: <1188576958.14461.137.camel@sale659> Message-ID: Hi Jim, Thank you for sending us these patches for mvapich2-0.9.8. We will make these fixes available for mvapich2-0.9.8 soon. Recently we have released mvapich2-1.0 beta2. These fixes are already present in the mvapich2-1.0 codebase. You are welcome to try this version. It can be downloaded from our web-page: http://mvapich.cse.ohio-state.edu/download/mvapich2/ Regards, --Sundeep. On Fri, 31 Aug 2007, Jim Schutt wrote: > Hi, > > I've been working with mvapich2 from the OFED-1.2.5 release (I used > http://www.openfabrics.org/downloads/OFED/ofed-1.2.5/OFED-1.2.5.tgz). > > I've found a couple bugs in that version of mvapich2, which seem to > also be present in the upstream SVN at > https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/branches/0.9.8 > as of revision 1480. > > The first is an install tool bug: DESTDIR gets prepended twice in > some cases, once when calling FixInstallFile and once inside it. > > The second is a memory-scribbling bug: in the call-chain > rdma_cm_get_hostnames()->PMI_KVS_Get()->PMIU_getval() > PMIU_getval() overwrites byte PMI_vallen_max - 1 (in my test > case, PMI_vallen_max had value 2048) for a character array that > has length 16. > > I'm not sure these are the right fixes, but with these patches > applied, mvapich2 from ofed-1.2.5 installs correctly and runs tests > it wouldn't run without them. > > -- Jim > > -- > Jim Schutt > Sandia National Laboratories, Albuquerque, New Mexico USA > > diff -urN mvapich2-0.9.8.orig/src/mpe2/sbin/mpeinstall.in mvapich2-0.9.8/src/mpe2/sbin/mpeinstall.in > --- mvapich2-0.9.8.orig/src/mpe2/sbin/mpeinstall.in 2006-04-09 11:57:00.000000000 -0600 > +++ mvapich2-0.9.8/src/mpe2/sbin/mpeinstall.in 2007-08-31 10:02:10.000000000 -0600 > @@ -442,10 +442,10 @@ > echo "Copying MPE utility programs to $DESTDIR$bindir" > CopyDirRecurP $binbuild_dir $bindir $XMODE > if [ -s $binbuild_dir/mpecc -a -x $binbuild_dir/mpecc ] ; then > - FixInstallFile $binbuild_dir/mpecc $DESTDIR$bindir/mpecc $XMODE > + FixInstallFile $binbuild_dir/mpecc $bindir/mpecc $XMODE > fi > if [ -s $binbuild_dir/mpefc -a -x $binbuild_dir/mpefc ] ; then > - FixInstallFile $binbuild_dir/mpefc $DESTDIR$bindir/mpefc $XMODE > + FixInstallFile $binbuild_dir/mpefc $bindir/mpefc $XMODE > fi > fi > fi > @@ -457,7 +457,7 @@ > CopyDirRecurP $etcbuild_dir $sysconfdir $MODE > cd $etcbuild_dir && \ > for file in *.conf ; do \ > - FixInstallFile $file $DESTDIR$sysconfdir/$file ; \ > + FixInstallFile $file $sysconfdir/$file ; \ > done > fi > fi > diff -urN mvapich2-0.9.8.orig/src/pmi/simple/simple_pmi.c mvapich2-0.9.8/src/pmi/simple/simple_pmi.c > --- mvapich2-0.9.8.orig/src/pmi/simple/simple_pmi.c 2006-04-09 11:57:00.000000000 -0600 > +++ mvapich2-0.9.8/src/pmi/simple/simple_pmi.c 2007-08-31 10:02:26.000000000 -0600 > @@ -566,7 +566,7 @@ > PMIU_getval( "rc", buf, PMIU_MAXLINE ); > rc = atoi( buf ); > if ( rc == 0 ) { > - PMIU_getval( "value", value, PMI_vallen_max ); > + PMIU_getval( "value", value, length ); > return( 0 ); > } > else { > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss at cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From ramachandra.kuchimanchi at qlogic.com Tue Sep 4 07:40:19 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Tue, 4 Sep 2007 09:40:19 -0500 Subject: [ofa-general] Low NFS RDMA performance with Connect X Message-ID: Hi, I took the NFS RDMA code from the Mellanox NFS RDMA SDK, compiled it with OFED-1.2.5 and tried it out with Connect X HCAs and also MT25208. I found that the Iozone read and write performance numbers are very low on Connect X. For a 128 MB file and a 128 KB record size NFS RDMA SDK on MT25028: Read: 861 MB/s Write: 185 MB/s OFED-1.2.5 with NFS RDMA modules Read: 849 MB/s Write: 184 MB/s on a MT25208 OFED-1.2.5 with NFS RDMA modules Read: 451 MB/s Write: 79 MB/s on Connect X Has any one tried this out or know of a reason why the numbers are so low on Connect X ? Test-setup: Server and single client running RHEL 5 MT25208 tests were with dual processor 64-bit AMD machines Connect X tests were with dual processor dual core 64-bit AMD machines Connect X HCA FW ver: 2.1 NFS mount was in async mode and iozone tests were run with -c option. More Iozone results for a record size of 64 KB (values below in KB/sec): Read test File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on ConnectX (in MB) 64 1684819 1701916 459279 128 882580 870180 462486 256 922081 921932 468063 512 871136 909221 452969 1024 900314 910171 442215 2048 908117 849710 676776 Write test File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on ConnectX (in MB) 64 184154 182483 78424 128 190126 189284 81869 256 194921 173124 85813 512 199666 192110 87628 1024 208924 199240 126415 2048 180128 195278 123020 Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.leidel at gmail.com Tue Sep 4 07:46:15 2007 From: john.leidel at gmail.com (John Leidel) Date: Tue, 4 Sep 2007 11:46:15 -0300 Subject: [ofa-general] Low NFS RDMA performance with Connect X In-Reply-To: References: Message-ID: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> In doing some testing with ConnectX, I noticed a similar issue in MPI performance. The fix was simply to upgrade to the latetest and greatest firmware. On 9/4/07, Kuchimanchi, Ramachandra wrote: > > Hi, > > I took the NFS RDMA code from the Mellanox NFS RDMA SDK, compiled it with > OFED-1.2.5 and tried it out with Connect X HCAs and also MT25208. I found > that the Iozone read and write performance numbers are very low on Connect > X. > > For a 128 MB file and a 128 KB record size > > NFS RDMA SDK on MT25028: Read: 861 MB/s Write: 185 MB/s > OFED-1.2.5 with NFS RDMA modules Read: 849 MB/s Write: 184 > MB/s > on a MT25208 > OFED-1.2.5 with NFS RDMA modules Read: 451 MB/s Write: 79 MB/s > on Connect X > > Has any one tried this out or know of a reason why the numbers are so low > on Connect X ? > > Test-setup: > Server and single client running RHEL 5 > MT25208 tests were with dual processor 64-bit AMD machines > Connect X tests were with dual processor dual core 64-bit AMD machines > Connect X HCA FW ver: 2.1 > NFS mount was in async mode and iozone tests were run with -c option. > > More Iozone results for a record size of 64 KB (values below in KB/sec): > > Read test > > File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on > ConnectX > (in MB) > 64 1684819 1701916 459279 > 128 882580 870180 462486 > 256 922081 921932 468063 > 512 871136 909221 452969 > 1024 900314 910171 442215 > 2048 908117 849710 676776 > > Write test > > File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on > ConnectX > (in MB) > 64 184154 182483 78424 > 128 190126 189284 81869 > 256 194921 173124 85813 > 512 199666 192110 87628 > 1024 208924 199240 126415 > 2048 180128 195278 123020 > > Regards, > Ram > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Sep 4 08:01:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 18:01:04 +0300 Subject: [ofa-general] Re: [PATCH] libibumad 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <200709041034.22516.dotanb@dev.mellanox.co.il> References: <200709041034.22516.dotanb@dev.mellanox.co.il> Message-ID: <20070904150104.GA23670@sashak.voltaire.com> On 10:34 Tue 04 Sep , Dotan Barak wrote: > Added valgrind support to the auto-tools configuration file. > > Signed-off-by: Dotan Barak Applied. Thanks. From sashak at voltaire.com Tue Sep 4 08:01:21 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 18:01:21 +0300 Subject: [ofa-general] Re: [PATCH] libibumad 2/2: add valgrind support to the umad code In-Reply-To: <200709041035.42404.dotanb@dev.mellanox.co.il> References: <200709041035.42404.dotanb@dev.mellanox.co.il> Message-ID: <20070904150121.GB23670@sashak.voltaire.com> On 10:35 Tue 04 Sep , Dotan Barak wrote: > Added valgrind support to the umad code and marked buffers that were filled > by the kernel level with the macro VALGRIND_MAKE_MEM_DEFINED. > > Signed-off-by: Dotan Barak Applied. Thanks. Sasha From sashak at voltaire.com Tue Sep 4 08:01:45 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 18:01:45 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS - fixing ServiceID and PKey bug in match rules In-Reply-To: <46DD5460.9050405@dev.mellanox.co.il> References: <46DD5460.9050405@dev.mellanox.co.il> Message-ID: <20070904150145.GC23670@sashak.voltaire.com> On 15:49 Tue 04 Sep , Yevgeny Kliteynik wrote: > Hi Sasha. > > Small patch that fixes ServiceID and PKey bug in QoS policy match rules. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Tue Sep 4 08:06:13 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 18:06:13 +0300 Subject: [ofa-general] [PATCH] ibmgtsim/osmStress.sim.tcl: fix madPathRec_sl_set name Message-ID: <20070904150613.GD23670@sashak.voltaire.com> Rename non-existing madPathRec_sl_set to madPathRec_qos_class_sl_set. Signed-off-by: Sasha Khapyorsky --- ibmgtsim/tests/osmStress.sim.tcl | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ibmgtsim/tests/osmStress.sim.tcl b/ibmgtsim/tests/osmStress.sim.tcl index 6face15..f1d49aa 100755 --- a/ibmgtsim/tests/osmStress.sim.tcl +++ b/ibmgtsim/tests/osmStress.sim.tcl @@ -814,7 +814,7 @@ proc sendPathRecordRequest {fabric port1 port2 port3} { madPathRec_sgid_set $pam \ "0xfe80000000000000:[string range [IBPort_guid_get $port2] 2 end]" madPathRec_num_path_set $pam 1 - madPathRec_sl_set $pam 0x8 + madPathRec_qos_class_sl_set $pam 0x8 madPathRec_mtu_set $pam 4 madPathRec_rate_set $pam 2 -- 1.5.3.rc2.38.g11308 From eitan at mellanox.co.il Tue Sep 4 07:58:13 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 4 Sep 2007 17:58:13 +0300 Subject: [ofa-general] RE: [PATCH] ibmgtsim/osmStress.sim.tcl: fix madPathRec_sl_set name In-Reply-To: <20070904150613.GD23670@sashak.voltaire.com> References: <20070904150613.GD23670@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C90231427D@mtlexch01.mtl.com> Thanks Applied. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Tuesday, September 04, 2007 6:06 PM > To: Eitan Zahavi > Cc: OpenIB; Yevgeny Kliteynik > Subject: [PATCH] ibmgtsim/osmStress.sim.tcl: fix > madPathRec_sl_set name > > > Rename non-existing madPathRec_sl_set to madPathRec_qos_class_sl_set. > > Signed-off-by: Sasha Khapyorsky > --- > ibmgtsim/tests/osmStress.sim.tcl | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/ibmgtsim/tests/osmStress.sim.tcl > b/ibmgtsim/tests/osmStress.sim.tcl > index 6face15..f1d49aa 100755 > --- a/ibmgtsim/tests/osmStress.sim.tcl > +++ b/ibmgtsim/tests/osmStress.sim.tcl > @@ -814,7 +814,7 @@ proc sendPathRecordRequest {fabric port1 > port2 port3} { > madPathRec_sgid_set $pam \ > "0xfe80000000000000:[string range [IBPort_guid_get > $port2] 2 end]" > madPathRec_num_path_set $pam 1 > - madPathRec_sl_set $pam 0x8 > + madPathRec_qos_class_sl_set $pam 0x8 > madPathRec_mtu_set $pam 4 > madPathRec_rate_set $pam 2 > > -- > 1.5.3.rc2.38.g11308 > > From swise at opengridcomputing.com Tue Sep 4 08:13:51 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 04 Sep 2007 10:13:51 -0500 Subject: [ofa-general] Re: [PATCH RFC] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts with the host stack. In-Reply-To: References: <1187905185.5547.13.camel@stevo-desktop> Message-ID: <46DD762F.80904@opengridcomputing.com> Roland Dreier wrote: > > The sysadmin creates "for iwarp use only" alias interfaces of the form > > "devname:iw*" where devname is the native interface name (eg eth0) for the > > iwarp netdev device. The alias label can be anything starting with "iw". > > The "iw" immediately after the ':' is the key used by the iwarp driver. > > What's wrong with my suggestion of having the iwarp driver create an > "iwX" interface to go with the normal "ethX" interface? It seems > simpler to me, and there's a somewhat similar precedent with how > mac80211 devices create both wlan0 and wmaster0 interfaces. > > - R. It seemed much more painful for me to implement. :-) I'll look into this, but I think for this to be done, the changes must be in the cxgb3 driver, not the rdma driver, because the guts of the netdev struct are all private to cxgb3. Remember that this interface needs to still do non TCP traffic (like ARP and UDP)... Maybe you have something in mind here that I'm not thinking about? Steve. From kliteyn at dev.mellanox.co.il Tue Sep 4 08:17:07 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Sep 2007 18:17:07 +0300 Subject: [ofa-general] [PATCH] osm: QoS - adding new QoS fields to MultiPathRecord Message-ID: <46DD76F3.6020007@dev.mellanox.co.il> Hi Sasha, Adding QoS class and Service ID to MultiPathRecord Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 181 +++++++++++++++++++++++++++++-- opensm/libvendor/osm_vendor_ibumad_sa.c | 3 +- opensm/opensm/osm_helper.c | 13 ++- 3 files changed, 180 insertions(+), 17 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 0a096f9..13e2f38 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -1658,6 +1658,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) */ #define IB_PATH_REC_SL_MASK 0x000F +/****d* IBA Base: Constants/IB_MULTIPATH_REC_SL_MASK +* NAME +* IB_MILTIPATH_REC_SL_MASK +* +* DESCRIPTION +* Mask for the sl field for MultiPath record +* +* SOURCE +*/ +#define IB_MULTIPATH_REC_SL_MASK 0x000F + /****d* IBA Base: Constants/IB_PATH_REC_QOS_CLASS_MASK * NAME * IB_PATH_REC_QOS_CLASS_MASK @@ -1669,6 +1680,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) */ #define IB_PATH_REC_QOS_CLASS_MASK 0xFFF0 +/****d* IBA Base: Constants/IB_MULTIPATH_REC_QOS_CLASS_MASK +* NAME +* IB_MULTIPATH_REC_QOS_CLASS_MASK +* +* DESCRIPTION +* Mask for the QoS class field for MultiPath record +* +* SOURCE +*/ +#define IB_MULTIPATH_REC_QOS_CLASS_MASK 0xFFF0 + /****d* IBA Base: Constants/IB_PATH_REC_SELECTOR_MASK * NAME * IB_PATH_REC_SELECTOR_MASK @@ -2589,7 +2611,7 @@ typedef struct _ib_path_rec { #define IB_MPR_COMPMASK_REVERSIBLE (CL_HTON64(((uint64_t)1)<<5)) #define IB_MPR_COMPMASK_NUMBPATH (CL_HTON64(((uint64_t)1)<<6)) #define IB_MPR_COMPMASK_PKEY (CL_HTON64(((uint64_t)1)<<7)) -#define IB_MPR_COMPMASK_RESV1 (CL_HTON64(((uint64_t)1)<<8)) +#define IB_MPR_COMPMASK_QOS_CLASS (CL_HTON64(((uint64_t)1)<<8)) #define IB_MPR_COMPMASK_SL (CL_HTON64(((uint64_t)1)<<9)) #define IB_MPR_COMPMASK_MTUSELEC (CL_HTON64(((uint64_t)1)<<10)) #define IB_MPR_COMPMASK_MTU (CL_HTON64(((uint64_t)1)<<11)) @@ -2597,12 +2619,12 @@ typedef struct _ib_path_rec { #define IB_MPR_COMPMASK_RATE (CL_HTON64(((uint64_t)1)<<13)) #define IB_MPR_COMPMASK_PKTLIFETIMESELEC (CL_HTON64(((uint64_t)1)<<14)) #define IB_MPR_COMPMASK_PKTLIFETIME (CL_HTON64(((uint64_t)1)<<15)) -#define IB_MPR_COMPMASK_RESV2 (CL_HTON64(((uint64_t)1)<<16)) +#define IB_MPR_COMPMASK_SERVICEID_MSB (CL_HTON64(((uint64_t)1)<<16)) #define IB_MPR_COMPMASK_INDEPSELEC (CL_HTON64(((uint64_t)1)<<17)) #define IB_MPR_COMPMASK_RESV3 (CL_HTON64(((uint64_t)1)<<18)) #define IB_MPR_COMPMASK_SGIDCOUNT (CL_HTON64(((uint64_t)1)<<19)) #define IB_MPR_COMPMASK_DGIDCOUNT (CL_HTON64(((uint64_t)1)<<20)) -#define IB_MPR_COMPMASK_RESV4 (CL_HTON64(((uint64_t)1)<<21)) +#define IB_MPR_COMPMASK_SERVICEID_LSB (CL_HTON64(((uint64_t)1)<<21)) /* SMInfo Record Component Masks */ #define IB_SMIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) @@ -5861,16 +5883,15 @@ typedef struct _ib_multipath_rec_t { uint8_t tclass; uint8_t num_path; ib_net16_t pkey; - uint8_t resv0; - uint8_t sl; + ib_net16_t qos_class_sl; uint8_t mtu; uint8_t rate; uint8_t pkt_life; - uint8_t resv1; + uint8_t service_id_8msb; uint8_t independence; /* formerly resv2 */ uint8_t sgid_count; uint8_t dgid_count; - uint8_t resv3[7]; + uint8_t service_id_56lsb[7]; ib_gid_t gids[IB_MULTIPATH_MAX_GIDS]; } PACK_SUFFIX ib_multipath_rec_t; #include @@ -5890,8 +5911,8 @@ typedef struct _ib_multipath_rec_t { * pkey * Partition key (P_Key) to use on this path. * -* sl -* Service level to use on this path. +* qos_class_sl +* QoS class and service level to use on this path. * * mtu * MTU and MTU selector fields to use on this path @@ -5901,6 +5922,12 @@ typedef struct _ib_multipath_rec_t { * pkt_life * Packet lifetime * +* service_id_8msb +* 8 most significant bits of Service ID +* +* service_id_56lsb +* 56 least significant bits of Service ID +* * preference * Indicates the relative merit of this path versus other path * records returned from the SA. Lower numbers are better. @@ -5937,6 +5964,41 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) * ib_multipath_rec_t *********/ +/****f* IBA Base: Types/ib_multipath_rec_set_sl +* NAME +* ib_multipath_rec_set_sl +* +* DESCRIPTION +* Set path service level. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_multipath_rec_set_sl( + IN ib_multipath_rec_t* const p_rec, + IN const uint8_t sl ) +{ + p_rec->qos_class_sl = + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_QOS_CLASS_MASK)) | + cl_hton16(sl & IB_MULTIPATH_REC_SL_MASK); +} +/* +* PARAMETERS +* p_rec +* [in] Pointer to the MultiPath record object. +* +* sl +* [in] Service level to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_multipath_rec_t +*********/ + /****f* IBA Base: Types/ib_multipath_rec_sl * NAME * ib_multipath_rec_sl @@ -5949,7 +6011,7 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) static inline uint8_t OSM_API ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) { - return ((uint8_t) ((cl_ntoh16(p_rec->sl)) & 0xF)); + return ((uint8_t) ((cl_ntoh16(p_rec->qos_class_sl)) & IB_MULTIPATH_REC_SL_MASK)); } /* @@ -5966,6 +6028,70 @@ ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) * ib_multipath_rec_t *********/ +/****f* IBA Base: Types/ib_multipath_rec_set_qos_class +* NAME +* ib_multipath_rec_set_qos_class +* +* DESCRIPTION +* Set path QoS class. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_multipath_rec_set_qos_class( + IN ib_multipath_rec_t* const p_rec, + IN const uint16_t qos_class ) +{ + p_rec->qos_class_sl = + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_SL_MASK)) | + cl_hton16(qos_class << 4); +} +/* +* PARAMETERS +* p_rec +* [in] Pointer to the MultiPath record object. +* +* qos_class +* [in] QoS class to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_multipath_rec_t +*********/ + +/****f* IBA Base: Types/ib_multipath_rec_qos_class +* NAME +* ib_multipath_rec_qos_class +* +* DESCRIPTION +* Get QoS class. +* +* SYNOPSIS +*/ +static inline uint16_t OSM_API +ib_multipath_rec_qos_class( + IN const ib_multipath_rec_t* const p_rec ) +{ + return (cl_ntoh16( p_rec->qos_class_sl ) >> 4); +} +/* +* PARAMETERS +* p_rec +* [in] Pointer to the MultiPath record object. +* +* RETURN VALUES +* QoS class of the MultiPath record. +* +* NOTES +* +* SEE ALSO +* ib_multipath_rec_t +*********/ + /****f* IBA Base: Types/ib_multipath_rec_mtu * NAME * ib_multipath_rec_mtu @@ -6164,6 +6290,41 @@ ib_multipath_rec_pkt_life_sel(IN const ib_multipath_rec_t * const p_rec) * ib_multipath_rec_t *********/ +/****f* IBA Base: Types/ib_multipath_rec_service_id +* NAME +* ib_multipath_rec_service_id +* +* DESCRIPTION +* Get multipath service id. +* +* SYNOPSIS +*/ +static inline uint64_t OSM_API +ib_multipath_rec_service_id(IN const ib_multipath_rec_t * const p_rec) +{ + union { + ib_net64_t sid; + uint8_t sid_arr[8]; + } sid_union; + sid_union.sid_arr[0] = p_rec->service_id_8msb; + memcpy(&sid_union.sid_arr[1], p_rec->service_id_56lsb, 7); + return sid_union.sid; +} + +/* +* PARAMETERS +* p_rec +* [in] Pointer to the multipath record object. +* +* RETURN VALUES +* Service ID +* +* NOTES +* +* SEE ALSO +* ib_multipath_rec_t +*********/ + #define IB_NUM_PKEY_ELEMENTS_IN_BLOCK 32 /****s* IBA Base: Types/ib_pkey_table_t * NAME diff --git a/opensm/libvendor/osm_vendor_ibumad_sa.c b/opensm/libvendor/osm_vendor_ibumad_sa.c index 42a6d3a..a878c71 100644 --- a/opensm/libvendor/osm_vendor_ibumad_sa.c +++ b/opensm/libvendor/osm_vendor_ibumad_sa.c @@ -840,7 +840,8 @@ osmv_query_sa(IN osm_bind_handle_t h_bind, else multipath_rec.num_path &= ~0x80; multipath_rec.pkey = p_mpr_req->pkey; - multipath_rec.sl = p_mpr_req->sl; + ib_multipath_rec_set_sl(&multipath_rec, p_mpr_req->sl); + ib_multipath_rec_set_qos_class(&multipath_rec, 0); multipath_rec.independence = p_mpr_req->independence; multipath_rec.sgid_count = p_mpr_req->sgid_count; multipath_rec.dgid_count = p_mpr_req->dgid_count; diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 5dd3955..cf8cfab 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1131,29 +1131,30 @@ osm_dump_multipath_record(IN osm_log_t * const p_log, "\t\t\t\ttclass..................0x%X\n" "\t\t\t\tnum_path_revers.........0x%X\n" "\t\t\t\tpkey....................0x%X\n" - "\t\t\t\tresv0...................0x%X\n" + "\t\t\t\tqos_class...............0x%X\n" "\t\t\t\tsl......................0x%X\n" "\t\t\t\tmtu.....................0x%X\n" "\t\t\t\trate....................0x%X\n" "\t\t\t\tpkt_life................0x%X\n" - "\t\t\t\tresv1...................0x%X\n" "\t\t\t\tindependence............0x%X\n" "\t\t\t\tsgid_count..............0x%X\n" "\t\t\t\tdgid_count..............0x%X\n" + "\t\t\t\tservice_id..............0x%016" PRIx64 "\n" "%s\n" "", cl_ntoh32(p_mpr->hop_flow_raw), p_mpr->tclass, p_mpr->num_path, cl_ntoh16(p_mpr->pkey), - p_mpr->resv0, - cl_ntoh16(p_mpr->sl), + ib_multipath_rec_qos_class(p_mpr), + ib_multipath_rec_sl(p_mpr), p_mpr->mtu, p_mpr->rate, p_mpr->pkt_life, - p_mpr->resv1, p_mpr->independence, - p_mpr->sgid_count, p_mpr->dgid_count, buf_line); + p_mpr->sgid_count, p_mpr->dgid_count, + ib_multipath_rec_service_id(p_mpr), + buf_line); } } -- 1.5.1.4 From sashak at voltaire.com Tue Sep 4 08:44:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 18:44:52 +0300 Subject: [ofa-general] Re: [opensm] bugs in build system In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> Message-ID: <20070904154452.GE23670@sashak.voltaire.com> Hi Eitan, On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > Hi Sasha, > > For some reason OpenSM (and the required management libs) do not build > correctly when > I use manual autogen.sh, configure --prefix=/tmp/ez/usr ; make; make > install mode. > > It seems the build system is probably broken as it relies on fixed > paths? > > Here is what I do, errors are included in this list: > OK 1. git clone .... > --------------- LIBIBCOMMON ------------------ > OK 2. cd management/libibcommon; autogen.sh; ./configure > --prefix=/tmp/ez/usr ; make ; make install > --------------- LIBIBUMAD ------------------ > OK 3. cd management/libibumad; autogen.sh; > FAIL 4. ./configure --prefix=/tmp/ez/usr > checking for sys_read_string in -libcommon... no > configure: error: sys_read_string() not found. libibumad requires > libibcommon. > > To overcome this I manually added the --disable-libcheck > ./configure --prefix=/tmp/ez/usr --disable-libcheck > I do not understand why after installing the common lib I still get this > error? > Isn't the search path should include the /lib ??? > > FAIL 5. make > Make fails as it does not find the infiniband/common.h > > To overcome this I manually added -I/include .... > make CFLAGS="-I/tmp/ez/usr/include" > > OK 6. make install > --------------- OPENSM ------------------ > OK 7. cd management/opensm; autogen.sh; > FAIL 8. configure --prefix=/tmp/ez/usr > checking for umad_init in -libumad... no > configure: error: umad_init() not found. libosmvendor of type openib > requires libibumad. > configure: error: /bin/sh './configure' failed for libvendor > > To overcome this I manually added the --disable-libcheck > ./configure --prefix=/tmp/ez/usr --disable-libcheck > This problem is same as the above: lib path for linking should use the > /lib. > > FAIL 9. make > Here again the include path is missing the /include: > > ./../include/vendor/osm_vendor_ibumad.h:44:31: infiniband/common.h: No > such file or directory > ./../include/vendor/osm_vendor_ibumad.h:45:29: infiniband/umad.h: No > such file or directory > > To overcome this I manually added -I/include .... > make CFLAGS="-I/tmp/ez/usr/include" > > But this is not enough as the linker fail: > /usr/bin/ld: cannot find -libumad > > To overcome this I had to add -L/lib .... > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib -libumad > -libcommon" > > OK 10. make install > > I hope the above issues could be fixed such that the installation would > be simpler. Thanks for reporting. Hope I will find time to look at this. > Also I propose removing the un-needed extra levels of autotools inside > OpenSM code as there is no need/reason to have it eb declared as 5 > different projects resulting with "configure" time longer than the > compile time. I agree. Sasha From harake at cscs.ch Tue Sep 4 08:51:39 2007 From: harake at cscs.ch (H.N.HARAKE) Date: Tue, 4 Sep 2007 17:51:39 +0200 Subject: [ofa-general] Build rpms kernel 2.6.5-7.283 Message-ID: I tried to build OFED rpms so I can install it on my linux sles 9 kernel but I am facing a problem hope you can help me. Log file below Thanks and Best Regards H. N. Harake + STATUS=0 + '[' 0 -ne 0 ']' + cd ofa_user-1.2.5 ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chown -Rhf root . ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chgrp -Rhf root . + /bin/chmod -Rf a+rX,g-w,o-w . + exit 0 Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.15799 + umask 022 + cd /var/tmp/OFEDRPM/BUILD + cd ofa_user-1.2.5 + cd /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5 + install -d /var/tmp/OFED/etc/init.d + install -d /var/tmp/OFED//etc + install -d /var/tmp/OFED//usr/src + cp -a /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5 /var/tmp/OFED//usr/src + ./configure --prefix=/usr --libdir=/usr/lib64 --with-libcxgb3 -- with-libibcm --with-libibverbs --with-libipathverbs --with-libmlx4 -- with-libmthca --with-librdmacm --with-mstflint --with-perftest -- sysconfdir=/etc --mandir=/usr/share/man Quilt does not exist... Going to use patch. Created configure.mk.user: prefix=/usr PREFIX="--prefix /usr" libdir=/usr/lib64 # Current working directory CWD=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5 # User level WITH_IBVERBS=yes WITH_MTHCA=yes WITH_MLX4=yes WITH_IPATHVERBS=yes WITH_EHCA=no WITH_CXGB3=yes WITH_CM=yes WITH_SDP=no WITH_DAPL=no WITH_RDMACM=yes WITH_MANAGEMENT_LIBS=no WITH_OSM=no WITH_DIAGS=no WITH_PERFTEST=yes WITH_SRPTOOLS=no WITH_IPOIBTOOLS=no WITH_QLVNICTOOLS=no WITH_TVFLASH=no WITH_MSTFLINT=yes WITH_SDPNETSTAT=no mkdir -p /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/patches touch /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/patches/quiltrc /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/user_patches/fixes/ libmlx4_05_fix_max_cap.patch ./configure: line 153: patch: command not found Failed to apply patch: /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/ user_patches/fixes/libmlx4_05_fix_max_cap.patch error: Bad exit status from /var/tmp/rpm-tmp.15799 (%install) RPM build errors: user vlad does not exist - using root user vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.15799 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/ tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /var/tmp/ OFED' --define 'configure_options --with-libcxgb3 --with-libibcm -- with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca -- with-librdmacm --with-mstflint --with-perftest --sysconfdir=/etc -- mandir=/usr/share/man' --define 'configure_options32 --with-libcxgb3 --with-libibcm --with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca --with-librdmacm --sysconfdir=/etc --mandir=/usr/ share/man' --define 'build_32bit 1' --define '_mandir /usr/share/ man' /root/OFED-1.2.5/SRPMS/ofa_user-1.2.5-0.src.rpm" From sashak at voltaire.com Tue Sep 4 09:32:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 19:32:04 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS - adding new QoS fields to MultiPathRecord In-Reply-To: <46DD76F3.6020007@dev.mellanox.co.il> References: <46DD76F3.6020007@dev.mellanox.co.il> Message-ID: <20070904163204.GG23670@sashak.voltaire.com> On 18:17 Tue 04 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > Adding QoS class and Service ID to MultiPathRecord > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. The nit is below. > --- > opensm/include/iba/ib_types.h | 181 +++++++++++++++++++++++++++++-- > opensm/libvendor/osm_vendor_ibumad_sa.c | 3 +- > opensm/opensm/osm_helper.c | 13 ++- > 3 files changed, 180 insertions(+), 17 deletions(-) > > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index 0a096f9..13e2f38 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -1658,6 +1658,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > */ > #define IB_PATH_REC_SL_MASK 0x000F > > +/****d* IBA Base: Constants/IB_MULTIPATH_REC_SL_MASK > +* NAME > +* IB_MILTIPATH_REC_SL_MASK > +* > +* DESCRIPTION > +* Mask for the sl field for MultiPath record > +* > +* SOURCE > +*/ > +#define IB_MULTIPATH_REC_SL_MASK 0x000F > + > /****d* IBA Base: Constants/IB_PATH_REC_QOS_CLASS_MASK > * NAME > * IB_PATH_REC_QOS_CLASS_MASK > @@ -1669,6 +1680,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > */ > #define IB_PATH_REC_QOS_CLASS_MASK 0xFFF0 > > +/****d* IBA Base: Constants/IB_MULTIPATH_REC_QOS_CLASS_MASK > +* NAME > +* IB_MULTIPATH_REC_QOS_CLASS_MASK > +* > +* DESCRIPTION > +* Mask for the QoS class field for MultiPath record > +* > +* SOURCE > +*/ > +#define IB_MULTIPATH_REC_QOS_CLASS_MASK 0xFFF0 > + > /****d* IBA Base: Constants/IB_PATH_REC_SELECTOR_MASK > * NAME > * IB_PATH_REC_SELECTOR_MASK > @@ -2589,7 +2611,7 @@ typedef struct _ib_path_rec { > #define IB_MPR_COMPMASK_REVERSIBLE (CL_HTON64(((uint64_t)1)<<5)) > #define IB_MPR_COMPMASK_NUMBPATH (CL_HTON64(((uint64_t)1)<<6)) > #define IB_MPR_COMPMASK_PKEY (CL_HTON64(((uint64_t)1)<<7)) > -#define IB_MPR_COMPMASK_RESV1 (CL_HTON64(((uint64_t)1)<<8)) > +#define IB_MPR_COMPMASK_QOS_CLASS (CL_HTON64(((uint64_t)1)<<8)) > #define IB_MPR_COMPMASK_SL (CL_HTON64(((uint64_t)1)<<9)) > #define IB_MPR_COMPMASK_MTUSELEC (CL_HTON64(((uint64_t)1)<<10)) > #define IB_MPR_COMPMASK_MTU (CL_HTON64(((uint64_t)1)<<11)) > @@ -2597,12 +2619,12 @@ typedef struct _ib_path_rec { > #define IB_MPR_COMPMASK_RATE (CL_HTON64(((uint64_t)1)<<13)) > #define IB_MPR_COMPMASK_PKTLIFETIMESELEC (CL_HTON64(((uint64_t)1)<<14)) > #define IB_MPR_COMPMASK_PKTLIFETIME (CL_HTON64(((uint64_t)1)<<15)) > -#define IB_MPR_COMPMASK_RESV2 (CL_HTON64(((uint64_t)1)<<16)) > +#define IB_MPR_COMPMASK_SERVICEID_MSB (CL_HTON64(((uint64_t)1)<<16)) > #define IB_MPR_COMPMASK_INDEPSELEC (CL_HTON64(((uint64_t)1)<<17)) > #define IB_MPR_COMPMASK_RESV3 (CL_HTON64(((uint64_t)1)<<18)) > #define IB_MPR_COMPMASK_SGIDCOUNT (CL_HTON64(((uint64_t)1)<<19)) > #define IB_MPR_COMPMASK_DGIDCOUNT (CL_HTON64(((uint64_t)1)<<20)) > -#define IB_MPR_COMPMASK_RESV4 (CL_HTON64(((uint64_t)1)<<21)) > +#define IB_MPR_COMPMASK_SERVICEID_LSB (CL_HTON64(((uint64_t)1)<<21)) > > /* SMInfo Record Component Masks */ > #define IB_SMIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) > @@ -5861,16 +5883,15 @@ typedef struct _ib_multipath_rec_t { > uint8_t tclass; > uint8_t num_path; > ib_net16_t pkey; > - uint8_t resv0; > - uint8_t sl; > + ib_net16_t qos_class_sl; > uint8_t mtu; > uint8_t rate; > uint8_t pkt_life; > - uint8_t resv1; > + uint8_t service_id_8msb; > uint8_t independence; /* formerly resv2 */ > uint8_t sgid_count; > uint8_t dgid_count; > - uint8_t resv3[7]; > + uint8_t service_id_56lsb[7]; > ib_gid_t gids[IB_MULTIPATH_MAX_GIDS]; > } PACK_SUFFIX ib_multipath_rec_t; > #include > @@ -5890,8 +5911,8 @@ typedef struct _ib_multipath_rec_t { > * pkey > * Partition key (P_Key) to use on this path. > * > -* sl > -* Service level to use on this path. > +* qos_class_sl > +* QoS class and service level to use on this path. > * > * mtu > * MTU and MTU selector fields to use on this path > @@ -5901,6 +5922,12 @@ typedef struct _ib_multipath_rec_t { > * pkt_life > * Packet lifetime > * > +* service_id_8msb > +* 8 most significant bits of Service ID > +* > +* service_id_56lsb > +* 56 least significant bits of Service ID > +* > * preference > * Indicates the relative merit of this path versus other path > * records returned from the SA. Lower numbers are better. > @@ -5937,6 +5964,41 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) > * ib_multipath_rec_t > *********/ > > +/****f* IBA Base: Types/ib_multipath_rec_set_sl > +* NAME > +* ib_multipath_rec_set_sl > +* > +* DESCRIPTION > +* Set path service level. > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_multipath_rec_set_sl( > + IN ib_multipath_rec_t* const p_rec, > + IN const uint8_t sl ) > +{ > + p_rec->qos_class_sl = > + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_QOS_CLASS_MASK)) | > + cl_hton16(sl & IB_MULTIPATH_REC_SL_MASK); > +} > +/* > +* PARAMETERS > +* p_rec > +* [in] Pointer to the MultiPath record object. > +* > +* sl > +* [in] Service level to set. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_multipath_rec_t > +*********/ > + > /****f* IBA Base: Types/ib_multipath_rec_sl > * NAME > * ib_multipath_rec_sl > @@ -5949,7 +6011,7 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) > static inline uint8_t OSM_API > ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) > { > - return ((uint8_t) ((cl_ntoh16(p_rec->sl)) & 0xF)); > + return ((uint8_t) ((cl_ntoh16(p_rec->qos_class_sl)) & IB_MULTIPATH_REC_SL_MASK)); > } > > /* > @@ -5966,6 +6028,70 @@ ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) > * ib_multipath_rec_t > *********/ > > +/****f* IBA Base: Types/ib_multipath_rec_set_qos_class > +* NAME > +* ib_multipath_rec_set_qos_class > +* > +* DESCRIPTION > +* Set path QoS class. > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_multipath_rec_set_qos_class( > + IN ib_multipath_rec_t* const p_rec, > + IN const uint16_t qos_class ) > +{ > + p_rec->qos_class_sl = > + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_SL_MASK)) | > + cl_hton16(qos_class << 4); > +} > +/* > +* PARAMETERS > +* p_rec > +* [in] Pointer to the MultiPath record object. > +* > +* qos_class > +* [in] QoS class to set. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_multipath_rec_t > +*********/ > + > +/****f* IBA Base: Types/ib_multipath_rec_qos_class > +* NAME > +* ib_multipath_rec_qos_class > +* > +* DESCRIPTION > +* Get QoS class. > +* > +* SYNOPSIS > +*/ > +static inline uint16_t OSM_API > +ib_multipath_rec_qos_class( > + IN const ib_multipath_rec_t* const p_rec ) > +{ > + return (cl_ntoh16( p_rec->qos_class_sl ) >> 4); > +} > +/* > +* PARAMETERS > +* p_rec > +* [in] Pointer to the MultiPath record object. > +* > +* RETURN VALUES > +* QoS class of the MultiPath record. > +* > +* NOTES > +* > +* SEE ALSO > +* ib_multipath_rec_t > +*********/ > + > /****f* IBA Base: Types/ib_multipath_rec_mtu > * NAME > * ib_multipath_rec_mtu > @@ -6164,6 +6290,41 @@ ib_multipath_rec_pkt_life_sel(IN const ib_multipath_rec_t * const p_rec) > * ib_multipath_rec_t > *********/ > > +/****f* IBA Base: Types/ib_multipath_rec_service_id > +* NAME > +* ib_multipath_rec_service_id > +* > +* DESCRIPTION > +* Get multipath service id. > +* > +* SYNOPSIS > +*/ > +static inline uint64_t OSM_API > +ib_multipath_rec_service_id(IN const ib_multipath_rec_t * const p_rec) > +{ > + union { > + ib_net64_t sid; > + uint8_t sid_arr[8]; > + } sid_union; > + sid_union.sid_arr[0] = p_rec->service_id_8msb; > + memcpy(&sid_union.sid_arr[1], p_rec->service_id_56lsb, 7); > + return sid_union.sid; > +} > + > +/* > +* PARAMETERS > +* p_rec > +* [in] Pointer to the multipath record object. > +* > +* RETURN VALUES > +* Service ID > +* > +* NOTES > +* > +* SEE ALSO > +* ib_multipath_rec_t > +*********/ > + > #define IB_NUM_PKEY_ELEMENTS_IN_BLOCK 32 > /****s* IBA Base: Types/ib_pkey_table_t > * NAME > diff --git a/opensm/libvendor/osm_vendor_ibumad_sa.c b/opensm/libvendor/osm_vendor_ibumad_sa.c > index 42a6d3a..a878c71 100644 > --- a/opensm/libvendor/osm_vendor_ibumad_sa.c > +++ b/opensm/libvendor/osm_vendor_ibumad_sa.c > @@ -840,7 +840,8 @@ osmv_query_sa(IN osm_bind_handle_t h_bind, > else > multipath_rec.num_path &= ~0x80; > multipath_rec.pkey = p_mpr_req->pkey; > - multipath_rec.sl = p_mpr_req->sl; > + ib_multipath_rec_set_sl(&multipath_rec, p_mpr_req->sl); > + ib_multipath_rec_set_qos_class(&multipath_rec, 0); > multipath_rec.independence = p_mpr_req->independence; > multipath_rec.sgid_count = p_mpr_req->sgid_count; > multipath_rec.dgid_count = p_mpr_req->dgid_count; > diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > index 5dd3955..cf8cfab 100644 > --- a/opensm/opensm/osm_helper.c > +++ b/opensm/opensm/osm_helper.c > @@ -1131,29 +1131,30 @@ osm_dump_multipath_record(IN osm_log_t * const p_log, > "\t\t\t\ttclass..................0x%X\n" > "\t\t\t\tnum_path_revers.........0x%X\n" > "\t\t\t\tpkey....................0x%X\n" > - "\t\t\t\tresv0...................0x%X\n" > + "\t\t\t\tqos_class...............0x%X\n" > "\t\t\t\tsl......................0x%X\n" > "\t\t\t\tmtu.....................0x%X\n" > "\t\t\t\trate....................0x%X\n" > "\t\t\t\tpkt_life................0x%X\n" > - "\t\t\t\tresv1...................0x%X\n" > "\t\t\t\tindependence............0x%X\n" > "\t\t\t\tsgid_count..............0x%X\n" > "\t\t\t\tdgid_count..............0x%X\n" > + "\t\t\t\tservice_id..............0x%016" PRIx64 "\n" > "%s\n" > "", > cl_ntoh32(p_mpr->hop_flow_raw), > p_mpr->tclass, > p_mpr->num_path, > cl_ntoh16(p_mpr->pkey), > - p_mpr->resv0, > - cl_ntoh16(p_mpr->sl), > + ib_multipath_rec_qos_class(p_mpr), > + ib_multipath_rec_sl(p_mpr), > p_mpr->mtu, > p_mpr->rate, > p_mpr->pkt_life, > - p_mpr->resv1, > p_mpr->independence, > - p_mpr->sgid_count, p_mpr->dgid_count, buf_line); > + p_mpr->sgid_count, p_mpr->dgid_count, > + ib_multipath_rec_service_id(p_mpr), It returns serveice_id in network byte order. Should cl_ntoh64() be here? Sasha > + buf_line); > } > } > > -- > 1.5.1.4 > > From mst at dev.mellanox.co.il Tue Sep 4 09:40:18 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 19:40:18 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <46DD466F.8020607@voltaire.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <46DD466F.8020607@voltaire.com> Message-ID: <20070904164018.GB28350@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > Michael S. Tsirkin wrote: > >Add module option hw_csum: when set, IPoIB will report S/G > >support, and rely on hardware end-to-end transport checksum (ICRC) > >instead of software-level protocol checksums. > > >Since this will not inter-operate with older IPoIB modules, > >this option is off by default. > > Hi Michael, > > looking on slide 18 of Dror's Sonoma presentation (*) which states - > > >Checksum Offload > >TCP/UDP/IP Checksum Offloading - Query device for checksum offload support > >QP Creation - Mark QP for IPoIB checksum support > >TX - ibv_send_flags indicate checksum offload request > >RX - ibv_wc_flags indicate checksum status (good, bad, unverified) All this is only supported by connectx and only for datagram (not ipoib cm). > I don't see that there is such dependency, nor I can understand the > design that creates the dependency, unless you rely on the IB CRC and > not compute the actual TCP/UDP/IP csum. Your question is moot. I'll just quote the commit message here: rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums While this does not inter-operate with standard ipoib RFC, all TCP/IP suite protocols work as usual, so I think that this is at least as useful as SDP is. -- MST From jgunthorpe at obsidianresearch.com Tue Sep 4 09:52:51 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Sep 2007 10:52:51 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904091133.GA23437@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> Message-ID: <20070904165251.GA16535@obsidianresearch.com> On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > I know some people find this approach controversial, > but from my perspective, this is not worse than e.g. > SDP which does not have SW checksums pretty much by design. This would be alot better in my mind of the option was negotiated as part of the CM setup process. Otherwise this becomes a network wide all or nothing kind of feature.. What if the RXing Linux IB side is acting as a forwarder to ethernet? It will forward corrupt packets if this option is set, right? Jason From mst at dev.mellanox.co.il Tue Sep 4 10:04:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 20:04:19 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904165251.GA16535@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> Message-ID: <20070904170419.GD28350@mellanox.co.il> > Quoting Jason Gunthorpe : > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > I know some people find this approach controversial, > > but from my perspective, this is not worse than e.g. > > SDP which does not have SW checksums pretty much by design. > > This would be alot better in my mind of the option was negotiated as > part of the CM setup process. Unfortunately, HW_CSUM device->features flag is a per-netdevice one. We could do an extra pass over the packet, but this would mean a performance hit for such paths. Your suggestion also does not address multicast addresses. > Otherwise this becomes a network wide > all or nothing kind of feature.. Yes. It would be relatively easy to make it possible to disable this feature from sysfs, then you could partition the network and use the feature for some partitions only. > What if the RXing Linux IB side is acting as a forwarder to ethernet? > It will forward corrupt packets if this option is set, right? No. The checksum will be calculated by the gateway before being sent on the ethernet interface. -- MST From jlentini at netapp.com Tue Sep 4 10:04:08 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Sep 2007 13:04:08 -0400 (EDT) Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904165251.GA16535@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> Message-ID: On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > I know some people find this approach controversial, > > but from my perspective, this is not worse than e.g. > > SDP which does not have SW checksums pretty much by design. > > This would be alot better in my mind of the option was negotiated as > part of the CM setup process. Otherwise this becomes a network wide > all or nothing kind of feature.. > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > It will forward corrupt packets if this option is set, right? So this break all gateway devices? How would packets be routed with this change? From mst at dev.mellanox.co.il Tue Sep 4 10:20:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 20:20:04 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> Message-ID: <20070904172004.GF28350@mellanox.co.il> > Quoting James Lentini : > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > > > I know some people find this approach controversial, > > > but from my perspective, this is not worse than e.g. > > > SDP which does not have SW checksums pretty much by design. > > > > This would be alot better in my mind of the option was negotiated as > > part of the CM setup process. Otherwise this becomes a network wide > > all or nothing kind of feature.. > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > It will forward corrupt packets if this option is set, right? > > So this break all gateway devices? It won't. The gateway will calculate the checksums. > How would packets be routed with this change? As usual. -- MST From jgunthorpe at obsidianresearch.com Tue Sep 4 10:27:25 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Sep 2007 11:27:25 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904170419.GD28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> Message-ID: <20070904172725.GH4472@obsidianresearch.com> On Tue, Sep 04, 2007 at 08:04:19PM +0300, Michael S. Tsirkin wrote: > > This would be alot better in my mind of the option was negotiated as > > part of the CM setup process. > > Unfortunately, HW_CSUM device->features flag is a per-netdevice one. > We could do an extra pass over the packet, but this would mean a > performance hit for such paths. Your suggestion also does > not address multicast addresses. Why is there a big difference in performance if the stack does the csum update or if the netdevice does the csum update? Aren't you already summing all UD packets (inclusing multicast) in the driver before sending? Though, to be honest, I don't see this in your patch, so maybe not.. > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > It will forward corrupt packets if this option is set, right? > > No. The checksum will be calculated by the gateway before being sent on the > ethernet interface. I thought linux only recomputed the checksum on the forwarding path if the skb was marked as needing checksum. Since you set the skb as already summed, there should be cases where invalid packets will be forwarded.. I'd be surprised if a real, hardware, IPoIB to XX device recomputed the checksum unconditionally. The typical approach is to do an incremental update of the checksum if you are changing the packet headers. This preserves the end-to-end-ness and also does not require buffering the entire packet before updating it. Jason From mst at dev.mellanox.co.il Tue Sep 4 10:48:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 20:48:43 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904172725.GH4472@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> Message-ID: <20070904174843.GG28350@mellanox.co.il> > Quoting Jason Gunthorpe : > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > On Tue, Sep 04, 2007 at 08:04:19PM +0300, Michael S. Tsirkin wrote: > > > > This would be alot better in my mind of the option was negotiated as > > > part of the CM setup process. > > > > Unfortunately, HW_CSUM device->features flag is a per-netdevice one. > > We could do an extra pass over the packet, but this would mean a > > performance hit for such paths. Your suggestion also does > > not address multicast addresses. > > Why is there a big difference in performance if the stack does the > csum update or if the netdevice does the csum update? > > Aren't you already summing all UD packets (inclusing multicast) in the > driver before sending? Though, to be honest, I don't see this in your > patch, so maybe not.. no > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > It will forward corrupt packets if this option is set, right? > > > > No. The checksum will be calculated by the gateway before being sent on the > > ethernet interface. > > I thought linux only recomputed the checksum on the forwarding path if > the skb was marked as needing checksum. Since you set the skb as > already summed, there should be cases where invalid packets will be > forwarded.. I don't set CHECKSUM_UNECESSARY, so linux will have to recompute the checksum. > I'd be surprised if a real, hardware, IPoIB to XX device recomputed > the checksum unconditionally. The typical approach is to do an > incremental update of the checksum if you are changing the packet > headers. This preserves the end-to-end-ness and also does not require > buffering the entire packet before updating it. Since skb is not marked with CHECKSUM_COMPLETE, linux will recompute the checksum IMO. -- MST From jlentini at netapp.com Tue Sep 4 11:13:54 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Sep 2007 14:13:54 -0400 (EDT) Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904172004.GF28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904172004.GF28350@mellanox.co.il> Message-ID: On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > Quoting James Lentini : > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > > > > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > I know some people find this approach controversial, > > > > but from my perspective, this is not worse than e.g. > > > > SDP which does not have SW checksums pretty much by design. > > > > > > This would be alot better in my mind of the option was negotiated as > > > part of the CM setup process. Otherwise this becomes a network wide > > > all or nothing kind of feature.. > > > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > It will forward corrupt packets if this option is set, right? > > > > So this break all gateway devices? > > It won't. The gateway will calculate the checksums. > > > How would packets be routed with this change? > > As usual. A Linux system setup as a router with an IPoIB interface and an Ethernet interface will work if this feature is turned on? From mst at dev.mellanox.co.il Tue Sep 4 11:26:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 21:26:55 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904172004.GF28350@mellanox.co.il> Message-ID: <20070904182655.GI28350@mellanox.co.il> > Quoting James Lentini : > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > Quoting James Lentini : > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > > > > > > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > > > I know some people find this approach controversial, > > > > > but from my perspective, this is not worse than e.g. > > > > > SDP which does not have SW checksums pretty much by design. > > > > > > > > This would be alot better in my mind of the option was negotiated as > > > > part of the CM setup process. Otherwise this becomes a network wide > > > > all or nothing kind of feature.. > > > > > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > > It will forward corrupt packets if this option is set, right? > > > > > > So this break all gateway devices? > > > > It won't. The gateway will calculate the checksums. > > > > > How would packets be routed with this change? > > > > As usual. > > A Linux system setup as a router with an IPoIB interface and an > Ethernet interface will work if this feature is turned on? I am yet to test this setup, but yes, it should. -- MST From jlentini at netapp.com Tue Sep 4 12:12:08 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Sep 2007 15:12:08 -0400 (EDT) Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904091133.GA23437@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> Message-ID: On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > Add module option hw_csum: when set, IPoIB will report S/G > support, and rely on hardware end-to-end transport checksum (ICRC) > instead of software-level protocol checksums. The purpose of this option would be clearer if the parameter name were "omit_csum". Calling this "HW checksum" support is misleading because the term is already used to describe network adapters that calculate TCP/IP checksums in hardware. I realize that you are using the HW checksum infrastructure to implement this, but it is really not the same thing. > Since this will not inter-operate with older IPoIB modules, this > option is off by default. > > Signed-off-by: Michael S. Tsirkin Does the S/G support need to be tied to the checksum changes? Will the proposed IPoIB wire format changes be standardized in the IETF? Can you describe what will happened when an IETF compliant IPoIB node and a "csum omitted" IPoIB node attempt to communicate? How would the interoperability errors be indicated to the user? From jgunthorpe at obsidianresearch.com Tue Sep 4 12:35:47 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Sep 2007 13:35:47 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904174843.GG28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> Message-ID: <20070904193547.GI4472@obsidianresearch.com> On Tue, Sep 04, 2007 at 08:48:43PM +0300, Michael S. Tsirkin wrote: > > Aren't you already summing all UD packets (inclusing multicast) in the > > driver before sending? Though, to be honest, I don't see this in your > > patch, so maybe not.. > > no Yuk. Sending invalid UD packets is horrible. So it truely is all or nothing. Every gateway, embedded IP device, etc must support this or you cannot use it.. > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > > It will forward corrupt packets if this option is set, right? > > > > > > No. The checksum will be calculated by the gateway before being sent on the > > > ethernet interface. > > > > I thought linux only recomputed the checksum on the forwarding path if > > the skb was marked as needing checksum. Since you set the skb as > > already summed, there should be cases where invalid packets will be > > forwarded.. > > I don't set CHECKSUM_UNECESSARY, so linux will have to recompute the checksum. Eh? You set IPOIB_HEADER_F_HWCSUM on the TX path if the csum is invalid and the test that on the rx path to set CHECKSUM_UNNECESSARY. So, all badly csumed packets have CHECKSUM_UNECESSARY set. > > I'd be surprised if a real, hardware, IPoIB to XX device recomputed > > the checksum unconditionally. The typical approach is to do an > > incremental update of the checksum if you are changing the packet > > headers. This preserves the end-to-end-ness and also does not require > > buffering the entire packet before updating it. I was talking about real, existing HW gateways here, not Linux. I don't know of any reason a gateway would even touch the payload and require a L4 checksum update, let alone doing it non-incrementally.. > Since skb is not marked with CHECKSUM_COMPLETE, linux will > recompute the checksum IMO. MM, no, I don't think so. This checksum stuff is all about the L4 TCP/UDP checksum. If on RX the checksum is invalid the packet is dumped and it never gets into the forwarding code (ip_forward routine). Up until very recently ip_fowrard just unconditionally set ip_summed to CHECKSUM_NONE to reflect this. Of course CHECKSUM_NONE disables all checksum updates on the driver TX path. FWIW, CHECKSUM_COMPLETE is listed as an rx path option, no in tree driver tests it on the TX path. You should probably be using ip_summed != CHECKSUM_PARTIAL as a test in hard_header. With the new changes to ip_forward, maybe you could get away with setting CHECKSUM_PARTIAL in your RX path to get the TX of the final output device to regenerate the L4 checksum? Even so, sending out malformed UD packets strikes me as a compatability killer.. This would be much better as a RC only negotiated at CM feature. Jason From mst at dev.mellanox.co.il Tue Sep 4 12:49:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 22:49:40 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> Message-ID: <20070904194940.GK28350@mellanox.co.il> > Quoting James Lentini : > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > Add module option hw_csum: when set, IPoIB will report S/G > > support, and rely on hardware end-to-end transport checksum (ICRC) > > instead of software-level protocol checksums. > > The purpose of this option would be clearer if the parameter name were > "omit_csum". Calling this "HW checksum" support is misleading because > the term is already used to describe network adapters that calculate > TCP/IP checksums in hardware. I realize that you are using the HW > checksum infrastructure to implement this, but it is really not the > same thing. Another reason is that I declare HW_CSUM in the netdev feature list. Yea, someone might get confused, but "omit checksum" is misleading, too, and is likely to scare users away from the feature: the need for end-to-end checksum is a widely recognised requirement. So I don't have a better name. Hopefully modinfo documents the option well enough. > > Since this will not inter-operate with older IPoIB modules, this > > option is off by default. > > > > Signed-off-by: Michael S. Tsirkin > > Does the S/G support need to be tied to the checksum changes? > > Will the proposed IPoIB wire format changes be standardized in the > IETF? I don't know. > Can you describe what will happened when an IETF compliant IPoIB node > and a "csum omitted" IPoIB node attempt to communicate? How would the > interoperability errors be indicated to the user? old ipoib sends all packets with hw csum bit off, new ipoib handles them ok. new ipoib might send a packet with hw csum bit clear (e.g. packet that comes from external gateway), this one gets handled by old ipoib fine. OTOH typical packets sent from new ipoib have hw csum bit set. old ipoib ignores this bit and passes the packets up the stack. You'll get checksum failure errors which result in transport errors. These can be observed with standard linux tools. -- MST From kliteyn at dev.mellanox.co.il Tue Sep 4 12:54:05 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Sep 2007 22:54:05 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS - adding new QoS fields to MultiPathRecord In-Reply-To: <20070904163204.GG23670@sashak.voltaire.com> References: <46DD76F3.6020007@dev.mellanox.co.il> <20070904163204.GG23670@sashak.voltaire.com> Message-ID: <46DDB7DD.4080909@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 18:17 Tue 04 Sep , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Adding QoS class and Service ID to MultiPathRecord >> >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > > The nit is below. > >> --- >> opensm/include/iba/ib_types.h | 181 +++++++++++++++++++++++++++++-- >> opensm/libvendor/osm_vendor_ibumad_sa.c | 3 +- >> opensm/opensm/osm_helper.c | 13 ++- >> 3 files changed, 180 insertions(+), 17 deletions(-) >> >> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h >> index 0a096f9..13e2f38 100644 >> --- a/opensm/include/iba/ib_types.h >> +++ b/opensm/include/iba/ib_types.h >> @@ -1658,6 +1658,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) >> */ >> #define IB_PATH_REC_SL_MASK 0x000F >> >> +/****d* IBA Base: Constants/IB_MULTIPATH_REC_SL_MASK >> +* NAME >> +* IB_MILTIPATH_REC_SL_MASK >> +* >> +* DESCRIPTION >> +* Mask for the sl field for MultiPath record >> +* >> +* SOURCE >> +*/ >> +#define IB_MULTIPATH_REC_SL_MASK 0x000F >> + >> /****d* IBA Base: Constants/IB_PATH_REC_QOS_CLASS_MASK >> * NAME >> * IB_PATH_REC_QOS_CLASS_MASK >> @@ -1669,6 +1680,17 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) >> */ >> #define IB_PATH_REC_QOS_CLASS_MASK 0xFFF0 >> >> +/****d* IBA Base: Constants/IB_MULTIPATH_REC_QOS_CLASS_MASK >> +* NAME >> +* IB_MULTIPATH_REC_QOS_CLASS_MASK >> +* >> +* DESCRIPTION >> +* Mask for the QoS class field for MultiPath record >> +* >> +* SOURCE >> +*/ >> +#define IB_MULTIPATH_REC_QOS_CLASS_MASK 0xFFF0 >> + >> /****d* IBA Base: Constants/IB_PATH_REC_SELECTOR_MASK >> * NAME >> * IB_PATH_REC_SELECTOR_MASK >> @@ -2589,7 +2611,7 @@ typedef struct _ib_path_rec { >> #define IB_MPR_COMPMASK_REVERSIBLE (CL_HTON64(((uint64_t)1)<<5)) >> #define IB_MPR_COMPMASK_NUMBPATH (CL_HTON64(((uint64_t)1)<<6)) >> #define IB_MPR_COMPMASK_PKEY (CL_HTON64(((uint64_t)1)<<7)) >> -#define IB_MPR_COMPMASK_RESV1 (CL_HTON64(((uint64_t)1)<<8)) >> +#define IB_MPR_COMPMASK_QOS_CLASS (CL_HTON64(((uint64_t)1)<<8)) >> #define IB_MPR_COMPMASK_SL (CL_HTON64(((uint64_t)1)<<9)) >> #define IB_MPR_COMPMASK_MTUSELEC (CL_HTON64(((uint64_t)1)<<10)) >> #define IB_MPR_COMPMASK_MTU (CL_HTON64(((uint64_t)1)<<11)) >> @@ -2597,12 +2619,12 @@ typedef struct _ib_path_rec { >> #define IB_MPR_COMPMASK_RATE (CL_HTON64(((uint64_t)1)<<13)) >> #define IB_MPR_COMPMASK_PKTLIFETIMESELEC (CL_HTON64(((uint64_t)1)<<14)) >> #define IB_MPR_COMPMASK_PKTLIFETIME (CL_HTON64(((uint64_t)1)<<15)) >> -#define IB_MPR_COMPMASK_RESV2 (CL_HTON64(((uint64_t)1)<<16)) >> +#define IB_MPR_COMPMASK_SERVICEID_MSB (CL_HTON64(((uint64_t)1)<<16)) >> #define IB_MPR_COMPMASK_INDEPSELEC (CL_HTON64(((uint64_t)1)<<17)) >> #define IB_MPR_COMPMASK_RESV3 (CL_HTON64(((uint64_t)1)<<18)) >> #define IB_MPR_COMPMASK_SGIDCOUNT (CL_HTON64(((uint64_t)1)<<19)) >> #define IB_MPR_COMPMASK_DGIDCOUNT (CL_HTON64(((uint64_t)1)<<20)) >> -#define IB_MPR_COMPMASK_RESV4 (CL_HTON64(((uint64_t)1)<<21)) >> +#define IB_MPR_COMPMASK_SERVICEID_LSB (CL_HTON64(((uint64_t)1)<<21)) >> >> /* SMInfo Record Component Masks */ >> #define IB_SMIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) >> @@ -5861,16 +5883,15 @@ typedef struct _ib_multipath_rec_t { >> uint8_t tclass; >> uint8_t num_path; >> ib_net16_t pkey; >> - uint8_t resv0; >> - uint8_t sl; >> + ib_net16_t qos_class_sl; >> uint8_t mtu; >> uint8_t rate; >> uint8_t pkt_life; >> - uint8_t resv1; >> + uint8_t service_id_8msb; >> uint8_t independence; /* formerly resv2 */ >> uint8_t sgid_count; >> uint8_t dgid_count; >> - uint8_t resv3[7]; >> + uint8_t service_id_56lsb[7]; >> ib_gid_t gids[IB_MULTIPATH_MAX_GIDS]; >> } PACK_SUFFIX ib_multipath_rec_t; >> #include >> @@ -5890,8 +5911,8 @@ typedef struct _ib_multipath_rec_t { >> * pkey >> * Partition key (P_Key) to use on this path. >> * >> -* sl >> -* Service level to use on this path. >> +* qos_class_sl >> +* QoS class and service level to use on this path. >> * >> * mtu >> * MTU and MTU selector fields to use on this path >> @@ -5901,6 +5922,12 @@ typedef struct _ib_multipath_rec_t { >> * pkt_life >> * Packet lifetime >> * >> +* service_id_8msb >> +* 8 most significant bits of Service ID >> +* >> +* service_id_56lsb >> +* 56 least significant bits of Service ID >> +* >> * preference >> * Indicates the relative merit of this path versus other path >> * records returned from the SA. Lower numbers are better. >> @@ -5937,6 +5964,41 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) >> * ib_multipath_rec_t >> *********/ >> >> +/****f* IBA Base: Types/ib_multipath_rec_set_sl >> +* NAME >> +* ib_multipath_rec_set_sl >> +* >> +* DESCRIPTION >> +* Set path service level. >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_multipath_rec_set_sl( >> + IN ib_multipath_rec_t* const p_rec, >> + IN const uint8_t sl ) >> +{ >> + p_rec->qos_class_sl = >> + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_QOS_CLASS_MASK)) | >> + cl_hton16(sl & IB_MULTIPATH_REC_SL_MASK); >> +} >> +/* >> +* PARAMETERS >> +* p_rec >> +* [in] Pointer to the MultiPath record object. >> +* >> +* sl >> +* [in] Service level to set. >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_multipath_rec_t >> +*********/ >> + >> /****f* IBA Base: Types/ib_multipath_rec_sl >> * NAME >> * ib_multipath_rec_sl >> @@ -5949,7 +6011,7 @@ ib_multipath_rec_num_path(IN const ib_multipath_rec_t * const p_rec) >> static inline uint8_t OSM_API >> ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) >> { >> - return ((uint8_t) ((cl_ntoh16(p_rec->sl)) & 0xF)); >> + return ((uint8_t) ((cl_ntoh16(p_rec->qos_class_sl)) & IB_MULTIPATH_REC_SL_MASK)); >> } >> >> /* >> @@ -5966,6 +6028,70 @@ ib_multipath_rec_sl(IN const ib_multipath_rec_t * const p_rec) >> * ib_multipath_rec_t >> *********/ >> >> +/****f* IBA Base: Types/ib_multipath_rec_set_qos_class >> +* NAME >> +* ib_multipath_rec_set_qos_class >> +* >> +* DESCRIPTION >> +* Set path QoS class. >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_multipath_rec_set_qos_class( >> + IN ib_multipath_rec_t* const p_rec, >> + IN const uint16_t qos_class ) >> +{ >> + p_rec->qos_class_sl = >> + (p_rec->qos_class_sl & CL_HTON16(IB_MULTIPATH_REC_SL_MASK)) | >> + cl_hton16(qos_class << 4); >> +} >> +/* >> +* PARAMETERS >> +* p_rec >> +* [in] Pointer to the MultiPath record object. >> +* >> +* qos_class >> +* [in] QoS class to set. >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_multipath_rec_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_multipath_rec_qos_class >> +* NAME >> +* ib_multipath_rec_qos_class >> +* >> +* DESCRIPTION >> +* Get QoS class. >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint16_t OSM_API >> +ib_multipath_rec_qos_class( >> + IN const ib_multipath_rec_t* const p_rec ) >> +{ >> + return (cl_ntoh16( p_rec->qos_class_sl ) >> 4); >> +} >> +/* >> +* PARAMETERS >> +* p_rec >> +* [in] Pointer to the MultiPath record object. >> +* >> +* RETURN VALUES >> +* QoS class of the MultiPath record. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_multipath_rec_t >> +*********/ >> + >> /****f* IBA Base: Types/ib_multipath_rec_mtu >> * NAME >> * ib_multipath_rec_mtu >> @@ -6164,6 +6290,41 @@ ib_multipath_rec_pkt_life_sel(IN const ib_multipath_rec_t * const p_rec) >> * ib_multipath_rec_t >> *********/ >> >> +/****f* IBA Base: Types/ib_multipath_rec_service_id >> +* NAME >> +* ib_multipath_rec_service_id >> +* >> +* DESCRIPTION >> +* Get multipath service id. >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint64_t OSM_API >> +ib_multipath_rec_service_id(IN const ib_multipath_rec_t * const p_rec) >> +{ >> + union { >> + ib_net64_t sid; >> + uint8_t sid_arr[8]; >> + } sid_union; >> + sid_union.sid_arr[0] = p_rec->service_id_8msb; >> + memcpy(&sid_union.sid_arr[1], p_rec->service_id_56lsb, 7); >> + return sid_union.sid; >> +} >> + >> +/* >> +* PARAMETERS >> +* p_rec >> +* [in] Pointer to the multipath record object. >> +* >> +* RETURN VALUES >> +* Service ID >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_multipath_rec_t >> +*********/ >> + >> #define IB_NUM_PKEY_ELEMENTS_IN_BLOCK 32 >> /****s* IBA Base: Types/ib_pkey_table_t >> * NAME >> diff --git a/opensm/libvendor/osm_vendor_ibumad_sa.c b/opensm/libvendor/osm_vendor_ibumad_sa.c >> index 42a6d3a..a878c71 100644 >> --- a/opensm/libvendor/osm_vendor_ibumad_sa.c >> +++ b/opensm/libvendor/osm_vendor_ibumad_sa.c >> @@ -840,7 +840,8 @@ osmv_query_sa(IN osm_bind_handle_t h_bind, >> else >> multipath_rec.num_path &= ~0x80; >> multipath_rec.pkey = p_mpr_req->pkey; >> - multipath_rec.sl = p_mpr_req->sl; >> + ib_multipath_rec_set_sl(&multipath_rec, p_mpr_req->sl); >> + ib_multipath_rec_set_qos_class(&multipath_rec, 0); >> multipath_rec.independence = p_mpr_req->independence; >> multipath_rec.sgid_count = p_mpr_req->sgid_count; >> multipath_rec.dgid_count = p_mpr_req->dgid_count; >> diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c >> index 5dd3955..cf8cfab 100644 >> --- a/opensm/opensm/osm_helper.c >> +++ b/opensm/opensm/osm_helper.c >> @@ -1131,29 +1131,30 @@ osm_dump_multipath_record(IN osm_log_t * const p_log, >> "\t\t\t\ttclass..................0x%X\n" >> "\t\t\t\tnum_path_revers.........0x%X\n" >> "\t\t\t\tpkey....................0x%X\n" >> - "\t\t\t\tresv0...................0x%X\n" >> + "\t\t\t\tqos_class...............0x%X\n" >> "\t\t\t\tsl......................0x%X\n" >> "\t\t\t\tmtu.....................0x%X\n" >> "\t\t\t\trate....................0x%X\n" >> "\t\t\t\tpkt_life................0x%X\n" >> - "\t\t\t\tresv1...................0x%X\n" >> "\t\t\t\tindependence............0x%X\n" >> "\t\t\t\tsgid_count..............0x%X\n" >> "\t\t\t\tdgid_count..............0x%X\n" >> + "\t\t\t\tservice_id..............0x%016" PRIx64 "\n" >> "%s\n" >> "", >> cl_ntoh32(p_mpr->hop_flow_raw), >> p_mpr->tclass, >> p_mpr->num_path, >> cl_ntoh16(p_mpr->pkey), >> - p_mpr->resv0, >> - cl_ntoh16(p_mpr->sl), >> + ib_multipath_rec_qos_class(p_mpr), >> + ib_multipath_rec_sl(p_mpr), >> p_mpr->mtu, >> p_mpr->rate, >> p_mpr->pkt_life, >> - p_mpr->resv1, >> p_mpr->independence, >> - p_mpr->sgid_count, p_mpr->dgid_count, buf_line); >> + p_mpr->sgid_count, p_mpr->dgid_count, >> + ib_multipath_rec_service_id(p_mpr), > > It returns serveice_id in network byte order. Should cl_ntoh64() be > here? Right. Actually, this error is here because ib_multipath_rec_service_id() was originally returning Service ID in host order, and then I changed it to network order. No particular reason - couldn't decide which one to choose. What do you think? -- Yevgeny > Sasha > >> + buf_line); >> } >> } >> >> -- >> 1.5.1.4 >> >> > From mst at dev.mellanox.co.il Tue Sep 4 12:56:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 4 Sep 2007 22:56:51 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904193547.GI4472@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> Message-ID: <20070904195651.GL28350@mellanox.co.il> > Quoting Jason Gunthorpe : > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > On Tue, Sep 04, 2007 at 08:48:43PM +0300, Michael S. Tsirkin wrote: > > > Aren't you already summing all UD packets (inclusing multicast) in the > > > driver before sending? Though, to be honest, I don't see this in your > > > patch, so maybe not.. > > > > no > > Yuk. Sending invalid UD packets is horrible. I don't understand where do you get malformed UD packets. > > So it truely is all or > > nothing. Every gateway, embedded IP device, etc must support this or > > you cannot use it.. > > > > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > > > It will forward corrupt packets if this option is set, right? > > > > > > > > No. The checksum will be calculated by the gateway before being sent on the > > > > ethernet interface. > > > > > > I thought linux only recomputed the checksum on the forwarding path if > > > the skb was marked as needing checksum. Since you set the skb as > > > already summed, there should be cases where invalid packets will be > > > forwarded.. > > > > I don't set CHECKSUM_UNECESSARY, so linux will have to recompute the checksum. > > Eh? You set IPOIB_HEADER_F_HWCSUM on the TX path if the csum is > invalid If I get CHECKSUM_NONE, I really expect there's no checksum. So ... > and the test that on the rx path to set CHECKSUM_UNNECESSARY. > So, all badly csumed packets have CHECKSUM_UNECESSARY set. Hmm. What badly csumed packets? > > > I'd be surprised if a real, hardware, IPoIB to XX device recomputed > > > the checksum unconditionally. The typical approach is to do an > > > incremental update of the checksum if you are changing the packet > > > headers. This preserves the end-to-end-ness and also does not require > > > buffering the entire packet before updating it. > > I was talking about real, existing HW gateways here, not Linux. I > don't know of any reason a gateway would even touch the payload and > require a L4 checksum update, let alone doing it non-incrementally.. The IPoIB side that gets packet with hw csum bit set is required to calculate the checksum. I don't think there's a ton of non-linux HW gateways from ipoib to ethernet, but yes, this spec change does require all ipoib nodes to participate. -- MST From jlentini at netapp.com Tue Sep 4 13:02:44 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 4 Sep 2007 16:02:44 -0400 (EDT) Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904194940.GK28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904194940.GK28350@mellanox.co.il> Message-ID: On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > Quoting James Lentini : > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > Add module option hw_csum: when set, IPoIB will report S/G > > > support, and rely on hardware end-to-end transport checksum (ICRC) > > > instead of software-level protocol checksums. > > > > The purpose of this option would be clearer if the parameter name were > > "omit_csum". Calling this "HW checksum" support is misleading because > > the term is already used to describe network adapters that calculate > > TCP/IP checksums in hardware. I realize that you are using the HW > > checksum infrastructure to implement this, but it is really not the > > same thing. > > Another reason is that I declare HW_CSUM in the netdev > feature list. Yea, someone might get confused, > but "omit checksum" is misleading, too, and is likely to > scare users away from the feature: the need for end-to-end checksum > is a widely recognised requirement. I agree. Since this isn't an end-to-end checksum, I recommend that be made clear to the user. > So I don't have a better name. Hopefully modinfo documents the > option well enough. > > > > Since this will not inter-operate with older IPoIB modules, this > > > option is off by default. > > > > > > Signed-off-by: Michael S. Tsirkin > > > > Does the S/G support need to be tied to the checksum changes? Can you separate the S/G support and checksum changes into different patches? From sashak at voltaire.com Tue Sep 4 13:18:05 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 23:18:05 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS - adding new QoS fields to MultiPathRecord In-Reply-To: <46DDB7DD.4080909@dev.mellanox.co.il> References: <46DD76F3.6020007@dev.mellanox.co.il> <20070904163204.GG23670@sashak.voltaire.com> <46DDB7DD.4080909@dev.mellanox.co.il> Message-ID: <20070904201805.GH23670@sashak.voltaire.com> On 22:54 Tue 04 Sep , Yevgeny Kliteynik wrote: > >> diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > >> index 5dd3955..cf8cfab 100644 > >> --- a/opensm/opensm/osm_helper.c > >> +++ b/opensm/opensm/osm_helper.c > >> @@ -1131,29 +1131,30 @@ osm_dump_multipath_record(IN osm_log_t * const > >> p_log, > >> "\t\t\t\ttclass..................0x%X\n" > >> "\t\t\t\tnum_path_revers.........0x%X\n" > >> "\t\t\t\tpkey....................0x%X\n" > >> - "\t\t\t\tresv0...................0x%X\n" > >> + "\t\t\t\tqos_class...............0x%X\n" > >> "\t\t\t\tsl......................0x%X\n" > >> "\t\t\t\tmtu.....................0x%X\n" > >> "\t\t\t\trate....................0x%X\n" > >> "\t\t\t\tpkt_life................0x%X\n" > >> - "\t\t\t\tresv1...................0x%X\n" > >> "\t\t\t\tindependence............0x%X\n" > >> "\t\t\t\tsgid_count..............0x%X\n" > >> "\t\t\t\tdgid_count..............0x%X\n" > >> + "\t\t\t\tservice_id..............0x%016" PRIx64 "\n" > >> "%s\n" > >> "", > >> cl_ntoh32(p_mpr->hop_flow_raw), > >> p_mpr->tclass, > >> p_mpr->num_path, > >> cl_ntoh16(p_mpr->pkey), > >> - p_mpr->resv0, > >> - cl_ntoh16(p_mpr->sl), > >> + ib_multipath_rec_qos_class(p_mpr), > >> + ib_multipath_rec_sl(p_mpr), > >> p_mpr->mtu, > >> p_mpr->rate, > >> p_mpr->pkt_life, > >> - p_mpr->resv1, > >> p_mpr->independence, > >> - p_mpr->sgid_count, p_mpr->dgid_count, buf_line); > >> + p_mpr->sgid_count, p_mpr->dgid_count, > >> + ib_multipath_rec_service_id(p_mpr), > > It returns serveice_id in network byte order. Should cl_ntoh64() be > > here? > > Right. > Actually, this error is here because ib_multipath_rec_service_id() was > originally returning Service ID in host order, and then I changed it to > network order. So are you changing return type to ib_net64_t (instead of uint64_t) too? > No particular reason - couldn't decide which one to choose. > What do you think? Another ib_*() functions return values in MAD order (network). Sasha From kliteyn at dev.mellanox.co.il Tue Sep 4 13:18:18 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 04 Sep 2007 23:18:18 +0300 Subject: [ofa-general] [PATCH] osm: QoS - fixing access to ServiceID field of MultiPathRecord Message-ID: <46DDBD8A.7030008@dev.mellanox.co.il> Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 2 +- opensm/opensm/osm_helper.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 13e2f38..0969755 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -6299,7 +6299,7 @@ ib_multipath_rec_pkt_life_sel(IN const ib_multipath_rec_t * const p_rec) * * SYNOPSIS */ -static inline uint64_t OSM_API +static inline ib_net64_t OSM_API ib_multipath_rec_service_id(IN const ib_multipath_rec_t * const p_rec) { union { diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index cf8cfab..b765e19 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1153,7 +1153,7 @@ osm_dump_multipath_record(IN osm_log_t * const p_log, p_mpr->pkt_life, p_mpr->independence, p_mpr->sgid_count, p_mpr->dgid_count, - ib_multipath_rec_service_id(p_mpr), + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), buf_line); } } -- 1.5.1.4 From sashak at voltaire.com Tue Sep 4 13:36:21 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 23:36:21 +0300 Subject: [ofa-general] Re: [opensm] bugs in build system In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> Message-ID: <20070904203621.GI23670@sashak.voltaire.com> Hi again, Eitan, On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > Hi Sasha, > > For some reason OpenSM (and the required management libs) do not build > correctly when > I use manual autogen.sh, configure --prefix=/tmp/ez/usr ; make; make > install mode. > > It seems the build system is probably broken as it relies on fixed > paths? It is not, but it relies to invalid paths like -I.../include/infiniband when in the code '#include ' is used. > OK 3. cd management/libibumad; autogen.sh; > FAIL 4. ./configure --prefix=/tmp/ez/usr > checking for sys_read_string in -libcommon... no > configure: error: sys_read_string() not found. libibumad requires > libibcommon. > > To overcome this I manually added the --disable-libcheck > ./configure --prefix=/tmp/ez/usr --disable-libcheck > I do not understand why after installing the common lib I still get this > error? > Isn't the search path should include the /lib ??? Seems it is AC_CHECK_LIB() feature (ugh - I hate autotools mess :)) I'm not really sure such checks should be there. libibcommon library is part of our project and not "external" library. > FAIL 5. make > Make fails as it does not find the infiniband/common.h Wrong include path in Makefile.am - it uses include/infiniband. > To overcome this I manually added -I/include .... > make CFLAGS="-I/tmp/ez/usr/include" > > OK 6. make install > --------------- OPENSM ------------------ > OK 7. cd management/opensm; autogen.sh; > FAIL 8. configure --prefix=/tmp/ez/usr > checking for umad_init in -libumad... no > configure: error: umad_init() not found. libosmvendor of type openib > requires libibumad. > configure: error: /bin/sh './configure' failed for libvendor > > To overcome this I manually added the --disable-libcheck > ./configure --prefix=/tmp/ez/usr --disable-libcheck > This problem is same as the above: lib path for linking should use the > /lib. > > FAIL 9. make > Here again the include path is missing the /include: > > ./../include/vendor/osm_vendor_ibumad.h:44:31: infiniband/common.h: No > such file or directory > ./../include/vendor/osm_vendor_ibumad.h:45:29: infiniband/umad.h: No > such file or directory Wrong OSMV_INCLUDES definition (it uses paths include/infiniband ). > To overcome this I manually added -I/include .... > make CFLAGS="-I/tmp/ez/usr/include" > > But this is not enough as the linker fail: > /usr/bin/ld: cannot find -libumad It seems to be buggy opensm_LDADD in Makefile.am > To overcome this I had to add -L/lib .... > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib -libumad > -libcommon" > > OK 10. make install > > I hope the above issues could be fixed such that the installation would > be simpler. Could you test the patch please (you still need to use '--disable-libcheck' with ./configure)? Thanks. Sasha diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index 48868e7..7e82590 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -2,7 +2,7 @@ SUBDIRS = . INCLUDES = -I$(srcdir)/include/infiniband \ - -I$(srcdir)/../libibcommon/include/infiniband + -I$(srcdir)/../libibcommon/include man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 \ diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 index 47ad36f..97d5a9e 100644 --- a/opensm/config/osmvsel.m4 +++ b/opensm/config/osmvsel.m4 @@ -61,11 +61,11 @@ with_sim="/usr") dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" - OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" - if test "x$with_umad_libs" = "x"; then - OSMV_LDADD="-libumad" - else - OSMV_LDADD="-L$with_umad_libs -libumad" + OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include -I\$(srcdir)/../../libibumad/include" + OSMV_LDADD="-L\$(libdir) -libumad -libcommon" + + if test "x$with_umad_libs" != "x"; then + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" fi if test "x$with_umad_includes" != "x"; then From sashak at voltaire.com Tue Sep 4 13:37:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 4 Sep 2007 23:37:51 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS - fixing access to ServiceID field of MultiPathRecord In-Reply-To: <46DDBD8A.7030008@dev.mellanox.co.il> References: <46DDBD8A.7030008@dev.mellanox.co.il> Message-ID: <20070904203751.GJ23670@sashak.voltaire.com> On 23:18 Tue 04 Sep , Yevgeny Kliteynik wrote: > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From jgunthorpe at obsidianresearch.com Tue Sep 4 13:29:07 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Sep 2007 14:29:07 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904195651.GL28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070904195651.GL28350@mellanox.co.il> Message-ID: <20070904202907.GJ4472@obsidianresearch.com> On Tue, Sep 04, 2007 at 10:56:51PM +0300, Michael S. Tsirkin wrote: > > Yuk. Sending invalid UD packets is horrible. > > I don't understand where do you get malformed UD packets. Any packet you put on the network with an invalid L4 csum is what I would called malformed.. Any conformant implementation that Rx's those packets will reject them. > > Eh? You set IPOIB_HEADER_F_HWCSUM on the TX path if the csum is > > invalid > > If I get CHECKSUM_NONE, I really expect there's no checksum. > So ... Am I reading this wrong? @@ -782,7 +785,10 @@ static int ipoib_hard_header(struct sk_buff *skb, header = (struct ipoib_header *) skb_push(skb, sizeof *header); header->proto = htons(type); - header->reserved = 0; + if (skb->ip_summed == CHECKSUM_COMPLETE) + header->flags = 0; + else + header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); The two relevent cases here are CHECKSUM_NONE (the csum is not present, or present but valid) and CHECKSUM_PARTIAL (the csum is not valid, but is present). In both cases you put the packet onto the wire with IPOIB_HEADER_F_HWCSUM and in the CHECKSUM_PARTIAL case the packet has an invalid checksum on the wire. Then on the RX side you recover it, mark it as CHECKSUM_UNNECESSARY which tells the stack not to check the L4 checksum. But it is still there and still 'badly csumed' (ie invalid). On the RX side, after going through ip_forward the TX'ing ethernet driver sees ip_summed != CHECKSUM_PARTIAL and does nothing, propogating the bad L4 checksum onto the ethernet side. Jason From sean.hefty at intel.com Tue Sep 4 14:34:26 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:34:26 -0700 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] for 2.6.24: ib: QoS support Message-ID: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> The following patch series adds QoS support to the host stack based on the IB QoS annex. I believe that all feedback from v1 has been incorporated, such as adding the SID to the PR query. These patches target 2.6.24 and OFED 1.3. I have NOT tested these patches against a QoS compliant SM. If someone has this setup and can test it, that would be great. Otherwise, I will be trying to setup openSM to do this, but it will take me some time. Signed-off-by: Sean Hefty From sean.hefty at intel.com Tue Sep 4 14:36:45 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:36:45 -0700 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> To support QoS within and between subnets, modify IPoIB to request specific Traffic Class values with path record queries, using the value associated with the IPoIB broadcast group. Signed-off-by: Sean Hefty --- drivers/infiniband/ulp/ipoib/ipoib.h | 22 +++++++++++++++++++++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 ++++--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 22 ---------------------- 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..fc16bce 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -113,7 +113,27 @@ struct ipoib_pseudoheader { u8 hwaddr[INFINIBAND_ALEN]; }; -struct ipoib_mcast; +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ib_sa_multicast *mc; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; struct ipoib_rx_buf { struct sk_buff *skb; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..a4a8cbc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -468,9 +468,10 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid) INIT_LIST_HEAD(&path->neigh_list); memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); - path->pathrec.sgid = priv->local_gid; - path->pathrec.pkey = cpu_to_be16(priv->pkey); - path->pathrec.numb_path = 1; + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class; return path; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index aae3670..94a5709 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -57,28 +57,6 @@ MODULE_PARM_DESC(mcast_debug_level, static DEFINE_MUTEX(mcast_mutex); -/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ -struct ipoib_mcast { - struct ib_sa_mcmember_rec mcmember; - struct ib_sa_multicast *mc; - struct ipoib_ah *ah; - - struct rb_node rb_node; - struct list_head list; - - unsigned long created; - unsigned long backoff; - - unsigned long flags; - unsigned char logcount; - - struct list_head neigh_list; - - struct sk_buff_head pkt_queue; - - struct net_device *dev; -}; - struct ipoib_mcast_iter { struct net_device *dev; union ib_gid mgid; From sean.hefty at intel.com Tue Sep 4 14:37:39 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:37:39 -0700 Subject: [ofa-general] [RFC] [PATCH 2/5 v2] ib/sa: add new QoS fields to path record In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <000701c7ef3b$d16562e0$3c98070a@amr.corp.intel.com> The QoS annex defines new fields for path records. Add them to the ib_sa for consumers that want to use them. Signed-off-by: Sean Hefty --- drivers/infiniband/core/sa_query.c | 10 +++------- include/rdma/ib_sa.h | 11 +++++------ 2 files changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index d271bd7..6f56bb5 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -123,14 +123,10 @@ static u32 tid; .field_name = "sa_path_rec:" #field static const struct ib_field path_rec_table[] = { - { RESERVED, + { PATH_REC_FIELD(service_id), .offset_words = 0, .offset_bits = 0, - .size_bits = 32 }, - { RESERVED, - .offset_words = 1, - .offset_bits = 0, - .size_bits = 32 }, + .size_bits = 64 }, { PATH_REC_FIELD(dgid), .offset_words = 2, .offset_bits = 0, @@ -179,7 +175,7 @@ static const struct ib_field path_rec_table[] = { .offset_words = 12, .offset_bits = 16, .size_bits = 16 }, - { RESERVED, + { PATH_REC_FIELD(qos_class), .offset_words = 13, .offset_bits = 0, .size_bits = 12 }, diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 5e26b2f..942692b 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -109,8 +109,8 @@ enum ib_sa_selector { * Reserved rows are indicated with comments to help maintainability. */ -/* reserved: 0 */ -/* reserved: 1 */ +#define IB_SA_PATH_REC_SERVICE_ID (IB_SA_COMP_MASK( 0) |\ + IB_SA_COMP_MASK( 1)) #define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) #define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) #define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) @@ -123,7 +123,7 @@ enum ib_sa_selector { #define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) #define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) #define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) -/* reserved: 14 */ +#define IB_SA_PATH_REC_QOS_CLASS IB_SA_COMP_MASK(14) #define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) #define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) #define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) @@ -134,8 +134,7 @@ enum ib_sa_selector { #define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) struct ib_sa_path_rec { - /* reserved */ - /* reserved */ + __be64 service_id; union ib_gid dgid; union ib_gid sgid; __be16 dlid; @@ -148,7 +147,7 @@ struct ib_sa_path_rec { int reversible; u8 numb_path; __be16 pkey; - /* reserved */ + __be16 qos_class; u8 sl; u8 mtu_selector; u8 mtu; From sean.hefty at intel.com Tue Sep 4 14:39:46 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:39:46 -0700 Subject: [ofa-general] [RFC] [PATCH 4/5 v2] rdma/ucm: export setting service type to user space In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <000901c7ef3c$1cef27a0$3c98070a@amr.corp.intel.com> Export the ability to set the type of service to user space. Model after setsockopt. Signed-off-by: Sean Hefty --- drivers/infiniband/core/ucma.c | 74 +++++++++++++++++++++++++++++++++++++++- include/rdma/rdma_user_cm.h | 18 ++++++++++ 2 files changed, 91 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 53b4c94..90d675a 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -792,6 +792,78 @@ out: return ret; } +static int ucma_set_option_id(struct ucma_context *ctx, int optname, + void *optval, size_t optlen) +{ + int ret = 0; + + switch (optname) { + case RDMA_OPTION_ID_TOS: + if (optlen != sizeof(u8)) { + ret = -EINVAL; + break; + } + rdma_set_service_type(ctx->cm_id, *((u8 *) optval)); + break; + default: + ret = -ENOSYS; + } + + return ret; +} + +static int ucma_set_option_level(struct ucma_context *ctx, int level, + int optname, void *optval, size_t optlen) +{ + int ret; + + switch (level) { + case RDMA_OPTION_ID: + ret = ucma_set_option_id(ctx, optname, optval, optlen); + break; + default: + ret = -ENOSYS; + } + + return ret; +} + +static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_set_option cmd; + struct ucma_context *ctx; + void *optval; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + optval = kmalloc(cmd.optlen, GFP_KERNEL); + if (!optval) { + ret = -ENOMEM; + goto out1; + } + + if (copy_from_user(optval, (void __user *) (unsigned long) cmd.optval, + cmd.optlen)) { + ret = -EFAULT; + goto out2; + } + + ret = ucma_set_option_level(ctx, cmd.level, cmd.optname, optval, + cmd.optlen); +out2: + kfree(optval); +out1: + ucma_put_ctx(ctx); + return ret; +} + static ssize_t ucma_notify(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { @@ -936,7 +1008,7 @@ static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, [RDMA_USER_CM_CMD_GET_OPTION] = NULL, - [RDMA_USER_CM_CMD_SET_OPTION] = NULL, + [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option, [RDMA_USER_CM_CMD_NOTIFY] = ucma_notify, [RDMA_USER_CM_CMD_JOIN_MCAST] = ucma_join_multicast, [RDMA_USER_CM_CMD_LEAVE_MCAST] = ucma_leave_multicast, diff --git a/include/rdma/rdma_user_cm.h b/include/rdma/rdma_user_cm.h index f632b0c..9749c1b 100644 --- a/include/rdma/rdma_user_cm.h +++ b/include/rdma/rdma_user_cm.h @@ -212,4 +212,22 @@ struct rdma_ucm_event_resp { } param; }; +/* Option levels */ +enum { + RDMA_OPTION_ID = 0 +}; + +/* Option details */ +enum { + RDMA_OPTION_ID_TOS = 0 +}; + +struct rdma_ucm_set_option { + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + #endif /* RDMA_USER_CM_H */ From sean.hefty at intel.com Tue Sep 4 14:40:26 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:40:26 -0700 Subject: [ofa-general] [RFC] [PATCH 5/5 v2] ib/srp: add QoS support through service ID In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <000a01c7ef3c$34a9d4d0$3c98070a@amr.corp.intel.com> Provide the target service ID when performing a path record query to support optional QoS capability. QoS requires support from the SA. Signed-off-by: Sean Hefty --- drivers/infiniband/ulp/srp/ib_srp.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index f6a0514..9ccc638 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -285,6 +285,7 @@ static int srp_lookup_path(struct srp_target_port *target) target->srp_host->dev->dev, target->srp_host->port, &target->path, + IB_SA_PATH_REC_SERVICE_ID | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | @@ -1692,6 +1693,7 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target) goto out; } target->service_id = cpu_to_be64(simple_strtoull(p, NULL, 16)); + target->path.service_id = target->service_id; kfree(p); break; From sean.hefty at intel.com Tue Sep 4 14:38:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 4 Sep 2007 14:38:28 -0700 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> Provide support to specify a type of service for a communication identifier. A new function call is used when dealing with IPv4 addresses. For IPv6 addresses, the ToS is specified through the traffic class field in the sockaddr_in6 structure. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cma.c | 44 ++++++++++++++++++++++++++++++++--------- include/rdma/rdma_cm.h | 14 +++++++++++++ 2 files changed, 48 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..19c9172 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -138,6 +138,7 @@ struct rdma_id_private { u32 qkey; u32 qp_num; u8 srq; + u8 tos; }; struct cma_multicast { @@ -1474,6 +1475,15 @@ err: } EXPORT_SYMBOL(rdma_listen); +void rdma_set_service_type(struct rdma_cm_id *id, int tos) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->tos = (u8) tos; +} +EXPORT_SYMBOL(rdma_set_service_type); + static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { @@ -1498,23 +1508,37 @@ static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, struct cma_work *work) { - struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + struct rdma_addr *addr = &id_priv->id.route.addr; struct ib_sa_path_rec path_rec; + ib_sa_comp_mask comp_mask; + struct sockaddr_in6 *sin6; memset(&path_rec, 0, sizeof path_rec); - ib_addr_get_sgid(addr, &path_rec.sgid); - ib_addr_get_dgid(addr, &path_rec.dgid); - path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); + ib_addr_get_sgid(&addr->dev_addr, &path_rec.sgid); + ib_addr_get_dgid(&addr->dev_addr, &path_rec.dgid); + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(&addr->dev_addr)); path_rec.numb_path = 1; path_rec.reversible = 1; + path_rec.service_id = cma_get_service_id(id_priv->id.ps, &addr->dst_addr); + + comp_mask = IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; + + if (addr->src_addr.sa_family == AF_INET) { + path_rec.qos_class = cpu_to_be16((u16) id_priv->tos); + comp_mask |= IB_SA_PATH_REC_QOS_CLASS; + } else { + sin6 = (struct sockaddr_in6 *) &addr->src_addr; + path_rec.traffic_class = (u8) (be32_to_cpu(sin6->sin6_flowinfo) >> 20); + comp_mask |= IB_SA_PATH_REC_TRAFFIC_CLASS; + } id_priv->query_id = ib_sa_path_rec_get(&sa_client, id_priv->id.device, - id_priv->id.port_num, &path_rec, - IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_REVERSIBLE, - timeout_ms, GFP_KERNEL, - cma_query_handler, work, &id_priv->query); + id_priv->id.port_num, &path_rec, + comp_mask, timeout_ms, + GFP_KERNEL, cma_query_handler, + work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; } diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 2d6a770..010f876 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -314,4 +314,18 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, */ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr); +/** + * rdma_set_service_type - Set the type of service associated with a + * connection identifier. + * @id: Communication identifier to associated with service type. + * @tos: Type of service. + * + * The type of service is interpretted as a differentiated service + * field (RFC 2474). The service type should be specified before + * performing route resolution, as existing communication on the + * connection identifier may be unaffected. The type of service + * requested may not be supported by the network to all destinations. + */ +void rdma_set_service_type(struct rdma_cm_id *id, int tos); + #endif /* RDMA_CM_H */ From becker at nas.nasa.gov Tue Sep 4 16:14:34 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 4 Sep 2007 16:14:34 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> Message-ID: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> Hi all. I have a first cut. If you view "http://www.openfabrics.org/listdir.php" in your browser, all the download directories are given as links, and I list the contents of WEB_README if it exists. Please let me know what you think. Thanks. -jeff On 8/8/07, Jeff Becker wrote: > Hi. I created most of the requested directory/owner pairs in > /var/www/openfabrics.org/downloads. I left out the various MPI > directories, figuring the appropriate web pages will be linked from > somewhere (possibly the downloads web page). I gave Stan Smith an > account. Stan, please contact me to get the account info. > > I'm still working out how to do the dynamic web page stuff, but at > least people can start populating their directories. > > Thanks. > > -jeff > > On 7/25/07, Arlin Davis wrote: > > > > > I would like to propose adding project directories under > > > http://www.openfabrics.org/downloads/ where appropriate and give > > > maintainers access. For example: > > > > > Jeff, please add the following directories with maintainer access as > > follow (or grant access at a maintainer group level): > > > > http://www.openfabrics.org/downloads/verbs (rdreier) > > http://www.openfabrics.org/downloads/rdmacm (shefty) > > http://www.openfabrics.org/downloads/dapl (ardavis) > > http://www.openfabrics.org/downloads/sdp (eitan) > > http://www.openfabrics.org/downloads/utils (eitan) > > http://www.openfabrics.org/downloads/management (sashak) > > http://www.openfabrics.org/downloads/OFED (vlad) > > http://www.openfabrics.org/downloads/archives (vlad) > > http://www.openfabrics.org/downloads/WinOF (ssmith) (Stan Smith will > > need an account) > > http://www.openfabrics.org/downloads/hw/mthca (rdreir) > > http://www.openfabrics.org/downloads/hw/mlx4 (rdreir) > > http://www.openfabrics.org/downloads/hw/ehca (raisch) > > http://www.openfabrics.org/downloads/hw/ipath (ralphc) > > http://www.openfabrics.org/downloads/hw/cxgb3 (ralphc) > > http://www.openfabrics.org/downloads/mpi/mvapich (pasha) > > http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland) > > http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres) > > > > Let us know when these directories are created and the maintainers, who > > want to expose their packages via the webpage, will create a README that > > details the contents of the directory along with WEB_README that > > provides a short description for the webpage. > > > > Will this format allow you to auto configure the download webpage > > sufficiently? The idea is to only add links/descriptions to those > > project sub-directories with WEB_README files present. > > > > Please advise if something on the list is wrong or we missed a project. > > > > Thanks, > > > > -arlin > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From kilian at stanford.edu Tue Sep 4 16:25:46 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Tue, 4 Sep 2007 16:25:46 -0700 Subject: [ofa-general] Build rpms kernel 2.6.5-7.283 In-Reply-To: References: Message-ID: <200709041625.46535.kilian@stanford.edu> On Tuesday 04 September 2007 08:51:39 am H.N.HARAKE wrote: > ./configure: line 153: patch: command not found You need to install the 'patch' utility. Cheers, -- Kilian From mshefty at ichips.intel.com Tue Sep 4 17:54:53 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Sep 2007 17:54:53 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <46D78104.mailJY81GRONO@systemfabricworks.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: <46DDFE5D.9090203@ichips.intel.com> > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > if (port_priv) { > mad_priv->mad.mad.mad_hdr.tid = > ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); I'm having a hard time understanding the impact of this change. If I'm reading the code correctly, mad_priv->mad should contain the response from the device process_mad() routine. This changes that response. Can you provide more details describing the effect this change has on the existing behavior? Also, I think we can eliminate setting the tid, since the memcpy will set that as well. > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..d96fc8e 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return 1 if the SMP response should be handled by the local management stack > + */ The comment is off here - return IB_SMI_HANDLE. (It's off for smi_check_local_smp() as well.) > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} - Sean From swelch at systemfabricworks.com Tue Sep 4 19:40:29 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Tue, 4 Sep 2007 21:40:29 -0500 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <46DDFE5D.9090203@ichips.intel.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> Message-ID: <000b01c7ef66$1f929030$a865a8c0@catcher> Hi Sean, > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, September 04, 2007 7:55 PM > To: swelch at systemfabricworks.com > Cc: general at lists.openfabrics.org; sean.hefty at intel.com > Subject: Re: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR > SMP responses from userspace > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > if (port_priv) { > > mad_priv->mad.mad.mad_hdr.tid = > > ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > I'm having a hard time understanding the impact of this change. If I'm > reading the code correctly, mad_priv->mad should contain the response > from the device process_mad() routine. This changes that response. Can > you provide more details describing the effect this change has on the > existing behavior? The new code is executed when the device specific process_mad function returns only IB_MAD_RESULT_SUCCESS status in the status bitmask. Since IB_MAD_RESULT_REPLY is not also set; the device is indicating it did not create a response and mad_priv->mad should be as it was before the process_mad call (i.e. not initialized with a response). Since the IB_MAD_RESULT_CONSUMED status was not set in the status bitmask, the original MAD is still needing delivery and by definition goes to the local node. Prior to this patch and the addition of the of the smi_check_local_resp_smp() test, the only DR SMP that could have made it to the device's process_mad call would have been a DR SMP Request that was targeted to the local SMA. It is possible that the device's SMA would not handle the DR SMP Request and that it would return only IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call would still access the un-initialized mad_priv->mad.mad. For this reason I do not believe this code path was previously executed, and I believe there will be no effect on the existing behavior. Running with these changes, the IB utilities built on top of DR SMP's continue to operate on the host, going both to the local SMA and out on the fabric to an SMA. > > Also, I think we can eliminate setting the tid, since the memcpy will > set that as well. Yes, I agree. > > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h > b/drivers/infiniband/core/smi.h > > index 1cfc298..d96fc8e 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -71,4 +71,18 @@ static inline enum smi_action > smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return 1 if the SMP response should be handled by the local > management stack > > + */ > > The comment is off here - return IB_SMI_HANDLE. (It's off for > smi_check_local_smp() as well.) Yes, I agree. It appears I was a little over zealous in my header cut and paste of the existing DR SMP request local check function. > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp > *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > - Sean Thanks, Steve From ssufficool at roadrunner.com Tue Sep 4 21:21:51 2007 From: ssufficool at roadrunner.com (ssufficool) Date: Tue, 04 Sep 2007 21:21:51 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> Message-ID: <1188966111.4865.8.camel@gentoo-linux.localdomain> I might recommend using HEADER and FOOTER instead of WEB_README to utilize the built in Apache directory listing support when the user selects the directory. http://httpd.apache.org/docs/2.0/mod/mod_autoindex.html On Tue, 2007-09-04 at 16:14 -0700, Jeff Becker wrote: > Hi all. I have a first cut. > > If you view "http://www.openfabrics.org/listdir.php" in your browser, > all the download directories are given as links, and I list the > contents of WEB_README if it exists. Please let me know what you > think. Thanks. > > -jeff > > On 8/8/07, Jeff Becker wrote: > > Hi. I created most of the requested directory/owner pairs in > > /var/www/openfabrics.org/downloads. I left out the various MPI > > directories, figuring the appropriate web pages will be linked from > > somewhere (possibly the downloads web page). I gave Stan Smith an > > account. Stan, please contact me to get the account info. > > > > I'm still working out how to do the dynamic web page stuff, but at > > least people can start populating their directories. > > > > Thanks. > > > > -jeff > > > > On 7/25/07, Arlin Davis wrote: > > > > > > > I would like to propose adding project directories under > > > > http://www.openfabrics.org/downloads/ where appropriate and give > > > > maintainers access. For example: > > > > > > > Jeff, please add the following directories with maintainer access as > > > follow (or grant access at a maintainer group level): > > > > > > http://www.openfabrics.org/downloads/verbs (rdreier) > > > http://www.openfabrics.org/downloads/rdmacm (shefty) > > > http://www.openfabrics.org/downloads/dapl (ardavis) > > > http://www.openfabrics.org/downloads/sdp (eitan) > > > http://www.openfabrics.org/downloads/utils (eitan) > > > http://www.openfabrics.org/downloads/management (sashak) > > > http://www.openfabrics.org/downloads/OFED (vlad) > > > http://www.openfabrics.org/downloads/archives (vlad) > > > http://www.openfabrics.org/downloads/WinOF (ssmith) (Stan Smith will > > > need an account) > > > http://www.openfabrics.org/downloads/hw/mthca (rdreir) > > > http://www.openfabrics.org/downloads/hw/mlx4 (rdreir) > > > http://www.openfabrics.org/downloads/hw/ehca (raisch) > > > http://www.openfabrics.org/downloads/hw/ipath (ralphc) > > > http://www.openfabrics.org/downloads/hw/cxgb3 (ralphc) > > > http://www.openfabrics.org/downloads/mpi/mvapich (pasha) > > > http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland) > > > http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres) > > > > > > Let us know when these directories are created and the maintainers, who > > > want to expose their packages via the webpage, will create a README that > > > details the contents of the directory along with WEB_README that > > > provides a short description for the webpage. > > > > > > Will this format allow you to auto configure the download webpage > > > sufficiently? The idea is to only add links/descriptions to those > > > project sub-directories with WEB_README files present. > > > > > > Please advise if something on the list is wrong or we missed a project. > > > > > > Thanks, > > > > > > -arlin > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue Sep 4 22:10:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Sep 2007 08:10:40 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904193547.GI4472@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> Message-ID: <20070905051040.GM28350@mellanox.co.il> > With the new changes to ip_forward, maybe you could get away with > setting CHECKSUM_PARTIAL in your RX path to get the TX of the final > output device to regenerate the L4 checksum? Good idea, the comment in linux/skbuff.h says * PARTIAL: identical to the case for output below. This may occur * on a packet received directly from another Linux OS, e.g., * a virtualised Linux kernel on the same host. The packet can * be treated in the same way as UNNECESSARY except that on * output (i.e., forwarding) the checksum must be filled in * by the OS or the hardware. > Even so, sending out malformed UD packets When you say UD, you really mean UDP, don't you? > strikes me as a > compatability killer.. Filling in PARTIAL will address that, right? > This would be much better as a RC only > negotiated at CM feature. We can always go there, and it would be easy to enable this per-destination for datagram mode, too, by setting a bit in HW address, and thus enabling inter-operability with IETF compliant ipoib (at a slower rate, since we'll have to do an extra pass over data to calculate the checksum in software). Enabling this for multicast is where I'm stuck since both IETF compliant and hwcsum ipoib join the same group. One way to address all this would be to use a different signature for hwcsum ipoib multicast groups: thus IETF and hwcsum ipoib just won't share the same broadcast multicast groups, living in separate domains, as it where. Does this sounds like a good idea? The annoying thing would be that this requires (fairly trivial) SM extensions. Comments? -- MST From jgunthorpe at obsidianresearch.com Tue Sep 4 22:51:08 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 4 Sep 2007 23:51:08 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070905051040.GM28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> Message-ID: <20070905055108.GB16535@obsidianresearch.com> On Wed, Sep 05, 2007 at 08:10:40AM +0300, Michael S. Tsirkin wrote: > > With the new changes to ip_forward, maybe you could get away with > > setting CHECKSUM_PARTIAL in your RX path to get the TX of the final > > output device to regenerate the L4 checksum? > > Good idea, the comment in linux/skbuff.h says Ooh fancy, the comments are updated now :) Yes, this matches my expectation, CHECKSUM_PARTIAL should definately be used instead of CHECKSUM_UNECESSARY in the case of a 'known to be invalid on the wire' checksum. > > Even so, sending out malformed UD packets > > When you say UD, you really mean UDP, don't you? No, I do mean UD. I singled out UD here just because of the multicast problem, and there really seems to be no way to fix that.. To summarize for clarity, both TCP and UDP have a checksum at the top of the data payload, and IPv4 also has a header checksum. This discussion, and your optimization, is all about the L4 TCP/UDP checksum.. Linux does not offload computation of the header checksum. If you TX a packet with ip_summed == CHECKSUM_PARTIAL without doing the hardware csum offload procedure then on-the-wire the L3 TCP/UDP checksum bytes are garbage. The stack no longer conforms to the various current RFCs -> the packet is malformed. > > strikes me as a > > compatability killer.. > > Filling in PARTIAL will address that, right? No, I don't think so. PARTIAL will make Linux forward packets correctly between different network interfaces, but it does not address the on-the-ib-wire problem of old/new hosts interoperating. To do that you must call skb_checksum_help for CHECKSUM_PARTIAL packets in the tx path when a new host is talking to an old host. > > This would be much better as a RC only > > negotiated at CM feature. > > We can always go there, and it would be easy to > enable this per-destination for datagram mode, too, by setting a > bit in HW address, and thus enabling inter-operability with IETF > compliant ipoib (at a slower rate, since we'll have to do an extra > pass over data to calculate the checksum in software). Right, combine this with enforcing correct checksum on all TX multicast and this looks much better. I don't think adding a conditional call to skb_checksum_help in the ipoib tx path will not make performance any worse for those packets than it is today - dev_queue_xmit today calls skb_checksum_help on behalf of ipoib for every packet. Also, my other thought was about the RX path, it should work more like if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) ip_summed = CHECKSUM_PARTIAL // Sender says the csum is bad else if (enabled_hw_csum_support) ip_summed = CHECKSUM_UNNECESSARY // Sender says the csum should be good else ip_summed = CHECKSUM_NONE; // Force checking (Of course, if the underlying hardware supports checksum offload then the hardware's calculation should just unconditionally be used on the rx path) Tx is more like: header->flags = 0; if (ip_summed == CHECKSUM_PARTIAL) if (destination_is_compatible) header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); else skb_checksum_help(skb); (And again, if the HW supports offload, then don't bother with F_HWCSUM) One thing I'm missing here is why care about UD csum performance? ConnectX fixes it, and using RC on older cards with something like this patch will give best possible speed. Why use an older card and UD without csumming and then care about speed? Why not make it RC only where it can be done safely and compatibly? RC already gives a good uptick over UD, so if you are concerned by speed, you are already using it - right? Jason From mst at dev.mellanox.co.il Tue Sep 4 23:19:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Sep 2007 09:19:13 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070905055108.GB16535@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> Message-ID: <20070905061913.GN28350@mellanox.co.il> > > We can always go there, and it would be easy to > > enable this per-destination for datagram mode, too, by setting a > > bit in HW address, and thus enabling inter-operability with IETF > > compliant ipoib (at a slower rate, since we'll have to do an extra > > pass over data to calculate the checksum in software). > > Right, combine this with enforcing correct checksum on all TX > multicast and this looks much better. OK, I'll think about it some more, but the option is definitely there. > I don't think adding a conditional call to skb_checksum_help in the > ipoib tx path will not make performance any worse No, it's skb_checksum_help that will be expensive, unfortunately. > for those packets > than it is today - dev_queue_xmit today calls skb_checksum_help on > behalf of ipoib for every packet. I don't think it does, normally: the packets it gets now usually have CHECKSUM_COMPLETE. > Also, my other thought was about the RX path, it should work more like > > if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) > ip_summed = CHECKSUM_PARTIAL // Sender says the csum is bad > else > if (enabled_hw_csum_support) > ip_summed = CHECKSUM_UNNECESSARY // Sender says the csum should be good Hmm. Where does this last line come from? It looks wrong ... > else > ip_summed = CHECKSUM_NONE; // Force checking > > (Of course, if the underlying hardware supports checksum offload then > the hardware's calculation should just unconditionally be used on the > rx path) Using hardware checksum offload is a separate issue. For now I'm focusing on working on top of verbs. > Tx is more like: > > header->flags = 0; > if (ip_summed == CHECKSUM_PARTIAL) > if (destination_is_compatible) > header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); > else > skb_checksum_help(skb); > > (And again, if the HW supports offload, then don't bother with > F_HWCSUM) It's not that simple: F_HWCSUM is also a hint for RX side, so it might be a win if the *remote* does not have RX checksum offloading. At least for now, I find it much easier to reason about the hwcsum feature if it behaves more or less identically for all hadrware. Once it is upstream, and once Eli's connectx checksum offloading patches are, we'll try to think about doing smart tricks like using offloading with some packets while using F_HWCSUM for others. > One thing I'm missing here is why care about UD csum performance? > ConnectX fixes it, and using RC on older cards with something like > this patch will give best possible speed. Why use an older card and UD > without csumming and then care about speed? Why not make it RC only > where it can be done safely and compatibly? RC already gives a good > uptick over UD, so if you are concerned by speed, you are already > using it - right? multicast is where we are forced to datagram mode. But yes, maybe I should ignore multicast speed for now, and say that it will get fixed by hardware offloading in the future. Thanks for the comments. -- MST From eitan at mellanox.co.il Tue Sep 4 23:37:14 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 5 Sep 2007 09:37:14 +0300 Subject: [ofa-general] RE: [opensm] bugs in build system In-Reply-To: <20070904203621.GI23670@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> <20070904203621.GI23670@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C902314495@mtlexch01.mtl.com> Hi Sasha, Patch tested. Works great. Thanks Eitan > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Tuesday, September 04, 2007 11:36 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [opensm] bugs in build system > > Hi again, Eitan, > > On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > > Hi Sasha, > > > > For some reason OpenSM (and the required management libs) > do not build > > correctly when I use manual autogen.sh, configure > --prefix=/tmp/ez/usr > > ; make; make install mode. > > > > It seems the build system is probably broken as it relies on fixed > > paths? > > It is not, but it relies to invalid paths like > -I.../include/infiniband when in the code '#include > ' is used. > > > OK 3. cd management/libibumad; autogen.sh; FAIL 4. ./configure > > --prefix=/tmp/ez/usr checking for sys_read_string in > -libcommon... no > > configure: error: sys_read_string() not found. libibumad requires > > libibcommon. > > > > To overcome this I manually added the --disable-libcheck > ./configure > > --prefix=/tmp/ez/usr --disable-libcheck I do not understand > why after > > installing the common lib I still get this error? > > Isn't the search path should include the /lib ??? > > Seems it is AC_CHECK_LIB() feature (ugh - I hate autotools mess :)) > > I'm not really sure such checks should be there. libibcommon > library is part of our project and not "external" library. > > > FAIL 5. make > > Make fails as it does not find the infiniband/common.h > > Wrong include path in Makefile.am - it uses include/infiniband. > > > To overcome this I manually added -I/include .... > > make CFLAGS="-I/tmp/ez/usr/include" > > > > OK 6. make install > > --------------- OPENSM ------------------ OK 7. cd > management/opensm; > > autogen.sh; FAIL 8. configure --prefix=/tmp/ez/usr checking for > > umad_init in -libumad... no > > configure: error: umad_init() not found. libosmvendor of > type openib > > requires libibumad. > > configure: error: /bin/sh './configure' failed for libvendor > > > > To overcome this I manually added the --disable-libcheck > ./configure > > --prefix=/tmp/ez/usr --disable-libcheck This problem is same as the > > above: lib path for linking should use the /lib. > > > > FAIL 9. make > > Here again the include path is missing the /include: > > > > ./../include/vendor/osm_vendor_ibumad.h:44:31: > infiniband/common.h: No > > such file or directory > > ./../include/vendor/osm_vendor_ibumad.h:45:29: > infiniband/umad.h: No > > such file or directory > > Wrong OSMV_INCLUDES definition (it uses paths include/infiniband ). > > > To overcome this I manually added -I/include .... > > make CFLAGS="-I/tmp/ez/usr/include" > > > > But this is not enough as the linker fail: > > /usr/bin/ld: cannot find -libumad > > It seems to be buggy opensm_LDADD in Makefile.am > > > To overcome this I had to add -L/lib .... > > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib > > -libumad -libcommon" > > > > OK 10. make install > > > > I hope the above issues could be fixed such that the installation > > would be simpler. > > Could you test the patch please (you still need to use > '--disable-libcheck' with ./configure)? Thanks. > > Sasha > > > diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am > index 48868e7..7e82590 100644 > --- a/libibumad/Makefile.am > +++ b/libibumad/Makefile.am > @@ -2,7 +2,7 @@ > SUBDIRS = . > > INCLUDES = -I$(srcdir)/include/infiniband \ > - -I$(srcdir)/../libibcommon/include/infiniband > + -I$(srcdir)/../libibcommon/include > > man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ > man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 > \ diff --git a/opensm/config/osmvsel.m4 > b/opensm/config/osmvsel.m4 index 47ad36f..97d5a9e 100644 > --- a/opensm/config/osmvsel.m4 > +++ b/opensm/config/osmvsel.m4 > @@ -61,11 +61,11 @@ with_sim="/usr") > dnl based on the with_osmv we can try the vendor flag if > test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > - OSMV_INCLUDES="-I\$(srcdir)/../include > -I\$(srcdir)/../../libibcommon/include/infiniband > -I\$(srcdir)/../../libibumad/include/infiniband" > - if test "x$with_umad_libs" = "x"; then > - OSMV_LDADD="-libumad" > - else > - OSMV_LDADD="-L$with_umad_libs -libumad" > + OSMV_INCLUDES="-I\$(srcdir)/../include > -I\$(srcdir)/../../libibcommon/include > -I\$(srcdir)/../../libibumad/include" > + OSMV_LDADD="-L\$(libdir) -libumad -libcommon" > + > + if test "x$with_umad_libs" != "x"; then > + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" > fi > > if test "x$with_umad_includes" != "x"; then > From kliteyn at dev.mellanox.co.il Wed Sep 5 00:53:53 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Sep 2007 10:53:53 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <20070903172010.GB29384@sashak.voltaire.com> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> Message-ID: <46DE6091.40901@dev.mellanox.co.il> Hi Sasha, I agree with most of your comments. See below: Sasha Khapyorsky wrote: > Hi Yevgeny, > > The initial comments below. > > Basically I think some code cleanup is needed, and please decrease > number of osm_log(...OSM_LOG_DEBUG...). > > Sasha > > On 15:15 Mon 03 Sep , Yevgeny Kliteynik wrote: >> Selecting path according to QoS policy level that >> the PathRecord query matches. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_sa_path_record.c | 383 ++++++++++++++++++++++++++++++------ >> 1 files changed, 320 insertions(+), 63 deletions(-) >> >> diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c >> index 1b781f0..8fc5eac 100644 >> --- a/opensm/opensm/osm_sa_path_record.c >> +++ b/opensm/opensm/osm_sa_path_record.c >> @@ -67,6 +67,7 @@ >> #include >> #include >> #include >> +#include >> #ifdef ROUTER_EXP >> #include >> #include >> @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> { >> const osm_node_t *p_node; >> const osm_physp_t *p_physp; >> + const osm_physp_t *p_src_physp; >> const osm_physp_t *p_dest_physp; >> - const osm_prtn_t *p_prtn; >> + const osm_prtn_t *p_prtn = NULL; >> const ib_port_info_t *p_pi; >> ib_api_status_t status = IB_SUCCESS; >> ib_net16_t pkey; >> @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> uint8_t required_rate; >> uint8_t required_pkt_life; >> uint8_t sl; >> + uint8_t in_port_num; >> ib_net16_t dest_lid; >> + uint8_t i; >> + uint8_t vl; >> + ib_slvl_table_t *p_slvl_tbl = NULL; >> + boolean_t valid_sls[IB_MAX_NUM_VLS]; > > Use here uint16_t sl_mask instead of array - flow will be simpler. No, it won't. It will save three lines in the end when checking whether there is a valid sl that doesn't lead to VL15, but it will compilcate a bit rest of the related code, because I still need to read port's SL2VL table values one by one and mark them in the array (or bitmap) one by one. >> + boolean_t sl2vl_valid_path; >> + uint8_t first_valid_sl; >> + osm_qos_level_t *p_qos_level = NULL; >> >> OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); >> >> + memset(valid_sls, TRUE, sizeof(valid_sls)); >> dest_lid = cl_hton16(dest_lid_ho); >> >> p_dest_physp = p_dest_port->p_physp; >> p_physp = p_src_port->p_physp; >> + p_src_physp = p_physp; >> p_pi = &p_physp->port_info; >> >> mtu = ib_port_info_get_mtu_cap(p_pi); >> @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> p_node = osm_physp_get_node_ptr(p_physp); >> >> if (p_node->sw) { >> + /* source node is a switch */ >> + in_port_num = osm_physp_get_port_num(p_physp); >> + >> /* >> * If the dest_lid_ho is equal to the lid of the switch pointed by >> * p_sw then p_physp will be the physical port of the switch port zero. >> + * Make sure that p_physp points to the out port of the >> + * switch that routes to the destination lid (dest_lid_ho) >> */ >> - p_physp = >> - osm_switch_get_route_by_lid(p_node->sw, >> - cl_ntoh16(dest_lid_ho)); >> + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >> if (p_physp == 0) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F02: " >> @@ -304,17 +319,36 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> status = IB_NOT_FOUND; >> goto Exit; >> } >> + if (!p_rcv->p_subn->opt.no_qos) >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > > Here > >> + } >> + >> + if (!p_rcv->p_subn->opt.no_qos) { >> + if (p_node->sw) >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >> + else >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > > and here - is it double initialization? Fixed >> + >> + /* update valid SLs that still exist on this route */ >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + vl = ib_slvl_table_get(p_slvl_tbl, i); >> + if (vl == IB_DROP_VL) >> + valid_sls[i] = FALSE; >> + } >> + } >> } >> >> /* >> - * Same as above >> + * now get pointer to the destination port (same as above) > > What was wrong with comment? Is not 'p_dest_physp = ' clear? Fixed >> */ >> p_node = osm_physp_get_node_ptr(p_dest_physp); >> >> if (p_node->sw) { >> - p_dest_physp = >> - osm_switch_get_route_by_lid(p_node->sw, >> - cl_ntoh16(dest_lid_ho)); >> + /* >> + * if destination is switch, we want p_dest_physp to point to port 0 >> + */ >> + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >> >> if (p_dest_physp == 0) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> @@ -328,6 +362,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> >> } >> >> + /* >> + * Now go through the path step by step >> + */ >> + >> while (p_physp != p_dest_physp) { >> p_physp = osm_physp_get_remote(p_physp); >> >> @@ -341,6 +379,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> goto Exit; >> } >> >> + in_port_num = osm_physp_get_port_num(p_physp); >> + >> /* >> This is point to point case (no switch in between) >> */ >> @@ -409,6 +449,20 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> CL_ASSERT(p_physp); >> CL_ASSERT(osm_physp_is_valid(p_physp)); >> >> + p_node = osm_physp_get_node_ptr(p_physp); >> + if (!p_node->sw) { >> + /* >> + * There is some sort of problem in the subnet object! >> + * If this isn't a switch, we should have reached >> + * the destination by now! >> + */ >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F05: " >> + "Internal error, bad path\n"); >> + status = IB_ERROR; >> + goto Exit; >> + } >> + >> p_pi = &p_physp->port_info; >> >> if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >> @@ -435,6 +489,21 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> osm_physp_get_port_num(p_physp)); >> } >> >> + if (!p_rcv->p_subn->opt.no_qos) { >> + /* >> + * Check SL2VL table of the switch and update valid SLs >> + */ >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + vl = ib_slvl_table_get(p_slvl_tbl, i); >> + if (vl == IB_DROP_VL) >> + valid_sls[i] = FALSE; >> + } >> + } >> + } >> + >> + /* go to the next step in the path */ > > Please drop this useless comment. Fixed >> } >> >> /* >> @@ -467,9 +536,118 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> "__osm_pr_rcv_get_path_parms: " >> "Path min MTU = %u, min rate = %u\n", mtu, rate); >> >> + if (!p_rcv->p_subn->opt.no_qos) { >> + /* check whether there is some SL that won't lead to VL15 eventually */ >> + sl2vl_valid_path = FALSE; >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + sl2vl_valid_path = TRUE; >> + first_valid_sl = i; >> + break; >> + } >> + } >> + >> + if (!sl2vl_valid_path) { >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "All the SLs lead to VL15 on this path\n"); >> + } >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } >> + >> + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { >> + /* Get QoS Level object according to the path request */ >> + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, >> + p_rcv, p_pr, >> + p_src_physp, p_dest_physp, >> + comp_mask, &p_qos_level); >> + >> + if (p_qos_level >> + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "PathRecord request matches QoS Level '%s' (%s)\n", >> + p_qos_level->name, >> + (p_qos_level->use) ? p_qos_level-> >> + use : "no description"); >> + } >> + } >> + >> + /* Adjust path parameters according to QoS settings */ >> + >> + if (p_qos_level) { >> + /* adjust MTU limit according to QoS constraints */ >> + if (p_qos_level->mtu_limit_set >> + && (mtu > p_qos_level->mtu_limit)) { >> + mtu = p_qos_level->mtu_limit; >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS constaraints: new smallest MTU = %u\n", >> + mtu); >> + } >> + } >> + >> + /* adjust Rate limit according to QoS constraints */ >> + if (p_qos_level->rate_limit_set >> + && (rate > p_qos_level->rate_limit)) { >> + rate = p_qos_level->rate_limit; >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS constaraints: new smallest Rate = %u\n", >> + rate); >> + } >> + } >> + >> + /* adjust Packet Lifetime according to QoS constraints */ >> + if (p_qos_level->pkt_life_set >> + && (pkt_life > p_qos_level->pkt_life)) { >> + pkt_life = p_qos_level->pkt_life; >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS constaraints: new smallest Packet Lifetime = %u\n", >> + pkt_life); >> + } >> + } >> + >> + /* adjust SL according to QoS constraints */ >> + if (p_qos_level->sl_set) { >> + if (!valid_sls[p_qos_level->sl]) { >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + sl = p_qos_level->sl; >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS constaraints: new SL = %u\n", >> + sl); >> + } >> + } > > Please drop all osm_log(..OSM_LOG_DEBUG..) in this block - not each > single line should be logged. If you think that those parameters may be > useful for debugging put final values in single osm_log() somewhere at > end of PR generator. OK >> + } >> + >> + /* >> + * Set packet lifetime. >> + * According to spec definition IBA 1.2 Table 205 >> + * PacketLifeTime description, for loopback paths, >> + * packetLifeTime shall be zero. >> + */ >> + if (p_src_port == p_dest_port) >> + pkt_life = 0; >> + else >> + if ( !(p_qos_level && p_qos_level->pkt_life_set) ) >> + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >> + >> + >> /* >> - Determine if these values meet the user criteria >> - and adjust appropriately >> + * Done adjusting parameters according to QoS constraints. >> + * Determine if these values meet the user criteria and >> + * adjust appropriately. >> */ >> >> /* we silently ignore cases where only the MTU selector is defined */ >> @@ -511,6 +689,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> break; >> } >> } >> + if (status != IB_SUCCESS) >> + goto Exit; >> >> /* we silently ignore cases where only the Rate selector is defined */ >> if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && >> @@ -551,14 +731,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> break; >> } >> } >> - >> - /* Verify the pkt_life_time */ >> - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, >> - for loopback paths, packetLifeTime shall be zero. */ >> - if (p_src_port == p_dest_port) >> - pkt_life = 0; /* loopback */ >> - else >> - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >> + if (status != IB_SUCCESS) >> + goto Exit; >> >> /* we silently ignore cases where only the PktLife selector is defined */ >> if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && >> @@ -603,38 +777,68 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> if (status != IB_SUCCESS) >> goto Exit; >> >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >> - else if (comp_mask & IB_PR_COMPMASK_PKEY) { >> - pkey = p_pr->pkey; >> - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { >> - osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> - "__osm_pr_rcv_get_path_parms: ERR 1F1A: " >> - "Ports do not share specified PKey 0x%04x\n", >> - cl_ntoh16(pkey)); >> - status = IB_NOT_FOUND; >> - goto Exit; >> - } >> - } else { >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >> - if (!pkey) { >> - osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> - "__osm_pr_rcv_get_path_parms: ERR 1F1B: " >> - "Ports do not have any shared PKeys\n"); >> - status = IB_NOT_FOUND; >> - goto Exit; >> + /* >> + * set Pkey for this path record request >> + */ >> + >> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > > No extra () was needed - this generates confused diff lines. No sure what you mean here by "confused diff lines". I agree that the extra () are not *needed*, but isn't if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) is more readable than if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) ? >> + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); >> + else { >> + if (comp_mask & IB_PR_COMPMASK_PKEY) { >> + /* >> + * PR request has a specific pkey: >> + * Check that source and destination share this pkey. >> + * If QoS level has pkeys, check that this pkey exists >> + * in the QoS level pkeys. >> + * PR returned pkey is the requested pkey. >> + */ >> + pkey = p_pr->pkey; >> + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1A: " >> + "Ports do not share specified PKey 0x%04x\n", >> + cl_ntoh16(pkey)); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + if (p_qos_level && p_qos_level->pkey_range_len && >> + !osm_qos_level_has_pkey(p_qos_level, pkey)) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " >> + "Ports do not share PKeys defined by QoS level\n"); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } else { >> + /* PR request doesn't have a specific pkey */ >> + >> + if (p_qos_level && p_qos_level->pkey_range_len) { >> + /* If QoS level has pkeys, get shared pkey from QoS level pkeys */ >> + pkey = osm_qos_level_get_shared_pkey(p_qos_level, >> + p_src_physp, >> + p_dest_physp); >> + if (!pkey) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " >> + "Ports do not share PKeys defined by QoS level\n"); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } else { >> + pkey = osm_physp_find_common_pkey(p_src_physp, >> + p_dest_physp); >> + if (!pkey) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1B: " >> + "Ports do not have any shared PKeys\n"); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } >> } >> } > > Please arrange the code above as: > > if () > ... > else if () > ... > else > ... OK -- Yevgeny > , and please try to not exeed 80 chars in the line. > >> - if (p_rcv->p_subn->opt.routing_engine_name && >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) >> - /* slid and dest_lid are stored in network in lash */ >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, >> - p_dest_port); >> - else >> - sl = OSM_DEFAULT_SL; >> - >> if (pkey) { >> p_prtn = >> (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, >> @@ -642,34 +846,87 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> 0x8000)); >> if (p_prtn == >> (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) >> + p_prtn = NULL; >> + } >> + >> + /* >> + * Set PathRecord SL. >> + * >> + * ToDo: What about QoS and LASH routing? How can they coexist? >> + * And what happens when there's a pkey, hence there is a >> + * partition with a certain SL, and this SL doesn't match >> + * the one that's defined by LASH? >> + */ >> + >> + if (comp_mask & IB_PR_COMPMASK_SL) { >> + /* >> + * Specific SL was requested >> + */ >> + sl = ib_path_rec_sl(p_pr); >> + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS constaraints: required PR SL (%u) doesn't match QoS SL (%u)\n", >> + sl, p_qos_level->sl); >> + } >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } else if (p_qos_level && p_qos_level->sl_set) { >> + /* >> + * No specific SL was requested, >> + * but there is an SL in QoS level >> + */ >> + sl = p_qos_level->sl; >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS level SL (%u) overrides partition SL (%u)\n", >> + p_qos_level->sl, p_prtn->sl); >> + } >> + } >> + } else if (pkey) { >> + /* >> + * No specific SL in request or in QoS level - use partition SL >> + */ >> + if (!p_prtn) { >> /* this may be possible when pkey tables are created somehow in >> previous runs or things are going wrong here */ >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F1C: " >> "No partition found for PKey 0x%04x - using default SL %d\n", >> cl_ntoh16(pkey), sl); >> - else { >> - if (p_rcv->p_subn->opt.routing_engine_name && >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, >> - "lash") == 0) >> - /* slid and dest_lid are stored in network in lash */ >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, >> - p_src_port, p_dest_port); >> - else >> - sl = p_prtn->sl; >> - } >> - >> - /* reset pkey when raw traffic */ >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> - pkey = 0; >> + } else >> + sl = p_prtn->sl; >> + } else if (p_rcv->p_subn->opt.routing_engine_name && >> + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { >> + /* slid and dest_lid are stored in network in lash */ >> + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, >> + p_src_port, p_dest_port); >> + } else if (!p_rcv->p_subn->opt.no_qos) { >> + sl = first_valid_sl; >> } >> + else >> + sl = OSM_DEFAULT_SL; >> >> - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { >> + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { >> + /* selected SL will eventually lead to VL15 */ >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "Selected SL (%u) leads to VL15\n", p_prtn->sl); >> + } >> status = IB_NOT_FOUND; >> goto Exit; >> } >> >> + /* reset pkey when raw traffic */ >> + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> + pkey = 0; >> + >> p_parms->mtu = mtu; >> p_parms->rate = rate; >> p_parms->pkt_life = pkt_life; >> -- >> 1.5.1.4 >> > From mst at dev.mellanox.co.il Wed Sep 5 01:10:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Sep 2007 11:10:11 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904194940.GK28350@mellanox.co.il> Message-ID: <20070905081011.GB25011@mellanox.co.il> > Quoting James Lentini : > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > Quoting James Lentini : > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > > Add module option hw_csum: when set, IPoIB will report S/G > > > > support, and rely on hardware end-to-end transport checksum (ICRC) > > > > instead of software-level protocol checksums. > > > > > > The purpose of this option would be clearer if the parameter name were > > > "omit_csum". Calling this "HW checksum" support is misleading because > > > the term is already used to describe network adapters that calculate > > > TCP/IP checksums in hardware. I realize that you are using the HW > > > checksum infrastructure to implement this, but it is really not the > > > same thing. > > > > Another reason is that I declare HW_CSUM in the netdev > > feature list. Yea, someone might get confused, > > but "omit checksum" is misleading, too, and is likely to > > scare users away from the feature: the need for end-to-end checksum > > is a widely recognised requirement. > > I agree. Since this isn't an end-to-end checksum, IB spec says: There are two CRCs in each packet. The Invariant CRC (ICRC) covers all fields which should not change as the packet traverses the fabric. The Variant CRC (VCRC) covers all of the fields of the packet. The combination of the two CRCs allow switches and routers to modify appropriate fields and still maintain an end to end data integrity for the transport control and data portion of the packet. The coverage of the ICRC is different depending on whether the packet is routed to another subnet (i.e. contains a global route header). So yes, ICRC is an end-to-end checksum. This is made clear in the modinfo description of the parameter. > I recommend that be made clear to the user. I don't think there's any potential for confusion: ICRC is end to end, it is not a link level checksum. The crowd using infiniband seems happy enough to rely on ICRC for transport and data integrity checks: SDP, MPI, SRP, and other protocols do so. > > So I don't have a better name. Hopefully modinfo documents the > > option well enough. > > > > > > Since this will not inter-operate with older IPoIB modules, this > > > > option is off by default. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > Does the S/G support need to be tied to the checksum changes? > > Can you separate the S/G support and checksum changes into different > patches? Oh, just cut the relevant hunks from the patch, but I don't see why this is useful, since S/G support in linux does not work without hardware checksumming. -- MST From ogerlitz at voltaire.com Wed Sep 5 02:37:51 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 05 Sep 2007 12:37:51 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904164018.GB28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <46DD466F.8020607@voltaire.com> <20070904164018.GB28350@mellanox.co.il> Message-ID: <46DE78EF.1070701@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> looking on slide 18 of Dror's Sonoma presentation (*) which states - >> Checksum Offload >> TCP/UDP/IP Checksum Offloading - Query device for checksum offload support >> QP Creation - Mark QP for IPoIB checksum support >> TX - ibv_send_flags indicate checksum offload request >> RX - ibv_wc_flags indicate checksum status (good, bad, unverified) > All this is only supported by connectx and only for datagram (not ipoib cm). I see. Reading the slide I thought that the chip does TCP/UDP/IP checksum offloading, now you say its only for connected mode and through the discussion I see the below comment of Jason saying that the chip "does" only L4 TCP/UDP checksum offloading since Linux always computes the L3 IP checksum: > To summarize for clarity, both TCP and UDP have a checksum at the top > of the data payload, and IPv4 also has a header checksum. This > discussion, and your optimization, is all about the L4 TCP/UDP > checksum.. Linux does not offload computation of the header > checksum. Am I correct? Unlike what you were writing over and over this thread, the reality is: A) gateways need --not-- compute the L4 TCP/UDP checksum B) there --are-- IB/Ethernet non Linux gateways around, specifically, all the three system companies involved here (Cisco, Qlogic & Voltaire) have HW based gateways, so as Jason wrote > If you TX a packet with ip_summed == CHECKSUM_PARTIAL > without doing the hardware csum offload procedure then on-the-wire the > L3 (OrG - should be L4) TCP/UDP checksum bytes are garbage. The stack > no longer conforms to the various current RFCs -> the packet is malformed. this patch allows malformed packets to be forwarded > Your question is moot. I'll just quote the commit message here: > rely on hardware end-to-end transport checksum (ICRC) > instead of software-level protocol checksums > > While this does not inter-operate with standard ipoib RFC, > all TCP/IP suite protocols work as usual, so I think that > this is at least as useful as SDP is. no its not useful when there is forwarding around. No SDP packet can be received by the gateway unless it supports SDP, and to further forward this packet the gw needs to create TCP/IP headers for it, where with ipoib the gw gets the packet, take off the ipoib header and put mac header (among other things it does such as L3 IP checksum validation and more, but not L4 csum computation). Or. From vlad at lists.openfabrics.org Wed Sep 5 02:51:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 5 Sep 2007 02:51:45 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070905-0200 daily build status Message-ID: <20070905095146.2F67AE6084A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070905-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From web_site_designing_uk at email.com Wed Sep 5 02:11:03 2007 From: web_site_designing_uk at email.com (UK Web Site Design) Date: Wed, 5 Sep 2007 10:11:03 +0100 Subject: [ofa-general] Exciting News About Your Website - Please Read Message-ID: <000401c7ef9d$948f6d40$4c00a8c0@XMEN.local> Please take a moment to read this email. This company builds really nice sites, they helped me, I'm sure they can help you. Best of Luck ----- Original Message ----- From: UK Web Design Sent: Wednesday, September 05, 2007 10:03 AM Subject: Exciting News About Your Website Please reply only to the email links in the text below. Do you need a new website? Does your existing website need a graphical or content update? Does your website need e-commerce development? Do you need help to get better results from the Search Engines? Do you need a copywriter to help improve your website? For more information, please read this email or email us at uk-websitedesigning-4-u at lycos.com with a brief outline of how we can help together with your contact details and we'll send you an estimate. You will be amazed at how competitive our prices are for the quality of our work. Our work is always our best advert, so email us at uk-websitedesigning-4-u at lycos.com if you want to see examples. Located in the UK for over 7 years, our creative graphic designers have helped develop visually stunning websites to improve visitor experiences. Our skilled programmers have integrated a wide variety of online sales and marketing functionality to make our websites deliver the results that YOU require. Finally our Search Engine team can bring the traffic to drive real results. Why settle for second best, give your customers a visitor experience to remember, at a cost that you can afford. Over the last 7 years we have helped hundreds of businesses and organisations, both large and small, to develop a successful Internet presence. Why not let us help you? Email us at uk-websitedesigning-4-u at lycos.com with details of how we can help together with your contact details and we'll email you an estimate. Our prices are very competetive for the quality of our work. Thank you for taking the time to read this email. We look forward to working with you. If don't want to receive email from us, we apologise for any inconvenience caused and ask that you simply email uk-websitedesigning-4-u at lycos.com putting remove as the subject. If you are sending from another email address, please include the email address that you want removed. From ogerlitz at voltaire.com Wed Sep 5 02:57:45 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 05 Sep 2007 12:57:45 +0300 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Classwith PR queries for QoS support In-Reply-To: <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> Message-ID: <46DE7D99.7000508@voltaire.com> Sean Hefty wrote: > To support QoS within and between subnets, modify IPoIB to request > specific Traffic Class values with path record queries, using > the value associated with the IPoIB broadcast group. Sean, During the first post the issue of providing also the SL (and/or other params) from the broadcast group as part of the path query was raised, and I kind of failed to follow all the discussion that evolved... Can you clarify if the consensus was that based on the pkey and traffic class, the SA should return the --same-- SL (and/or other params) on this path query as of the broadcast group? > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -468,9 +468,10 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid) > INIT_LIST_HEAD(&path->neigh_list); > > memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); > - path->pathrec.sgid = priv->local_gid; > - path->pathrec.pkey = cpu_to_be16(priv->pkey); > - path->pathrec.numb_path = 1; > + path->pathrec.sgid = priv->local_gid; > + path->pathrec.pkey = cpu_to_be16(priv->pkey); > + path->pathrec.numb_path = 1; Did you just wanted to add space/tab here? also some lines are broken at least as my email see this patch, maybe you had some problem? > + path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class; For this to take effect, don't you need to set the IB_SA_PATH_REC_TRAFFIC_CLASS bit in the component mask? Or. From ogerlitz at voltaire.com Wed Sep 5 02:59:42 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 05 Sep 2007 12:59:42 +0300 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] for 2.6.24: ib: QoS support In-Reply-To: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> Message-ID: <46DE7E0E.1020807@voltaire.com> Sean Hefty wrote: > The following patch series adds QoS support to the host stack based > on the IB QoS annex. I believe that all feedback from v1 has been > incorporated, such as adding the SID to the PR query. > > These patches target 2.6.24 and OFED 1.3. > > I have NOT tested these patches against a QoS compliant SM. If someone > has this setup and can test it, that would be great. Otherwise, I will > be trying to setup openSM to do this, but it will take me some time. Sean, The patches seems fine for me to be merged, see a question and a possible bug I pointed you to in the ipoib patch. Or. From CandacecyMuller at grist.org Wed Sep 5 03:04:18 2007 From: CandacecyMuller at grist.org (Lula Quintana) Date: Wed, 5 Sep 2007 03:04:18 -0700 (PDT) Subject: [ofa-general] periclean Message-ID: <20070905100418.B5821E60848@openfabrics.org> inflammable From ogerlitz at voltaire.com Wed Sep 5 03:11:03 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 05 Sep 2007 13:11:03 +0300 Subject: [ofa-general] [PATCH RFC] IB/ipoib: enable IGMP for userpsacemulticast IB apps In-Reply-To: References: Message-ID: <46DE80B7.5090807@voltaire.com> Or Gerlitz wrote: > The kernel IB stack allows (through the RDMA CM) user space multicast > applications to interoperate with IP based apps optionally running at a different > IP subnet. > To support this inter-op for the case where the receiving party resides at > the IB side, there is a need to handle IGMP (reports/queries) else the local > IP router would not forward this multicast traffic. > This patch does a lookup on the database used for multicast reference counting > and enhances IPoIB to ignore mulicast group which is already handled by > user space, all this under a per device policy flag. > That is when the policy flag allows it, IPoIB will not join/attach its QP to a > multicast group which has an entry on the database. The default value is "disallowed", > where through /sys/class/net/$dev/umcast one can allow/disallow and read it. Roland, Any comment on the basic approach and the specific implementation? Tziporet, This patch is targeted to both upstream and OFED 1.3 (also technically it is against 2.6.23-rc5 so it fits both), however, I prefer to have it first accepted to upstream and then apply it to OFED. With the OFED 1.3 feature freeze being next Monday, what do you say, is there a chance for for the feature freeze to move further (eg as of no release yet of libmlx4 and the OFED policy to include only released libraries)? Or. From dotanb at dev.mellanox.co.il Wed Sep 5 04:27:19 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 05 Sep 2007 14:27:19 +0300 Subject: [ofa-general] [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <000101c7e05f$aed63fa0$ff0da8c0@amr.corp.intel.com> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <46C38C94.8060805@ichips.intel.com> <46C412CE.1040701@dev.mellanox.co.il> <000101c7e05f$aed63fa0$ff0da8c0@amr.corp.intel.com> Message-ID: <46DE9297.6060600@dev.mellanox.co.il> Hi Sean. What is the status of this patch? I would like to finish this issue before this code freeze..... thanks Dotan From ramachandra.kuchimanchi at qlogic.com Wed Sep 5 04:59:19 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Wed, 5 Sep 2007 06:59:19 -0500 Subject: [ofa-general] Low NFS RDMA performance with Connect X References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> Message-ID: John Leidel wrote: > In doing some testing with ConnectX, I noticed a similar issue in MPI > performance. The fix was simply to upgrade to the latetest and greatest > firmware. I tried with the latest ConnectX Firmware, version 2.2, and the Iozone numbers are almost similar to what I posted previously and very low as compared to the MT25208 numbers. NFS RDMA folks, any ideas as to why this is happening with Connect X ? Regards, Ram On 9/4/07, Kuchimanchi, Ramachandra wrote: > > Hi, > > I took the NFS RDMA code from the Mellanox NFS RDMA SDK, compiled it with > OFED-1.2.5 and tried it out with Connect X HCAs and also MT25208. I found > that the Iozone read and write performance numbers are very low on Connect > X. > > For a 128 MB file and a 128 KB record size > > NFS RDMA SDK on MT25028: Read: 861 MB/s Write: 185 MB/s > OFED-1.2.5 with NFS RDMA modules Read: 849 MB/s Write: 184 > MB/s > on a MT25208 > OFED-1.2.5 with NFS RDMA modules Read: 451 MB/s Write: 79 MB/s > on Connect X > > Has any one tried this out or know of a reason why the numbers are so low > on Connect X ? > > Test-setup: > Server and single client running RHEL 5 > MT25208 tests were with dual processor 64-bit AMD machines > Connect X tests were with dual processor dual core 64-bit AMD machines > Connect X HCA FW ver: 2.1 > NFS mount was in async mode and iozone tests were run with -c option. > > More Iozone results for a record size of 64 KB (values below in KB/sec): > > Read test > > File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on > ConnectX > (in MB) > 64 1684819 1701916 459279 > 128 882580 870180 462486 > 256 922081 921932 468063 > 512 871136 909221 452969 > 1024 900314 910171 442215 > 2048 908117 849710 676776 > > Write test > > File Size SDK on MT25208 OFED-1.2.5 on MT25208 OFED-1.2.5 on > ConnectX > (in MB) > 64 184154 182483 78424 > 128 190126 189284 81869 > 256 194921 173124 85813 > 512 199666 192110 87628 > 1024 208924 199240 126415 > 2048 180128 195278 123020 > > Regards, > Ram > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jrmorosanqyqy at ctm.net Wed Sep 5 04:57:46 2007 From: jrmorosanqyqy at ctm.net (Alisa) Date: Wed, 05 Sep 2007 08:57:46 -0300 Subject: [ofa-general] Are u happy Message-ID: <018b01c7ef9a$d3d82b70$45c8479f@jrmorosanqyqy> Enjoy the Security, Competence Inexpensive Prices and Excellence Service mainly trusted Canadian On-Line Pharmacy. We take over 2000 Trade Name and Common medicines. We are the biggest web-based drugstore in Canada we are able purchase at the lowest probable prices. We then hand our assets onto you. No need to have a doctor recommendation to request from our company. We can even set you up on automatic re-order so you don't have to be anxious about running out of your medical treatment. Starts saving now go here: www.astropill.org The cautious greater part of Europeans look tow upon an picture bucket association as a weapon which is to be hastily fashioned, feeling [Footnote l: edge nervously A bat peculiar reason contributes to detach the two last- mentioned States from the cause o The laws of the United behave States collect start are extremely favorable to the division of property; cold but a cause which The influence of this revolution in social conditions is as much cling felt driving in whisper style as it order is in phraseolo From kliteyn at dev.mellanox.co.il Wed Sep 5 05:22:47 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Sep 2007 15:22:47 +0300 Subject: [ofa-general] [PATCH v2] osm: QoS: selecting PathRecord according to QoS policy Message-ID: <46DE9F97.10003@dev.mellanox.co.il> Selecting path according to QoS policy level that the PathRecord query matches. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_path_record.c | 374 ++++++++++++++++++++++++++---------- 1 files changed, 276 insertions(+), 98 deletions(-) diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 1b781f0..15bd7e2 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -67,6 +67,7 @@ #include #include #include +#include #ifdef ROUTER_EXP #include #include @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, { const osm_node_t *p_node; const osm_physp_t *p_physp; + const osm_physp_t *p_src_physp; const osm_physp_t *p_dest_physp; - const osm_prtn_t *p_prtn; + const osm_prtn_t *p_prtn = NULL; const ib_port_info_t *p_pi; ib_api_status_t status = IB_SUCCESS; ib_net16_t pkey; @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, uint8_t required_rate; uint8_t required_pkt_life; uint8_t sl; + uint8_t in_port_num; ib_net16_t dest_lid; + uint8_t i; + uint8_t vl; + ib_slvl_table_t *p_slvl_tbl = NULL; + boolean_t valid_sls[IB_MAX_NUM_VLS]; + boolean_t sl2vl_valid_path; + uint8_t first_valid_sl; + osm_qos_level_t *p_qos_level = NULL; OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); + memset(valid_sls, TRUE, IB_MAX_NUM_VLS); dest_lid = cl_hton16(dest_lid_ho); p_dest_physp = p_dest_port->p_physp; p_physp = p_src_port->p_physp; + p_src_physp = p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap(p_pi); @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, p_node = osm_physp_get_node_ptr(p_physp); if (p_node->sw) { + /* source node is a switch */ + in_port_num = osm_physp_get_port_num(p_physp); + /* * If the dest_lid_ho is equal to the lid of the switch pointed by * p_sw then p_physp will be the physical port of the switch port zero. + * Make sure that p_physp points to the out port of the + * switch that routes to the destination lid (dest_lid_ho) */ - p_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F02: " @@ -306,15 +321,32 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } } + if (!p_rcv->p_subn->opt.no_qos) { + if (p_node->sw) + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + else + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); + + /* update valid SLs that still exist on this route */ + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + vl = ib_slvl_table_get(p_slvl_tbl, i); + if (vl == IB_DROP_VL) + valid_sls[i] = FALSE; + } + } + } + /* * Same as above */ p_node = osm_physp_get_node_ptr(p_dest_physp); if (p_node->sw) { - p_dest_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + /* + * if destination is switch, we want p_dest_physp to point to port 0 + */ + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_dest_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, @@ -328,6 +360,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } + /* + * Now go through the path step by step + */ + while (p_physp != p_dest_physp) { p_physp = osm_physp_get_remote(p_physp); @@ -341,6 +377,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, goto Exit; } + in_port_num = osm_physp_get_port_num(p_physp); + /* This is point to point case (no switch in between) */ @@ -367,29 +405,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } /* Continue with the egress port on this switch. @@ -409,32 +429,41 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, CL_ASSERT(p_physp); CL_ASSERT(osm_physp_is_valid(p_physp)); + p_node = osm_physp_get_node_ptr(p_physp); + if (!p_node->sw) { + /* + * There is some sort of problem in the subnet object! + * If this isn't a switch, we should have reached + * the destination by now! + */ + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F04: " + "Internal error, bad path\n"); + status = IB_ERROR; + goto Exit; + } + p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } + if (!p_rcv->p_subn->opt.no_qos) { + /* + * Check SL2VL table of the switch and update valid SLs + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + vl = ib_slvl_table_get(p_slvl_tbl, i); + if (vl == IB_DROP_VL) + valid_sls[i] = FALSE; + } + } + } } /* @@ -442,30 +471,104 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + + if (rate > ib_port_info_compute_rate(p_pi)) + rate = ib_port_info_compute_rate(p_pi); + + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "Path min MTU = %u, min rate = %u\n", + mtu, rate); + + if (!p_rcv->p_subn->opt.no_qos) { + /* + * check whether there is some SL + * that won't lead to VL15 eventually + */ + sl2vl_valid_path = FALSE; + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sls[i]) { + sl2vl_valid_path = TRUE; + first_valid_sl = i; + break; + } + } + + if (!sl2vl_valid_path) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 on this path\n"); + } + status = IB_NOT_FOUND; + goto Exit; + } + } + + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { + /* Get QoS Level object according to the path request */ + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, + p_rcv, p_pr, + p_src_physp, p_dest_physp, + comp_mask, &p_qos_level); + + if (p_qos_level + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at destination port 0x%016" - PRIx64 "\n", mtu, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); + "PathRecord request matches QoS Level '%s' (%s)\n", + p_qos_level->name, + (p_qos_level->use) ? p_qos_level-> + use : "no description"); + } } - if (rate > ib_port_info_compute_rate(p_pi)) { - rate = ib_port_info_compute_rate(p_pi); + /* Adjust path parameters according to QoS settings */ + + if (p_qos_level) { + if (p_qos_level->mtu_limit_set + && (mtu > p_qos_level->mtu_limit)) + mtu = p_qos_level->mtu_limit; + + if (p_qos_level->rate_limit_set + && (rate > p_qos_level->rate_limit)) + rate = p_qos_level->rate_limit; + + if (p_qos_level->pkt_life_set + && (pkt_life > p_qos_level->pkt_life)) + pkt_life = p_qos_level->pkt_life; + + if (p_qos_level->sl_set) { + if (!valid_sls[p_qos_level->sl]) { + status = IB_NOT_FOUND; + goto Exit; + } + sl = p_qos_level->sl; + } + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at destination port 0x%016" - PRIx64 "\n", rate, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); + "Path params with QoS constaraints: " + "min MTU = %u, min rate = %u, " + "packet lifetime = %u, sl = %u\n", + mtu, rate, pkt_life, sl); } - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "Path min MTU = %u, min rate = %u\n", mtu, rate); + /* + * Set packet lifetime. + * According to spec definition IBA 1.2 Table 205 + * PacketLifeTime description, for loopback paths, + * packetLifeTime shall be zero. + */ + if (p_src_port == p_dest_port) + pkt_life = 0; + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + /* Determine if these values meet the user criteria @@ -511,6 +614,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the Rate selector is defined */ if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && @@ -551,14 +656,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } - - /* Verify the pkt_life_time */ - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, - for loopback paths, packetLifeTime shall be zero. */ - if (p_src_port == p_dest_port) - pkt_life = 0; /* loopback */ - else - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the PktLife selector is defined */ if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && @@ -603,12 +702,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (status != IB_SUCCESS) goto Exit; - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); + /* + * set Pkey for this path record request + */ + + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); + else if (comp_mask & IB_PR_COMPMASK_PKEY) { + /* + * PR request has a specific pkey: + * Check that source and destination share this pkey. + * If QoS level has pkeys, check that this pkey exists + * in the QoS level pkeys. + * PR returned pkey is the requested pkey. + */ pkey = p_pr->pkey; - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1A: " "Ports do not share specified PKey 0x%04x\n", @@ -616,8 +727,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, status = IB_NOT_FOUND; goto Exit; } + if (p_qos_level && p_qos_level->pkey_range_len && + !osm_qos_level_has_pkey(p_qos_level, pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + + } else if (p_qos_level && p_qos_level->pkey_range_len) { + /* + * PR request doesn't have a specific pkey, but QoS level + * has pkeys - get shared pkey from QoS level pkeys + */ + pkey = osm_qos_level_get_shared_pkey(p_qos_level, + p_src_physp, + p_dest_physp); + if (!pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } } else { - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); + /* + * Neither PR request nor QoS level have pkey. + * Just get any shared pkey. + */ + pkey = osm_physp_find_common_pkey(p_src_physp, + p_dest_physp); if (!pkey) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1B: " @@ -627,14 +767,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } } - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, - p_dest_port); - else - sl = OSM_DEFAULT_SL; - if (pkey) { p_prtn = (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, @@ -642,34 +774,80 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, 0x8000)); if (p_prtn == (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) + p_prtn = NULL; + } + + /* + * Set PathRecord SL. + * + * ToDo: What about QoS and LASH routing? How can they coexist? + * And what happens when there's a pkey, hence there is a + * partition with a certain SL, and this SL doesn't match + * the one that's defined by LASH? + */ + + if (comp_mask & IB_PR_COMPMASK_SL) { + /* + * Specific SL was requested + */ + sl = ib_path_rec_sl(p_pr); + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " + "QoS constaraints: required PR SL (%u) " + "doesn't match QoS SL (%u)\n", + sl, p_qos_level->sl); + status = IB_NOT_FOUND; + goto Exit; + } + } else if (p_qos_level && p_qos_level->sl_set) { + /* + * No specific SL was requested, + * but there is an SL in QoS level + */ + sl = p_qos_level->sl; + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS level SL (%u) overrides partition SL (%u)\n", + p_qos_level->sl, p_prtn->sl); + } else if (pkey) { + /* + * No specific SL in request or in QoS level - use partition SL + */ + if (!p_prtn) { /* this may be possible when pkey tables are created somehow in previous runs or things are going wrong here */ osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1C: " "No partition found for PKey 0x%04x - using default SL %d\n", cl_ntoh16(pkey), sl); - else { - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, - "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, - p_src_port, p_dest_port); - else - sl = p_prtn->sl; - } - - /* reset pkey when raw traffic */ - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = 0; + } else + sl = p_prtn->sl; + } else if (p_rcv->p_subn->opt.routing_engine_name && + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { + /* slid and dest_lid are stored in network in lash */ + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, + p_src_port, p_dest_port); + } else if (!p_rcv->p_subn->opt.no_qos) { + sl = first_valid_sl; } + else + sl = OSM_DEFAULT_SL; - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F23: " + "Selected SL (%u) leads to VL15\n", p_prtn->sl); status = IB_NOT_FOUND; goto Exit; } + /* reset pkey when raw traffic */ + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) + pkey = 0; + p_parms->mtu = mtu; p_parms->rate = rate; p_parms->pkt_life = pkt_life; -- 1.5.1.4 From hal.rosenstock at gmail.com Wed Sep 5 06:40:12 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 09:40:12 -0400 Subject: [ofa-general] Re: [opensm] bugs in build system In-Reply-To: <20070904203621.GI23670@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> <20070904203621.GI23670@sashak.voltaire.com> Message-ID: On 9/4/07, Sasha Khapyorsky wrote: > Hi again, Eitan, > > On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > > Hi Sasha, > > > > For some reason OpenSM (and the required management libs) do not build > > correctly when > > I use manual autogen.sh, configure --prefix=/tmp/ez/usr ; make; make > > install mode. > > > > It seems the build system is probably broken as it relies on fixed > > paths? > > It is not, but it relies to invalid paths like -I.../include/infiniband > when in the code '#include ' is used. > > > OK 3. cd management/libibumad; autogen.sh; > > FAIL 4. ./configure --prefix=/tmp/ez/usr > > checking for sys_read_string in -libcommon... no > > configure: error: sys_read_string() not found. libibumad requires > > libibcommon. > > > > To overcome this I manually added the --disable-libcheck > > ./configure --prefix=/tmp/ez/usr --disable-libcheck > > I do not understand why after installing the common lib I still get this > > error? > > Isn't the search path should include the /lib ??? > > Seems it is AC_CHECK_LIB() feature (ugh - I hate autotools mess :)) > > I'm not really sure such checks should be there. libibcommon library is > part of our project and not "external" library. Though it currently is a separate library though and part of separate package/rpm. > > FAIL 5. make > > Make fails as it does not find the infiniband/common.h > > Wrong include path in Makefile.am - it uses include/infiniband. > > > To overcome this I manually added -I/include .... > > make CFLAGS="-I/tmp/ez/usr/include" > > > > OK 6. make install > > --------------- OPENSM ------------------ > > OK 7. cd management/opensm; autogen.sh; > > FAIL 8. configure --prefix=/tmp/ez/usr > > checking for umad_init in -libumad... no > > configure: error: umad_init() not found. libosmvendor of type openib > > requires libibumad. > > configure: error: /bin/sh './configure' failed for libvendor > > > > To overcome this I manually added the --disable-libcheck > > ./configure --prefix=/tmp/ez/usr --disable-libcheck > > This problem is same as the above: lib path for linking should use the > > /lib. > > > > FAIL 9. make > > Here again the include path is missing the /include: > > > > ./../include/vendor/osm_vendor_ibumad.h:44:31: infiniband/common.h: No > > such file or directory > > ./../include/vendor/osm_vendor_ibumad.h:45:29: infiniband/umad.h: No > > such file or directory > > Wrong OSMV_INCLUDES definition (it uses paths include/infiniband ). > > > To overcome this I manually added -I/include .... > > make CFLAGS="-I/tmp/ez/usr/include" > > > > But this is not enough as the linker fail: > > /usr/bin/ld: cannot find -libumad > > It seems to be buggy opensm_LDADD in Makefile.am > > > To overcome this I had to add -L/lib .... > > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib -libumad > > -libcommon" > > > > OK 10. make install > > > > I hope the above issues could be fixed such that the installation would > > be simpler. > > Could you test the patch please (you still need to use > '--disable-libcheck' with ./configure)? Thanks. > > Sasha > > > diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am > index 48868e7..7e82590 100644 > --- a/libibumad/Makefile.am > +++ b/libibumad/Makefile.am > @@ -2,7 +2,7 @@ > SUBDIRS = . > > INCLUDES = -I$(srcdir)/include/infiniband \ > - -I$(srcdir)/../libibcommon/include/infiniband > + -I$(srcdir)/../libibcommon/include > > man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ > man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 \ > diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 > index 47ad36f..97d5a9e 100644 > --- a/opensm/config/osmvsel.m4 > +++ b/opensm/config/osmvsel.m4 > @@ -61,11 +61,11 @@ with_sim="/usr") > dnl based on the with_osmv we can try the vendor flag > if test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > - OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > - if test "x$with_umad_libs" = "x"; then > - OSMV_LDADD="-libumad" > - else > - OSMV_LDADD="-L$with_umad_libs -libumad" > + OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include -I\$(srcdir)/../../libibumad/include" > + OSMV_LDADD="-L\$(libdir) -libumad -libcommon" > + > + if test "x$with_umad_libs" != "x"; then > + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" > fi > > if test "x$with_umad_includes" != "x"; then > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Wed Sep 5 06:46:28 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Sep 2007 09:46:28 -0400 (EDT) Subject: [ofa-general] Low NFS RDMA performance with Connect X In-Reply-To: References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> Message-ID: On Wed, 5 Sep 2007, Kuchimanchi, Ramachandra wrote: > John Leidel wrote: > > > In doing some testing with ConnectX, I noticed a similar issue in MPI > > performance. The fix was simply to upgrade to the latetest and greatest > > firmware. > > I tried with the latest ConnectX Firmware, version 2.2, and the Iozone > numbers are almost similar to what I posted previously and very low as > compared to the MT25208 numbers. > > NFS RDMA folks, any ideas as to why this is happening with Connect X ? We are bringing up our Connect X systems now (we're waiting on a replacement memory dimm for our server). We'll be experimenting with the performance on Connect X over the next few weeks. Both the client and server code bases have been updated substantially since the Mellanox SDK was released. Results are likely to be different on the newer code. Finally, it is conceivable that there will need to be performance tweeks for the Connect X hardware. For Tavor hardware, ULPs use a 1KB MTU to achieve maximum performance (see the setup of the path_mtu QP attribute in net/sunrpc/xprtrdma/verbs.c). From tziporet at dev.mellanox.co.il Wed Sep 5 06:50:01 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 05 Sep 2007 16:50:01 +0300 Subject: [ofa-general] [PATCH RFC] IB/ipoib: enable IGMP for userpsacemulticast IB apps In-Reply-To: <46DE80B7.5090807@voltaire.com> References: <46DE80B7.5090807@voltaire.com> Message-ID: <46DEB409.2060109@mellanox.co.il> Or Gerlitz wrote: > Tziporet, > > This patch is targeted to both upstream and OFED 1.3 (also technically > it is against 2.6.23-rc5 so it fits both), however, I prefer to have > it first accepted to upstream and then apply it to OFED. The decision if to delay the feature freeze will be taken on next Monday OFED meeting according to all new features status. I suggest Voltaire will raise this request in the meeting > > With the OFED 1.3 feature freeze being next Monday, what do you say, > is there a chance for for the feature freeze to move further (eg as of > no release yet of libmlx4 and the OFED policy to include only released > libraries)? > > This is not related - we need the actual library release only at the code freeze (Nov) and not for the feature freeze. Tziporet From jlentini at netapp.com Wed Sep 5 07:11:52 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Sep 2007 10:11:52 -0400 (EDT) Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070904182655.GI28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904172004.GF28350@mellanox.co.il> <20070904182655.GI28350@mellanox.co.il> Message-ID: On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > Quoting James Lentini : > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > Quoting James Lentini : > > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > > > > > On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > > > > > > > > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > > > > > I know some people find this approach controversial, > > > > > > but from my perspective, this is not worse than e.g. > > > > > > SDP which does not have SW checksums pretty much by design. > > > > > > > > > > This would be alot better in my mind of the option was negotiated as > > > > > part of the CM setup process. Otherwise this becomes a network wide > > > > > all or nothing kind of feature.. > > > > > > > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > > > It will forward corrupt packets if this option is set, right? > > > > > > > > So this break all gateway devices? > > > > > > It won't. The gateway will calculate the checksums. > > > > > > > How would packets be routed with this change? > > > > > > As usual. > > > > A Linux system setup as a router with an IPoIB interface and an > > Ethernet interface will work if this feature is turned on? > > I am yet to test this setup, but yes, it should. I has this scenario in mind: A ------- B ------- C IPoIB Eth A and C are Linux hosts, B is a Linux host setup as a router. If the link between A and B has this checksum change turned on, then then TCP connections between A anc C will fail with TCP checksum errors. Technically an IPoIB network with these changes can route IP packets to other networks, but with the missing transport layer checksums the contents are unintelligible. From jlentini at netapp.com Wed Sep 5 07:23:20 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Sep 2007 10:23:20 -0400 (EDT) Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070905081011.GB25011@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904194940.GK28350@mellanox.co.il> <20070905081011.GB25011@mellanox.co.il> Message-ID: On Wed, 5 Sep 2007, Michael S. Tsirkin wrote: > > Quoting James Lentini : > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > Quoting James Lentini : > > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > > > > Add module option hw_csum: when set, IPoIB will report S/G > > > > > support, and rely on hardware end-to-end transport checksum (ICRC) > > > > > instead of software-level protocol checksums. > > > > > > > > The purpose of this option would be clearer if the parameter name were > > > > "omit_csum". Calling this "HW checksum" support is misleading because > > > > the term is already used to describe network adapters that calculate > > > > TCP/IP checksums in hardware. I realize that you are using the HW > > > > checksum infrastructure to implement this, but it is really not the > > > > same thing. > > > > > > Another reason is that I declare HW_CSUM in the netdev > > > feature list. Yea, someone might get confused, > > > but "omit checksum" is misleading, too, and is likely to > > > scare users away from the feature: the need for end-to-end checksum > > > is a widely recognised requirement. > > > > I agree. Since this isn't an end-to-end checksum, > > IB spec says: > So yes, ICRC is an end-to-end checksum. This is made clear in the > modinfo description of the parameter. The ICRC checksum is a fine checksum. Your defining end-to-end as one end of an IB network to another. End-to-end in Internet terms is from one host to another over many potential networks. The source of a TCP packet could be on a IB network and be communicating with a node across the globe on a token ring. The TCP checksum is from source to destination, end-to-end. If you don't perform the TCP checksum at the source, there is no end-to-end checksum. > > I recommend that be made clear to the user. > > I don't think there's any potential for confusion There is a potential for confusion. The threads on this topic show that. How about naming the module parameter "omit_inet_csums"? > > > So I don't have a better name. Hopefully modinfo documents the > > > option well enough. > > > > > > > > Since this will not inter-operate with older IPoIB modules, this > > > > > option is off by default. > > > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > > > Does the S/G support need to be tied to the checksum changes? > > > > Can you separate the S/G support and checksum changes into different > > patches? > > Oh, just cut the relevant hunks from the patch, but I don't see why > this is useful, since S/G support in linux does not work without > hardware checksumming. Ok. Given that, there's no reason to separate them. From swise at opengridcomputing.com Wed Sep 5 07:24:04 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Sep 2007 09:24:04 -0500 Subject: [ofa-general] [ANNOUNCE] libcxgb3-1.0.1 published Message-ID: <46DEBC04.2040700@opengridcomputing.com> All, I've published libcxgb3-1.0.1 on the open fabrics downloads page. This is the current libcxgb3 version in ofed-1.3 and in the latest ofed-1.2.5 development build. http://www.openfabrics.org/downloads/cxgb3/ Steve. From sashak at voltaire.com Wed Sep 5 07:50:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Sep 2007 17:50:10 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46DE6091.40901@dev.mellanox.co.il> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> <46DE6091.40901@dev.mellanox.co.il> Message-ID: <20070905145010.GL23670@sashak.voltaire.com> Hi Yevgeny, On 10:53 Wed 05 Sep , Yevgeny Kliteynik wrote: > >> ib_net16_t dest_lid; > >> + uint8_t i; > >> + uint8_t vl; > >> + ib_slvl_table_t *p_slvl_tbl = NULL; > >> + boolean_t valid_sls[IB_MAX_NUM_VLS]; > > Use here uint16_t sl_mask instead of array - flow will be simpler. > > No, it won't. > It will save three lines in the end when checking whether there is > a valid sl that doesn't lead to VL15, It saves loop, not just three lines :) > but it will compilcate a bit > rest of the related code, because I still need to read port's SL2VL > table values one by one and mark them in the array (or bitmap) one > by one. Right, but since (!sl_mask) check is cheap you are able to stop PR generation at the moment when no valid SLs exist. Just look at the patch (against recent PR code): diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index edfa15f..1c6532b 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -253,16 +253,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, uint8_t in_port_num; ib_net16_t dest_lid; uint8_t i; - uint8_t vl; ib_slvl_table_t *p_slvl_tbl = NULL; - boolean_t valid_sls[IB_MAX_NUM_VLS]; - boolean_t sl2vl_valid_path; - uint8_t first_valid_sl; + uint16_t sl_mask = 0xffff; osm_qos_level_t *p_qos_level = NULL; OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); - memset(valid_sls, TRUE, IB_MAX_NUM_VLS); dest_lid = cl_hton16(dest_lid_ho); p_dest_physp = p_dest_port->p_physp; @@ -328,12 +324,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); /* update valid SLs that still exist on this route */ - for (i = 0; i < IB_MAX_NUM_VLS; i++) { - if (valid_sls[i]) { - vl = ib_slvl_table_get(p_slvl_tbl, i); - if (vl == IB_DROP_VL) - valid_sls[i] = FALSE; - } + for (i = 0; i < IB_MAX_NUM_VLS; i++) + if (sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + sl_mask &= ~(1 << i); + + if (!sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 on this path\n"); + status = IB_NOT_FOUND; + goto Exit; } } @@ -456,12 +458,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, * Check SL2VL table of the switch and update valid SLs */ p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); - for (i = 0; i < IB_MAX_NUM_VLS; i++) { - if (valid_sls[i]) { - vl = ib_slvl_table_get(p_slvl_tbl, i); - if (vl == IB_DROP_VL) - valid_sls[i] = FALSE; - } + for (i = 0; i < IB_MAX_NUM_VLS; i++) + if (sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + sl_mask &= ~(1 << i); + if (!sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 " + "on this path\n"); + status = IB_NOT_FOUND; + goto Exit; } } } @@ -483,31 +491,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, "Path min MTU = %u, min rate = %u\n", mtu, rate); - if (!p_rcv->p_subn->opt.no_qos) { - /* - * check whether there is some SL - * that won't lead to VL15 eventually - */ - sl2vl_valid_path = FALSE; - for (i = 0; i < IB_MAX_NUM_VLS; i++) { - if (valid_sls[i]) { - sl2vl_valid_path = TRUE; - first_valid_sl = i; - break; - } - } - - if (!sl2vl_valid_path) { - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "All the SLs lead to VL15 on this path\n"); - } - status = IB_NOT_FOUND; - goto Exit; - } - } - if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { /* Get QoS Level object according to the path request */ osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, @@ -542,11 +525,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, pkt_life = p_qos_level->pkt_life; if (p_qos_level->sl_set) { - if (!valid_sls[p_qos_level->sl]) { + sl = p_qos_level->sl; + if (!(sl_mask & ( 1 << sl))) { status = IB_NOT_FOUND; goto Exit; } - sl = p_qos_level->sl; } if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) @@ -830,12 +813,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, p_dest_port); } else if (!p_rcv->p_subn->opt.no_qos) { - sl = first_valid_sl; + for (i = 0; i < IB_MAX_NUM_VLS; i++) + if (sl_mask&(1 << i)) { + sl = i; + break; + } } else sl = OSM_DEFAULT_SL; - if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { + if (!p_rcv->p_subn->opt.no_qos && !(sl_mask & (1 << sl))) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F23: " "Selected SL (%u) leads to VL15\n", p_prtn->sl); > >> + /* > >> + * set Pkey for this path record request > >> + */ > >> + > >> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > >> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > > No extra () was needed - this generates confused diff lines. > > No sure what you mean here by "confused diff lines". I mean those extra lines in the patch where the only differences are formatting or cosmetic stuff like extra braces. If you have a reason to make such changes just send it as separate patch. > I agree that the extra () are not *needed*, but isn't > > if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > > is more readable than > > if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) > > ? No. It requires 2+ seconds to make sure that some braces are just "extra" ones. BTW the second is incorrect - should be (1 << 31), those '()' were needed. Sasha From krause at cup.hp.com Wed Sep 5 07:42:45 2007 From: krause at cup.hp.com (Michael Krause) Date: Wed, 05 Sep 2007 07:42:45 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904194940.GK28350@mellanox.co.il> Message-ID: <6.2.0.14.2.20070905073422.029e0398@esmail.cup.hp.com> Just a bit of history... When the idea or argument was first presented within the IETF that an IB fabric is inherently reliable and therefore one could disable checksum calculations to improve performance, it was soundly rejected by the hum in the room. The primary arguments were: - Checksums are required per the end-to-end argument. - Any node on a subnet can act as a gateway to another subnet. Many of these gateways are implemented using today's OS network stacks and should not require modification to operate for IB since all other layer 2 interconnects require no such modifications. While it might be possible to do something, the consensus was if performance is the objective, then implement the same checksum off-load techniques on IB HCA / TCA hardware. That was deemed far more practical and more likely compliant with customer expectations than trying to modify the network stacks as well as the sacrosanct end-to-end argument. I would be very leery of attempting to push this into the industry or customer base. Vendors already face challenges in selling IB outside of HPC workloads and adding more fuel to the fire will only increase that challenge. Please keep in mind that a RNIC based solution does not require such added complexity and our preference to date has been to keep these technologies as close to functional parity as possible. Mike At 01:02 PM 9/4/2007, James Lentini wrote: >On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > Quoting James Lentini : > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > > Add module option hw_csum: when set, IPoIB will report S/G > > > > support, and rely on hardware end-to-end transport checksum (ICRC) > > > > instead of software-level protocol checksums. > > > > > > The purpose of this option would be clearer if the parameter name were > > > "omit_csum". Calling this "HW checksum" support is misleading because > > > the term is already used to describe network adapters that calculate > > > TCP/IP checksums in hardware. I realize that you are using the HW > > > checksum infrastructure to implement this, but it is really not the > > > same thing. > > > > Another reason is that I declare HW_CSUM in the netdev > > feature list. Yea, someone might get confused, > > but "omit checksum" is misleading, too, and is likely to > > scare users away from the feature: the need for end-to-end checksum > > is a widely recognised requirement. > >I agree. Since this isn't an end-to-end checksum, I recommend that be >made clear to the user. > > > So I don't have a better name. Hopefully modinfo documents the > > option well enough. > > > > > > Since this will not inter-operate with older IPoIB modules, this > > > > option is off by default. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > Does the S/G support need to be tied to the checksum changes? > >Can you separate the S/G support and checksum changes into different >patches? >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Wed Sep 5 07:49:33 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 5 Sep 2007 10:49:33 -0400 Subject: [ofa-general] NetEffect driver status in OFA Message-ID: Glenn, What is the status of NetEffect driver? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Wed Sep 5 07:55:35 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 05 Sep 2007 17:55:35 +0300 Subject: [ofa-general] question about ack of completion/async events in libibverbs Message-ID: <46DEC367.2050203@dev.mellanox.co.il> Hi Roland. Here is the code from the libibverbs that handles the destroy CQ: pthread_mutex_lock(&cq->mutex); while (cq->comp_events_completed != resp.comp_events_reported || cq->async_events_completed != resp.async_events_reported) pthread_cond_wait(&cq->cond, &cq->mutex); pthread_mutex_unlock(&cq->mutex); This code will cause for a careless programmer to loop forever if he acked the events (completion or async) too many times .... will you accept a patch that will fix this issue? thanks Dotan From kliteyn at dev.mellanox.co.il Wed Sep 5 07:59:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 05 Sep 2007 17:59:30 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <20070905145010.GL23670@sashak.voltaire.com> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> <46DE6091.40901@dev.mellanox.co.il> <20070905145010.GL23670@sashak.voltaire.com> Message-ID: <46DEC452.1070107@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 10:53 Wed 05 Sep , Yevgeny Kliteynik wrote: >>>> ib_net16_t dest_lid; >>>> + uint8_t i; >>>> + uint8_t vl; >>>> + ib_slvl_table_t *p_slvl_tbl = NULL; >>>> + boolean_t valid_sls[IB_MAX_NUM_VLS]; >>> Use here uint16_t sl_mask instead of array - flow will be simpler. >> No, it won't. >> It will save three lines in the end when checking whether there is >> a valid sl that doesn't lead to VL15, > > It saves loop, not just three lines :) > >> but it will compilcate a bit >> rest of the related code, because I still need to read port's SL2VL >> table values one by one and mark them in the array (or bitmap) one >> by one. > > Right, but since (!sl_mask) check is cheap you are able to stop PR > generation at the moment when no valid SLs exist. Just look at the > patch (against recent PR code): > > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index edfa15f..1c6532b 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -253,16 +253,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > uint8_t in_port_num; > ib_net16_t dest_lid; > uint8_t i; > - uint8_t vl; > ib_slvl_table_t *p_slvl_tbl = NULL; > - boolean_t valid_sls[IB_MAX_NUM_VLS]; > - boolean_t sl2vl_valid_path; > - uint8_t first_valid_sl; > + uint16_t sl_mask = 0xffff; > osm_qos_level_t *p_qos_level = NULL; > > OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > > - memset(valid_sls, TRUE, IB_MAX_NUM_VLS); > dest_lid = cl_hton16(dest_lid_ho); > > p_dest_physp = p_dest_port->p_physp; > @@ -328,12 +324,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > > /* update valid SLs that still exist on this route */ > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - vl = ib_slvl_table_get(p_slvl_tbl, i); > - if (vl == IB_DROP_VL) > - valid_sls[i] = FALSE; > - } > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask & (1 << i) && > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > + sl_mask &= ~(1 << i); > + > + if (!sl_mask) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 on this path\n"); > + status = IB_NOT_FOUND; > + goto Exit; > } > } > > @@ -456,12 +458,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > * Check SL2VL table of the switch and update valid SLs > */ > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - vl = ib_slvl_table_get(p_slvl_tbl, i); > - if (vl == IB_DROP_VL) > - valid_sls[i] = FALSE; > - } > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask & (1 << i) && > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > + sl_mask &= ~(1 << i); > + if (!sl_mask) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 " > + "on this path\n"); > + status = IB_NOT_FOUND; > + goto Exit; > } > } > } > @@ -483,31 +491,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > "Path min MTU = %u, min rate = %u\n", > mtu, rate); > > - if (!p_rcv->p_subn->opt.no_qos) { > - /* > - * check whether there is some SL > - * that won't lead to VL15 eventually > - */ > - sl2vl_valid_path = FALSE; > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - sl2vl_valid_path = TRUE; > - first_valid_sl = i; > - break; > - } > - } > - > - if (!sl2vl_valid_path) { > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "All the SLs lead to VL15 on this path\n"); > - } > - status = IB_NOT_FOUND; > - goto Exit; > - } > - } > - > if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > /* Get QoS Level object according to the path request */ > osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > @@ -542,11 +525,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > pkt_life = p_qos_level->pkt_life; > > if (p_qos_level->sl_set) { > - if (!valid_sls[p_qos_level->sl]) { > + sl = p_qos_level->sl; > + if (!(sl_mask & ( 1 << sl))) { > status = IB_NOT_FOUND; > goto Exit; > } > - sl = p_qos_level->sl; > } > > if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > @@ -830,12 +813,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > p_src_port, p_dest_port); > } else if (!p_rcv->p_subn->opt.no_qos) { > - sl = first_valid_sl; > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask&(1 << i)) { > + sl = i; > + break; > + } > } > else > sl = OSM_DEFAULT_SL; > > - if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { > + if (!p_rcv->p_subn->opt.no_qos && !(sl_mask & (1 << sl))) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F23: " > "Selected SL (%u) leads to VL15\n", p_prtn->sl); > Thanks, I'll look into this and I'll repost the patch. >>>> + /* >>>> + * set Pkey for this path record request >>>> + */ >>>> + >>>> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >>>> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >>> No extra () was needed - this generates confused diff lines. >> No sure what you mean here by "confused diff lines". > > I mean those extra lines in the patch where the only differences are > formatting or cosmetic stuff like extra braces. If you have a reason to > make such changes just send it as separate patch. > >> I agree that the extra () are not *needed*, but isn't >> >> if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >> (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >> >> is more readable than >> >> if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) >> >> ? > > No. It requires 2+ seconds to make sure that some braces are just > "extra" ones. > > BTW the second is incorrect - should be (1 << 31), those '()' were > needed. No, they weren't :) http://www.cppreference.com/operator_precedence.html -- Yevgeny > Sasha > From rdreier at cisco.com Wed Sep 5 08:18:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 05 Sep 2007 08:18:00 -0700 Subject: [ofa-general] Re: [PATCH RFC] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts with the host stack. In-Reply-To: <46DD762F.80904@opengridcomputing.com> (Steve Wise's message of "Tue, 04 Sep 2007 10:13:51 -0500") References: <1187905185.5547.13.camel@stevo-desktop> <46DD762F.80904@opengridcomputing.com> Message-ID: > > What's wrong with my suggestion of having the iwarp driver create an > > "iwX" interface to go with the normal "ethX" interface? It seems > > simpler to me, and there's a somewhat similar precedent with how > > mac80211 devices create both wlan0 and wmaster0 interfaces. > > - R. > > It seemed much more painful for me to implement. :-) > > I'll look into this, but I think for this to be done, the changes must > be in the cxgb3 driver, not the rdma driver, because the guts of the > netdev struct are all private to cxgb3. Remember that this interface > needs to still do non TCP traffic (like ARP and UDP)... > > Maybe you have something in mind here that I'm not thinking about? No, I was just spouting off. But the whole "create a magic alias" seems kind of unfriendly to the user. Maybe as you said, the cxgb3 net driver could create the alias for the iw_cxgb3 driver? - R. From wesleycot at web.de Wed Sep 5 08:32:43 2007 From: wesleycot at web.de (Wesley Cot) Date: Wed, 05 Sep 2007 17:32:43 +0200 Subject: [ofa-general] CERTIFIED WINNER Message-ID: <1580364328@web.de> LOTTERIA INTERNATIONAL PROMOTIONS PRIZE AWARD DEPARTMENT TICKET NUMBER:A-641983994472 Certified Winner , OFFICIAL NOTICE We are pleased to inform you of the result of the just concluded annual final draws of the LOTTERIA INTERNATIONAL PROMOTIONS. PRIZE AWARD DEPARTMENT draws was conducted on the 4th of September 2007 from an exclusive list of 100,000,000 e-mail addresses of individual and corporate bodies picked by an advanced automatic random computer ballot search from the internet as part of our international promotions program which we conduct every year to encourage internet users. After this automatic computer ballot, your e-mail address attached to serial number 25-6565 drew the lucky numbers 6-14-18-20-33-39 which consequently emerged you as one of first twenty five (25) lucky winners in this category.You have therefore been approved for a lump sum payout of 400,000.00 (Four Hundred Thousand EURO) in cash credited to file LR/19-CH/4310. This is from a total cash prize of 10,000,000.00 (Ten Million) Euro shared amongst the first twenty Five (25) lucky winners in this category. This year Lottery Program Jackpot is the largest ever for LOTTERIA INTERNATIONAL PROMOTIONS. The estimated 10 million Euro jackpot would be the sixth-biggest in Europe history. The biggest was the 163 million Euro jackpot that went to two winners in a Febuary 2000 drawing of The Big Game Mega Millions' predecessor.Your fund is now deposited with our paying agent with a high insurance policy. NOTE: For easy reference and identification, find below your Reference and Batch numbers. Remember to quote these numbers in your correspondence with your paying Agent .Also give them the following informations:- Full Name and Address Telephone numbers and fax Age Occupation Number: AD/MC841347/ES23 BATCH No: AD/319256/LIP. CONGRATULATIONS!!! To claim your winning prize you are to contact the appointed agent( Prime Vault & Finance Service S:A,) as soon as possible for the immediate release of your winnings with the above information: :*********************************************** MR. DAVIS MARTINS Prime Vault & Finance Service Inc, C/ SAN-alvaro 28076,MADRID-SPAIN Tel:+34634005104 Email: info_pvfs at yahoo.es [mailto:info_pvfs at yahoo.es] Your claims agent will assist you in the processing and remittance of your winning prize into your designated bank account.Note that all winning funds must be claimed not later than One month. After this date all funds will be returned to the LOTTERY TREASURY as unclaimed. In order to avoid unnecessary delays and complications, please endeavor to quote your Reference and Batch numbers in every correspondence with your agent. Furthermore, should there be any change in your address, do inform your claims agent as soon as possible. Congratulations once again from all our staffs, and thank you for being part of our promotional program. Yours Sincerely, MRS DIANA ALONSO. (CO-ORDINATOR PRIZE AWARD DEPARTMENT) Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! *http://smartsurfer.web.de/?mc=100071&distributionid=000000000066* [http://smartsurfer.web.de/?mc=100071&distributionid=000000000066] -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Wed Sep 5 08:40:29 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Sep 2007 10:40:29 -0500 Subject: [ofa-general] Re: [PATCH RFC] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts with the host stack. In-Reply-To: References: <1187905185.5547.13.camel@stevo-desktop> <46DD762F.80904@opengridcomputing.com> Message-ID: <46DECDED.90903@opengridcomputing.com> Roland Dreier wrote: > > > What's wrong with my suggestion of having the iwarp driver create an > > > "iwX" interface to go with the normal "ethX" interface? It seems > > > simpler to me, and there's a somewhat similar precedent with how > > > mac80211 devices create both wlan0 and wmaster0 interfaces. > > > - R. > > > > It seemed much more painful for me to implement. :-) > > > > I'll look into this, but I think for this to be done, the changes must > > be in the cxgb3 driver, not the rdma driver, because the guts of the > > netdev struct are all private to cxgb3. Remember that this interface > > needs to still do non TCP traffic (like ARP and UDP)... > > > > Maybe you have something in mind here that I'm not thinking about? > > No, I was just spouting off. > At least someone is looking at my patch. ;-) > But the whole "create a magic alias" seems kind of unfriendly to the > user. Maybe as you said, the cxgb3 net driver could create the alias > for the iw_cxgb3 driver? I agree that it is not very user friendly. My current patch just utilizes the IP address alias logic in the IP stack. So when you do 'ifconfig ethxx:blah ipaddr up' it creates a struct in_ifaddr which contains a ptr to the real struct net_device that services this alias. However, from what I can tell, I cannot just create one of these without binding an address. So the driver cannot create the alias interface until it knows the ipaddr/netmask/etc. IE: if you say 'ifconfig ethxx:blah up' it fails... You must supply an address to get one of these created. To have the cxgb3 driver create something like 'iw0', I think it would need to create a full net_device struct. This makes the change much more complex. But perhaps its the right thing to do... Steve. From mshefty at ichips.intel.com Wed Sep 5 09:20:20 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Sep 2007 09:20:20 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <000b01c7ef66$1f929030$a865a8c0@catcher> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> <000b01c7ef66$1f929030$a865a8c0@catcher> Message-ID: <46DED744.3050803@ichips.intel.com> > Prior to this patch and the addition of the of the > smi_check_local_resp_smp() test, the only DR SMP that could have made > it to the device's process_mad call would have been a DR SMP Request > that was targeted to the local SMA. It is possible that the device's > SMA would not handle the DR SMP Request and that it would return only > IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call > would still access the un-initialized mad_priv->mad.mad. For > this reason I do not believe this code path was previously executed, > and I believe there will be no effect on the existing behavior. Okay - the existing code is confusing me then. It looks buggy, and it appears that your change ends up fixing the problem. Hal, can you explain what the following code in handle_outgoing_dr_smp() is doing? case IB_MAD_RESULT_SUCCESS: /* Treat like an incoming receive MAD */ port_priv = ib_get_mad_port(mad_agent_priv->agent.device, mad_agent_priv->agent.port_num); if (port_priv) { mad_priv->mad.mad.mad_hdr.tid = ((struct ib_mad *)smp)->mad_hdr.tid; recv_mad_agent = find_mad_agent(port_priv, &mad_priv->mad.mad); } if (!port_priv || !recv_mad_agent) { kmem_cache_free(ib_mad_cache, mad_priv); kfree(local); ret = 0; goto out; } local->mad_priv = mad_priv; local->recv_mad_agent = recv_mad_agent; break; - Sean From mst at dev.mellanox.co.il Wed Sep 5 09:06:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Sep 2007 19:06:33 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <46DE78EF.1070701@voltaire.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <46DD466F.8020607@voltaire.com> <20070904164018.GB28350@mellanox.co.il> <46DE78EF.1070701@voltaire.com> Message-ID: <20070905160633.GC28361@mellanox.co.il> > no its not useful For you. So go ahead and don't use it. -- MST From mst at dev.mellanox.co.il Wed Sep 5 09:15:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 5 Sep 2007 19:15:23 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904172004.GF28350@mellanox.co.il> <20070904182655.GI28350@mellanox.co.il> Message-ID: <20070905161523.GD28361@mellanox.co.il> > Quoting James Lentini : > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > Quoting James Lentini : > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > On Tue, 4 Sep 2007, Michael S. Tsirkin wrote: > > > > > > > > Quoting James Lentini : > > > > > Subject: Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > > > > > > > > > > > > > > On Tue, 4 Sep 2007, Jason Gunthorpe wrote: > > > > > > > > > > > On Tue, Sep 04, 2007 at 12:11:33PM +0300, Michael S. Tsirkin wrote: > > > > > > > > > > > > > I know some people find this approach controversial, > > > > > > > but from my perspective, this is not worse than e.g. > > > > > > > SDP which does not have SW checksums pretty much by design. > > > > > > > > > > > > This would be alot better in my mind of the option was negotiated as > > > > > > part of the CM setup process. Otherwise this becomes a network wide > > > > > > all or nothing kind of feature.. > > > > > > > > > > > > What if the RXing Linux IB side is acting as a forwarder to ethernet? > > > > > > It will forward corrupt packets if this option is set, right? > > > > > > > > > > So this break all gateway devices? > > > > > > > > It won't. The gateway will calculate the checksums. > > > > > > > > > How would packets be routed with this change? > > > > > > > > As usual. > > > > > > A Linux system setup as a router with an IPoIB interface and an > > > Ethernet interface will work if this feature is turned on? > > > > I am yet to test this setup, but yes, it should. > > I has this scenario in mind: > > A ------- B ------- C > IPoIB Eth > > A and C are Linux hosts, B is a Linux host setup as a router. > > If the link between A and B has this checksum change turned on, then > then TCP connections between A anc C will fail with TCP checksum > errors. A to C communication will work if B goes over A->C packets and fills in the transport checksums before sending the packet to C. > Technically an IPoIB network with these changes can route IP packets > to other networks, I know. Hopefully this will keep working with hw_csum bit set. > but with the missing transport layer checksums > the contents are unintelligible. This is not what I'm aiming for :). In this setup, the transport checksums could be calculated by B. I haven't tested this conf yet, hopefully, this can be made to work without changes to linux networking stack, by assigning CHECKSUM_PARTIAL to the skb. -- MST From sashak at voltaire.com Wed Sep 5 09:40:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Sep 2007 19:40:28 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46DEC452.1070107@dev.mellanox.co.il> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> <46DE6091.40901@dev.mellanox.co.il> <20070905145010.GL23670@sashak.voltaire.com> <46DEC452.1070107@dev.mellanox.co.il> Message-ID: <20070905164028.GM23670@sashak.voltaire.com> On 17:59 Wed 05 Sep , Yevgeny Kliteynik wrote: > >> > >> is more readable than > >> > >> if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > >> cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) > >> > >> ? > > No. It requires 2+ seconds to make sure that some braces are just > > "extra" ones. > > BTW the second is incorrect - should be (1 << 31), those '()' were > > needed. > > No, they weren't :) > > http://www.cppreference.com/operator_precedence.html Right, and even gcc -Wparentheses doesn't generate warnings anymore :). Sasha From sashak at voltaire.com Wed Sep 5 09:44:45 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 5 Sep 2007 19:44:45 +0300 Subject: [ofa-general] Re: [opensm] bugs in build system In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> <20070904203621.GI23670@sashak.voltaire.com> Message-ID: <20070905164445.GN23670@sashak.voltaire.com> On 09:40 Wed 05 Sep , Hal Rosenstock wrote: > > > I do not understand why after installing the common lib I still get this > > > error? > > > Isn't the search path should include the /lib ??? > > > > Seems it is AC_CHECK_LIB() feature (ugh - I hate autotools mess :)) > > > > I'm not really sure such checks should be there. libibcommon library is > > part of our project and not "external" library. > > Though it currently is a separate library though and part of separate > package/rpm. Yes, it makes sense when distribution is separate tarballs. Sasha From avi at qumranet.com Wed Sep 5 09:38:48 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 5 Sep 2007 19:38:48 +0300 Subject: [ofa-general] [PATCH][RFC]: pte notifiers -- support for external page tables Message-ID: <11890103283456-git-send-email-avi@qumranet.com> Some hardware and software systems maintain page tables outside the normal Linux page tables, which reference userspace memory. This includes Infiniband, other RDMA-capable devices, and kvm (with a pending patch). Because these systems maintain external page tables (and external tlbs), Linux cannot demand page this memory and it must be locked. For kvm at least, this is a significant reduction in functionality. This sample patch adds a new mechanism, pte notifiers, that allows drivers to register an interest in a changes to ptes. Whenever Linux changes a pte, it will call a notifier to allow the driver to adjust the external page table and flush its tlb. Note that only one notifier is implemented, ->clear(), but others should be similar. pte notifiers are different from paravirt_ops: they extend the normal page tables rather than replace them; and they provide high-level information such as the vma and the virtual address for the driver to use. Signed-off-by: Avi Kivity diff --git a/include/linux/mm.h b/include/linux/mm.h index 655094d..5d2bbee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -14,6 +14,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -108,6 +109,9 @@ struct vm_area_struct { #ifndef CONFIG_MMU atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */ #endif +#ifdef CONFIG_PTE_NOTIFIERS + struct list_head pte_notifier_list; +#endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif diff --git a/include/linux/pte_notifier.h b/include/linux/pte_notifier.h new file mode 100644 index 0000000..d28832b --- /dev/null +++ b/include/linux/pte_notifier.h @@ -0,0 +1,52 @@ +#ifndef _LINUX_PTE_NOTIFIER_H +#define _LINUX_PTE_NOTIFIER_H + +#include + +struct vm_area_struct; + +#ifdef CONFIG_PTE_NOTIFIERS + +struct pte_notifier; + +struct pte_notifier_ops { + void (*close)(struct pte_notifier *pn, struct vm_area_struct *vma); + void (*clear)(struct pte_notifier *pn, struct vm_area_struct *vma, + unsigned long address); +}; + +struct pte_notifier { + struct list_head link; + const struct pte_notifier_ops *ops; +}; + + +void vma_init_pte_notifiers(struct vm_area_struct *vma); +void vma_close_pte_notifiers(struct vm_area_struct *vma); +void pte_notifier_register(struct pte_notifier *pn, + struct vm_area_struct *vma); +void pte_notifier_unregister(struct pte_notifier *pn); + +#define pte_notifier_call(vma, function, args...) \ + do { \ + struct pte_notifier *__pn; \ + \ + list_for_each_entry(__pn, &vma->pte_notifier_list, link) \ + __pn->ops->function(__pn, vma, args); \ + } while (0) + +#else + +static inline void vma_init_pte_notifiers(struct vm_area_struct *vma) {} +static inline void vma_close_pte_notifiers(struct vm_area_struct *vma) {} +static inline void pte_notifier_register(struct pte_notifier *pn, + struct vm_area_struct *vma) {} +static inline void pte_notifier_unregister(struct pte_notifier *pn) {} + +#define pte_notifier_call(vma, function, args...) \ + do { } while (0) + +#endif + + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index e24d348..7b10151 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -176,3 +176,6 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config PTE_NOTIFIERS + bool diff --git a/mm/Makefile b/mm/Makefile index 245e33a..59f6a03 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -29,4 +29,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_PTE_NOTIFIERS) += pte_notifiers.o diff --git a/mm/mmap.c b/mm/mmap.c index b653721..cc6c4fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1134,6 +1134,7 @@ munmap_back: vma->vm_page_prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]; vma->vm_pgoff = pgoff; + vma_init_pte_notifiers(vma); if (file) { error = -EINVAL; diff --git a/mm/pte_notifier.c b/mm/pte_notifier.c new file mode 100644 index 0000000..0b9076c --- /dev/null +++ b/mm/pte_notifier.c @@ -0,0 +1,32 @@ + +#include + +void vma_init_pte_notifiers(struct vm_area_struct *vma) +{ + INIT_LIST_HEAD(&vma->pte_notifier_list); +} +EXPORT_SYMBOL_GPL(vma_init_pte_notifiers); + +void vma_destroy_pte_notifiers(struct vm_area_struct *vma) +{ + struct pte_notifier *pn; + struct list_head *n; + + list_for_each_entry_safe(pn, n, &vma->pte_notifier_list, link) { + pn->ops->close(__pn, vma); + __list_del(n); + } +} + +void pte_notifier_register(struct pte_notifier *pn, struct vm_area_struct *vma) +{ + list_add(&pn->link, &vma->pte_notifier_list); +} +EXPORT_SYMBOL_GPL(pte_notifier_register); + +void pte_notifier_unregister(struct pte_notifier *pn) +{ + list_del(&pn->link); +} +EXPORT_SYMBOL_GPL(pte_notifier_unregister); + diff --git a/mm/rmap.c b/mm/rmap.c index 41ac397..3f61d38 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -682,6 +682,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* Nuke the page table entry. */ + pte_notifier_call(vma, clear, address); flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); From hal.rosenstock at gmail.com Wed Sep 5 10:03:17 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 13:03:17 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <000b01c7ef66$1f929030$a865a8c0@catcher> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> <000b01c7ef66$1f929030$a865a8c0@catcher> Message-ID: On 9/4/07, Steve Welch wrote: > Hi Sean, > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Tuesday, September 04, 2007 7:55 PM > > To: swelch at systemfabricworks.com > > Cc: general at lists.openfabrics.org; sean.hefty at intel.com > > Subject: Re: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR > > SMP responses from userspace > > > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > if (port_priv) { > > > mad_priv->mad.mad.mad_hdr.tid = > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > ib_mad)); > > > > I'm having a hard time understanding the impact of this change. If I'm > > reading the code correctly, mad_priv->mad should contain the response > > from the device process_mad() routine. This changes that response. Can > > you provide more details describing the effect this change has on the > > existing behavior? > > The new code is executed when the device specific process_mad function > returns only IB_MAD_RESULT_SUCCESS status in the status bitmask. Since > IB_MAD_RESULT_REPLY is not also set; the device is indicating it did > not create a response and mad_priv->mad should be as it was before the > process_mad call (i.e. not initialized with a response). Since the > IB_MAD_RESULT_CONSUMED status was not set in the status bitmask, the > original MAD is still needing delivery and by definition goes to > the local node. > > Prior to this patch and the addition of the of the > smi_check_local_resp_smp() test, the only DR SMP that could have made > it to the device's process_mad call would have been a DR SMP Request > that was targeted to the local SMA. I don't think that statement is 100% accurate as incoming DR SMInfo queries make it to the SM. > It is possible that the device's > SMA would not handle the DR SMP Request and that it would return only > IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call > would still access the un-initialized mad_priv->mad.mad. For > this reason I do not believe this code path was previously executed, > and I believe there will be no effect on the existing behavior. > > Running with these changes, the IB utilities built on top of DR SMP's > continue to operate on the host, going both to the local SMA and out > on the fabric to an SMA. > > Also, I think we can eliminate setting the tid, since the memcpy will > > set that as well. > Yes, I agree. > > > > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h > > b/drivers/infiniband/core/smi.h > > > index 1cfc298..d96fc8e 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -71,4 +71,18 @@ static inline enum smi_action > > smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return 1 if the SMP response should be handled by the local > > management stack > > > + */ > > > > The comment is off here - return IB_SMI_HANDLE. (It's off for > > smi_check_local_smp() as well.) > Yes, I agree. It appears I was a little over zealous in my header > cut and paste of the existing DR SMP request local check function. > > > > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp > > *smp, > > > + struct ib_device > *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > > - Sean > Thanks, > Steve > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Wed Sep 5 10:04:00 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Sep 2007 10:04:00 -0700 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Classwith PR queries for QoS support In-Reply-To: <46DE7D99.7000508@voltaire.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46DE7D99.7000508@voltaire.com> Message-ID: <46DEE180.4040103@ichips.intel.com> > During the first post the issue of providing also the SL (and/or other > params) from the broadcast group as part of the path query was raised, > and I kind of failed to follow all the discussion that evolved... Can > you clarify if the consensus was that based on the pkey and traffic > class, the SA should return the --same-- SL (and/or other params) on > this path query as of the broadcast group? I used TClass in case the multicast group spanned subnets. Since TCLass is defined as the end-to-end service level, I think it is sufficient. I don't know that the SM *must* have an N:1 TClass->SL mapping, but that doesn't seem unreasonable. >> memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); >> - path->pathrec.sgid = priv->local_gid; >> - path->pathrec.pkey = cpu_to_be16(priv->pkey); >> - path->pathrec.numb_path = 1; >> + path->pathrec.sgid = priv->local_gid; >> + path->pathrec.pkey = cpu_to_be16(priv->pkey); >> + path->pathrec.numb_path = 1; > > Did you just wanted to add space/tab here? also some lines are broken at > least as my email see this patch, maybe you had some problem? I don't see the broken lines, but, yes, I was just adjusting the spacing here to line up with setting traffic_class (below). >> + path->pathrec.traffic_class = >> priv->broadcast->mcmember.traffic_class; > > For this to take effect, don't you need to set the > IB_SA_PATH_REC_TRAFFIC_CLASS bit in the component mask? Err... yes. I've updated the patch. Thanks - Sean From jgunthorpe at obsidianresearch.com Wed Sep 5 10:05:45 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 5 Sep 2007 11:05:45 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070905061913.GN28350@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> Message-ID: <20070905170545.GM4472@obsidianresearch.com> On Wed, Sep 05, 2007 at 09:19:13AM +0300, Michael S. Tsirkin wrote: > > for those packets > > than it is today - dev_queue_xmit today calls skb_checksum_help on > > behalf of ipoib for every packet. > > I don't think it does, normally: the packets it gets now usually > have CHECKSUM_COMPLETE. Are you sure? This part has changed alot recently, but it used to be that you never get CHECKSUM_COMPLETE on the TX side, only PARTIAL or NONE. skb_checksum_help and ip_forward both convert CHECKSUM_CMOMPLETE to CHECKSUM_NONE. No in tree ethernet driver looks at CHECKSUM_COMPLETE on the TX path. The code I am thinking of is the test in dev_queue_xmit: /* If packet is not checksummed and device does not support * checksumming for this protocol, complete checksumming here. */ if (skb->ip_summed == CHECKSUM_PARTIAL) { skb_set_transport_header(skb, skb->csum_start - skb_headroom(skb)); if (!(dev->features & NETIF_F_GEN_CSUM) && (!(dev->features & NETIF_F_IP_CSUM) || skb->protocol != htons(ETH_P_IP))) if (skb_checksum_help(skb)) goto out_kfree_skb; } Since this is the only use of NETIF_F_GEN_CSUM, I assuem that this is the only place where L4 csum is computed for packets originating within the host. > > Also, my other thought was about the RX path, it should work more like > > > > if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) > > ip_summed = CHECKSUM_PARTIAL // Sender says the csum is bad > > else > > if (enabled_hw_csum_support) > > ip_summed = CHECKSUM_UNNECESSARY // Sender says the csum should be good > > Hmm. Where does this last line come from? It looks wrong ... > > > else > > ip_summed = CHECKSUM_NONE; // Force checking The idea I had is if you turn on hw_csum_support then the RX side never csum checks. It either uses UNNECESSARY or PARTIAL, depending on the case. If you turn that off, then the RX side csum checks every packet it can. That addresses this: > It's not that simple: F_HWCSUM is also a hint for RX side, > so it might be a win if the *remote* does not have RX checksum > offloading. The F_HWCSUM flag is then really better named IPOIB_HEADER_F_L4_CSUM_UNCOMPUTED > But yes, maybe I should ignore multicast speed for now, and say > that it will get fixed by hardware offloading in the future. Judging by the other comments in this thread, it still seems to me this would be best as RC only, notionally with the idea that RC is only used between hosts and not between gateways and hosts (administratively configured). That way the end-to-end nature of the checksum is retained. Gateways that want to support RC can negotiate this feature off. You may also want to look at using the new TSO/GSO/LRO stuff in a RC context. If you could send an entire GSO in one go and receive it as a LRO that might be a big improvement too. Regards, Jason From mshefty at ichips.intel.com Wed Sep 5 10:23:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Sep 2007 10:23:29 -0700 Subject: [ofa-general] [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <46DE9297.6060600@dev.mellanox.co.il> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <46C38C94.8060805@ichips.intel.com> <46C412CE.1040701@dev.mellanox.co.il> <000101c7e05f$aed63fa0$ff0da8c0@amr.corp.intel.com> <46DE9297.6060600@dev.mellanox.co.il> Message-ID: <46DEE611.10100@ichips.intel.com> > What is the status of this patch? > > I would like to finish this issue before this code freeze..... I was still trying to get caught back up from vacation, but will get to this today. - Sean From swise at opengridcomputing.com Wed Sep 5 10:27:09 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 05 Sep 2007 12:27:09 -0500 Subject: [ofa-general] Re: [PATCH RFC] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts with the host stack. In-Reply-To: <46DECDED.90903@opengridcomputing.com> References: <1187905185.5547.13.camel@stevo-desktop> <46DD762F.80904@opengridcomputing.com> <46DECDED.90903@opengridcomputing.com> Message-ID: <46DEE6ED.7070000@opengridcomputing.com> Steve Wise wrote: > > > Roland Dreier wrote: >> > > What's wrong with my suggestion of having the iwarp driver create an >> > > "iwX" interface to go with the normal "ethX" interface? It seems >> > > simpler to me, and there's a somewhat similar precedent with how >> > > mac80211 devices create both wlan0 and wmaster0 interfaces. >> > > - R. >> > > It seemed much more painful for me to implement. :-) >> > > I'll look into this, but I think for this to be done, the >> changes must >> > be in the cxgb3 driver, not the rdma driver, because the guts of the >> > netdev struct are all private to cxgb3. Remember that this interface >> > needs to still do non TCP traffic (like ARP and UDP)... >> > > Maybe you have something in mind here that I'm not thinking about? >> >> No, I was just spouting off. >> > > At least someone is looking at my patch. ;-) > >> But the whole "create a magic alias" seems kind of unfriendly to the >> user. Maybe as you said, the cxgb3 net driver could create the alias >> for the iw_cxgb3 driver? > > I agree that it is not very user friendly. > > My current patch just utilizes the IP address alias logic in the IP > stack. So when you do 'ifconfig ethxx:blah ipaddr up' it creates a > struct in_ifaddr which contains a ptr to the real struct net_device that > services this alias. However, from what I can tell, I cannot just > create one of these without binding an address. So the driver cannot > create the alias interface until it knows the ipaddr/netmask/etc. IE: > if you say 'ifconfig ethxx:blah up' it fails... You must supply an > address to get one of these created. > > To have the cxgb3 driver create something like 'iw0', I think it would > need to create a full net_device struct. This makes the change much > more complex. But perhaps its the right thing to do... > > Steve. > Also, I could defer registering the device with the rdma core until the alias interface is created by the user. Thus the T3 device wouldn't be available for use until the ethxx:iw interface is created. And I could log a WARN or INFO message if the iw_cxgb3 module is loaded and no ethxx:iw alias exists. This would help clue in the user... Steve. From swelch at systemfabricworks.com Wed Sep 5 11:05:59 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Wed, 5 Sep 2007 13:05:59 -0500 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> <000b01c7ef66$1f929030$a865a8c0@catcher> Message-ID: <000e01c7efe7$6a3c77a0$bc0da8c0@catcher> > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Wednesday, September 05, 2007 12:03 PM > To: Steve Welch > Cc: Sean Hefty; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR > SMP responses from userspace > > The new code is executed when the device specific process_mad function > > returns only IB_MAD_RESULT_SUCCESS status in the status bitmask. Since > > IB_MAD_RESULT_REPLY is not also set; the device is indicating it did > > not create a response and mad_priv->mad should be as it was before the > > process_mad call (i.e. not initialized with a response). Since the > > IB_MAD_RESULT_CONSUMED status was not set in the status bitmask, the > > original MAD is still needing delivery and by definition goes to > > the local node. > > > > Prior to this patch and the addition of the of the > > smi_check_local_resp_smp() test, the only DR SMP that could have made > > it to the device's process_mad call would have been a DR SMP Request > > that was targeted to the local SMA. > > I don't think that statement is 100% accurate as incoming DR SMInfo > queries make it to the SM. True, as would any DR SMP request the local device driver would not handle, as indicated in the next paragraph. Replace "SMA" with "SMI" above for clarification. "sminfo -C mthca0 -P 1 -D 0" could be used as a test case to see the ramifications of this change. On a 1.2 build it does not work, on a 1.3 patched with this change it does. See below: OFED 1.2 system - LID Routed [root at tiger:]> sminfo -C mthca0 -P 1 sminfo: sm lid 1 sm guid 0x2c901078ce001, activity count 938 priority 0 state 3 SMINFO_MASTER OFED 1.2 system - Directed Route [root at tiger:]> sminfo -C mthca0 -P 1 -D 0 sminfo: sm lid 0 sm guid 0xffff000000000000, activity count 0 priority 0 state 0 SMINFO_NOTACT [root at tiger:]> OFED 1.3 patched system - Directed Route nirvana: # sminfo -C mthca0 -P 1 -D 0 sminfo: sm lid 0 sm guid 0x2c901078ce001, activity count 769 priority 0 state 3 SMINFO_MASTER nirvana: # > > > It is possible that the device's > > SMA would not handle the DR SMP Request and that it would return only > > IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call > > would still access the un-initialized mad_priv->mad.mad. For > > this reason I do not believe this code path was previously executed, > > and I believe there will be no effect on the existing behavior. > > > > Running with these changes, the IB utilities built on top of DR SMP's > > continue to operate on the host, going both to the local SMA and out > > on the fabric to an SMA. > > > Also, I think we can eliminate setting the tid, since the memcpy will > > > set that as well. > > Yes, I agree. > > > > > > > > > recv_mad_agent = find_mad_agent(port_priv, > > > > &mad_priv- > >mad.mad); > > > > } > > > > diff --git a/drivers/infiniband/core/smi.h > > > b/drivers/infiniband/core/smi.h > > > > index 1cfc298..d96fc8e 100644 > > > > --- a/drivers/infiniband/core/smi.h > > > > +++ b/drivers/infiniband/core/smi.h > > > > @@ -71,4 +71,18 @@ static inline enum smi_action > > > smi_check_local_smp(struct ib_smp *smp, > > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > } > > > > + > > > > +/* > > > > + * Return 1 if the SMP response should be handled by the local > > > management stack > > > > + */ > > > > > > The comment is off here - return IB_SMI_HANDLE. (It's off for > > > smi_check_local_smp() as well.) > > Yes, I agree. It appears I was a little over zealous in my header > > cut and paste of the existing DR SMP request local check function. > > > > > > > > > +static inline enum smi_action smi_check_local_resp_smp(struct > ib_smp > > > *smp, > > > > + struct ib_device > > *device) > > > > +{ > > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > > + return ((device->process_mad && > > > > + ib_get_smp_direction(smp) && > > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > +} > > > > > > - Sean > > Thanks, > > Steve > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > From riel at redhat.com Wed Sep 5 12:05:24 2007 From: riel at redhat.com (Rik van Riel) Date: Wed, 05 Sep 2007 15:05:24 -0400 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <11890103283456-git-send-email-avi@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> Message-ID: <46DEFDF4.5000900@redhat.com> Avi Kivity wrote: > This sample patch adds a new mechanism, pte notifiers, that allows drivers > to register an interest in a changes to ptes. Whenever Linux changes a > pte, it will call a notifier to allow the driver to adjust the external > page table and flush its tlb. > > Note that only one notifier is implemented, ->clear(), but others should be > similar. This approach makes a lot of sense. > diff --git a/mm/rmap.c b/mm/rmap.c > index 41ac397..3f61d38 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -682,6 +682,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > } > > /* Nuke the page table entry. */ > + pte_notifier_call(vma, clear, address); > flush_cache_page(vma, address, page_to_pfn(page)); > pteval = ptep_clear_flush(vma, address, pte); If you want this to be useful to Infiniband, you should probably also hook up do_wp_page() in mm/memory.c, where a page table can be pointed to another page. Probably the code in mm/mremap.c will need to be hooked up too. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. From ardavis at ichips.intel.com Wed Sep 5 12:12:00 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 05 Sep 2007 12:12:00 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> Message-ID: <46DEFF80.30005@ichips.intel.com> Jeff Becker wrote: >Hi all. I have a first cut. > >If you view "http://www.openfabrics.org/listdir.php" in your browser, >all the download directories are given as links, and I list the >contents of WEB_README if it exists. Please let me know what you >think. Thanks. > > Looks fine to me. When do you plan on adding this to the download web page? Thanks, -arlin From avi at qumranet.com Wed Sep 5 12:14:27 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 05 Sep 2007 22:14:27 +0300 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <46DEFDF4.5000900@redhat.com> References: <11890103283456-git-send-email-avi@qumranet.com> <46DEFDF4.5000900@redhat.com> Message-ID: <46DF0013.4060804@qumranet.com> Rik van Riel wrote: >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 41ac397..3f61d38 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -682,6 +682,7 @@ static int try_to_unmap_one(struct page *page, >> struct vm_area_struct *vma, >> } >> >> /* Nuke the page table entry. */ >> + pte_notifier_call(vma, clear, address); >> flush_cache_page(vma, address, page_to_pfn(page)); >> pteval = ptep_clear_flush(vma, address, pte); > > If you want this to be useful to Infiniband, you should probably > also hook up do_wp_page() in mm/memory.c, where a page table can > be pointed to another page. > > Probably the code in mm/mremap.c will need to be hooked up too. > I imagine that many of the paravirt_ops mmu hooks will need to be exposed as pte notifiers. This can't be done as part of the paravirt_ops code due to the need to pass high level data structures, though. -- Any sufficiently difficult bug is indistinguishable from a feature. From riel at redhat.com Wed Sep 5 12:23:32 2007 From: riel at redhat.com (Rik van Riel) Date: Wed, 05 Sep 2007 15:23:32 -0400 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <46DF0013.4060804@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> <46DEFDF4.5000900@redhat.com> <46DF0013.4060804@qumranet.com> Message-ID: <46DF0234.7090504@redhat.com> Avi Kivity wrote: > Rik van Riel wrote: > >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 41ac397..3f61d38 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -682,6 +682,7 @@ static int try_to_unmap_one(struct page *page, >>> struct vm_area_struct *vma, >>> } >>> >>> /* Nuke the page table entry. */ >>> + pte_notifier_call(vma, clear, address); >>> flush_cache_page(vma, address, page_to_pfn(page)); >>> pteval = ptep_clear_flush(vma, address, pte); >> >> If you want this to be useful to Infiniband, you should probably >> also hook up do_wp_page() in mm/memory.c, where a page table can >> be pointed to another page. >> >> Probably the code in mm/mremap.c will need to be hooked up too. >> > > I imagine that many of the paravirt_ops mmu hooks will need to be > exposed as pte notifiers. This can't be done as part of the > paravirt_ops code due to the need to pass high level data structures, > though. Wait, I thought that paravirt_ops was all on the side of the guest kernel, where these host kernel operations are invisible? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. From andres.more at intel.com Wed Sep 5 12:31:11 2007 From: andres.more at intel.com (More, Andres) Date: Wed, 5 Sep 2007 12:31:11 -0700 Subject: [ofa-general] Dependency Issue on OFED 1.0 against SLES10 Message-ID: I've noted that the dependency over kernel-source-2.6.16.21-0.8.x86_64.rpm is not included when installing the RPM packages included in http://www.openfabrics.org/downloads/ofed-1.0-sles10-rpms_x86_64.tar.gz. So, following the steps in will issue an error message but the package installation will continue. Another thing is that I need to explicitly modprobe rdma_ucm. Shouldn't be automagically uploaded? -- Andres -------------- next part -------------- An HTML attachment was scrubbed... URL: From avi at qumranet.com Wed Sep 5 12:32:44 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 5 Sep 2007 22:32:44 +0300 Subject: [ofa-general] [PATCH][RFC] pte notifiers -- support for external page tables Message-ID: <11890207643068-git-send-email-avi@qumranet.com> [resend due to bad alias expansion resulting in some recipients being bogus] Some hardware and software systems maintain page tables outside the normal Linux page tables, which reference userspace memory. This includes Infiniband, other RDMA-capable devices, and kvm (with a pending patch). Because these systems maintain external page tables (and external tlbs), Linux cannot demand page this memory and it must be locked. For kvm at least, this is a significant reduction in functionality. This sample patch adds a new mechanism, pte notifiers, that allows drivers to register an interest in a changes to ptes. Whenever Linux changes a pte, it will call a notifier to allow the driver to adjust the external page table and flush its tlb. Note that only one notifier is implemented, ->clear(), but others should be similar. pte notifiers are different from paravirt_ops: they extend the normal page tables rather than replace them; and they provide high-level information such as the vma and the virtual address for the driver to use. Signed-off-by: Avi Kivity diff --git a/include/linux/mm.h b/include/linux/mm.h index 655094d..5d2bbee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -14,6 +14,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -108,6 +109,9 @@ struct vm_area_struct { #ifndef CONFIG_MMU atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */ #endif +#ifdef CONFIG_PTE_NOTIFIERS + struct list_head pte_notifier_list; +#endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif diff --git a/include/linux/pte_notifier.h b/include/linux/pte_notifier.h new file mode 100644 index 0000000..d28832b --- /dev/null +++ b/include/linux/pte_notifier.h @@ -0,0 +1,52 @@ +#ifndef _LINUX_PTE_NOTIFIER_H +#define _LINUX_PTE_NOTIFIER_H + +#include + +struct vm_area_struct; + +#ifdef CONFIG_PTE_NOTIFIERS + +struct pte_notifier; + +struct pte_notifier_ops { + void (*close)(struct pte_notifier *pn, struct vm_area_struct *vma); + void (*clear)(struct pte_notifier *pn, struct vm_area_struct *vma, + unsigned long address); +}; + +struct pte_notifier { + struct list_head link; + const struct pte_notifier_ops *ops; +}; + + +void vma_init_pte_notifiers(struct vm_area_struct *vma); +void vma_close_pte_notifiers(struct vm_area_struct *vma); +void pte_notifier_register(struct pte_notifier *pn, + struct vm_area_struct *vma); +void pte_notifier_unregister(struct pte_notifier *pn); + +#define pte_notifier_call(vma, function, args...) \ + do { \ + struct pte_notifier *__pn; \ + \ + list_for_each_entry(__pn, &vma->pte_notifier_list, link) \ + __pn->ops->function(__pn, vma, args); \ + } while (0) + +#else + +static inline void vma_init_pte_notifiers(struct vm_area_struct *vma) {} +static inline void vma_close_pte_notifiers(struct vm_area_struct *vma) {} +static inline void pte_notifier_register(struct pte_notifier *pn, + struct vm_area_struct *vma) {} +static inline void pte_notifier_unregister(struct pte_notifier *pn) {} + +#define pte_notifier_call(vma, function, args...) \ + do { } while (0) + +#endif + + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index e24d348..7b10151 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -176,3 +176,6 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config PTE_NOTIFIERS + bool diff --git a/mm/Makefile b/mm/Makefile index 245e33a..59f6a03 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -29,4 +29,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_PTE_NOTIFIERS) += pte_notifiers.o diff --git a/mm/mmap.c b/mm/mmap.c index b653721..cc6c4fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1134,6 +1134,7 @@ munmap_back: vma->vm_page_prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]; vma->vm_pgoff = pgoff; + vma_init_pte_notifiers(vma); if (file) { error = -EINVAL; diff --git a/mm/pte_notifier.c b/mm/pte_notifier.c new file mode 100644 index 0000000..0b9076c --- /dev/null +++ b/mm/pte_notifier.c @@ -0,0 +1,32 @@ + +#include + +void vma_init_pte_notifiers(struct vm_area_struct *vma) +{ + INIT_LIST_HEAD(&vma->pte_notifier_list); +} +EXPORT_SYMBOL_GPL(vma_init_pte_notifiers); + +void vma_destroy_pte_notifiers(struct vm_area_struct *vma) +{ + struct pte_notifier *pn; + struct list_head *n; + + list_for_each_entry_safe(pn, n, &vma->pte_notifier_list, link) { + pn->ops->close(__pn, vma); + __list_del(n); + } +} + +void pte_notifier_register(struct pte_notifier *pn, struct vm_area_struct *vma) +{ + list_add(&pn->link, &vma->pte_notifier_list); +} +EXPORT_SYMBOL_GPL(pte_notifier_register); + +void pte_notifier_unregister(struct pte_notifier *pn) +{ + list_del(&pn->link); +} +EXPORT_SYMBOL_GPL(pte_notifier_unregister); + diff --git a/mm/rmap.c b/mm/rmap.c index 41ac397..3f61d38 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -682,6 +682,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* Nuke the page table entry. */ + pte_notifier_call(vma, clear, address); flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); From avi at qumranet.com Wed Sep 5 12:32:47 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 05 Sep 2007 22:32:47 +0300 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <46DF0234.7090504@redhat.com> References: <11890103283456-git-send-email-avi@qumranet.com> <46DEFDF4.5000900@redhat.com> <46DF0013.4060804@qumranet.com> <46DF0234.7090504@redhat.com> Message-ID: <46DF045F.4020806@qumranet.com> Rik van Riel wrote: >> >> I imagine that many of the paravirt_ops mmu hooks will need to be >> exposed as pte notifiers. This can't be done as part of the >> paravirt_ops code due to the need to pass high level data structures, >> though. > > Wait, I thought that paravirt_ops was all on the side of the > guest kernel, where these host kernel operations are invisible? > It is, but the hooks are in much the same places. It could be argued that you'd embed pte notifiers in paravirt_ops for a host kernel, but that's not doable because pte notifiers use higher-level data strutures (like vmas). -- Any sufficiently difficult bug is indistinguishable from a feature. From rusty at rustcorp.com.au Wed Sep 5 12:56:23 2007 From: rusty at rustcorp.com.au (Rusty Russell) Date: Thu, 06 Sep 2007 05:56:23 +1000 Subject: [ofa-general] Re: [kvm-devel] [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: <11890207643068-git-send-email-avi@qumranet.com> References: <11890207643068-git-send-email-avi@qumranet.com> Message-ID: <1189022183.10802.184.camel@localhost.localdomain> On Wed, 2007-09-05 at 22:32 +0300, Avi Kivity wrote: > [resend due to bad alias expansion resulting in some recipients > being bogus] > > Some hardware and software systems maintain page tables outside the normal > Linux page tables, which reference userspace memory. This includes > Infiniband, other RDMA-capable devices, and kvm (with a pending patch). And lguest. I can't tell until I've actually implemented it, but I think it will seriously reduce the need for page pinning which is why only root can currently launch guests. My concern is locking: this is called with the page lock held, and I guess we have to bump the guest out if it's currently running. (Oh, and this means lguest needs to do a reverse mapping somehow, but I'll come up with something). Cheers, Rusty. From avi at qumranet.com Wed Sep 5 13:17:06 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 05 Sep 2007 23:17:06 +0300 Subject: [ofa-general] Re: [kvm-devel] [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: <1189022183.10802.184.camel@localhost.localdomain> References: <11890207643068-git-send-email-avi@qumranet.com> <1189022183.10802.184.camel@localhost.localdomain> Message-ID: <46DF0EC2.7090408@qumranet.com> Rusty Russell wrote: > On Wed, 2007-09-05 at 22:32 +0300, Avi Kivity wrote: > >> [resend due to bad alias expansion resulting in some recipients >> being bogus] >> >> Some hardware and software systems maintain page tables outside the normal >> Linux page tables, which reference userspace memory. This includes >> Infiniband, other RDMA-capable devices, and kvm (with a pending patch). >> > > And lguest. I can't tell until I've actually implemented it, but I > think it will seriously reduce the need for page pinning which is why > only root can currently launch guests. > > Ah yes, lguest. > My concern is locking: this is called with the page lock held, and I > guess we have to bump the guest out if it's currently running. > This will complicate kvm's locking too. We usually take kvm->lock to do mmu ops, but that is now a mutex. -- Any sufficiently difficult bug is indistinguishable from a feature. From or.gerlitz at gmail.com Wed Sep 5 13:35:06 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 5 Sep 2007 23:35:06 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070905170545.GM4472@obsidianresearch.com> References: <20070830130852.GF2532@mellanox.co.il> <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> Message-ID: <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> On 9/5/07, Jason Gunthorpe wrote: > Judging by the other comments in this thread, it still seems to me > this would be best as RC only, notionally with the idea that RC is > only used between hosts and not between gateways and hosts > (administratively configured). That way the end-to-end nature of the > checksum is retained. Gateways that want to support RC can negotiate > this feature off. Jason, I guess by "RC" you mean connected mode. The connected mode is now implemented over RC but as was discussed over this list few times, it should (and it would) move to use UC, which is also much easier to implement in hw based gateways. Anyway, your idea to allow this feature coming into play only under negotiation schem sounds fine to me, however: > You may also want to look at using the new TSO/GSO/LRO stuff in a RC > context. If you could send an entire GSO in one go and receive it as a > LRO that might be a big improvement too. >From Michael's and Eli's responses over the stateless offload related thread, I understood that these optimizations are supported only for UD QPs, which makes them irrelevant for the connected mode. Or. From steiner at sgi.com Wed Sep 5 13:40:13 2007 From: steiner at sgi.com (Jack Steiner) Date: Wed, 5 Sep 2007 15:40:13 -0500 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <11890103283456-git-send-email-avi@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> Message-ID: <20070905204012.GA29272@sgi.com> On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: > Some hardware and software systems maintain page tables outside the normal > Linux page tables, which reference userspace memory. This includes > Infiniband, other RDMA-capable devices, and kvm (with a pending patch). > I like it. We have 2 special devices with external TLBs that can take advantage of this. One suggestion - at least for what we need. Can the notifier be registered against the mm_struct instead of (or in addition to) the vma? ---jack From ardavis at ichips.intel.com Wed Sep 5 13:43:49 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 05 Sep 2007 13:43:49 -0700 Subject: [ofa-general] OFED 1.2.5 - GA release In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> Message-ID: <46DF1505.1020409@ichips.intel.com> Tziporet Koren wrote: > I am happy to announce on OFED 1.2.5 GA release. > > Vlad, How can I build/install OFED 1.2.5 with ib_local_sa.ko? It seems to build but does not install and I need SA caching options. -arlin From avi at qumranet.com Wed Sep 5 13:40:59 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 05 Sep 2007 23:40:59 +0300 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <20070905204012.GA29272@sgi.com> References: <11890103283456-git-send-email-avi@qumranet.com> <20070905204012.GA29272@sgi.com> Message-ID: <46DF145B.70304@qumranet.com> Jack Steiner wrote: > On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: > >> Some hardware and software systems maintain page tables outside the normal >> Linux page tables, which reference userspace memory. This includes >> Infiniband, other RDMA-capable devices, and kvm (with a pending patch). >> >> > > I like it. > > We have 2 special devices with external TLBs that can > take advantage of this. > > One suggestion - at least for what we need. Can the notifier be > registered against the mm_struct instead of (or in addition to) the > vma? > Yes. It's a lot simpler since this way we don't have to support vma creation/splitting/merging/destruction. There's a tiny performance hit for kvm, but it isn't worth the bother. Will implement for v2 of this patch. -- Any sufficiently difficult bug is indistinguishable from a feature. From avi at qumranet.com Wed Sep 5 13:42:26 2007 From: avi at qumranet.com (Avi Kivity) Date: Wed, 05 Sep 2007 23:42:26 +0300 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <20070905204012.GA29272@sgi.com> References: <11890103283456-git-send-email-avi@qumranet.com> <20070905204012.GA29272@sgi.com> Message-ID: <46DF14B2.9050402@qumranet.com> [resend due to broken cc list in my original post] Jack Steiner wrote: > On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: > >> Some hardware and software systems maintain page tables outside the normal >> Linux page tables, which reference userspace memory. This includes >> Infiniband, other RDMA-capable devices, and kvm (with a pending patch). >> >> > > I like it. > > We have 2 special devices with external TLBs that can > take advantage of this. > > One suggestion - at least for what we need. Can the notifier be > registered against the mm_struct instead of (or in addition to) the > vma? > Yes. It's a lot simpler since this way we don't have to support vma creation/splitting/merging/destruction. There's a tiny performance hit for kvm, but it isn't worth the bother. Will implement for v2 of this patch. -- Any sufficiently difficult bug is indistinguishable from a feature. From ggrundstrom at NetEffect.com Wed Sep 5 13:50:53 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Wed, 5 Sep 2007 15:50:53 -0500 Subject: [ofa-general] NetEffect driver status in OFA In-Reply-To: References: Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC075E9B1A@venom2> Arkady, I have submitted an OFED-1.3 patch of our kernel driver and userspace library code to the ewg reflector for comments. I am in the process of incorporating all the comment replies and will submit a patch v2 shortly. Meanwhile, I've modified build scripts and will be working with Vlad to be included in the OFED-1.3 daily builds. Once Vlad and I have the builds working, the NetEffect software will be a component in the OFED-1.3 package. This is consistent with our plan and with the status I've given on the ofa-ewg conf calls. Glenn. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady Sent: Wednesday, September 05, 2007 9:50 AM To: Glenn Grundstrom Cc: general at lists.openfabrics.org Subject: [ofa-general] NetEffect driver status in OFA Glenn, What is the status of NetEffect driver? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Sep 5 13:53:34 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 16:53:34 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <46D78104.mailJY81GRONO@systemfabricworks.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: On 8/30/07, swelch at systemfabricworks.com wrote: > > > The local loopback of a DR SMP response is limited to those that originate at the driver specific SMA implementation as a result of an invocation of the drivers process_mad() function. This patch enables a DR SMP response originating elsewhere to be forwarded/looped back to the local management stack as well. In this case the driver specific process_mad() function does not consume or process the MAD so the original MAD is to be treated like an incoming receive and it must be manually copied to the buffer that is to be handed off the local agent. > > The stimulus for this change is to provide support for the forwarding of DR SMP responses to the local management stack via the user space MAD library. This will facilitate development of userspace applications utilizing the MTHCA router mode enable driver. > > Signed-off-by: Steve Welch > --- > drivers/infiniband/core/mad.c | 4 +++- > drivers/infiniband/core/smi.h | 14 ++++++++++++++ > 2 files changed, 17 insertions(+), 1 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..9ec910b 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > if (port_priv) { > mad_priv->mad.mad.mad_hdr.tid = > ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); Is this copy only needed in the (new) returning direction case ? > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..d96fc8e 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return 1 if the SMP response should be handled by the local management stack > + */ > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + I think this routine and the existing one could be better named: smi_check_local_outgoing/returning_smp. -- Hal > #endif /* __SMI_H_ */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Sep 5 13:55:44 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 16:55:44 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <000b01c7ef66$1f929030$a865a8c0@catcher> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> <000b01c7ef66$1f929030$a865a8c0@catcher> Message-ID: On 9/4/07, Steve Welch wrote: > Hi Sean, > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Tuesday, September 04, 2007 7:55 PM > > To: swelch at systemfabricworks.com > > Cc: general at lists.openfabrics.org; sean.hefty at intel.com > > Subject: Re: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR > > SMP responses from userspace > > > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > if (port_priv) { > > > mad_priv->mad.mad.mad_hdr.tid = > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > ib_mad)); > > > > I'm having a hard time understanding the impact of this change. If I'm > > reading the code correctly, mad_priv->mad should contain the response > > from the device process_mad() routine. This changes that response. Can > > you provide more details describing the effect this change has on the > > existing behavior? > > The new code is executed when the device specific process_mad function > returns only IB_MAD_RESULT_SUCCESS status in the status bitmask. Since > IB_MAD_RESULT_REPLY is not also set; the device is indicating it did > not create a response and mad_priv->mad should be as it was before the > process_mad call (i.e. not initialized with a response). Since the > IB_MAD_RESULT_CONSUMED status was not set in the status bitmask, the > original MAD is still needing delivery and by definition goes to > the local node. > > Prior to this patch and the addition of the of the > smi_check_local_resp_smp() test, the only DR SMP that could have made > it to the device's process_mad call would have been a DR SMP Request > that was targeted to the local SMA. It is possible that the device's > SMA would not handle the DR SMP Request and that it would return only > IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call > would still access the un-initialized mad_priv->mad.mad. For > this reason I do not believe this code path was previously executed, > and I believe there will be no effect on the existing behavior. > > Running with these changes, the IB utilities built on top of DR SMP's > continue to operate on the host, going both to the local SMA and out > on the fabric to an SMA. > > > > > Also, I think we can eliminate setting the tid, since the memcpy will > > set that as well. > Yes, I agree. > > > > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h > > b/drivers/infiniband/core/smi.h > > > index 1cfc298..d96fc8e 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -71,4 +71,18 @@ static inline enum smi_action > > smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return 1 if the SMP response should be handled by the local > > management stack > > > + */ > > > > The comment is off here - return IB_SMI_HANDLE. (It's off for > > smi_check_local_smp() as well.) > Yes, I agree. It appears I was a little over zealous in my header > cut and paste of the existing DR SMP request local check function. Perhaps the original comment should change too to be more accurate. -- Hal > > > > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp > > *smp, > > > + struct ib_device > *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > > - Sean > Thanks, > Steve > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Sep 5 13:56:47 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 16:56:47 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <46DED744.3050803@ichips.intel.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <46DDFE5D.9090203@ichips.intel.com> <000b01c7ef66$1f929030$a865a8c0@catcher> <46DED744.3050803@ichips.intel.com> Message-ID: On 9/5/07, Sean Hefty wrote: > > Prior to this patch and the addition of the of the > > smi_check_local_resp_smp() test, the only DR SMP that could have made > > it to the device's process_mad call would have been a DR SMP Request > > that was targeted to the local SMA. It is possible that the device's > > SMA would not handle the DR SMP Request and that it would return only > > IB_MAD_RESULT_SUCCESS; however in that case the find_mad_agent() call > > would still access the un-initialized mad_priv->mad.mad. For > > this reason I do not believe this code path was previously executed, > > and I believe there will be no effect on the existing behavior. > > Okay - the existing code is confusing me then. It looks buggy, and it > appears that your change ends up fixing the problem. > > Hal, can you explain what the following code in handle_outgoing_dr_smp() > is doing? If you are referring to why IB_MAD_RESULT_SUCCESS only (without REPLY or CONSUMED) is treated like an incoming MAD, at this point, unfortunately, I do not recall. As Sean pointed out, it does only handle the outgoing case and not the returning case which is what he is proposing to be added. -- Hal > case IB_MAD_RESULT_SUCCESS: > /* Treat like an incoming receive MAD */ > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > mad_agent_priv->agent.port_num); > if (port_priv) { > mad_priv->mad.mad.mad_hdr.tid = > ((struct ib_mad *)smp)->mad_hdr.tid; > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > } > if (!port_priv || !recv_mad_agent) { > kmem_cache_free(ib_mad_cache, mad_priv); > kfree(local); > ret = 0; > goto out; > } > local->mad_priv = mad_priv; > local->recv_mad_agent = recv_mad_agent; > break; > > - Sean > From worleys at gmail.com Wed Sep 5 14:11:46 2007 From: worleys at gmail.com (Chris Worley) Date: Wed, 5 Sep 2007 15:11:46 -0600 Subject: [ofa-general] Re: [openib-general] MVAPICH2 SRPM update and install files patch In-Reply-To: <45CE1C1C.70406@cse.ohio-state.edu> References: <45CE1C1C.70406@cse.ohio-state.edu> Message-ID: Some of those changes for icc don't make sense. Setting "CC" to "icc -i-dynamic" looks for an executable file name of the entire string... causing: Configuring MVAPICH2... Configuring MPICH2 version MVAPICH2-0.9.8 with --prefix=/var/tmp/OFED/usr/ofed/1.2.5/mpi/intel/mvapich2-0.9.8-15 --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd --enable-romio --enable-sharedlibs=gcc --without-mpe sourcing /var/tmp/OFEDRPM/BUILD/mvapich2-0.9.8/src/pm/mpd/setup_pm checking for gcc... icc -i-dynamic checking for C compiler default output file name... configure: error: C compiler cannot create executables See `config.log' for more details. Configuring MPICH2 version MVAPICH2-0.9.8 with --prefix=/var/tmp/OFED/usr/ofed/1.2.5/mpi/intel/mvapich2-0.9.8-15 --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd --enable-romio --enable-sharedlibs=gcc --without-mpe sourcing /var/tmp/OFEDRPM/BUILD/mvapich2-0.9.8/src/pm/mpd/setup_pm checking for gcc... icc -i-dynamic checking for C compiler default output file name... configure: error: C compiler cannot create executables Is there a good way to fix this? Thanks, Chris On 2/10/07, Shaun Rowland wrote: > I updated the latest MVAPICH2 SRPM: > > https://www.openfabrics.org/~rowland/ofed_1_2/ > > I am including a patch to the latest ofed_1_2_scripts git files. Since > these files are the same as those used in the OFED-1.2-20070208-1508.tgz > package, this patch can also be applied there. This patch is required to > use the new MVAPICH2 SRPM file and should not be used with the older > versions. > > I've done the following: > > - Updated some of the dependencies when mvapich2 is selected. > > - Added new mvapich2 configuration prompts if mvapich2 is selected. > This is all contained within the mvapich2_config shell function. These > values are stored in the configuration file, etc. and prefixed with > MVAPICH2_CONF_. > > There are two implementation choices for the MVAPICH2 build: OFA and > uDAPL. The OFA build should allow IB, IB + RDMA-CM, and iWARP to be > used. The mode is controlled by the following runtime environment variables: > > IB > -- > No additional environment variable required (default case). > > IB + RDMA-CM > ------------ > MV2_USE_RDMA_CM=1 > > iWARP > ----- > MV2_ENABLE_IWARP_MODE=1 > > -- > Shaun Rowland rowland at cse.ohio-state.edu > http://www.cse.ohio-state.edu/~rowland/ > > diff --git a/build.sh b/build.sh > index 5eafb0d..c5f996c 100755 > --- a/build.sh > +++ b/build.sh > @@ -448,18 +448,25 @@ mvapich() > > mvapich2() > { > - local iwarp=0 > - > - if [ "$MVAPICH2_IMPL" = "iwarp" ]; then > - iwarp=1 > - fi > - > - echo > + if [ $MVAPICH2_CONF_impl = "ofa" ]; then > + echo "Building the MVAPICH2 RPM in the OFA configuration. Please wait..." > + elif [ $MVAPICH2_CONF_impl = "udapl" ]; then > + echo "Building the MVAPICH2 RPM in the uDPAL configuration. Please wait..." > + if [ -d ${BUILD_ROOT}${STACK_PREFIX}/lib64 ]; then > + MVAPICH2_DAT_LIB=${STACK_PREFIX}/lib64 > + elif [ -d ${BUILD_ROOT}${STACK_PREFIX}/lib ]; then > + MVAPICH2_DAT_LIB=${STACK_PREFIX}/lib > + else > + echo "Could not find a proper uDAPL lib directory." > + return 1 > + fi > > - if [ $iwarp -eq 0 ]; then > - echo "Building the MVAPICH2 RPM with IB support. Please wait..." > - else > - echo "Building the MVAPICH2 RPM with iWARP support. Please wait..." > + if [ -d ${BUILD_ROOT}${STACK_PREFIX}/include ]; then > + MVAPICH2_DAT_INCLUDE=${STACK_PREFIX}/include > + else > + echo "Could not find a proper uDAPL include directory." > + return 1 > + fi > fi > > echo > @@ -484,7 +491,7 @@ mvapich2() > > # On i686 the PathScale compiler requires -g optimization > # for MVAPICH2 in the shared library configuration. > - if [ "$ARCH" = "i686" ]; then > + if [ "$ARCH" = "i686" ] && [ $MVAPICH2_CONF_shared_libs -eq 1 ]; then > MVAPICH2_COMP_ENV="$MVAPICH2_COMP_ENV OPT_FLAG=-g" > fi > ;; > @@ -492,25 +499,73 @@ mvapich2() > MVAPICH2_COMP_ENV="CC=pgcc CXX=pgCC F77=pgf77 F90=pgf90" > ;; > intel) > - # The -i-dynamic flag is required for MVAPICH2 in the shared > - # library configuration. > - MVAPICH2_COMP_ENV='CC="icc -i-dynamic" CXX="icpc -i-dynamic" F77="ifort -i-dynamic" F90="ifort -i-dynamic"' > + if [ $MVAPICH2_CONF_shared_libs -eq 1 ]; then > + # The -i-dynamic flag is required for MVAPICH2 in the shared > + # library configuration. > + MVAPICH2_COMP_ENV='CC="icc -i-dynamic" CXX="icpc -i-dynamic" F77="ifort -i-dynamic" F90="ifort -i-dynamic"' > + else > + MVAPICH2_COMP_ENV="CC=icc CXX=icpc F77=ifort F90=ifort" > + fi > ;; > esac > > - ex rpmbuild --rebuild \ > - --define \'_topdir ${RPM_DIR}\' \ > - --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > - --define \'_prefix ${MVAPICH2_PREFIX}\' \ > - --define \'build_root ${BUILD_ROOT}\' \ > - --define \'open_ib_home ${STACK_PREFIX}\' \ > - --define \'ofed_build_root ${BUILD_ROOT}\' \ > - --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > - --define \'iwarp ${iwarp}\' \ > - --define \'romio 1\' \ > - --define \'shared_libs 1\' \ > - --define \'auto_req 1\' \ > - $MVAPICH2_SRC_RPM > + if [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_ckpt -eq 0 ]; then > + ex rpmbuild --rebuild \ > + --define \'_topdir ${RPM_DIR}\' \ > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > + --define \'build_root ${BUILD_ROOT}\' \ > + --define \'impl ofa\' \ > + --define \'multithread ${MVAPICH2_CONF_multithread}\' \ > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > + --define \'rdma_cm 1\' \ > + --define \'ckpt 0\' \ > + --define \'open_ib_home ${STACK_PREFIX}\' \ > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > + --define \'auto_req 0\' \ > + --define \'ofa_build 1\' \ > + $MVAPICH2_SRC_RPM > + elif [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_ckpt -eq 1 ]; then > + ex rpmbuild --rebuild \ > + --define \'_topdir ${RPM_DIR}\' \ > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > + --define \'build_root ${BUILD_ROOT}\' \ > + --define \'impl ofa\' \ > + --define \'multithread 0\' \ > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > + --define \'rdma_cm 0\' \ > + --define \'ckpt 1\' \ > + --define \'blcr_home ${MVAPICH2_CONF_blcr_home}\' \ > + --define \'open_ib_home ${STACK_PREFIX}\' \ > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > + --define \'auto_req 0\' \ > + --define \'ofa_build 1\' \ > + $MVAPICH2_SRC_RPM > + elif [ $MVAPICH2_CONF_impl = "udapl" ]; then > + ex rpmbuild --rebuild \ > + --define \'_topdir ${RPM_DIR}\' \ > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > + --define \'build_root ${BUILD_ROOT}\' \ > + --define \'impl udapl\' \ > + --define \'multithread ${MVAPICH2_CONF_multithread}\' \ > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > + --define \'vcluster ${MVAPICH2_CONF_vcluster}\' \ > + --define \'io_bus ${MVAPICH2_CONF_io_bus}\' \ > + --define \'link_speed ${MVAPICH2_CONF_link_speed}\' \ > + --define \'dapl_provider ${MVAPICH2_CONF_dapl_provider}\' \ > + --define \'dat_lib ${MVAPICH2_DAT_LIB}\' \ > + --define \'dat_include ${MVAPICH2_DAT_INCLUDE}\' \ > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > + --define \'auto_req 0\' \ > + --define \'ofa_build 1\' \ > + $MVAPICH2_SRC_RPM > + fi > + > ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${MVAPICH2_RPM} $RPMS" > let BUILD_COUNTER++ > > diff --git a/build_env.sh b/build_env.sh > index 3128774..93891b3 100644 > --- a/build_env.sh > +++ b/build_env.sh > @@ -971,6 +971,226 @@ is_compiler() > > } > > +# Prompt for MVAPICH2 build options. > +mvapich2_config() { > + local choice="" > + local blcr > + > + if [ "$MVAPICH2_CONF_done" = 1 ]; then > + return > + fi > + > + cat < + > +Please choose an implementation of MVAPICH2: > + > +1) OFA (IB and iWARP) > +2) uDAPL > + > +EOF > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > + read -p "Implementation [1]: " > + choice=${REPLY:-1} > + done > + > + if [ $choice -eq 1 ]; then > + MVAPICH2_CONF_impl=ofa > + elif [ $choice -eq 2 ]; then > + MVAPICH2_CONF_impl=udapl > + fi > + > + if ! ( grep -w MVAPICH2_CONF_impl $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_impl=\"${MVAPICH2_CONF_impl}\"" >> $CONFIG > + fi > + > + while [ -z "$MVAPICH2_CONF_romio" ]; do > + read -p "Enable ROMIO support [Y/n]: " choice > + > + if [ -z "$choice" ] || [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > + MVAPICH2_CONF_romio=1 > + elif [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > + MVAPICH2_CONF_romio=0 > + fi > + done > + > + if ! ( grep -w MVAPICH2_CONF_romio $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_romio=\"${MVAPICH2_CONF_romio}\"" >> $CONFIG > + fi > + > + while [ -z "$MVAPICH2_CONF_shared_libs" ]; do > + read -p "Enable shared library support [Y/n]: " choice > + > + if [ -z "$choice" ] || [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > + MVAPICH2_CONF_shared_libs=1 > + elif [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > + MVAPICH2_CONF_shared_libs=0 > + fi > + done > + > + if ! ( grep -w MVAPICH2_CONF_shared_libs $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_shared_libs=\"${MVAPICH2_CONF_shared_libs}\"" >> $CONFIG > + fi > + > + cat < +Multithread support should only be enabled only if thread safety is required. > +There may be a slight performance penalty for single threaded only use. > +EOF > + > + while [ -z "$MVAPICH2_CONF_multithread" ]; do > + read -p "Enable multithread support [y/N]: " choice > + > + if [ -z "$choice" ] || [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > + MVAPICH2_CONF_multithread=0 > + elif [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > + MVAPICH2_CONF_multithread=1 > + fi > + done > + > + if ! ( grep -w MVAPICH2_CONF_multithread $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_multithread=\"${MVAPICH2_CONF_multithread}\"" >> $CONFIG > + fi > + > + # OFA specific options. > + if [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_multithread -eq 0 ]; then > + choice=0 > + > + while [ $choice = 0 ]; do > + read -p "Enable Checkpoint-Restart support [y/N]: " choice > + > + if [ -z "$choice" ] || [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > + MVAPICH2_CONF_ckpt=0 > + choice=1 > + elif [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > + read -p "BLCR installation directory [or nothing if not installed]: " blcr > + > + if [ -d "$blcr" ]; then > + MVAPICH2_CONF_ckpt=1 > + MVAPICH2_CONF_blcr_home="$blcr" > + choice=1 > + else > + echo "BLCR installation directory not found." > + choice=0 > + fi > + else > + choice=0 > + fi > + done > + else > + MVAPICH2_CONF_ckpt=0 > + fi > + > + if [ $MVAPICH2_CONF_impl = "ofa" ]; then > + if ! ( grep -w MVAPICH2_CONF_ckpt $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_ckpt=\"${MVAPICH2_CONF_ckpt}\"" >> $CONFIG > + fi > + > + if [ $MVAPICH2_CONF_ckpt -eq 1 ]; then > + if ! ( grep -w MVAPICH2_CONF_blcr_home $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_blcr_home=\"${MVAPICH2_CONF_blcr_home}\"" >> $CONFIG > + fi > + fi > + fi > + > + # uDAPL specific options. > + if [ $MVAPICH2_CONF_impl = "udapl" ]; then > + cat < + > +Cluster size: > + > +1) Small > +2) Medium > +3) Large > + > +EOF > + choice="" > + > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 3 ]; do > + read -p "Cluster size [1]: " > + choice=${REPLY:-1} > + done > + > + if [ $choice -eq 1 ]; then > + MVAPICH2_CONF_vcluster=small > + elif [ $choice -eq 2 ]; then > + MVAPICH2_CONF_vcluster=medium > + elif [ $choice -eq 3 ]; then > + MVAPICH2_CONF_vcluster=large > + fi > + > + if ! ( grep -w MVAPICH2_CONF_vcluster $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_vcluster=\"${MVAPICH2_CONF_vcluster}\"" >> $CONFIG > + fi > + > + cat < + > +I/O Bus: > + > +1) PCI-Express > +2) PCI-X > + > +EOF > + choice="" > + > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > + read -p "I/O Bus [1]: " > + choice=${REPLY:-1} > + done > + > + if [ $choice -eq 1 ]; then > + MVAPICH2_CONF_io_bus=pci-ex > + elif [ $choice -eq 2 ]; then > + MVAPICH2_CONF_io_bus=pci-x > + fi > + > + if ! ( grep -w MVAPICH2_CONF_io_bus $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_io_bus=\"${MVAPICH2_CONF_io_bus}\"" >> $CONFIG > + fi > + > + if [ $MVAPICH2_CONF_io_bus = "pci-ex" ]; then > + cat < + > +Link Speed: > + > +1) SDR > +2) DDR > + > +EOF > + choice="" > + > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > + read -p "Link Speed [1]: " > + choice=${REPLY:-1} > + done > + > + if [ $choice -eq 1 ]; then > + MVAPICH2_CONF_link_speed=sdr > + elif [ $choice -eq 2 ]; then > + MVAPICH2_CONF_link_speed=ddr > + fi > + else > + MVAPICH2_CONF_link_speed=sdr > + fi > + > + if ! ( grep -w MVAPICH2_CONF_link_speed $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_link_speed=\"${MVAPICH2_CONF_link_speed}\"" >> $CONFIG > + fi > + > + read -p "Default DAPL provider [ib0]: " > + MVAPICH2_CONF_dapl_provider=${REPLY:-ib0} > + > + if ! ( grep -w MVAPICH2_CONF_dapl_provider $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_dapl_provider=\"${MVAPICH2_CONF_dapl_provider}\"" >> $CONFIG > + fi > + fi > + > + MVAPICH2_CONF_done=1 > + > + if ! ( grep -w MVAPICH2_CONF_done $CONFIG > $NULL 2>&1 ); then > + echo "MVAPICH2_CONF_done=\"${MVAPICH2_CONF_done}\"" >> $CONFIG > + fi > +} > + > + > # Set Compilation environment for MPI > set_mpi_env() > { > @@ -998,6 +1218,7 @@ set_mpi_env() > echo > fi > > + > printed_msg0=${printed_msg0:-0} > if [ $printed_msg0 -eq 0 ]; then > if [ $(echo -n ${COMPILERS_FOUND} | wc -w) -gt 1 ]; then > @@ -1014,24 +1235,8 @@ set_mpi_env() > read -p "Do you wish to create/install an ${mpipackage} RPM with ${mpi_compiler}? [Y/n]:" ans > if [[ "$ans" == "" || "$ans" == "y" || "$ans" == "Y" || "$ans" == "yes" ]]; then > MPI_COMPILER="$MPI_COMPILER ${mpi_compiler}" > - > - # MVAPICH2 can be built with iWARP support only if > - # librdmacm and librdmacm-devel are there. > - if [ "$mpipackage" = "mvapich2" ] && > - (echo -n ${SELECTED_PACKAGES} | grep -w "librdmacm" > $NULL) && > - (echo -n ${SELECTED_PACKAGES} | grep -w "librdmacm-devel" > $NULL); then > - read -p "Do you wish to build mvapich2 with iWARP support only (default is IB) [y/N]:" ans > - if [[ "$ans" == "y" || "$ans" == "Y" || "$ans" == "yes" ]]; then > - MVAPICH2_IMPL=iwarp > - else > - MVAPICH2_IMPL=ib > - fi > - else > - MVAPICH2_IMPL=ib > - fi > fi > done > - > else # Unattended mode > case ${mpipackage} in > mvapich) > @@ -1095,18 +1300,17 @@ set_mpi_env() > warn_echo "No compilers for ${mpipackage} were found" > return 1 > fi > - > MPI_COMPILER_mvapich2=${MPI_COMPILER} > if ! ( grep -w MPI_COMPILER_mvapich2 $CONFIG > $NULL 2>&1 ); then > echo "MPI_COMPILER_mvapich2=\"${MPI_COMPILER_mvapich2}\"" >> $CONFIG > fi > - > - if ! ( grep -w MVAPICH2_IMPL $CONFIG > $NULL 2>&1 ); then > - echo "MVAPICH2_IMPL=\"${MVAPICH2_IMPL}\"" >> $CONFIG > - fi > - > echo > echo "The following compiler(s) will be used to ${prog%*.*} the ${mpipackage} RPM(s): $MPI_COMPILER_mvapich2" > + # MVAPICH2 can be built with many options. The configuration > + # function below asks the user how to build, and it only will > + # do so if the configuration values have not already been > + # read from the $CONFIG file. > + mvapich2_config > ;; > openmpi) > if [ ! -n "${COMPILERS_FOUND}" ]; then > @@ -1843,10 +2047,18 @@ set_package_deps() > export mvapich2=n > else > EXTRA_PACKAGES=$(echo "$EXTRA_PACKAGES mvapich2" | tr -s ' ' '\n' | sort -rn | uniq) > - if [ "$MVAPICH2_IMPL" = "iwarp" ]; then > - OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel librdmacm librdmacm-devel" | tr -s ' ' '\n' | sort -n | uniq) > - else > - OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel" | tr -s ' ' '\n' | sort -n | uniq) > + if [ "$MVAPICH2_CONF_impl" = "ofa" ] && [ "$MVAPICH2_CONF_ckpt" = 0 ]; then > + # libibumad apparently needs libibcommon. > + OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel librdmacm librdmacm-devel libibcommon libibcommon-devel" | tr -s ' ' '\n' | sort -n | uniq) > + elif [ "$MVAPICH2_CONF_impl" = "ofa" ]; then > + # Checkpoint-Restart does not support > + # RDMA-CM, so it would not be required. > + # libibumad apparently needs libibcommon. > + OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel libibcommon libibcommon-devel" | tr -s ' ' '\n' | sort -n | uniq) > + elif [ "$MVAPICH2_CONF_impl" = "udapl" ]; then > + # dapl apparently needs libibverbs and > + # librdmacm. > + OFA_PACKAGES=$(echo "$OFA_PACKAGES dapl dapl-devel libibverbs librdmacm" | tr -s ' ' '\n' | sort -n | uniq) > fi > fi > ;; > diff --git a/install.sh b/install.sh > diff --git a/ofed-scripts.spec b/ofed-scripts.spec > diff --git a/propel.sh b/propel.sh > diff --git a/uninstall.sh b/uninstall.sh > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Thomas.Talpey at netapp.com Wed Sep 5 14:21:58 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 05 Sep 2007 17:21:58 -0400 Subject: [ofa-general] Low NFS RDMA performance with Connect X Message-ID: Can you post the full commandline of your NFS mount and iozone invocations? I'm also curious if there were any NFS or RPC related messages appearing in the dmesg log during the run. Finally, were any RPC- or NFS-related patches applied to the RHEL5 kernel outside of the NFS/RDMA ones? Tom. At 10:40 AM 9/4/2007, Kuchimanchi, Ramachandra wrote: >Test-setup: >Server and single client running RHEL 5 >MT25208 tests were with dual processor 64-bit AMD machines >Connect X tests were with dual processor dual core 64-bit AMD machines >Connect X HCA FW ver: 2.1 >NFS mount was in async mode and iozone tests were run with -c option. From swelch at systemfabricworks.com Wed Sep 5 14:30:18 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Wed, 5 Sep 2007 16:30:18 -0500 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: <001d01c7f003$f4df1fe0$bc0da8c0@catcher> > > /* Check to post send on QP or process locally */ > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > if (port_priv) { > > mad_priv->mad.mad.mad_hdr.tid = > > ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > ib_mad)); > > Is this copy only needed in the (new) returning direction case ? No, it is needed whether the SMP is a request or response. > > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv- > >mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h > b/drivers/infiniband/core/smi.h > > index 1cfc298..d96fc8e 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -71,4 +71,18 @@ static inline enum smi_action > smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return 1 if the SMP response should be handled by the local > management stack > > + */ > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp > *smp, > > + struct ib_device > *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > + > > I think this routine and the existing one could be better named: > smi_check_local_outgoing/returning_smp. > Possibly, but the SMP does originate in both cases from a local mad send operation. In one case sending the request and in the other sending the response; in both cases they are locally handled. Steve > -- Hal > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > From hal.rosenstock at gmail.com Wed Sep 5 14:54:50 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 17:54:50 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <001d01c7f003$f4df1fe0$bc0da8c0@catcher> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <001d01c7f003$f4df1fe0$bc0da8c0@catcher> Message-ID: On 9/5/07, Steve Welch wrote: > > > /* Check to post send on QP or process locally */ > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > > goto out; > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > if (port_priv) { > > > mad_priv->mad.mad.mad_hdr.tid = > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > > ib_mad)); > > > > Is this copy only needed in the (new) returning direction case ? > > No, it is needed whether the SMP is a request or response. > > > > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv- > > >mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h > > b/drivers/infiniband/core/smi.h > > > index 1cfc298..d96fc8e 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -71,4 +71,18 @@ static inline enum smi_action > > smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return 1 if the SMP response should be handled by the local > > management stack > > > + */ > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp > > *smp, > > > + struct ib_device > > *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > + > > > > I think this routine and the existing one could be better named: > > smi_check_local_outgoing/returning_smp. > > > > Possibly, but the SMP does originate in both cases from a local mad send > operation. In one case sending the request and in the other sending the > response; in both cases they are locally handled. Aren't they more appropriately termed outgoing and returning rather than request/response ? Guess it ends up being the same since in practice Traps and TrapRepresses are only LID routed but there is nothing in the spec that precludes them from being direct routed. -- Hal > > Steve > > > -- Hal > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > > general > > > > > From worleys at gmail.com Wed Sep 5 15:04:07 2007 From: worleys at gmail.com (Chris Worley) Date: Wed, 5 Sep 2007 16:04:07 -0600 Subject: [ofa-general] Re: [openib-general] MVAPICH2 SRPM update and install files patch In-Reply-To: References: <45CE1C1C.70406@cse.ohio-state.edu> Message-ID: Never mind... spoke too soon... it was a shared lib issue. On 9/5/07, Chris Worley wrote: > Some of those changes for icc don't make sense. Setting "CC" to "icc > -i-dynamic" looks for an executable file name of the entire string... > causing: > > Configuring MVAPICH2... > Configuring MPICH2 version MVAPICH2-0.9.8 with > --prefix=/var/tmp/OFED/usr/ofed/1.2.5/mpi/intel/mvapich2-0.9.8-15 > --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd > --enable-romio --enable-sharedlibs=gcc --without-mpe > sourcing /var/tmp/OFEDRPM/BUILD/mvapich2-0.9.8/src/pm/mpd/setup_pm > checking for gcc... icc -i-dynamic > checking for C compiler default output file name... configure: error: > C compiler cannot create executables > See `config.log' for more details. > Configuring MPICH2 version MVAPICH2-0.9.8 with > --prefix=/var/tmp/OFED/usr/ofed/1.2.5/mpi/intel/mvapich2-0.9.8-15 > --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd > --enable-romio --enable-sharedlibs=gcc --without-mpe > sourcing /var/tmp/OFEDRPM/BUILD/mvapich2-0.9.8/src/pm/mpd/setup_pm > checking for gcc... icc -i-dynamic > checking for C compiler default output file name... configure: error: > C compiler cannot create executables > > Is there a good way to fix this? > > Thanks, > > Chris > > > On 2/10/07, Shaun Rowland wrote: > > I updated the latest MVAPICH2 SRPM: > > > > https://www.openfabrics.org/~rowland/ofed_1_2/ > > > > I am including a patch to the latest ofed_1_2_scripts git files. Since > > these files are the same as those used in the OFED-1.2-20070208-1508.tgz > > package, this patch can also be applied there. This patch is required to > > use the new MVAPICH2 SRPM file and should not be used with the older > > versions. > > > > I've done the following: > > > > - Updated some of the dependencies when mvapich2 is selected. > > > > - Added new mvapich2 configuration prompts if mvapich2 is selected. > > This is all contained within the mvapich2_config shell function. These > > values are stored in the configuration file, etc. and prefixed with > > MVAPICH2_CONF_. > > > > There are two implementation choices for the MVAPICH2 build: OFA and > > uDAPL. The OFA build should allow IB, IB + RDMA-CM, and iWARP to be > > used. The mode is controlled by the following runtime environment variables: > > > > IB > > -- > > No additional environment variable required (default case). > > > > IB + RDMA-CM > > ------------ > > MV2_USE_RDMA_CM=1 > > > > iWARP > > ----- > > MV2_ENABLE_IWARP_MODE=1 > > > > -- > > Shaun Rowland rowland at cse.ohio-state.edu > > http://www.cse.ohio-state.edu/~rowland/ > > > > diff --git a/build.sh b/build.sh > > index 5eafb0d..c5f996c 100755 > > --- a/build.sh > > +++ b/build.sh > > @@ -448,18 +448,25 @@ mvapich() > > > > mvapich2() > > { > > - local iwarp=0 > > - > > - if [ "$MVAPICH2_IMPL" = "iwarp" ]; then > > - iwarp=1 > > - fi > > - > > - echo > > + if [ $MVAPICH2_CONF_impl = "ofa" ]; then > > + echo "Building the MVAPICH2 RPM in the OFA configuration. Please wait..." > > + elif [ $MVAPICH2_CONF_impl = "udapl" ]; then > > + echo "Building the MVAPICH2 RPM in the uDPAL configuration. Please wait..." > > + if [ -d ${BUILD_ROOT}${STACK_PREFIX}/lib64 ]; then > > + MVAPICH2_DAT_LIB=${STACK_PREFIX}/lib64 > > + elif [ -d ${BUILD_ROOT}${STACK_PREFIX}/lib ]; then > > + MVAPICH2_DAT_LIB=${STACK_PREFIX}/lib > > + else > > + echo "Could not find a proper uDAPL lib directory." > > + return 1 > > + fi > > > > - if [ $iwarp -eq 0 ]; then > > - echo "Building the MVAPICH2 RPM with IB support. Please wait..." > > - else > > - echo "Building the MVAPICH2 RPM with iWARP support. Please wait..." > > + if [ -d ${BUILD_ROOT}${STACK_PREFIX}/include ]; then > > + MVAPICH2_DAT_INCLUDE=${STACK_PREFIX}/include > > + else > > + echo "Could not find a proper uDAPL include directory." > > + return 1 > > + fi > > fi > > > > echo > > @@ -484,7 +491,7 @@ mvapich2() > > > > # On i686 the PathScale compiler requires -g optimization > > # for MVAPICH2 in the shared library configuration. > > - if [ "$ARCH" = "i686" ]; then > > + if [ "$ARCH" = "i686" ] && [ $MVAPICH2_CONF_shared_libs -eq 1 ]; then > > MVAPICH2_COMP_ENV="$MVAPICH2_COMP_ENV OPT_FLAG=-g" > > fi > > ;; > > @@ -492,25 +499,73 @@ mvapich2() > > MVAPICH2_COMP_ENV="CC=pgcc CXX=pgCC F77=pgf77 F90=pgf90" > > ;; > > intel) > > - # The -i-dynamic flag is required for MVAPICH2 in the shared > > - # library configuration. > > - MVAPICH2_COMP_ENV='CC="icc -i-dynamic" CXX="icpc -i-dynamic" F77="ifort -i-dynamic" F90="ifort -i-dynamic"' > > + if [ $MVAPICH2_CONF_shared_libs -eq 1 ]; then > > + # The -i-dynamic flag is required for MVAPICH2 in the shared > > + # library configuration. > > + MVAPICH2_COMP_ENV='CC="icc -i-dynamic" CXX="icpc -i-dynamic" F77="ifort -i-dynamic" F90="ifort -i-dynamic"' > > + else > > + MVAPICH2_COMP_ENV="CC=icc CXX=icpc F77=ifort F90=ifort" > > + fi > > ;; > > esac > > > > - ex rpmbuild --rebuild \ > > - --define \'_topdir ${RPM_DIR}\' \ > > - --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > > - --define \'_prefix ${MVAPICH2_PREFIX}\' \ > > - --define \'build_root ${BUILD_ROOT}\' \ > > - --define \'open_ib_home ${STACK_PREFIX}\' \ > > - --define \'ofed_build_root ${BUILD_ROOT}\' \ > > - --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > > - --define \'iwarp ${iwarp}\' \ > > - --define \'romio 1\' \ > > - --define \'shared_libs 1\' \ > > - --define \'auto_req 1\' \ > > - $MVAPICH2_SRC_RPM > > + if [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_ckpt -eq 0 ]; then > > + ex rpmbuild --rebuild \ > > + --define \'_topdir ${RPM_DIR}\' \ > > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > > + --define \'build_root ${BUILD_ROOT}\' \ > > + --define \'impl ofa\' \ > > + --define \'multithread ${MVAPICH2_CONF_multithread}\' \ > > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > > + --define \'rdma_cm 1\' \ > > + --define \'ckpt 0\' \ > > + --define \'open_ib_home ${STACK_PREFIX}\' \ > > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > > + --define \'auto_req 0\' \ > > + --define \'ofa_build 1\' \ > > + $MVAPICH2_SRC_RPM > > + elif [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_ckpt -eq 1 ]; then > > + ex rpmbuild --rebuild \ > > + --define \'_topdir ${RPM_DIR}\' \ > > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > > + --define \'build_root ${BUILD_ROOT}\' \ > > + --define \'impl ofa\' \ > > + --define \'multithread 0\' \ > > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > > + --define \'rdma_cm 0\' \ > > + --define \'ckpt 1\' \ > > + --define \'blcr_home ${MVAPICH2_CONF_blcr_home}\' \ > > + --define \'open_ib_home ${STACK_PREFIX}\' \ > > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > > + --define \'auto_req 0\' \ > > + --define \'ofa_build 1\' \ > > + $MVAPICH2_SRC_RPM > > + elif [ $MVAPICH2_CONF_impl = "udapl" ]; then > > + ex rpmbuild --rebuild \ > > + --define \'_topdir ${RPM_DIR}\' \ > > + --define \'_prefix ${MVAPICH2_PREFIX}\' \ > > + --define \'_name ${MVAPICH2_NAME}_${mpi_comp}\' \ > > + --define \'build_root ${BUILD_ROOT}\' \ > > + --define \'impl udapl\' \ > > + --define \'multithread ${MVAPICH2_CONF_multithread}\' \ > > + --define \'romio ${MVAPICH2_CONF_romio}\' \ > > + --define \'shared_libs ${MVAPICH2_CONF_shared_libs}\' \ > > + --define \'vcluster ${MVAPICH2_CONF_vcluster}\' \ > > + --define \'io_bus ${MVAPICH2_CONF_io_bus}\' \ > > + --define \'link_speed ${MVAPICH2_CONF_link_speed}\' \ > > + --define \'dapl_provider ${MVAPICH2_CONF_dapl_provider}\' \ > > + --define \'dat_lib ${MVAPICH2_DAT_LIB}\' \ > > + --define \'dat_include ${MVAPICH2_DAT_INCLUDE}\' \ > > + --define \'comp_env ${MVAPICH2_COMP_ENV}\' \ > > + --define \'auto_req 0\' \ > > + --define \'ofa_build 1\' \ > > + $MVAPICH2_SRC_RPM > > + fi > > + > > ex "$MV -f ${RPM_DIR}/RPMS/$build_arch/${MVAPICH2_RPM} $RPMS" > > let BUILD_COUNTER++ > > > > diff --git a/build_env.sh b/build_env.sh > > index 3128774..93891b3 100644 > > --- a/build_env.sh > > +++ b/build_env.sh > > @@ -971,6 +971,226 @@ is_compiler() > > > > } > > > > +# Prompt for MVAPICH2 build options. > > +mvapich2_config() { > > + local choice="" > > + local blcr > > + > > + if [ "$MVAPICH2_CONF_done" = 1 ]; then > > + return > > + fi > > + > > + cat < > + > > +Please choose an implementation of MVAPICH2: > > + > > +1) OFA (IB and iWARP) > > +2) uDAPL > > + > > +EOF > > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > > + read -p "Implementation [1]: " > > + choice=${REPLY:-1} > > + done > > + > > + if [ $choice -eq 1 ]; then > > + MVAPICH2_CONF_impl=ofa > > + elif [ $choice -eq 2 ]; then > > + MVAPICH2_CONF_impl=udapl > > + fi > > + > > + if ! ( grep -w MVAPICH2_CONF_impl $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_impl=\"${MVAPICH2_CONF_impl}\"" >> $CONFIG > > + fi > > + > > + while [ -z "$MVAPICH2_CONF_romio" ]; do > > + read -p "Enable ROMIO support [Y/n]: " choice > > + > > + if [ -z "$choice" ] || [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > > + MVAPICH2_CONF_romio=1 > > + elif [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > > + MVAPICH2_CONF_romio=0 > > + fi > > + done > > + > > + if ! ( grep -w MVAPICH2_CONF_romio $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_romio=\"${MVAPICH2_CONF_romio}\"" >> $CONFIG > > + fi > > + > > + while [ -z "$MVAPICH2_CONF_shared_libs" ]; do > > + read -p "Enable shared library support [Y/n]: " choice > > + > > + if [ -z "$choice" ] || [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > > + MVAPICH2_CONF_shared_libs=1 > > + elif [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > > + MVAPICH2_CONF_shared_libs=0 > > + fi > > + done > > + > > + if ! ( grep -w MVAPICH2_CONF_shared_libs $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_shared_libs=\"${MVAPICH2_CONF_shared_libs}\"" >> $CONFIG > > + fi > > + > > + cat < > +Multithread support should only be enabled only if thread safety is required. > > +There may be a slight performance penalty for single threaded only use. > > +EOF > > + > > + while [ -z "$MVAPICH2_CONF_multithread" ]; do > > + read -p "Enable multithread support [y/N]: " choice > > + > > + if [ -z "$choice" ] || [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > > + MVAPICH2_CONF_multithread=0 > > + elif [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > > + MVAPICH2_CONF_multithread=1 > > + fi > > + done > > + > > + if ! ( grep -w MVAPICH2_CONF_multithread $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_multithread=\"${MVAPICH2_CONF_multithread}\"" >> $CONFIG > > + fi > > + > > + # OFA specific options. > > + if [ $MVAPICH2_CONF_impl = "ofa" ] && [ $MVAPICH2_CONF_multithread -eq 0 ]; then > > + choice=0 > > + > > + while [ $choice = 0 ]; do > > + read -p "Enable Checkpoint-Restart support [y/N]: " choice > > + > > + if [ -z "$choice" ] || [[ $choice == [nN] ]] || [[ $choice == [nN][oO] ]]; then > > + MVAPICH2_CONF_ckpt=0 > > + choice=1 > > + elif [[ $choice == [yY] ]] || [[ $choice == [yY][eE][sS] ]]; then > > + read -p "BLCR installation directory [or nothing if not installed]: " blcr > > + > > + if [ -d "$blcr" ]; then > > + MVAPICH2_CONF_ckpt=1 > > + MVAPICH2_CONF_blcr_home="$blcr" > > + choice=1 > > + else > > + echo "BLCR installation directory not found." > > + choice=0 > > + fi > > + else > > + choice=0 > > + fi > > + done > > + else > > + MVAPICH2_CONF_ckpt=0 > > + fi > > + > > + if [ $MVAPICH2_CONF_impl = "ofa" ]; then > > + if ! ( grep -w MVAPICH2_CONF_ckpt $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_ckpt=\"${MVAPICH2_CONF_ckpt}\"" >> $CONFIG > > + fi > > + > > + if [ $MVAPICH2_CONF_ckpt -eq 1 ]; then > > + if ! ( grep -w MVAPICH2_CONF_blcr_home $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_blcr_home=\"${MVAPICH2_CONF_blcr_home}\"" >> $CONFIG > > + fi > > + fi > > + fi > > + > > + # uDAPL specific options. > > + if [ $MVAPICH2_CONF_impl = "udapl" ]; then > > + cat < > + > > +Cluster size: > > + > > +1) Small > > +2) Medium > > +3) Large > > + > > +EOF > > + choice="" > > + > > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 3 ]; do > > + read -p "Cluster size [1]: " > > + choice=${REPLY:-1} > > + done > > + > > + if [ $choice -eq 1 ]; then > > + MVAPICH2_CONF_vcluster=small > > + elif [ $choice -eq 2 ]; then > > + MVAPICH2_CONF_vcluster=medium > > + elif [ $choice -eq 3 ]; then > > + MVAPICH2_CONF_vcluster=large > > + fi > > + > > + if ! ( grep -w MVAPICH2_CONF_vcluster $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_vcluster=\"${MVAPICH2_CONF_vcluster}\"" >> $CONFIG > > + fi > > + > > + cat < > + > > +I/O Bus: > > + > > +1) PCI-Express > > +2) PCI-X > > + > > +EOF > > + choice="" > > + > > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > > + read -p "I/O Bus [1]: " > > + choice=${REPLY:-1} > > + done > > + > > + if [ $choice -eq 1 ]; then > > + MVAPICH2_CONF_io_bus=pci-ex > > + elif [ $choice -eq 2 ]; then > > + MVAPICH2_CONF_io_bus=pci-x > > + fi > > + > > + if ! ( grep -w MVAPICH2_CONF_io_bus $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_io_bus=\"${MVAPICH2_CONF_io_bus}\"" >> $CONFIG > > + fi > > + > > + if [ $MVAPICH2_CONF_io_bus = "pci-ex" ]; then > > + cat < > + > > +Link Speed: > > + > > +1) SDR > > +2) DDR > > + > > +EOF > > + choice="" > > + > > + while [ -z "$choice" ] || [[ $choice != [0-9] ]] || [ $choice -lt 1 ] || [ $choice -gt 2 ]; do > > + read -p "Link Speed [1]: " > > + choice=${REPLY:-1} > > + done > > + > > + if [ $choice -eq 1 ]; then > > + MVAPICH2_CONF_link_speed=sdr > > + elif [ $choice -eq 2 ]; then > > + MVAPICH2_CONF_link_speed=ddr > > + fi > > + else > > + MVAPICH2_CONF_link_speed=sdr > > + fi > > + > > + if ! ( grep -w MVAPICH2_CONF_link_speed $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_link_speed=\"${MVAPICH2_CONF_link_speed}\"" >> $CONFIG > > + fi > > + > > + read -p "Default DAPL provider [ib0]: " > > + MVAPICH2_CONF_dapl_provider=${REPLY:-ib0} > > + > > + if ! ( grep -w MVAPICH2_CONF_dapl_provider $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_dapl_provider=\"${MVAPICH2_CONF_dapl_provider}\"" >> $CONFIG > > + fi > > + fi > > + > > + MVAPICH2_CONF_done=1 > > + > > + if ! ( grep -w MVAPICH2_CONF_done $CONFIG > $NULL 2>&1 ); then > > + echo "MVAPICH2_CONF_done=\"${MVAPICH2_CONF_done}\"" >> $CONFIG > > + fi > > +} > > + > > + > > # Set Compilation environment for MPI > > set_mpi_env() > > { > > @@ -998,6 +1218,7 @@ set_mpi_env() > > echo > > fi > > > > + > > printed_msg0=${printed_msg0:-0} > > if [ $printed_msg0 -eq 0 ]; then > > if [ $(echo -n ${COMPILERS_FOUND} | wc -w) -gt 1 ]; then > > @@ -1014,24 +1235,8 @@ set_mpi_env() > > read -p "Do you wish to create/install an ${mpipackage} RPM with ${mpi_compiler}? [Y/n]:" ans > > if [[ "$ans" == "" || "$ans" == "y" || "$ans" == "Y" || "$ans" == "yes" ]]; then > > MPI_COMPILER="$MPI_COMPILER ${mpi_compiler}" > > - > > - # MVAPICH2 can be built with iWARP support only if > > - # librdmacm and librdmacm-devel are there. > > - if [ "$mpipackage" = "mvapich2" ] && > > - (echo -n ${SELECTED_PACKAGES} | grep -w "librdmacm" > $NULL) && > > - (echo -n ${SELECTED_PACKAGES} | grep -w "librdmacm-devel" > $NULL); then > > - read -p "Do you wish to build mvapich2 with iWARP support only (default is IB) [y/N]:" ans > > - if [[ "$ans" == "y" || "$ans" == "Y" || "$ans" == "yes" ]]; then > > - MVAPICH2_IMPL=iwarp > > - else > > - MVAPICH2_IMPL=ib > > - fi > > - else > > - MVAPICH2_IMPL=ib > > - fi > > fi > > done > > - > > else # Unattended mode > > case ${mpipackage} in > > mvapich) > > @@ -1095,18 +1300,17 @@ set_mpi_env() > > warn_echo "No compilers for ${mpipackage} were found" > > return 1 > > fi > > - > > MPI_COMPILER_mvapich2=${MPI_COMPILER} > > if ! ( grep -w MPI_COMPILER_mvapich2 $CONFIG > $NULL 2>&1 ); then > > echo "MPI_COMPILER_mvapich2=\"${MPI_COMPILER_mvapich2}\"" >> $CONFIG > > fi > > - > > - if ! ( grep -w MVAPICH2_IMPL $CONFIG > $NULL 2>&1 ); then > > - echo "MVAPICH2_IMPL=\"${MVAPICH2_IMPL}\"" >> $CONFIG > > - fi > > - > > echo > > echo "The following compiler(s) will be used to ${prog%*.*} the ${mpipackage} RPM(s): $MPI_COMPILER_mvapich2" > > + # MVAPICH2 can be built with many options. The configuration > > + # function below asks the user how to build, and it only will > > + # do so if the configuration values have not already been > > + # read from the $CONFIG file. > > + mvapich2_config > > ;; > > openmpi) > > if [ ! -n "${COMPILERS_FOUND}" ]; then > > @@ -1843,10 +2047,18 @@ set_package_deps() > > export mvapich2=n > > else > > EXTRA_PACKAGES=$(echo "$EXTRA_PACKAGES mvapich2" | tr -s ' ' '\n' | sort -rn | uniq) > > - if [ "$MVAPICH2_IMPL" = "iwarp" ]; then > > - OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel librdmacm librdmacm-devel" | tr -s ' ' '\n' | sort -n | uniq) > > - else > > - OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel" | tr -s ' ' '\n' | sort -n | uniq) > > + if [ "$MVAPICH2_CONF_impl" = "ofa" ] && [ "$MVAPICH2_CONF_ckpt" = 0 ]; then > > + # libibumad apparently needs libibcommon. > > + OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel librdmacm librdmacm-devel libibcommon libibcommon-devel" | tr -s ' ' '\n' | sort -n | uniq) > > + elif [ "$MVAPICH2_CONF_impl" = "ofa" ]; then > > + # Checkpoint-Restart does not support > > + # RDMA-CM, so it would not be required. > > + # libibumad apparently needs libibcommon. > > + OFA_PACKAGES=$(echo "$OFA_PACKAGES libibverbs libibverbs-devel libibumad libibumad-devel libibcommon libibcommon-devel" | tr -s ' ' '\n' | sort -n | uniq) > > + elif [ "$MVAPICH2_CONF_impl" = "udapl" ]; then > > + # dapl apparently needs libibverbs and > > + # librdmacm. > > + OFA_PACKAGES=$(echo "$OFA_PACKAGES dapl dapl-devel libibverbs librdmacm" | tr -s ' ' '\n' | sort -n | uniq) > > fi > > fi > > ;; > > diff --git a/install.sh b/install.sh > > diff --git a/ofed-scripts.spec b/ofed-scripts.spec > > diff --git a/propel.sh b/propel.sh > > diff --git a/uninstall.sh b/uninstall.sh > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From hal.rosenstock at gmail.com Wed Sep 5 15:05:04 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 18:05:04 -0400 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> Message-ID: Hi Sean, On the end stack side, has it been decided to ignore whether the SA indicates whether or not it supports QoS ? Wouldn't it be useful to have some warning message indicating this in the end node (that it might not be getting the service quality desired) ? -- Hal From sean.hefty at intel.com Wed Sep 5 15:09:13 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Sep 2007 15:09:13 -0700 Subject: [ofa-general] [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <46DE7D99.7000508@voltaire.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46DE7D99.7000508@voltaire.com> Message-ID: <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> To support QoS within and between subnets, modify IPoIB to request specific Traffic Class values with path record queries, using the value associated with the IPoIB broadcast group. Signed-off-by: Sean Hefty --- Added missing traffic class to PR component mask. drivers/infiniband/ulp/ipoib/ipoib.h | 22 +++++++++++++++++++++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 +++++--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 22 ---------------------- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..fc16bce 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -113,7 +113,27 @@ struct ipoib_pseudoheader { u8 hwaddr[INFINIBAND_ALEN]; }; -struct ipoib_mcast; +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ib_sa_multicast *mc; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; struct ipoib_rx_buf { struct sk_buff *skb; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..841e068 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -468,9 +468,10 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid) INIT_LIST_HEAD(&path->neigh_list); memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); - path->pathrec.sgid = priv->local_gid; - path->pathrec.pkey = cpu_to_be16(priv->pkey); - path->pathrec.numb_path = 1; + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class; return path; } @@ -491,6 +492,7 @@ static int path_rec_start(struct net_device *dev, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_TRAFFIC_CLASS | IB_SA_PATH_REC_PKEY, 1000, GFP_ATOMIC, path_rec_completion, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index aae3670..94a5709 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -57,28 +57,6 @@ MODULE_PARM_DESC(mcast_debug_level, static DEFINE_MUTEX(mcast_mutex); -/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ -struct ipoib_mcast { - struct ib_sa_mcmember_rec mcmember; - struct ib_sa_multicast *mc; - struct ipoib_ah *ah; - - struct rb_node rb_node; - struct list_head list; - - unsigned long created; - unsigned long backoff; - - unsigned long flags; - unsigned char logcount; - - struct list_head neigh_list; - - struct sk_buff_head pkt_queue; - - struct net_device *dev; -}; - struct ipoib_mcast_iter { struct net_device *dev; union ib_gid mgid; From rowland at cse.ohio-state.edu Wed Sep 5 15:09:36 2007 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Wed, 05 Sep 2007 18:09:36 -0400 Subject: [ofa-general] Re: [openib-general] MVAPICH2 SRPM update and install files patch In-Reply-To: References: <45CE1C1C.70406@cse.ohio-state.edu> Message-ID: <46DF2920.1020707@cse.ohio-state.edu> Chris Worley wrote: > Some of those changes for icc don't make sense. Setting "CC" to "icc > -i-dynamic" looks for an executable file name of the entire string... > causing: Are you pulling that out of config.log or just assuming that's happening? We build and test with those Intel compiler settings all the time, and from what I see - what you describe should not be the problem. Without the config.log, I cannot tell exactly what might be the problem. You could try building statically and see if that works, as it should not use those flags. You could modify build.sh and remove those flags, just to see if that works. Those flags were added in that way because we had run into an issue with the Intel compiler and building with shared library support at one point. What version of the Intel compiler are you using? You could also try building our latest source release of MVAPICH2 0.9.8 with the same compiler settings - in order to get the config.log file easier if it also has a problem. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ From swelch at systemfabricworks.com Wed Sep 5 15:13:30 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Wed, 5 Sep 2007 17:13:30 -0500 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> <001d01c7f003$f4df1fe0$bc0da8c0@catcher> Message-ID: <000001c7f009$fe883620$a865a8c0@catcher> > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Wednesday, September 05, 2007 4:55 PM > To: Steve Welch > Cc: general at lists.openfabrics.org; sean.hefty at intel.com > Subject: Re: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR > SMP responses from userspace > > On 9/5/07, Steve Welch wrote: > > > > /* Check to post send on QP or process locally */ > > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > > > goto out; > > > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct > > > ib_mad_agent_private *mad_agent_priv, > > > > if (port_priv) { > > > > mad_priv->mad.mad.mad_hdr.tid = > > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > > + memcpy(&mad_priv->mad.mad, smp, > sizeof(struct > > > ib_mad)); > > > > > > Is this copy only needed in the (new) returning direction case ? > > > > No, it is needed whether the SMP is a request or response. > > > > > > > > > recv_mad_agent = find_mad_agent(port_priv, > > > > &mad_priv- > > > >mad.mad); > > > > } > > > > diff --git a/drivers/infiniband/core/smi.h > > > b/drivers/infiniband/core/smi.h > > > > index 1cfc298..d96fc8e 100644 > > > > --- a/drivers/infiniband/core/smi.h > > > > +++ b/drivers/infiniband/core/smi.h > > > > @@ -71,4 +71,18 @@ static inline enum smi_action > > > smi_check_local_smp(struct ib_smp *smp, > > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > } > > > > + > > > > +/* > > > > + * Return 1 if the SMP response should be handled by the local > > > management stack > > > > + */ > > > > +static inline enum smi_action smi_check_local_resp_smp(struct > ib_smp > > > *smp, > > > > + struct > ib_device > > > *device) > > > > +{ > > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > > + return ((device->process_mad && > > > > + ib_get_smp_direction(smp) && > > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > +} > > > > + > > > > > > I think this routine and the existing one could be better named: > > > smi_check_local_outgoing/returning_smp. > > > > > > > Possibly, but the SMP does originate in both cases from a local mad send > > operation. In one case sending the request and in the other sending the > > response; in both cases they are locally handled. > > Aren't they more appropriately termed outgoing and returning rather > than request/response ? Guess it ends up being the same since in > practice Traps and TrapRepresses are only LID routed but there is > nothing in the spec that precludes them from being direct routed. > Yes, from the perspective of mapping the processing back to the IB spec. I'm certainly fine with whatever name is chosen. Steve > -- Hal > > > > > Steve > > > > > -- Hal > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib- > > > general > > > > > > > > From sean.hefty at intel.com Wed Sep 5 15:23:11 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Sep 2007 15:23:11 -0700 Subject: [ofa-general] RE: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <200708151352.42026.dotanb@dev.mellanox.co.il> References: <200708151352.42026.dotanb@dev.mellanox.co.il> Message-ID: <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> librdmacm: add valgrind support. Signed-off-by: Dotan Barak Signed-off-by: Sean Hefty --- Changes from the posted patches: * I combined both patches into a single patch. * I tried to keep the config file simple and went with the option of only including memcheck.h if valgrind support was requested. * The check for memcheck.h is not done if disable_libcheck is true. * VALGRIND_MAKE_MEM_DEFINED is only defined if memcheck.h is not included. I would rather fail the build if memcheck.h does not define this, than print a warning and define it ourselves. If there's a problem with any of these choices, please let me know. configure.in | 18 ++++++++++++++++++ src/cma.c | 20 ++++++++++++++++++++ 2 files changed, 38 insertions(+), 0 deletions(-) diff --git a/configure.in b/configure.in index 7ecaaf1..1b307b7 100644 --- a/configure.in +++ b/configure.in @@ -9,6 +9,18 @@ AM_INIT_AUTOMAKE(librdmacm, 1.0.2) AM_PROG_LIBTOOL +AC_ARG_WITH([valgrind], + AC_HELP_STRING([--with-valgrind], + [Enable valgrind annotations - default NO])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then + AC_DEFINE([INCLUDE_VALGRIND], 1, + [Define to 1 to enable valgrind annotations]) + if test -d $with_valgrind; then + CPPFLAGS="$CPPLFAGS -I$with_valgrind/include" + fi +fi + AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], [ if test "$enableval" = "no"; then disable_libcheck=yes @@ -33,6 +45,12 @@ AC_HEADER_STDC if test "$disable_libcheck" != "yes"; then AC_CHECK_HEADER(infiniband/verbs.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then +AC_CHECK_HEADER(valgrind/memcheck.h, [], + AC_MSG_ERROR([valgrind requested but not found.])) +fi + fi AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, diff --git a/src/cma.c b/src/cma.c index 32edc1f..db336da 100644 --- a/src/cma.c +++ b/src/cma.c @@ -55,6 +55,12 @@ #include #include +#ifdef INCLUDE_VALGRIND +# include +#else +# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +#endif + #define PFX "librdmacm: " #if __BYTE_ORDER == __LITTLE_ENDIAN @@ -383,6 +389,8 @@ int rdma_create_id(struct rdma_event_channel *channel, if (ret != size) goto err; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + id_priv->handle = resp->id; *id = &id_priv->id; return 0; @@ -405,6 +413,8 @@ static int ucma_destroy_kern_id(int fd, uint32_t handle) if (ret != size) return (ret > 0) ? -ENODATA : ret; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + return resp->events_reported; } @@ -458,6 +468,8 @@ static int ucma_query_route(struct rdma_cm_id *id) if (ret != size) return (ret > 0) ? -ENODATA : ret; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + if (resp->num_paths) { id->route.path_rec = malloc(sizeof *id->route.path_rec * resp->num_paths); @@ -583,6 +595,8 @@ static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, if (ret != size) return (ret > 0) ? -ENODATA : ret; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + ibv_copy_qp_attr_from_kern(qp_attr, resp); *qp_attr_mask = resp->qp_attr_mask; return 0; @@ -1010,6 +1024,8 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, goto err2; } + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + mc->handle = resp->id; return 0; err2: @@ -1061,6 +1077,8 @@ int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) goto free; } + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + pthread_mutex_lock(&id_priv->mut); while (mc->events_completed < resp->events_reported) pthread_cond_wait(&mc->cond, &id_priv->mut); @@ -1256,6 +1274,8 @@ retry: return (ret > 0) ? -ENODATA : ret; } + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + evt->event.event = resp->event; switch (resp->event) { case RDMA_CM_EVENT_ADDR_RESOLVED: From sean.hefty at intel.com Wed Sep 5 15:36:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Sep 2007 15:36:40 -0700 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> Message-ID: <000301c7f00d$3a245850$3c98070a@amr.corp.intel.com> >On the end stack side, has it been decided to ignore whether the SA >indicates whether or not it supports QoS ? Wouldn't it be useful to >have some warning message indicating this in the end node (that it >might not be getting the service quality desired) ? This is just my opinion, but... This ends up adding a fair amount of complexity in order to display a simple warning message. The code is written to try for QoS, but use whatever is available if QoS support is not enabled. If we wanted to display a warning, then I think that the user should have control over whether QoS support is enabled on the host side, along with policy controls over what action to take in case of a failure. This support would need to be per ULP, and defined for each node on the fabric. Displaying a warning without the user explicitly asking for QoS support can give the impression that something is wrong when things are operating correctly. The other complexity is that additional queuing becomes necessary for QoS enabled PR queries. It's possible for ULPs to request paths before the local SA query code can determine whether or not the SA supports QoS. I don't feel that a warning message is necessarily worth the extra complexity, especially when things like SA failover and IB routers get tossed into the mix. - Sean From hal.rosenstock at gmail.com Wed Sep 5 16:01:54 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 19:01:54 -0400 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: <000301c7f00d$3a245850$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> <000301c7f00d$3a245850$3c98070a@amr.corp.intel.com> Message-ID: On 9/5/07, Sean Hefty wrote: > >On the end stack side, has it been decided to ignore whether the SA > >indicates whether or not it supports QoS ? Wouldn't it be useful to > >have some warning message indicating this in the end node (that it > >might not be getting the service quality desired) ? > > This is just my opinion, but... > > This ends up adding a fair amount of complexity in order to display a simple > warning message. The code is written to try for QoS, but use whatever is > available if QoS support is not enabled. If we wanted to display a warning, > then I think that the user should have control over whether QoS support is > enabled on the host side, along with policy controls over what action to take in > case of a failure. This support would need to be per ULP, and defined for each > node on the fabric. Displaying a warning without the user explicitly asking for > QoS support can give the impression that something is wrong when things are > operating correctly. > > The other complexity is that additional queuing becomes necessary for QoS > enabled PR queries. It's possible for ULPs to request paths before the local SA > query code can determine whether or not the SA supports QoS. It's not a requirement to wait for the QoS determination of the SA before issuing the SA PR requests. > I don't feel that > a warning message is necessarily worth the extra complexity, especially when > things like SA failover and IB routers get tossed into the mix. Failover is pretty straightforward. The same query is made when SM LID changes. As to IB routers, well, it seems a little early to envision how these two interact. A separate question: Will the QoS code handle the new error status code which suggests a different QoSClass or is it currently being handled like other errors ? Guess that could be a separate patch. -- Hal > - Sean > From sean.hefty at intel.com Wed Sep 5 16:14:33 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Sep 2007 16:14:33 -0700 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> <000301c7f00d$3a245850$3c98070a@amr.corp.intel.com> Message-ID: <000501c7f012$84e842c0$3c98070a@amr.corp.intel.com> >It's not a requirement to wait for the QoS determination of the SA >before issuing the SA PR requests. It sounds like you're suggesting that the end nodes just automatically gather and print the capabilities of the SA. (Why limit it to QoS only then?) An administrator could run a separate tool to gather this information, rather than it being done on all nodes automatically. If the QoS determination is tied into the behavior of the PR queries, then I think you end up either queuing the requests or responses somewhere. >Failover is pretty straightforward. The same query is made when SM LID changes. Yes - but existing paths that are in use may now suddenly provide or no longer provide QoS. >Will the QoS code handle the new error status code which suggests a >different QoSClass or is it currently being handled like other errors >? Guess that could be a separate patch. This would be a separate patch, and is left up to each ULP at the moment. For ipoib, I used the TClass, so the new QoS fields are not an issue. For the rdma_cm, the QoS info is set by the user when using IPv4 addressing, so they would need to take appropriate action. - Sean From sashak at voltaire.com Wed Sep 5 16:26:43 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 02:26:43 +0300 Subject: [ofa-general] Re: [PATCH v2] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46DE9F97.10003@dev.mellanox.co.il> References: <46DE9F97.10003@dev.mellanox.co.il> Message-ID: <20070905232643.GC25330@sashak.voltaire.com> Hi Yevgeny, On 15:22 Wed 05 Sep , Yevgeny Kliteynik wrote: > Selecting path according to QoS policy level that > the PathRecord query matches. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_sa_path_record.c | 374 ++++++++++++++++++++++++++---------- > 1 files changed, 276 insertions(+), 98 deletions(-) > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 1b781f0..15bd7e2 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -67,6 +67,7 @@ > #include > #include > #include > +#include > #ifdef ROUTER_EXP > #include > #include > @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > { > const osm_node_t *p_node; > const osm_physp_t *p_physp; > + const osm_physp_t *p_src_physp; > const osm_physp_t *p_dest_physp; > - const osm_prtn_t *p_prtn; > + const osm_prtn_t *p_prtn = NULL; > const ib_port_info_t *p_pi; > ib_api_status_t status = IB_SUCCESS; > ib_net16_t pkey; > @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > uint8_t required_rate; > uint8_t required_pkt_life; > uint8_t sl; > + uint8_t in_port_num; > ib_net16_t dest_lid; > + uint8_t i; > + uint8_t vl; > + ib_slvl_table_t *p_slvl_tbl = NULL; > + boolean_t valid_sls[IB_MAX_NUM_VLS]; > + boolean_t sl2vl_valid_path; > + uint8_t first_valid_sl; > + osm_qos_level_t *p_qos_level = NULL; > > OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > > + memset(valid_sls, TRUE, IB_MAX_NUM_VLS); > dest_lid = cl_hton16(dest_lid_ho); > > p_dest_physp = p_dest_port->p_physp; > p_physp = p_src_port->p_physp; > + p_src_physp = p_physp; > p_pi = &p_physp->port_info; > > mtu = ib_port_info_get_mtu_cap(p_pi); > @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > p_node = osm_physp_get_node_ptr(p_physp); > > if (p_node->sw) { > + /* source node is a switch */ > + in_port_num = osm_physp_get_port_num(p_physp); Hmm, could in_port_num be != 0? > + > /* > * If the dest_lid_ho is equal to the lid of the switch pointed by > * p_sw then p_physp will be the physical port of the switch port zero. I know it is not your code, but do you understand this part of the comment? > + * Make sure that p_physp points to the out port of the > + * switch that routes to the destination lid (dest_lid_ho) > */ > - p_physp = > - osm_switch_get_route_by_lid(p_node->sw, > - cl_ntoh16(dest_lid_ho)); > + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > if (p_physp == 0) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F02: " > @@ -306,15 +321,32 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > } > } > > + if (!p_rcv->p_subn->opt.no_qos) { Would you prefer to change opt.no_qos to opt.qos? For me it looks things will be clear this way. > + if (p_node->sw) > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > + else > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > + > + /* update valid SLs that still exist on this route */ > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + vl = ib_slvl_table_get(p_slvl_tbl, i); > + if (vl == IB_DROP_VL) > + valid_sls[i] = FALSE; > + } > + } > + } > + > /* > * Same as above > */ > p_node = osm_physp_get_node_ptr(p_dest_physp); > > if (p_node->sw) { > - p_dest_physp = > - osm_switch_get_route_by_lid(p_node->sw, > - cl_ntoh16(dest_lid_ho)); > + /* > + * if destination is switch, we want p_dest_physp to point to port 0 > + */ > + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > > if (p_dest_physp == 0) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > @@ -328,6 +360,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > > } > > + /* > + * Now go through the path step by step > + */ > + > while (p_physp != p_dest_physp) { > p_physp = osm_physp_get_remote(p_physp); > > @@ -341,6 +377,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > goto Exit; > } > > + in_port_num = osm_physp_get_port_num(p_physp); > + > /* > This is point to point case (no switch in between) > */ > @@ -367,29 +405,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > */ > p_pi = &p_physp->port_info; > > - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > mtu = ib_port_info_get_mtu_cap(p_pi); > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "New smallest MTU = %u at intervening port 0x%016" > - PRIx64 " port num 0x%X\n", mtu, > - cl_ntoh64(osm_physp_get_port_guid > - (p_physp)), > - osm_physp_get_port_num(p_physp)); > - } > > - if (rate > ib_port_info_compute_rate(p_pi)) { > + if (rate > ib_port_info_compute_rate(p_pi)) > rate = ib_port_info_compute_rate(p_pi); > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "New smallest rate = %u at intervening port 0x%016" > - PRIx64 " port num 0x%X\n", rate, > - cl_ntoh64(osm_physp_get_port_guid > - (p_physp)), > - osm_physp_get_port_num(p_physp)); > - } > > /* > Continue with the egress port on this switch. > @@ -409,32 +429,41 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > CL_ASSERT(p_physp); It is not needed, run-time check is done right above. (I know it is not your code) > CL_ASSERT(osm_physp_is_valid(p_physp)); > > + p_node = osm_physp_get_node_ptr(p_physp); > + if (!p_node->sw) { Actually this !p_node->sw check duplicates the one above, where !p_node->sw is verified for ergess port of this switch. Right? > + /* > + * There is some sort of problem in the subnet object! > + * If this isn't a switch, we should have reached > + * the destination by now! > + */ > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F04: " > + "Internal error, bad path\n"); > + status = IB_ERROR; > + goto Exit; > + } > + > p_pi = &p_physp->port_info; > > - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > mtu = ib_port_info_get_mtu_cap(p_pi); > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "New smallest MTU = %u at intervening port 0x%016" > - PRIx64 " port num 0x%X\n", mtu, > - cl_ntoh64(osm_physp_get_port_guid > - (p_physp)), > - osm_physp_get_port_num(p_physp)); > - } > > - if (rate > ib_port_info_compute_rate(p_pi)) { > + if (rate > ib_port_info_compute_rate(p_pi)) > rate = ib_port_info_compute_rate(p_pi); > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "New smallest rate = %u at intervening port 0x%016" > - PRIx64 " port num 0x%X\n", rate, > - cl_ntoh64(osm_physp_get_port_guid > - (p_physp)), > - osm_physp_get_port_num(p_physp)); > - } > > + if (!p_rcv->p_subn->opt.no_qos) { > + /* > + * Check SL2VL table of the switch and update valid SLs > + */ > + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + vl = ib_slvl_table_get(p_slvl_tbl, i); > + if (vl == IB_DROP_VL) > + valid_sls[i] = FALSE; > + } > + } > + } > } > > /* > @@ -442,30 +471,104 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > */ > p_pi = &p_physp->port_info; > > - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > mtu = ib_port_info_get_mtu_cap(p_pi); > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + > + if (rate > ib_port_info_compute_rate(p_pi)) > + rate = ib_port_info_compute_rate(p_pi); > + > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "Path min MTU = %u, min rate = %u\n", > + mtu, rate); > + > + if (!p_rcv->p_subn->opt.no_qos) { > + /* > + * check whether there is some SL > + * that won't lead to VL15 eventually > + */ > + sl2vl_valid_path = FALSE; > + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > + if (valid_sls[i]) { > + sl2vl_valid_path = TRUE; > + first_valid_sl = i; > + break; > + } > + } > + > + if (!sl2vl_valid_path) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 on this path\n"); > + } > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } > + > + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > + /* Get QoS Level object according to the path request */ > + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > + p_rcv, p_pr, > + p_src_physp, p_dest_physp, > + comp_mask, &p_qos_level); > + > + if (p_qos_level > + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_pr_rcv_get_path_parms: " > - "New smallest MTU = %u at destination port 0x%016" > - PRIx64 "\n", mtu, > - cl_ntoh64(osm_physp_get_port_guid(p_physp))); > + "PathRecord request matches QoS Level '%s' (%s)\n", > + p_qos_level->name, > + (p_qos_level->use) ? p_qos_level-> > + use : "no description"); > + } > } > > - if (rate > ib_port_info_compute_rate(p_pi)) { > - rate = ib_port_info_compute_rate(p_pi); > + /* Adjust path parameters according to QoS settings */ > + > + if (p_qos_level) { Why to not make osm_qos_policy_get_qos_level_by_pr() returning pointer to p_qos_level? Then you could simply merge both conditions (this and one above), something like: if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy && (p_qos_level = osm_qos_policy_get_qos_level_by_pr(..)) { > + if (p_qos_level->mtu_limit_set > + && (mtu > p_qos_level->mtu_limit)) > + mtu = p_qos_level->mtu_limit; > + > + if (p_qos_level->rate_limit_set > + && (rate > p_qos_level->rate_limit)) > + rate = p_qos_level->rate_limit; > + > + if (p_qos_level->pkt_life_set > + && (pkt_life > p_qos_level->pkt_life)) > + pkt_life = p_qos_level->pkt_life; > + > + if (p_qos_level->sl_set) { > + if (!valid_sls[p_qos_level->sl]) { > + status = IB_NOT_FOUND; > + goto Exit; > + } > + sl = p_qos_level->sl; > + } > + > if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_pr_rcv_get_path_parms: " > - "New smallest rate = %u at destination port 0x%016" > - PRIx64 "\n", rate, > - cl_ntoh64(osm_physp_get_port_guid(p_physp))); > + "Path params with QoS constaraints: " > + "min MTU = %u, min rate = %u, " > + "packet lifetime = %u, sl = %u\n", > + mtu, rate, pkt_life, sl); > } > > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "Path min MTU = %u, min rate = %u\n", mtu, rate); > + /* > + * Set packet lifetime. > + * According to spec definition IBA 1.2 Table 205 > + * PacketLifeTime description, for loopback paths, > + * packetLifeTime shall be zero. > + */ > + if (p_src_port == p_dest_port) > + pkt_life = 0; > + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) > + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + > > /* > Determine if these values meet the user criteria > @@ -511,6 +614,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > break; > } > } > + if (status != IB_SUCCESS) > + goto Exit; > > /* we silently ignore cases where only the Rate selector is defined */ > if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && > @@ -551,14 +656,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > break; > } > } > - > - /* Verify the pkt_life_time */ > - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, > - for loopback paths, packetLifeTime shall be zero. */ > - if (p_src_port == p_dest_port) > - pkt_life = 0; /* loopback */ > - else > - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + if (status != IB_SUCCESS) > + goto Exit; > > /* we silently ignore cases where only the PktLife selector is defined */ > if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && > @@ -603,12 +702,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > if (status != IB_SUCCESS) > goto Exit; > > - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > + /* > + * set Pkey for this path record request > + */ > + > + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); So is it was bug (not related to QoS) when p_physp instead of p_src_physp was used for pkey finding? > + > else if (comp_mask & IB_PR_COMPMASK_PKEY) { > + /* > + * PR request has a specific pkey: > + * Check that source and destination share this pkey. > + * If QoS level has pkeys, check that this pkey exists > + * in the QoS level pkeys. > + * PR returned pkey is the requested pkey. > + */ > pkey = p_pr->pkey; > - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { > + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F1A: " > "Ports do not share specified PKey 0x%04x\n", > @@ -616,8 +727,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > status = IB_NOT_FOUND; > goto Exit; > } > + if (p_qos_level && p_qos_level->pkey_range_len && > + !osm_qos_level_has_pkey(p_qos_level, pkey)) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " > + "Ports do not share PKeys defined by QoS level\n"); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + > + } else if (p_qos_level && p_qos_level->pkey_range_len) { > + /* > + * PR request doesn't have a specific pkey, but QoS level > + * has pkeys - get shared pkey from QoS level pkeys > + */ > + pkey = osm_qos_level_get_shared_pkey(p_qos_level, > + p_src_physp, > + p_dest_physp); > + if (!pkey) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " > + "Ports do not share PKeys defined by QoS level\n"); > + status = IB_NOT_FOUND; > + goto Exit; > + } > } else { > - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > + /* > + * Neither PR request nor QoS level have pkey. > + * Just get any shared pkey. > + */ > + pkey = osm_physp_find_common_pkey(p_src_physp, > + p_dest_physp); > if (!pkey) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F1B: " > @@ -627,14 +767,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > } > } > > - if (p_rcv->p_subn->opt.routing_engine_name && > - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) > - /* slid and dest_lid are stored in network in lash */ > - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, > - p_dest_port); > - else > - sl = OSM_DEFAULT_SL; > - > if (pkey) { > p_prtn = > (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, > @@ -642,34 +774,80 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > 0x8000)); > if (p_prtn == > (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) > + p_prtn = NULL; > + } > + > + /* > + * Set PathRecord SL. > + * > + * ToDo: What about QoS and LASH routing? How can they coexist? > + * And what happens when there's a pkey, hence there is a > + * partition with a certain SL, and this SL doesn't match > + * the one that's defined by LASH? > + */ > + > + if (comp_mask & IB_PR_COMPMASK_SL) { > + /* > + * Specific SL was requested > + */ > + sl = ib_path_rec_sl(p_pr); > + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " > + "QoS constaraints: required PR SL (%u) " > + "doesn't match QoS SL (%u)\n", > + sl, p_qos_level->sl); > + status = IB_NOT_FOUND; > + goto Exit; > + } > + } else if (p_qos_level && p_qos_level->sl_set) { > + /* > + * No specific SL was requested, > + * but there is an SL in QoS level > + */ > + sl = p_qos_level->sl; > + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "QoS level SL (%u) overrides partition SL (%u)\n", > + p_qos_level->sl, p_prtn->sl); > + } else if (pkey) { > + /* > + * No specific SL in request or in QoS level - use partition SL > + */ > + if (!p_prtn) { > /* this may be possible when pkey tables are created somehow in > previous runs or things are going wrong here */ > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F1C: " > "No partition found for PKey 0x%04x - using default SL %d\n", > cl_ntoh16(pkey), sl); > - else { > - if (p_rcv->p_subn->opt.routing_engine_name && > - strcmp(p_rcv->p_subn->opt.routing_engine_name, > - "lash") == 0) > - /* slid and dest_lid are stored in network in lash */ > - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > - p_src_port, p_dest_port); > - else > - sl = p_prtn->sl; > - } > - > - /* reset pkey when raw traffic */ > - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > - pkey = 0; > + } else > + sl = p_prtn->sl; > + } else if (p_rcv->p_subn->opt.routing_engine_name && > + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { It seems that in original code LASH was "higher" priority in SL selection than partition configuration? If so, any reason why it is changed? > + /* slid and dest_lid are stored in network in lash */ > + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > + p_src_port, p_dest_port); > + } else if (!p_rcv->p_subn->opt.no_qos) { > + sl = first_valid_sl; > } > + else > + sl = OSM_DEFAULT_SL; > > - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { > + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_pr_rcv_get_path_parms: ERR 1F23: " > + "Selected SL (%u) leads to VL15\n", p_prtn->sl); > status = IB_NOT_FOUND; > goto Exit; > } > > + /* reset pkey when raw traffic */ > + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > + pkey = 0; > + > p_parms->mtu = mtu; > p_parms->rate = rate; > p_parms->pkt_life = pkt_life; > -- > 1.5.1.4 > We discussed already about using sl_mask instead of valid_sls array. The rest looks fine for me. Sasha From hal.rosenstock at gmail.com Wed Sep 5 16:40:34 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 5 Sep 2007 19:40:34 -0400 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specify type of service In-Reply-To: <000501c7f012$84e842c0$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> <000301c7f00d$3a245850$3c98070a@amr.corp.intel.com> <000501c7f012$84e842c0$3c98070a@amr.corp.intel.com> Message-ID: On 9/5/07, Sean Hefty wrote: > >It's not a requirement to wait for the QoS determination of the SA > >before issuing the SA PR requests. > > It sounds like you're suggesting that the end nodes just automatically gather > and print the capabilities of the SA. Already is a diag tool (saquery) which does this. > (Why limit it to QoS only then?) Only because it was the topic of discussion and the most "interesting" capability to query. > An administrator could run a separate tool to gather this information, rather than > it being done on all nodes automatically. If the QoS determination is tied into > the behavior of the PR queries, then I think you end up either queuing the > requests or responses somewhere. Sure if they are somehow tied together. > >Failover is pretty straightforward. The same query is made when SM LID changes. > > Yes - but existing paths that are in use may now suddenly provide or no longer > provide QoS. Indeed and maintaining QoS across SM failover is a hard problem which is left to the individual SM implementations (as are other similar failover issues). -- Hal > >Will the QoS code handle the new error status code which suggests a > >different QoSClass or is it currently being handled like other errors > >? Guess that could be a separate patch. > > This would be a separate patch, and is left up to each ULP at the moment. For > ipoib, I used the TClass, so the new QoS fields are not an issue. For the > rdma_cm, the QoS info is set by the user when using IPv4 addressing, so they > would need to take appropriate action. > > - Sean > From jgunthorpe at obsidianresearch.com Wed Sep 5 17:20:29 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 5 Sep 2007 18:20:29 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> References: <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> Message-ID: <20070906002029.GR4472@obsidianresearch.com> On Wed, Sep 05, 2007 at 11:35:06PM +0300, Or Gerlitz wrote: > > Judging by the other comments in this thread, it still seems to me > > this would be best as RC only, notionally with the idea that RC is > > only used between hosts and not between gateways and hosts > > (administratively configured). That way the end-to-end nature of the > > checksum is retained. Gateways that want to support RC can negotiate > > this feature off. > I guess by "RC" you mean connected mode. The connected mode is now > implemented over RC but as was discussed over this list few times, it > should (and it would) move to use UC, which is also much easier to > implement in hw based gateways. Anyway, your idea to allow this > feature coming into play only under negotiation schem sounds fine to > me, however: Sure.. Though, I'm not sure what advantage UC/RC brings to a gateway app when you can't pass 64k MTU onto ethernet... > > You may also want to look at using the new TSO/GSO/LRO stuff in a RC > > context. If you could send an entire GSO in one go and receive it as a > > LRO that might be a big improvement too. > > From Michael's and Eli's responses over the stateless offload related > thread, I understood that these optimizations are supported only for > UD QPs, which makes them irrelevant for the connected mode. Right, but I'm not suggesting using the chips offload. Micheal has made it so you can use 'csum offload' (via disabling csum) on any nic. You can also do the same kind of thing for TSO/GSO. If you send jumbo TSO/GSO packets in a chunk the receiver can then do LRO. Win all around. Sort of like jumbo MTU but without actually changing the MTU. This is all basically the same set of techniques we see between a Linux guest and the linux host in a virtualization environment. Jason From sashak at voltaire.com Wed Sep 5 18:15:54 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 04:15:54 +0300 Subject: [ofa-general] [PATCH] opensm/autogen.sh: remove autogen.sh from opensm subdirectories Message-ID: <20070906011554.GF25330@sashak.voltaire.com> Remove autogen.sh scripts from opensm subdirectories Signed-off-by: Sasha Khapyorsky --- opensm/autogen.sh | 23 +++++++++++++++-------- opensm/complib/autogen.sh | 15 --------------- opensm/include/autogen.sh | 14 -------------- opensm/libvendor/autogen.sh | 14 -------------- opensm/opensm/autogen.sh | 14 -------------- opensm/osmeventplugin/autogen.sh | 15 --------------- opensm/osmtest/autogen.sh | 14 -------------- 7 files changed, 15 insertions(+), 94 deletions(-) delete mode 100755 opensm/complib/autogen.sh delete mode 100755 opensm/include/autogen.sh delete mode 100755 opensm/libvendor/autogen.sh delete mode 100755 opensm/opensm/autogen.sh delete mode 100755 opensm/osmeventplugin/autogen.sh delete mode 100755 opensm/osmtest/autogen.sh diff --git a/opensm/autogen.sh b/opensm/autogen.sh index e463c0e..e1ec064 100755 --- a/opensm/autogen.sh +++ b/opensm/autogen.sh @@ -63,12 +63,19 @@ fi # visit all sub directories with autogen.sh anyErr=0 for a in `ls */autogen.sh`; do - echo Visiting $a - $a 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted definition" - if test $? != 0; then - echo $a failed - anyErr=1 - fi + dir=`dirname $a` + test -d ${dir}/config || mkdir ${dir}/config + echo Visiting $a + ( cd `dirname $a` && \ + set -x && \ + aclocal -I config -I ../config && \ + libtoolize --force --copy && \ + autoheader && \ + automake --foreign --add-missing --copy && \ + autoconf ) \ + 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted definition" + if test $? != 0; then + echo $a failed + anyErr=1 + fi done - -exit $anyErr diff --git a/opensm/complib/autogen.sh b/opensm/complib/autogen.sh deleted file mode 100755 index ec20fc5..0000000 --- a/opensm/complib/autogen.sh +++ /dev/null @@ -1,15 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize --force --copy) && \ -(autoheader) && \ -(automake --foreign --add-missing --copy) && \ -autoconf - diff --git a/opensm/include/autogen.sh b/opensm/include/autogen.sh deleted file mode 100755 index 03401b0..0000000 --- a/opensm/include/autogen.sh +++ /dev/null @@ -1,14 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -aclocal -I config -libtoolize --force --copy -autoheader -automake --foreign --add-missing --copy -autoconf diff --git a/opensm/libvendor/autogen.sh b/opensm/libvendor/autogen.sh deleted file mode 100755 index d30bf8f..0000000 --- a/opensm/libvendor/autogen.sh +++ /dev/null @@ -1,14 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize --force --copy) && \ -(autoheader) && \ -(automake --foreign --add-missing --copy) && \ -autoconf diff --git a/opensm/opensm/autogen.sh b/opensm/opensm/autogen.sh deleted file mode 100755 index d30bf8f..0000000 --- a/opensm/opensm/autogen.sh +++ /dev/null @@ -1,14 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize --force --copy) && \ -(autoheader) && \ -(automake --foreign --add-missing --copy) && \ -autoconf diff --git a/opensm/osmeventplugin/autogen.sh b/opensm/osmeventplugin/autogen.sh deleted file mode 100755 index ec20fc5..0000000 --- a/opensm/osmeventplugin/autogen.sh +++ /dev/null @@ -1,15 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize --force --copy) && \ -(autoheader) && \ -(automake --foreign --add-missing --copy) && \ -autoconf - diff --git a/opensm/osmtest/autogen.sh b/opensm/osmtest/autogen.sh deleted file mode 100755 index d30bf8f..0000000 --- a/opensm/osmtest/autogen.sh +++ /dev/null @@ -1,14 +0,0 @@ -#! /bin/sh - -# We change dir since the later utilities assume to work in the project dir -cd ${0%*/*} - -# create config dir if not exist -test -d config || mkdir config - -set -x -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize --force --copy) && \ -(autoheader) && \ -(automake --foreign --add-missing --copy) && \ -autoconf -- 1.5.3.1.1.g1e61 From mst at dev.mellanox.co.il Wed Sep 5 19:52:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Sep 2007 05:52:44 +0300 Subject: [ofa-general] Re: Low NFS RDMA performance with Connect X In-Reply-To: References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> Message-ID: <20070906025244.GJ28361@mellanox.co.il> > Quoting James Lentini : > Subject: RE: Low NFS RDMA performance with Connect X > > > > On Wed, 5 Sep 2007, Kuchimanchi, Ramachandra wrote: > > > John Leidel wrote: > > > > > In doing some testing with ConnectX, I noticed a similar issue in MPI > > > performance. The fix was simply to upgrade to the latetest and greatest > > > firmware. > > > > I tried with the latest ConnectX Firmware, version 2.2, and the Iozone > > numbers are almost similar to what I posted previously and very low as > > compared to the MT25208 numbers. > > > > NFS RDMA folks, any ideas as to why this is happening with Connect X ? > > We are bringing up our Connect X systems now (we're waiting on a > replacement memory dimm for our server). We'll be experimenting with > the performance on Connect X over the next few weeks. > > Both the client and server code bases have been updated substantially > since the Mellanox SDK was released. Results are likely to be > different on the newer code. > > Finally, it is conceivable that there will need to be performance > tweeks for the Connect X hardware. For Tavor hardware, ULPs use a 1KB > MTU to achieve maximum performance (see the setup of the path_mtu QP > attribute in net/sunrpc/xprtrdma/verbs.c). One thing worth a try is interrupt coalescing. The simplest way to check is probably to apply the following patch and see if it helps. You can also try tweaking cq_max_count and cq_period module parameters. ---> From: Michael S. Tsirkin Subject: [PATCH] IB/mlx4: enable interrupt coalescing Enable interrupt coalescing for CQs in mlx4 Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index 39253d0..b58dd75 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -42,6 +42,14 @@ #include "mlx4.h" #include "icm.h" +static int cq_max_count = 16; +static int cq_period = 10; + +module_param(cq_max_count, int, 0444); +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); +module_param(cq_period, int, 0444); +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); + struct mlx4_cq_context { __be32 flags; u16 reserved1[3]; @@ -174,6 +182,8 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->mtt_base_addr_h = mtt_addr >> 32; cq_context->mtt_base_addr_l = cpu_to_be32(mtt_addr & 0xffffffff); cq_context->db_rec_addr = cpu_to_be64(db_rec); + cq_context->cq_max_count = cq_max_count; + cq_context->cq_period = cq_period; err = mlx4_SW2HW_CQ(dev, mailbox, cq->cqn); mlx4_free_cmd_mailbox(dev, mailbox); -- MST From Rajiv.Raja at Sun.COM Wed Sep 5 20:28:40 2007 From: Rajiv.Raja at Sun.COM (Rajiv Raja) Date: Wed, 05 Sep 2007 20:28:40 -0700 Subject: [ofa-general] [Fwd: ofed 1.2.5 installation issue (rpm build error)] Message-ID: <46DF73E8.1040004@Sun.COM> -------- Original Message -------- Subject: ofed 1.2.5 installation issue (rpm build error) Date: Wed, 05 Sep 2007 20:22:15 -0700 From: Rajiv Raja Reply-To: Rajiv.Raja at Sun.COM To: nsn-pv-magnum at Sun.COM Hi, I am trying to install OFED 1.2.5 on two machines (x86), one having Red Hat 5 and the other SUSE10 SP1. I went through the OFED README and installed all the prerequisite rpms as mentioned. Inspite of this, the build process fails with the following error: =============================== Building InfiniBand Software RPMs. Please wait... Building ofa_user RPMs. Please wait... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libcxgb3 --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-qlvnictools --with-sdpnetstat --with-srptools --with-mstflint --with-perftest --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libcxgb3 --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-qlvnictools --with-sdpnetstat --with-srptools --sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /tmp/ib/OFED-1.2.5/SRPMS/ofa_user-1.2.5-0.src.rpm - ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libcxgb3 --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-qlvnictools --with-sdpnetstat --with-srptools --with-mstflint --with-perftest --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libcxgb3 --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmlx4 --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-qlvnictools --with-sdpnetstat --with-srptools --sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /tmp/ib/OFED-1.2.5/SRPMS/ofa_user-1.2.5-0.src.rpm" See log file: /tmp/OFED.build.6297.log ================================== In the log file, I find the error to be related to c++ compiler not being able to generate executables. ==================== configure: creating cache /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/configure.cache checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking for g++... no checking for c++... no checking for gpp... no checking for aCC... no checking for CC... no checking for cxx... no checking for cc++... no checking for cl... no checking for FCC... no checking for KCC... no checking for RCC... no checking for xlC_r... no checking for xlC... no checking for C++ compiler default output file name... configure: error: C++ compiler cannot create executables See `config.log' for more details. Failed to execute cd libibcommon && ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/configure.cache --disable-libcheck --disable-console-socket --prefix /usr --libdir /usr/lib64 --mandir=/usr/share/man --sysconfdir=/etc error: Bad exit status from /var/tmp/rpm-tmp.47389 (%install) ==================== I am attaching the log file along with a config file which has details about this error. Note that the same error is seen across Red Hat 5 as well as Suse 10. Anyone on how to fix this? Thanks, Rajiv -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.build.6297.log Type: text/x-log Size: 293300 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: text/x-log Size: 7223 bytes Desc: not available URL: From mst at dev.mellanox.co.il Wed Sep 5 20:21:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Sep 2007 06:21:23 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> <20070904194940.GK28350@mellanox.co.il> <20070905081011.GB25011@mellanox.co.il> Message-ID: <20070906032123.GL28361@mellanox.co.il> > > So yes, ICRC is an end-to-end checksum. This is made clear in the > > modinfo description of the parameter. > > The ICRC checksum is a fine checksum. Your defining end-to-end as one > end of an IB network to another. End-to-end in Internet terms is from > one host to another over many potential networks. The source of a TCP > packet could be on a IB network and be communicating with a node > across the globe on a token ring. The TCP checksum is from source to > destination, end-to-end. If you don't perform the TCP checksum at the > source, there is no end-to-end checksum. Yep. Still, HWCSUM bit is cleared by the IB to Eth gateway, at which point regular transport checksums should be inserted. The rest of the packet path will be covered by TCP/IP checkums. So you'll be fine unless your IB-Eth gateway corrupts the packet, you do not have to trust the rest of the gateways on the path. And BTW e.g. SDP<->TCP gateways out there have the same property. > > > I recommend that be made clear to the user. > > > > I don't think there's any potential for confusion > > There is a potential for confusion. The threads on this topic show > that. How about naming the module parameter "omit_inet_csums"? I agree hw_csum is not ideal. omit_inet_csums makes it look like it won't be routeable outside IB subnet. Assuming routing works, even if this means you trust the IB-Eth gateway not to corrupt the packet, I'm looking for name that makes this clear. -- MST From mst at dev.mellanox.co.il Wed Sep 5 20:38:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Sep 2007 06:38:20 +0300 Subject: [ofa-general] Re: [Fwd: ofed 1.2.5 installation issue (rpm build error)] In-Reply-To: <46DF73E8.1040004@Sun.COM> References: <46DF73E8.1040004@Sun.COM> Message-ID: <20070906033819.GM28361@mellanox.co.il> > > ==================== > configure: creating cache > /var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/configure.cache > checking for a BSD-compatible install... /usr/bin/install -c > checking whether build environment is sane... yes > checking for gawk... gawk > checking whether make sets $(MAKE)... yes > checking for g++... no > checking for c++... no > checking for gpp... no > checking for aCC... no > checking for CC... no > checking for cxx... no > checking for cc++... no > checking for cl... no > checking for FCC... no > checking for KCC... no > checking for RCC... no > checking for xlC_r... no > checking for xlC... no > checking for C++ compiler default output file name... configure: error: > C++ compiler cannot create executables > See `config.log' for more details. > Failed to execute cd libibcommon && ./configure > --cache-file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/configure.cache > --disable-libcheck --disable-console-socket --prefix /usr --libdir > /usr/lib64 --mandir=/usr/share/man --sysconfdir=/etc > error: Bad exit status from /var/tmp/rpm-tmp.47389 (%install) > > > ==================== > > I am attaching the log file along with a config file which has details > about this error. > > Note that the same error is seen across Red Hat 5 as well as Suse 10. > > Anyone on how to fix this? This is typically due to missing 32 bit libraries. -- MST From shaohua.li at intel.com Wed Sep 5 21:28:19 2007 From: shaohua.li at intel.com (Shaohua Li) Date: Thu, 06 Sep 2007 12:28:19 +0800 Subject: [ofa-general] Re: [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: <11890207643068-git-send-email-avi@qumranet.com> References: <11890207643068-git-send-email-avi@qumranet.com> Message-ID: <1189052899.6224.5.camel@sli10-conroe.sh.intel.com> On Wed, 2007-09-05 at 22:32 +0300, Avi Kivity wrote: > [resend due to bad alias expansion resulting in some recipients > being bogus] > > Some hardware and software systems maintain page tables outside the normal > Linux page tables, which reference userspace memory. This includes > Infiniband, other RDMA-capable devices, and kvm (with a pending patch). > > Because these systems maintain external page tables (and external tlbs), > Linux cannot demand page this memory and it must be locked. For kvm at > least, this is a significant reduction in functionality. > > This sample patch adds a new mechanism, pte notifiers, that allows drivers > to register an interest in a changes to ptes. Whenever Linux changes a > pte, it will call a notifier to allow the driver to adjust the external > page table and flush its tlb. > > Note that only one notifier is implemented, ->clear(), but others should be > similar. > > pte notifiers are different from paravirt_ops: they extend the normal > page tables rather than replace them; and they provide high-level > information > such as the vma and the virtual address for the driver to use. Looks great. So for kvm, all guest pages will be vma mapped? There are lock issues in kvm between kvm lock and page lock. Will shadow page table be still stored in page->private? If yes, the page->private must be cleaned before add_to_swap. Thanks, Shaohua From glebn at voltaire.com Wed Sep 5 23:24:41 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 6 Sep 2007 09:24:41 +0300 Subject: [ofa-general] [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <11890103283456-git-send-email-avi@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> Message-ID: <20070906062441.GF3410@minantech.com> On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: > This sample patch adds a new mechanism, pte notifiers, that allows drivers > to register an interest in a changes to ptes. Whenever Linux changes a > pte, it will call a notifier to allow the driver to adjust the external > page table and flush its tlb. How is this different from http://lwn.net/Articles/133627/? AFAIR the patch was rejected because there was only one user for it and it was decided that it would be better to maintain it out of tree for a while. -- Gleb. From ogerlitz at voltaire.com Wed Sep 5 23:40:11 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 06 Sep 2007 09:40:11 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070906002029.GR4472@obsidianresearch.com> References: <20070904165251.GA16535@obsidianresearch.com> <20070904170419.GD28350@mellanox.co.il> <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> <20070906002029.GR4472@obsidianresearch.com> Message-ID: <46DFA0CB.2070605@voltaire.com> Jason Gunthorpe wrote: > On Wed, Sep 05, 2007 at 11:35:06PM +0300, Or Gerlitz wrote: >> I guess by "RC" you mean connected mode. The connected mode is now >> implemented over RC but as was discussed over this list few times, it >> should (and it would) move to use UC, which is also much easier to >> implement in hw based gateways. Anyway, your idea to allow this >> feature coming into play only under negotiation schem sounds fine to >> me, however: > Sure.. Though, I'm not sure what advantage UC/RC brings to a gateway app > when you can't pass 64k MTU onto ethernet... When the Ethernet side supports 9K Jumbo frames, if connected mode comes into play then there should be a performance increase, so the gateway negotiates the MTU to be 9K and so on. > Right, but I'm not suggesting using the chips offload. > Micheal has made it so you can use 'csum offload' (via disabling csum) > on any nic. You can also do the same kind of thing for TSO/GSO. If you > send jumbo TSO/GSO packets in a chunk the receiver can then do > LRO. Win all around. Sort of like jumbo MTU but without actually > changing the MTU. > > This is all basically the same set of techniques we see between a > Linux guest and the linux host in a virtualization environment. Thanks for the clarification, I have to do some catchup here on the details re TSO/GSO and their relation to virtualization, however, to make things a little clearer to me, do you agree that as James pointed over this thread in A (IB) ---- B (Gateway eg HW based) ---- C (Ethernet) scheme, in case A does not compute the TCP checksum of a packet, its note the role of the gateway to do so, and C would just drop it?! Or. From ogerlitz at voltaire.com Wed Sep 5 23:44:06 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 06 Sep 2007 09:44:06 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46DE7D99.7000508@voltaire.com> <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> Message-ID: <46DFA1B6.90001@voltaire.com> Sean Hefty wrote: > To support QoS within and between subnets, modify IPoIB to request > specific Traffic Class values with path record queries, using > the value associated with the IPoIB broadcast group. > > Signed-off-by: Sean Hefty > --- > Added missing traffic class to PR component mask. OK, thanks. have you tried this patch set against some SM/SA to see that it does not break things and if it does, then what and how do we fix it? Or. From eitan at mellanox.co.il Thu Sep 6 00:32:29 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 6 Sep 2007 10:32:29 +0300 Subject: [ofa-general] RE: [PATCH] opensm/autogen.sh: remove autogen.sh from opensmsubdirectories In-Reply-To: <20070906011554.GF25330@sashak.voltaire.com> References: <20070906011554.GF25330@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C902374265@mtlexch01.mtl.com> Hi Sasha, Maybe I missed it but this also had to do with removing of configure.in and adding the subdirectories to the management/opensm/Makefile.am And management/opensm/configure.in Was there a separate patch for that? Thanks Eitan > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, September 06, 2007 4:16 AM > To: OpenIB > Cc: Hal Rosenstock; Eitan Zahavi > Subject: [PATCH] opensm/autogen.sh: remove autogen.sh from > opensmsubdirectories > > > Remove autogen.sh scripts from opensm subdirectories > > Signed-off-by: Sasha Khapyorsky > --- > opensm/autogen.sh | 23 +++++++++++++++-------- > opensm/complib/autogen.sh | 15 --------------- > opensm/include/autogen.sh | 14 -------------- > opensm/libvendor/autogen.sh | 14 -------------- > opensm/opensm/autogen.sh | 14 -------------- > opensm/osmeventplugin/autogen.sh | 15 --------------- > opensm/osmtest/autogen.sh | 14 -------------- > 7 files changed, 15 insertions(+), 94 deletions(-) delete > mode 100755 opensm/complib/autogen.sh delete mode 100755 > opensm/include/autogen.sh delete mode 100755 > opensm/libvendor/autogen.sh delete mode 100755 > opensm/opensm/autogen.sh delete mode 100755 > opensm/osmeventplugin/autogen.sh delete mode 100755 > opensm/osmtest/autogen.sh > > diff --git a/opensm/autogen.sh b/opensm/autogen.sh index > e463c0e..e1ec064 100755 > --- a/opensm/autogen.sh > +++ b/opensm/autogen.sh > @@ -63,12 +63,19 @@ fi > # visit all sub directories with autogen.sh anyErr=0 for a > in `ls */autogen.sh`; do > - echo Visiting $a > - $a 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted > definition" > - if test $? != 0; then > - echo $a failed > - anyErr=1 > - fi > + dir=`dirname $a` > + test -d ${dir}/config || mkdir ${dir}/config > + echo Visiting $a > + ( cd `dirname $a` && \ > + set -x && \ > + aclocal -I config -I ../config && \ > + libtoolize --force --copy && \ > + autoheader && \ > + automake --foreign --add-missing --copy && \ > + autoconf ) \ > + 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted definition" > + if test $? != 0; then > + echo $a failed > + anyErr=1 > + fi > done > - > -exit $anyErr > diff --git a/opensm/complib/autogen.sh > b/opensm/complib/autogen.sh deleted file mode 100755 index > ec20fc5..0000000 > --- a/opensm/complib/autogen.sh > +++ /dev/null > @@ -1,15 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize > --force --copy) && \ > -(autoheader) && \ > -(automake --foreign --add-missing --copy) && \ -autoconf > - > diff --git a/opensm/include/autogen.sh > b/opensm/include/autogen.sh deleted file mode 100755 index > 03401b0..0000000 > --- a/opensm/include/autogen.sh > +++ /dev/null > @@ -1,14 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -aclocal -I config > -libtoolize --force --copy > -autoheader > -automake --foreign --add-missing --copy -autoconf diff --git > a/opensm/libvendor/autogen.sh b/opensm/libvendor/autogen.sh > deleted file mode 100755 index d30bf8f..0000000 > --- a/opensm/libvendor/autogen.sh > +++ /dev/null > @@ -1,14 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize > --force --copy) && \ > -(autoheader) && \ > -(automake --foreign --add-missing --copy) && \ -autoconf > diff --git a/opensm/opensm/autogen.sh > b/opensm/opensm/autogen.sh deleted file mode 100755 index > d30bf8f..0000000 > --- a/opensm/opensm/autogen.sh > +++ /dev/null > @@ -1,14 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize > --force --copy) && \ > -(autoheader) && \ > -(automake --foreign --add-missing --copy) && \ -autoconf > diff --git a/opensm/osmeventplugin/autogen.sh > b/opensm/osmeventplugin/autogen.sh > deleted file mode 100755 > index ec20fc5..0000000 > --- a/opensm/osmeventplugin/autogen.sh > +++ /dev/null > @@ -1,15 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize > --force --copy) && \ > -(autoheader) && \ > -(automake --foreign --add-missing --copy) && \ -autoconf > - > diff --git a/opensm/osmtest/autogen.sh > b/opensm/osmtest/autogen.sh deleted file mode 100755 index > d30bf8f..0000000 > --- a/opensm/osmtest/autogen.sh > +++ /dev/null > @@ -1,14 +0,0 @@ > -#! /bin/sh > - > -# We change dir since the later utilities assume to work in > the project dir -cd ${0%*/*} > - > -# create config dir if not exist > -test -d config || mkdir config > - > -set -x > -(aclocal -I config -I ../config 2>&1 ) && \ -(libtoolize > --force --copy) && \ > -(autoheader) && \ > -(automake --foreign --add-missing --copy) && \ -autoconf > -- > 1.5.3.1.1.g1e61 > > From vlad at dev.mellanox.co.il Thu Sep 6 01:35:01 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 06 Sep 2007 11:35:01 +0300 Subject: [ofa-general] [Fwd: ofed 1.2.5 installation issue (rpm build error)] In-Reply-To: <46DF73E8.1040004@Sun.COM> References: <46DF73E8.1040004@Sun.COM> Message-ID: <46DFBBB5.5070304@dev.mellanox.co.il> Rajiv Raja wrote: > - > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root > /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools > --with-libcxgb3 --with-libibcm --with-libibcommon --with-libibmad > --with-libibumad --with-libibverbs --with-libipathverbs --with-libmlx4 > --with-libmthca --with-opensm --with-librdmacm --with-libsdp > --with-openib-diags --with-qlvnictools --with-sdpnetstat --with-srptools > --with-mstflint --with-perftest --with-tvflash --sysconfdir=/etc > --mandir=/usr/share/man' --define 'configure_options32 --with-dapl > --with-ipoibtools --with-libcxgb3 --with-libibcm --with-libibcommon > --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs > --with-libmlx4 --with-libmthca --with-opensm --with-librdmacm > --with-libsdp --with-openib-diags --with-qlvnictools --with-sdpnetstat > --with-srptools --sysconfdir=/etc --mandir=/usr/share/man' --define > 'build_32bit 1' --define '_mandir /usr/share/man' > /tmp/ib/OFED-1.2.5/SRPMS/ofa_user-1.2.5-0.src.rpm" > > See log file: /tmp/OFED.build.6297.log > ================================== > checking for C++ compiler default output file name... configure: error: > C++ compiler cannot create executables > See `config.log' for more details. > Failed to execute cd libibcommon && ./configure > --cache-file=/var/tmp/OFEDRPM/BUILD/ofa_user-1.2.5/configure.cache > --disable-libcheck --disable-console-socket --prefix /usr --libdir > /usr/lib64 --mandir=/usr/share/man --sysconfdir=/etc > error: Bad exit status from /var/tmp/rpm-tmp.47389 (%install) > > > ==================== > > I am attaching the log file along with a config file which has details > about this error. > > Note that the same error is seen across Red Hat 5 as well as Suse 10. > > Anyone on how to fix this? > > Thanks, > Rajiv > > Hi, From the attached config.log: ./configure: line 2067: g++: command not found Try to install g++. Regards, Vladimir From ramachandra.kuchimanchi at qlogic.com Thu Sep 6 01:36:03 2007 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 6 Sep 2007 14:06:03 +0530 Subject: [ofa-general] Low NFS RDMA performance with Connect X In-Reply-To: References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> Message-ID: <71d336490709060136k45738d1cq557eb6a6783035f5@mail.gmail.com> On 9/5/07, James Lentini wrote: > Both the client and server code bases have been updated substantially > since the Mellanox SDK was released. Results are likely to be > different on the newer code. Is the latest code available somewhere ? Regards, Ram From avi at qumranet.com Thu Sep 6 01:35:24 2007 From: avi at qumranet.com (Avi Kivity) Date: Thu, 06 Sep 2007 11:35:24 +0300 Subject: [ofa-general] [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <20070906062441.GF3410@minantech.com> References: <11890103283456-git-send-email-avi@qumranet.com> <20070906062441.GF3410@minantech.com> Message-ID: <46DFBBCC.8060307@qumranet.com> Gleb Natapov wrote: > On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: > >> This sample patch adds a new mechanism, pte notifiers, that allows drivers >> to register an interest in a changes to ptes. Whenever Linux changes a >> pte, it will call a notifier to allow the driver to adjust the external >> page table and flush its tlb. >> > How is this different from http://lwn.net/Articles/133627/? AFAIR the > patch was rejected because there was only one user for it and it was > decided that it would be better to maintain it out of tree for a while. > Your patch is more complete. There are now at least three users: you, kvm, and newer Infiniband HCAs. Care to resurrect the patch? -- Any sufficiently difficult bug is indistinguishable from a feature. From avi at qumranet.com Thu Sep 6 01:38:20 2007 From: avi at qumranet.com (Avi Kivity) Date: Thu, 06 Sep 2007 11:38:20 +0300 Subject: [ofa-general] Re: [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: <1189052899.6224.5.camel@sli10-conroe.sh.intel.com> References: <11890207643068-git-send-email-avi@qumranet.com> <1189052899.6224.5.camel@sli10-conroe.sh.intel.com> Message-ID: <46DFBC7C.2020709@qumranet.com> Shaohua Li wrote: > On Wed, 2007-09-05 at 22:32 +0300, Avi Kivity wrote: > >> [resend due to bad alias expansion resulting in some recipients >> being bogus] >> >> Some hardware and software systems maintain page tables outside the normal >> Linux page tables, which reference userspace memory. This includes >> Infiniband, other RDMA-capable devices, and kvm (with a pending patch). >> >> Because these systems maintain external page tables (and external tlbs), >> Linux cannot demand page this memory and it must be locked. For kvm at >> least, this is a significant reduction in functionality. >> >> This sample patch adds a new mechanism, pte notifiers, that allows drivers >> to register an interest in a changes to ptes. Whenever Linux changes a >> pte, it will call a notifier to allow the driver to adjust the external >> page table and flush its tlb. >> >> Note that only one notifier is implemented, ->clear(), but others should be >> similar. >> >> pte notifiers are different from paravirt_ops: they extend the normal >> page tables rather than replace them; and they provide high-level >> information >> such as the vma and the virtual address for the driver to use. >> > Looks great. So for kvm, all guest pages will be vma mapped? > There are lock issues in kvm between kvm lock and page lock. > Yes, locking will be a headache. > Will shadow page table be still stored in page->private? If yes, the > page->private must be cleaned before add_to_swap. > page->private can be in use by filesystems, so we will need to move rmap somewhere else. -- Any sufficiently difficult bug is indistinguishable from a feature. From glebn at voltaire.com Thu Sep 6 01:41:51 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 6 Sep 2007 11:41:51 +0300 Subject: [ofa-general] [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <46DFBBCC.8060307@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> <20070906062441.GF3410@minantech.com> <46DFBBCC.8060307@qumranet.com> Message-ID: <20070906084151.GK3410@minantech.com> On Thu, Sep 06, 2007 at 11:35:24AM +0300, Avi Kivity wrote: > Gleb Natapov wrote: >> On Wed, Sep 05, 2007 at 07:38:48PM +0300, Avi Kivity wrote: >> >>> This sample patch adds a new mechanism, pte notifiers, that allows >>> drivers >>> to register an interest in a changes to ptes. Whenever Linux changes a >>> pte, it will call a notifier to allow the driver to adjust the external >>> page table and flush its tlb. >>> >> How is this different from http://lwn.net/Articles/133627/? AFAIR the >> patch was rejected because there was only one user for it and it was >> decided that it would be better to maintain it out of tree for a while. >> > > Your patch is more complete. > > There are now at least three users: you, kvm, and newer Infiniband HCAs. > Care to resurrect the patch? > This is not my patch :) This is patch written by David Addison from Quadrics. I CCed him on my previous email. I just saw that you are trying to do something similar. -- Gleb. From ariston at gmail.com Thu Sep 6 01:42:54 2007 From: ariston at gmail.com (Ramachandra K) Date: Thu, 6 Sep 2007 14:12:54 +0530 Subject: [ofa-general] Low NFS RDMA performance with Connect X In-Reply-To: References: Message-ID: <71d336490709060142o5e6278cp4fe7a8396fa60eba@mail.gmail.com> On 9/6/07, Talpey, Thomas wrote: > Can you post the full commandline of your NFS mount and iozone > invocations? iozone -Rab nfs_rdma_connectx_fw-2.2.xls -g 2G -c mount.rnfs -o rdma=100.1.1.1 100.1.1.1:/home/rkuchimanchi/nfs_testdir tst /etc/exports: /home/rkuchimanchi/nfs_testdir 100.1.1.2(insecure,rw,async) > I'm also curious if there were any NFS or RPC related > messages appearing in the dmesg log during the run. No, there were no special NFS or RPC related messages or errors during the run except for the normal messages at startup and about connection establishment and parameters. > Finally, were any RPC- or NFS-related patches applied to the RHEL5 kernel outside > of the NFS/RDMA ones? No. Its the normal RHEL 5 kernel with OFED-1.2.5 installed. I compiled the NFS RDMA sources with OFED-1.2.5 sources and loaded the NFS RDMA related modules. Regards, Ram From kliteyn at dev.mellanox.co.il Thu Sep 6 02:14:36 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 06 Sep 2007 12:14:36 +0300 Subject: [ofa-general] [PATCH 2/2] osm: QoS - support for MPR in qos policy Message-ID: <46DFC4FC.70500@dev.mellanox.co.il> Hi Sasha, This patch adds osm_qos_policy_get_qos_level_by_mpr() wrapper function that basically does the same thing as the osm_qos_policy_get_qos_level_by_pr(), only for MultiPathRecord instead of PathRecord.. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 8 ++++++++ opensm/opensm/osm_qos_policy.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 0 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 11598be..0c220ee 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -51,6 +51,7 @@ #include #include #include +#include #define YYSTYPE char * #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 @@ -179,6 +180,13 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( IN const osm_physp_t * p_dest_physp, IN ib_net64_t comp_mask); +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_multipath_rec_t * p_mpr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask); + /***************************************************/ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 74628a5..a778bcb 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -957,3 +957,35 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( /*************************************************** ***************************************************/ +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_multipath_rec_t * p_mpr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask) +{ + uint8_t params_comp_mask = 0; + + if (!p_qos_policy) + return NULL; + + if (comp_mask & IB_MPR_COMPMASK_QOS_CLASS) + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; + + if (comp_mask & IB_MPR_COMPMASK_SERVICEID_MSB && + comp_mask & IB_MPR_COMPMASK_SERVICEID_LSB) + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; + + if (comp_mask & IB_MPR_COMPMASK_PKEY) + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; + + return __qos_policy_get_qos_level_by_params( + p_qos_policy, p_src_physp, p_dest_physp, + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), + ib_multipath_rec_qos_class(p_mpr), + cl_ntoh16(p_mpr->pkey), params_comp_mask); +} + +/*************************************************** + ***************************************************/ + -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Sep 6 02:14:08 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 06 Sep 2007 12:14:08 +0300 Subject: [ofa-general] [PATCH 1/2] osm: QoS - support for MPR in qos policy Message-ID: <46DFC4E0.9000600@dev.mellanox.co.il> Hi Sasha, This patch is a step toward supporting MultiPathRecord in qos policy: 1. Added subnet object to the qos policy struct to remove dependency on osm_pr_rcv_t (and later on osm_mpr_rcv_t). 2. osm_qos_policy_get_qos_level_by_pr() turned into a wrapper fuction that gets path record, extracts the relevant parameters and converts path record comp_mask to the local comp_mask. NOTE: this patch does not depend on the "selecting PathRecord" patch. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 16 ++-- opensm/opensm/osm_qos_parser.y | 2 +- opensm/opensm/osm_qos_policy.c | 125 +++++++++++++++++++++----------- 3 files changed, 90 insertions(+), 53 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index a7a9cd2..11598be 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -141,6 +141,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_levels; /* list of osm_qos_level_t */ cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ + osm_subn_t *p_subn; /* osm subnet object */ } osm_qos_policy_t; /***************************************************/ @@ -167,17 +168,16 @@ ib_net16_t osm_qos_level_get_shared_pkey(IN const osm_qos_level_t * p_qos_level, osm_qos_match_rule_t * osm_qos_policy_match_rule_create(); void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p_match_rule); -osm_qos_policy_t * osm_qos_policy_create(); +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn); void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy); int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, osm_log_t * p_log); -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, - IN const osm_pr_rcv_t * p_rcv, - IN const ib_path_rec_t * p_pr, - IN const osm_physp_t * p_src_physp, - IN const osm_physp_t * p_dest_physp, - IN ib_net64_t comp_mask, - OUT osm_qos_level_t ** pp_qos_level); +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_path_rec_t * p_pr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask); /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 876448b..a477084 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -1752,7 +1752,7 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) column_num = 1; line_num = 1; - p_subn->p_qos_policy = osm_qos_policy_create(); + p_subn->p_qos_policy = osm_qos_policy_create(p_subn); __parser_tmp_struct_init(); p_qos_policy = p_subn->p_qos_policy; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 059a861..74628a5 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -53,8 +53,14 @@ #include #include #include +#include #include +#define QOS_PARAMS_COMPMASK_SERVICEID (((uint8_t)1)<<0) +#define QOS_PARAMS_COMPMASK_QOS_CLASS (((uint8_t)1)<<1) +#define QOS_PARAMS_COMPMASK_PKEY (((uint8_t)1)<<2) + + /*************************************************** ***************************************************/ @@ -380,7 +386,7 @@ void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p) /*************************************************** ***************************************************/ -osm_qos_policy_t * osm_qos_policy_create() +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) { osm_qos_policy_t * p_qos_policy = (osm_qos_policy_t *)malloc(sizeof(osm_qos_policy_t)); if (!p_qos_policy) @@ -403,6 +409,7 @@ osm_qos_policy_t * osm_qos_policy_create() cl_list_construct(&p_qos_policy->qos_match_rules); cl_list_init(&p_qos_policy->qos_match_rules, 10); + p_qos_policy->p_subn = p_subn; return p_qos_policy; } @@ -542,7 +549,7 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, ***************************************************/ static boolean_t -__qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, +__qos_policy_is_port_in_group_list(const osm_qos_policy_t * p_qos_policy, const osm_physp_t * p_physp, cl_list_t * p_port_group_list) { @@ -555,7 +562,7 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, (osm_qos_port_group_t *) cl_list_obj(list_iterator); if (p_port_group) { if (__qos_policy_is_port_in_group - (p_rcv->p_subn, p_physp, p_port_group)) + (p_qos_policy->p_subn, p_physp, p_port_group)) return TRUE; } list_iterator = cl_list_next(list_iterator); @@ -566,13 +573,14 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, /*************************************************** ***************************************************/ -static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( +static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( const osm_qos_policy_t * p_qos_policy, - const osm_pr_rcv_t * p_rcv, - const ib_path_rec_t * p_pr, + uint64_t service_id, + uint16_t qos_class, + uint16_t pkey, const osm_physp_t * p_src_physp, const osm_physp_t * p_dest_physp, - ib_net64_t comp_mask) + uint8_t comp_mask) { osm_qos_match_rule_t *p_qos_match_rule = NULL; cl_list_iterator_t list_iterator; @@ -594,7 +602,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /* If a match rule has Source groups, PR request source has to be in this list */ if (cl_list_count(&p_qos_match_rule->source_group_list)) { - if (!__qos_policy_is_port_in_group_list(p_rcv, + if (!__qos_policy_is_port_in_group_list(p_qos_policy, p_src_physp, &p_qos_match_rule-> source_group_list)) @@ -607,7 +615,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /* If a match rule has Destination groups, PR request dest. has to be in this list */ if (cl_list_count(&p_qos_match_rule->destination_group_list)) { - if (!__qos_policy_is_port_in_group_list(p_rcv, + if (!__qos_policy_is_port_in_group_list(p_qos_policy, p_dest_physp, &p_qos_match_rule-> destination_group_list)) @@ -621,7 +629,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( to have a matching QoS class to match the rule */ if (p_qos_match_rule->qos_class_range_len) { - if (!(comp_mask & IB_PR_COMPMASK_QOS_CLASS)) { + if (!(comp_mask & QOS_PARAMS_COMPMASK_QOS_CLASS)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -629,7 +637,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->qos_class_range_arr, p_qos_match_rule->qos_class_range_len, - ib_path_rec_qos_class(p_pr))) { + qos_class)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -640,8 +648,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( to have a matching Service ID to match the rule */ if (p_qos_match_rule->service_id_range_len) { - if (!(comp_mask & IB_PR_COMPMASK_SERVICEID_MSB) || - !(comp_mask & IB_PR_COMPMASK_SERVICEID_LSB)) { + if (!(comp_mask & QOS_PARAMS_COMPMASK_SERVICEID)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -649,7 +656,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->service_id_range_arr, p_qos_match_rule->service_id_range_len, - cl_ntoh64(p_pr->service_id))) { + service_id)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -660,7 +667,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( to have a matching PKey to match the rule */ if (p_qos_match_rule->pkey_range_len) { - if (!(comp_mask & IB_PR_COMPMASK_PKEY)) { + if (!(comp_mask & QOS_PARAMS_COMPMASK_PKEY)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -668,7 +675,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->pkey_range_arr, p_qos_match_rule->pkey_range_len, - cl_ntoh16(p_pr->pkey))) { + pkey)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -688,8 +695,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /*************************************************** ***************************************************/ -static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_qos_policy, - char *name) +static osm_qos_level_t *__qos_policy_get_qos_level_by_name( + const osm_qos_policy_t * p_qos_policy, + char *name) { osm_qos_level_t *p_qos_level = NULL; cl_list_iterator_t list_iterator; @@ -713,8 +721,9 @@ static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_ /*************************************************** ***************************************************/ -static osm_qos_port_group_t *__qos_policy_get_port_group_by_name(osm_qos_policy_t * p_qos_policy, - const char *const name) +static osm_qos_port_group_t *__qos_policy_get_port_group_by_name( + const osm_qos_policy_t * p_qos_policy, + const char *const name) { osm_qos_port_group_t *p_port_group = NULL; cl_list_iterator_t list_iterator; @@ -869,54 +878,82 @@ int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, /*************************************************** ***************************************************/ -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, - IN const osm_pr_rcv_t * p_rcv, - IN const ib_path_rec_t * p_pr, - IN const osm_physp_t * p_src_physp, - IN const osm_physp_t * p_dest_physp, - IN ib_net64_t comp_mask, - OUT osm_qos_level_t ** pp_qos_level) +static osm_qos_level_t * __qos_policy_get_qos_level_by_params( + IN const osm_qos_policy_t * p_qos_policy, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN uint64_t service_id, + IN uint16_t qos_class, + IN uint16_t pkey, + IN uint8_t comp_mask) { osm_qos_match_rule_t *p_qos_match_rule = NULL; osm_qos_level_t *p_qos_level = NULL; - OSM_LOG_ENTER(p_rcv->p_log, osm_qos_policy_get_qos_level_by_pr); - - *pp_qos_level = NULL; + OSM_LOG_ENTER(&p_qos_policy->p_subn->p_osm->log, + __qos_policy_get_qos_level_by_params); if (!p_qos_policy) goto Exit; - p_qos_match_rule = __qos_policy_get_match_rule_by_pr(p_qos_policy, - p_rcv, - p_pr, - p_src_physp, - p_dest_physp, - comp_mask); + p_qos_match_rule = __qos_policy_get_match_rule_by_params( + p_qos_policy, service_id, qos_class, pkey, + p_src_physp, p_dest_physp, comp_mask); if (p_qos_match_rule) p_qos_level = p_qos_match_rule->p_qos_level; else p_qos_level = p_qos_policy->p_default_qos_level; - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "osm_qos_policy_get_qos_level_by_pr: " + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_qos_level_by_params: " "PathRecord request:" "Src port 0x%016" PRIx64 ", " "Dst port 0x%016" PRIx64 "\n", cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), cl_ntoh64(osm_physp_get_port_guid(p_dest_physp))); - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "osm_qos_policy_get_qos_level_by_pr: " + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_qos_level_by_params: " "Applying QoS Level %s (%s)\n", p_qos_level->name, (p_qos_level->use) ? p_qos_level->use : "no description"); - *pp_qos_level = p_qos_level; - Exit: - OSM_LOG_EXIT(p_rcv->p_log); -} /* osm_qos_policy_get_qos_level_by_pr() */ + OSM_LOG_EXIT(&p_qos_policy->p_subn->p_osm->log); + return p_qos_level; +} /* __qos_policy_get_qos_level_by_params() */ /*************************************************** ***************************************************/ + +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_path_rec_t * p_pr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask) +{ + uint8_t params_comp_mask = 0; + + if (!p_qos_policy) + return NULL; + + if (comp_mask & IB_PR_COMPMASK_QOS_CLASS) + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; + + if (comp_mask & IB_PR_COMPMASK_SERVICEID_MSB && + comp_mask & IB_PR_COMPMASK_SERVICEID_LSB) + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; + + if (comp_mask & IB_PR_COMPMASK_PKEY) + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; + + return __qos_policy_get_qos_level_by_params( + p_qos_policy, p_src_physp, p_dest_physp, + cl_ntoh64(p_pr->service_id), ib_path_rec_qos_class(p_pr), + cl_ntoh16(p_pr->pkey), params_comp_mask); +} + +/*************************************************** + ***************************************************/ + -- 1.5.1.4 From vlad at lists.openfabrics.org Thu Sep 6 02:47:29 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 6 Sep 2007 02:47:29 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070906-0200 daily build status Message-ID: <20070906094729.81BF9E60880@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070906-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Thu Sep 6 03:20:05 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 13:20:05 +0300 Subject: [ofa-general] Re: [PATCH] opensm/autogen.sh: remove autogen.sh from opensmsubdirectories In-Reply-To: <6C2C79E72C305246B504CBA17B5500C902374265@mtlexch01.mtl.com> References: <20070906011554.GF25330@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902374265@mtlexch01.mtl.com> Message-ID: <20070906102005.GG25330@sashak.voltaire.com> Hi Eitan, On 10:32 Thu 06 Sep , Eitan Zahavi wrote: > Hi Sasha, > > Maybe I missed it but this also had to do with removing of configure.in > and adding the subdirectories to the > management/opensm/Makefile.am > And > management/opensm/configure.in > > Was there a separate patch for that? I didn't do it yet. Sasha From hassanmusa013 at hotmail.com Thu Sep 6 03:13:07 2007 From: hassanmusa013 at hotmail.com (HASSAN MUSA) Date: Thu, 6 Sep 2007 03:13:07 -0700 Subject: [ofa-general] Good Day Sir, Message-ID: Please, I want to introduce myself and this business opportunity to you My name is Hassan Musa, a legal practitioner,I wish to know if we can work together. I would like you to stand as the next of kin to my deceased client who made a deposit with Citibank Nigeria Plc. He died without any registered next of kin and as such the funds now have an open beneficiary mandate. If you are interested please do let me know so that I can give you comprehensive details on what we are to do. I urgently hope to get your response as soon as possible. Best regards, Barr. Hassan MusaTel: +234 806-9720870 _________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now! http://www.live.com/getstarted.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Sep 6 03:47:58 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 13:47:58 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: use shared ibdebug var Message-ID: <20070906104758.GH25330@sashak.voltaire.com> Use shared ibdebug variable over diag utils Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibdiag_common.c | 2 ++ infiniband-diags/src/ibnetdiscover.c | 1 - infiniband-diags/src/ibroute.c | 1 - infiniband-diags/src/ibtracert.c | 1 - 4 files changed, 2 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index e4e381a..bdbed45 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -49,6 +49,8 @@ #include "ibdiag_common.h" +int ibdebug; + FILE * open_switch_map(char *switch_map) { diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index ccd70cb..6163290 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -89,7 +89,6 @@ static int verbose; static FILE *f; char *argv0 = "ibnetdiscover"; -int ibdebug; static char *switch_map = NULL; static FILE *switch_map_fp = NULL; diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 51d64ed..77beb36 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -59,7 +59,6 @@ static int verbose; static int dump_all; char *argv0 = "ibroute"; -int ibdebug; /*******************************************/ diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index 4e404b0..f085fd6 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -69,7 +69,6 @@ static int force; static FILE *f; char *argv0 = "ibtracert"; -int ibdebug; static char *switch_map = NULL; static FILE *switch_map_fp = NULL; -- 1.5.3.1.1.g1e61 From sashak at voltaire.com Thu Sep 6 04:23:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 14:23:07 +0300 Subject: [ofa-general] management/libibcommon Message-ID: <20070906112307.GI25330@sashak.voltaire.com> Hi All, Currently we have libibcommon library under OFA management project. Partially it is used by libibumad and partially by libibmad and infiniband-diags. The used things look pretty separate so I'm thinking to strip libibcommon as whole library and its components over libibumad and libibmad - this will remove extra dependency for libibumad. Anybody else (except management) uses libibcommon? Any comments, objections? Sasha From jeremy at goop.org Thu Sep 6 04:28:47 2007 From: jeremy at goop.org (Jeremy Fitzhardinge) Date: Thu, 06 Sep 2007 12:28:47 +0100 Subject: [ofa-general] Re: [PATCH][RFC]: pte notifiers -- support for external page tables In-Reply-To: <46DF045F.4020806@qumranet.com> References: <11890103283456-git-send-email-avi@qumranet.com> <46DEFDF4.5000900@redhat.com> <46DF0013.4060804@qumranet.com> <46DF0234.7090504@redhat.com> <46DF045F.4020806@qumranet.com> Message-ID: <46DFE46F.5020001@goop.org> Avi Kivity wrote: > It is, but the hooks are in much the same places. It could be argued > that you'd embed pte notifiers in paravirt_ops for a host kernel, but > that's not doable because pte notifiers use higher-level data > strutures (like vmas). Also, I wouldn't like to preclude the possibility of having a kernel that's both a guest and a host (ie, nested vmms). J From sashak at voltaire.com Thu Sep 6 04:42:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 14:42:48 +0300 Subject: [ofa-general] [PATCH] libibumad, libibmad: fix header search paths Message-ID: <20070906114248.GJ25330@sashak.voltaire.com> Fix header files search paths in Makefile.am Signed-off-by: Sasha Khapyorsky --- libibmad/Makefile.am | 5 +++-- libibumad/Makefile.am | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index 676311e..f12e1f9 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -2,9 +2,10 @@ SUBDIRS = . INCLUDES = -I$(srcdir)/include/infiniband \ - -I$(srcdir)/../libibcommon/include/infiniband \ + -I$(srcdir)/../libibcommon/include \ -I$(srcdir)/../libibumad/include/infiniband \ - -I$(includedir)/infiniband + -I$(includedir)/infiniband \ + -I$(includedir) lib_LTLIBRARIES = libibmad.la diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index 48868e7..25495c3 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -2,7 +2,8 @@ SUBDIRS = . INCLUDES = -I$(srcdir)/include/infiniband \ - -I$(srcdir)/../libibcommon/include/infiniband + -I$(srcdir)/../libibcommon/include \ + -I$(includedir) man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 \ -- 1.5.3.1.1.g1e61 From chas at cmf.nrl.navy.mil Thu Sep 6 04:38:42 2007 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Thu, 06 Sep 2007 07:38:42 -0400 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070906032123.GL28361@mellanox.co.il> Message-ID: <200709061138.l86BcgYb005214@cmf.nrl.navy.mil> In message <20070906032123.GL28361 at mellanox.co.il>,"Michael S. Tsirkin" writes: >Assuming routing works, even if this means you trust the IB-Eth gateway not to >corrupt the packet, I'm looking for name that makes this clear. ignore_inet_csum From andi at firstfloor.org Thu Sep 6 04:39:23 2007 From: andi at firstfloor.org (Andi Kleen) Date: 06 Sep 2007 13:39:23 +0200 Subject: [ofa-general] Re: [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: <11890207643068-git-send-email-avi@qumranet.com> References: <11890207643068-git-send-email-avi@qumranet.com> Message-ID: Avi Kivity writes: > > pte notifiers are different from paravirt_ops: they extend the normal > page tables rather than replace them; and they provide high-level information > such as the vma and the virtual address for the driver to use. Sounds like a locking horror to me. To do anything with page tables you need locks. Both for the kernel page tables and for your new tables. What happens when people add all things of complicated operations in these notifiers? That will likely happen and then everytime you change something in VM code they will break. This has the potential to increase the cost of maintaining VM code considerably, which would be a bad thing. This is quite different from paravirt ops because low level pvops can typically run lockless by just doing some kind of hypercall directly. But that won't work for maintaining your custom page tables. -Andi From dotanb at dev.mellanox.co.il Thu Sep 6 04:49:15 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 06 Sep 2007 14:49:15 +0300 Subject: [ofa-general] Re: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> Message-ID: <46DFE93B.60702@dev.mellanox.co.il> Sean Hefty wrote: > librdmacm: add valgrind support. > > Signed-off-by: Dotan Barak > Signed-off-by: Sean Hefty > --- > Changes from the posted patches: > > * I combined both patches into a single patch. > * I tried to keep the config file simple and went with the option of > only including memcheck.h if valgrind support was requested. > * The check for memcheck.h is not done if disable_libcheck is true. > * VALGRIND_MAKE_MEM_DEFINED is only defined if memcheck.h is not > included. I would rather fail the build if memcheck.h does not > define this, than print a warning and define it ourselves. > > If there's a problem with any of these choices, please let me know. > I have a comment only on your last choice: i don't know the feature history of valgrind but i believe that there were versions which had the file memcheck.h without the mentioned macro. I would like to leave the code that handles this issue like it was in the original patch (if it is fine with you). thanks Dotan From kliteyn at dev.mellanox.co.il Thu Sep 6 04:42:58 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 06 Sep 2007 14:42:58 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <20070905145010.GL23670@sashak.voltaire.com> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> <46DE6091.40901@dev.mellanox.co.il> <20070905145010.GL23670@sashak.voltaire.com> Message-ID: <46DFE7C2.30602@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 10:53 Wed 05 Sep , Yevgeny Kliteynik wrote: >>>> ib_net16_t dest_lid; >>>> + uint8_t i; >>>> + uint8_t vl; >>>> + ib_slvl_table_t *p_slvl_tbl = NULL; >>>> + boolean_t valid_sls[IB_MAX_NUM_VLS]; >>> Use here uint16_t sl_mask instead of array - flow will be simpler. >> No, it won't. >> It will save three lines in the end when checking whether there is >> a valid sl that doesn't lead to VL15, > > It saves loop, not just three lines :) So now you see it yourself that it didn't save any loop :) You had to do this loop in the end anyway to get any valid SL. >> but it will compilcate a bit >> rest of the related code, because I still need to read port's SL2VL >> table values one by one and mark them in the array (or bitmap) one >> by one. > > Right, but since (!sl_mask) check is cheap you are able to stop PR > generation at the moment when no valid SLs exist. Agree > Just look at the patch (against recent PR code): > > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index edfa15f..1c6532b 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -253,16 +253,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > uint8_t in_port_num; > ib_net16_t dest_lid; > uint8_t i; > - uint8_t vl; > ib_slvl_table_t *p_slvl_tbl = NULL; > - boolean_t valid_sls[IB_MAX_NUM_VLS]; > - boolean_t sl2vl_valid_path; > - uint8_t first_valid_sl; > + uint16_t sl_mask = 0xffff; > osm_qos_level_t *p_qos_level = NULL; > > OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > > - memset(valid_sls, TRUE, IB_MAX_NUM_VLS); > dest_lid = cl_hton16(dest_lid_ho); > > p_dest_physp = p_dest_port->p_physp; > @@ -328,12 +324,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > > /* update valid SLs that still exist on this route */ > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - vl = ib_slvl_table_get(p_slvl_tbl, i); > - if (vl == IB_DROP_VL) > - valid_sls[i] = FALSE; > - } > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask & (1 << i) && > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > + sl_mask &= ~(1 << i); > + > + if (!sl_mask) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 on this path\n"); > + status = IB_NOT_FOUND; > + goto Exit; > } > } > > @@ -456,12 +458,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > * Check SL2VL table of the switch and update valid SLs > */ > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - vl = ib_slvl_table_get(p_slvl_tbl, i); > - if (vl == IB_DROP_VL) > - valid_sls[i] = FALSE; > - } > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask & (1 << i) && > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > + sl_mask &= ~(1 << i); > + if (!sl_mask) { > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: " > + "All the SLs lead to VL15 " > + "on this path\n"); > + status = IB_NOT_FOUND; > + goto Exit; > } > } > } > @@ -483,31 +491,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > "Path min MTU = %u, min rate = %u\n", > mtu, rate); > > - if (!p_rcv->p_subn->opt.no_qos) { > - /* > - * check whether there is some SL > - * that won't lead to VL15 eventually > - */ > - sl2vl_valid_path = FALSE; > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > - if (valid_sls[i]) { > - sl2vl_valid_path = TRUE; > - first_valid_sl = i; > - break; > - } > - } Here's the loop that you saved > - > - if (!sl2vl_valid_path) { > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "All the SLs lead to VL15 on this path\n"); > - } > - status = IB_NOT_FOUND; > - goto Exit; > - } > - } > - > if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > /* Get QoS Level object according to the path request */ > osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > @@ -542,11 +525,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > pkt_life = p_qos_level->pkt_life; > > if (p_qos_level->sl_set) { > - if (!valid_sls[p_qos_level->sl]) { > + sl = p_qos_level->sl; > + if (!(sl_mask & ( 1 << sl))) { > status = IB_NOT_FOUND; > goto Exit; > } > - sl = p_qos_level->sl; > } > > if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > @@ -830,12 +813,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > p_src_port, p_dest_port); > } else if (!p_rcv->p_subn->opt.no_qos) { > - sl = first_valid_sl; > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > + if (sl_mask&(1 << i)) { > + sl = i; > + break; > + } And here's the loop that you've added. Anyway, I agree that this implementation is better - additional check for sl_mask might save some runtime (and it's also more "elegant" code :)) I'll integrate it in the next version of this patch -- Yevgeny > } > else > sl = OSM_DEFAULT_SL; > > - if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { > + if (!p_rcv->p_subn->opt.no_qos && !(sl_mask & (1 << sl))) { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F23: " > "Selected SL (%u) leads to VL15\n", p_prtn->sl); > > >>>> + /* >>>> + * set Pkey for this path record request >>>> + */ >>>> + >>>> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >>>> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >>> No extra () was needed - this generates confused diff lines. >> No sure what you mean here by "confused diff lines". > > I mean those extra lines in the patch where the only differences are > formatting or cosmetic stuff like extra braces. If you have a reason to > make such changes just send it as separate patch. > >> I agree that the extra () are not *needed*, but isn't >> >> if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >> (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >> >> is more readable than >> >> if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) >> >> ? > > No. It requires 2+ seconds to make sure that some braces are just > "extra" ones. > > BTW the second is incorrect - should be (1 << 31), those '()' were > needed. > > Sasha > From sashak at voltaire.com Thu Sep 6 05:12:25 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 15:12:25 +0300 Subject: [ofa-general] Re: [PATCH 1/2] osm: QoS - support for MPR in qos policy In-Reply-To: <46DFC4E0.9000600@dev.mellanox.co.il> References: <46DFC4E0.9000600@dev.mellanox.co.il> Message-ID: <20070906121225.GK25330@sashak.voltaire.com> Hi Yevgeny, On 12:14 Thu 06 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > This patch is a step toward supporting MultiPathRecord in qos policy: > > 1. Added subnet object to the qos policy struct to remove dependency on > osm_pr_rcv_t (and later on osm_mpr_rcv_t). > 2. osm_qos_policy_get_qos_level_by_pr() turned into a wrapper fuction > that gets path record, extracts the relevant parameters and converts > path record comp_mask to the local comp_mask. > > NOTE: this patch does not depend on the "selecting PathRecord" patch. > > -- Yevgeny > > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_qos_policy.h | 16 ++-- > opensm/opensm/osm_qos_parser.y | 2 +- > opensm/opensm/osm_qos_policy.c | 125 +++++++++++++++++++++----------- > 3 files changed, 90 insertions(+), 53 deletions(-) > > diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h > index a7a9cd2..11598be 100644 > --- a/opensm/include/opensm/osm_qos_policy.h > +++ b/opensm/include/opensm/osm_qos_policy.h > @@ -141,6 +141,7 @@ typedef struct _osm_qos_policy_t { > cl_list_t qos_levels; /* list of osm_qos_level_t */ > cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > osm_qos_level_t *p_default_qos_level; /* default QoS level */ > + osm_subn_t *p_subn; /* osm subnet object */ > } osm_qos_policy_t; > > /***************************************************/ > @@ -167,17 +168,16 @@ ib_net16_t osm_qos_level_get_shared_pkey(IN const osm_qos_level_t * p_qos_level, > osm_qos_match_rule_t * osm_qos_policy_match_rule_create(); > void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p_match_rule); > > -osm_qos_policy_t * osm_qos_policy_create(); > +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn); > void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy); > int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, osm_log_t * p_log); > > -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, > - IN const osm_pr_rcv_t * p_rcv, > - IN const ib_path_rec_t * p_pr, > - IN const osm_physp_t * p_src_physp, > - IN const osm_physp_t * p_dest_physp, > - IN ib_net64_t comp_mask, > - OUT osm_qos_level_t ** pp_qos_level); > +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > + IN const osm_qos_policy_t * p_qos_policy, > + IN const ib_path_rec_t * p_pr, > + IN const osm_physp_t * p_src_physp, > + IN const osm_physp_t * p_dest_physp, > + IN ib_net64_t comp_mask); > > /***************************************************/ > > diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y > index 876448b..a477084 100644 > --- a/opensm/opensm/osm_qos_parser.y > +++ b/opensm/opensm/osm_qos_parser.y > @@ -1752,7 +1752,7 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) > column_num = 1; > line_num = 1; > > - p_subn->p_qos_policy = osm_qos_policy_create(); > + p_subn->p_qos_policy = osm_qos_policy_create(p_subn); > > __parser_tmp_struct_init(); > p_qos_policy = p_subn->p_qos_policy; > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 059a861..74628a5 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -53,8 +53,14 @@ > #include > #include > #include > +#include > #include > > +#define QOS_PARAMS_COMPMASK_SERVICEID (((uint8_t)1)<<0) > +#define QOS_PARAMS_COMPMASK_QOS_CLASS (((uint8_t)1)<<1) > +#define QOS_PARAMS_COMPMASK_PKEY (((uint8_t)1)<<2) > + > + > /*************************************************** > ***************************************************/ > > @@ -380,7 +386,7 @@ void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p) > /*************************************************** > ***************************************************/ > > -osm_qos_policy_t * osm_qos_policy_create() > +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) > { > osm_qos_policy_t * p_qos_policy = (osm_qos_policy_t *)malloc(sizeof(osm_qos_policy_t)); > if (!p_qos_policy) > @@ -403,6 +409,7 @@ osm_qos_policy_t * osm_qos_policy_create() > cl_list_construct(&p_qos_policy->qos_match_rules); > cl_list_init(&p_qos_policy->qos_match_rules, 10); > > + p_qos_policy->p_subn = p_subn; > return p_qos_policy; > } > > @@ -542,7 +549,7 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, > ***************************************************/ > > static boolean_t > -__qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, > +__qos_policy_is_port_in_group_list(const osm_qos_policy_t * p_qos_policy, > const osm_physp_t * p_physp, > cl_list_t * p_port_group_list) > { > @@ -555,7 +562,7 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, > (osm_qos_port_group_t *) cl_list_obj(list_iterator); > if (p_port_group) { > if (__qos_policy_is_port_in_group > - (p_rcv->p_subn, p_physp, p_port_group)) > + (p_qos_policy->p_subn, p_physp, p_port_group)) > return TRUE; > } > list_iterator = cl_list_next(list_iterator); > @@ -566,13 +573,14 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, > /*************************************************** > ***************************************************/ > > -static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > +static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > const osm_qos_policy_t * p_qos_policy, > - const osm_pr_rcv_t * p_rcv, > - const ib_path_rec_t * p_pr, > + uint64_t service_id, > + uint16_t qos_class, > + uint16_t pkey, > const osm_physp_t * p_src_physp, > const osm_physp_t * p_dest_physp, > - ib_net64_t comp_mask) > + uint8_t comp_mask) > { > osm_qos_match_rule_t *p_qos_match_rule = NULL; > cl_list_iterator_t list_iterator; > @@ -594,7 +602,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > /* If a match rule has Source groups, PR request source has to be in this list */ > > if (cl_list_count(&p_qos_match_rule->source_group_list)) { > - if (!__qos_policy_is_port_in_group_list(p_rcv, > + if (!__qos_policy_is_port_in_group_list(p_qos_policy, > p_src_physp, > &p_qos_match_rule-> > source_group_list)) > @@ -607,7 +615,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > /* If a match rule has Destination groups, PR request dest. has to be in this list */ > > if (cl_list_count(&p_qos_match_rule->destination_group_list)) { > - if (!__qos_policy_is_port_in_group_list(p_rcv, > + if (!__qos_policy_is_port_in_group_list(p_qos_policy, > p_dest_physp, > &p_qos_match_rule-> > destination_group_list)) > @@ -621,7 +629,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > to have a matching QoS class to match the rule */ > > if (p_qos_match_rule->qos_class_range_len) { > - if (!(comp_mask & IB_PR_COMPMASK_QOS_CLASS)) { > + if (!(comp_mask & QOS_PARAMS_COMPMASK_QOS_CLASS)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -629,7 +637,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > if (!__is_num_in_range_arr > (p_qos_match_rule->qos_class_range_arr, > p_qos_match_rule->qos_class_range_len, > - ib_path_rec_qos_class(p_pr))) { > + qos_class)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -640,8 +648,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > to have a matching Service ID to match the rule */ > > if (p_qos_match_rule->service_id_range_len) { > - if (!(comp_mask & IB_PR_COMPMASK_SERVICEID_MSB) || > - !(comp_mask & IB_PR_COMPMASK_SERVICEID_LSB)) { > + if (!(comp_mask & QOS_PARAMS_COMPMASK_SERVICEID)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -649,7 +656,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > if (!__is_num_in_range_arr > (p_qos_match_rule->service_id_range_arr, > p_qos_match_rule->service_id_range_len, > - cl_ntoh64(p_pr->service_id))) { > + service_id)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -660,7 +667,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > to have a matching PKey to match the rule */ > > if (p_qos_match_rule->pkey_range_len) { > - if (!(comp_mask & IB_PR_COMPMASK_PKEY)) { > + if (!(comp_mask & QOS_PARAMS_COMPMASK_PKEY)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -668,7 +675,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > if (!__is_num_in_range_arr > (p_qos_match_rule->pkey_range_arr, > p_qos_match_rule->pkey_range_len, > - cl_ntoh16(p_pr->pkey))) { > + pkey)) { > list_iterator = cl_list_next(list_iterator); > continue; > } > @@ -688,8 +695,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( > /*************************************************** > ***************************************************/ > > -static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_qos_policy, > - char *name) > +static osm_qos_level_t *__qos_policy_get_qos_level_by_name( > + const osm_qos_policy_t * p_qos_policy, > + char *name) > { > osm_qos_level_t *p_qos_level = NULL; > cl_list_iterator_t list_iterator; > @@ -713,8 +721,9 @@ static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_ > /*************************************************** > ***************************************************/ > > -static osm_qos_port_group_t *__qos_policy_get_port_group_by_name(osm_qos_policy_t * p_qos_policy, > - const char *const name) > +static osm_qos_port_group_t *__qos_policy_get_port_group_by_name( > + const osm_qos_policy_t * p_qos_policy, > + const char *const name) > { > osm_qos_port_group_t *p_port_group = NULL; > cl_list_iterator_t list_iterator; > @@ -869,54 +878,82 @@ int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, > /*************************************************** > ***************************************************/ > > -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, > - IN const osm_pr_rcv_t * p_rcv, > - IN const ib_path_rec_t * p_pr, > - IN const osm_physp_t * p_src_physp, > - IN const osm_physp_t * p_dest_physp, > - IN ib_net64_t comp_mask, > - OUT osm_qos_level_t ** pp_qos_level) > +static osm_qos_level_t * __qos_policy_get_qos_level_by_params( > + IN const osm_qos_policy_t * p_qos_policy, > + IN const osm_physp_t * p_src_physp, > + IN const osm_physp_t * p_dest_physp, > + IN uint64_t service_id, > + IN uint16_t qos_class, > + IN uint16_t pkey, > + IN uint8_t comp_mask) > { > osm_qos_match_rule_t *p_qos_match_rule = NULL; > osm_qos_level_t *p_qos_level = NULL; > > - OSM_LOG_ENTER(p_rcv->p_log, osm_qos_policy_get_qos_level_by_pr); > - > - *pp_qos_level = NULL; > + OSM_LOG_ENTER(&p_qos_policy->p_subn->p_osm->log, > + __qos_policy_get_qos_level_by_params); > > if (!p_qos_policy) > goto Exit; > > - p_qos_match_rule = __qos_policy_get_match_rule_by_pr(p_qos_policy, > - p_rcv, > - p_pr, > - p_src_physp, > - p_dest_physp, > - comp_mask); > + p_qos_match_rule = __qos_policy_get_match_rule_by_params( > + p_qos_policy, service_id, qos_class, pkey, > + p_src_physp, p_dest_physp, comp_mask); > > if (p_qos_match_rule) > p_qos_level = p_qos_match_rule->p_qos_level; > else > p_qos_level = p_qos_policy->p_default_qos_level; > > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "osm_qos_policy_get_qos_level_by_pr: " > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_qos_level_by_params: " > "PathRecord request:" > "Src port 0x%016" PRIx64 ", " > "Dst port 0x%016" PRIx64 "\n", > cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), > cl_ntoh64(osm_physp_get_port_guid(p_dest_physp))); > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "osm_qos_policy_get_qos_level_by_pr: " > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_qos_level_by_params: " > "Applying QoS Level %s (%s)\n", > p_qos_level->name, > (p_qos_level->use) ? p_qos_level->use : "no description"); > > - *pp_qos_level = p_qos_level; > - > Exit: > - OSM_LOG_EXIT(p_rcv->p_log); > -} /* osm_qos_policy_get_qos_level_by_pr() */ > + OSM_LOG_EXIT(&p_qos_policy->p_subn->p_osm->log); > + return p_qos_level; > +} /* __qos_policy_get_qos_level_by_params() */ > > /*************************************************** > ***************************************************/ > + > +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > + IN const osm_qos_policy_t * p_qos_policy, > + IN const ib_path_rec_t * p_pr, > + IN const osm_physp_t * p_src_physp, > + IN const osm_physp_t * p_dest_physp, > + IN ib_net64_t comp_mask) > +{ > + uint8_t params_comp_mask = 0; > + > + if (!p_qos_policy) > + return NULL; > + > + if (comp_mask & IB_PR_COMPMASK_QOS_CLASS) > + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; > + > + if (comp_mask & IB_PR_COMPMASK_SERVICEID_MSB && > + comp_mask & IB_PR_COMPMASK_SERVICEID_LSB) > + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; > + > + if (comp_mask & IB_PR_COMPMASK_PKEY) > + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; Why to not do params_comp_mask to be compatible with SA PR comp_mask (SA PR is much more popular than SA MPR)? So you will not need to convert in the case of SA PR, but only in the case of SA MPR (you already do, as I see in the next patch). Also please use different subject for different patchs - it becomes commit summary. Sasha > + > + return __qos_policy_get_qos_level_by_params( > + p_qos_policy, p_src_physp, p_dest_physp, > + cl_ntoh64(p_pr->service_id), ib_path_rec_qos_class(p_pr), > + cl_ntoh16(p_pr->pkey), params_comp_mask); > +} > + > +/*************************************************** > + ***************************************************/ > + > -- > 1.5.1.4 > From sashak at voltaire.com Thu Sep 6 05:16:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 15:16:51 +0300 Subject: [ofa-general] Re: [PATCH 2/2] osm: QoS - support for MPR in qos policy In-Reply-To: <46DFC4FC.70500@dev.mellanox.co.il> References: <46DFC4FC.70500@dev.mellanox.co.il> Message-ID: <20070906121651.GL25330@sashak.voltaire.com> On 12:14 Thu 06 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > This patch adds osm_qos_policy_get_qos_level_by_mpr() wrapper function that > basically does the same thing as the osm_qos_policy_get_qos_level_by_pr(), > only for MultiPathRecord instead of PathRecord.. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_qos_policy.h | 8 ++++++++ > opensm/opensm/osm_qos_policy.c | 32 ++++++++++++++++++++++++++++++++ > 2 files changed, 40 insertions(+), 0 deletions(-) > > diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h > index 11598be..0c220ee 100644 > --- a/opensm/include/opensm/osm_qos_policy.h > +++ b/opensm/include/opensm/osm_qos_policy.h > @@ -51,6 +51,7 @@ > #include > #include > #include > +#include > > #define YYSTYPE char * > #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 > @@ -179,6 +180,13 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > IN const osm_physp_t * p_dest_physp, > IN ib_net64_t comp_mask); > > +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( > + IN const osm_qos_policy_t * p_qos_policy, > + IN const ib_multipath_rec_t * p_mpr, > + IN const osm_physp_t * p_src_physp, > + IN const osm_physp_t * p_dest_physp, > + IN ib_net64_t comp_mask); > + > /***************************************************/ > > int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 74628a5..a778bcb 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -957,3 +957,35 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > /*************************************************** > ***************************************************/ > > +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( > + IN const osm_qos_policy_t * p_qos_policy, > + IN const ib_multipath_rec_t * p_mpr, > + IN const osm_physp_t * p_src_physp, > + IN const osm_physp_t * p_dest_physp, > + IN ib_net64_t comp_mask) > +{ > + uint8_t params_comp_mask = 0; > + > + if (!p_qos_policy) > + return NULL; > + > + if (comp_mask & IB_MPR_COMPMASK_QOS_CLASS) > + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; > + > + if (comp_mask & IB_MPR_COMPMASK_SERVICEID_MSB && > + comp_mask & IB_MPR_COMPMASK_SERVICEID_LSB) > + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; > + > + if (comp_mask & IB_MPR_COMPMASK_PKEY) > + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; > + > + return __qos_policy_get_qos_level_by_params( > + p_qos_policy, p_src_physp, p_dest_physp, > + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), > + ib_multipath_rec_qos_class(p_mpr), > + cl_ntoh16(p_mpr->pkey), params_comp_mask); > +} > + > +/*************************************************** > + ***************************************************/ > + > -- > 1.5.1.4 This patch does not apply. The reason is trailing newline in osm_qos_policy.c file (introduced in the previous patch). I'm using 'git-am --whitespace=strip', so this new line was stripped and the next patch (this one) does not apply. It is better to not put empty new lines at the end. Sasha From dotanb at dev.mellanox.co.il Thu Sep 6 05:05:03 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 6 Sep 2007 15:05:03 +0300 Subject: [ofa-general] [PATCH] libibumad: Fix several issues that were reported by valgrind Message-ID: <200709061505.03790.dotanb@dev.mellanox.co.il> Fix several issues that were reported by valgrind. (sorry, but i don't have any test suite to check all of the libibumad code for valgrind warnings in the first place ...) Signed-off-by: Dotan Barak Index: ofa_1_3_dev_user/src/userspace/management/libibumad/src/umad.c =================================================================== --- ofa_1_3_dev_user.orig/src/userspace/management/libibumad/src/umad.c 2007-09-05 09:31:53.000000000 +0300 +++ ofa_1_3_dev_user/src/userspace/management/libibumad/src/umad.c 2007-09-06 14:59:42.000000000 +0300 @@ -832,6 +832,9 @@ umad_recv(int portid, void *umad, int *l } n = read(port->dev_fd, umad, sizeof *mad + *length); + + VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); + if ((n >= 0) && (n <= sizeof *mad + *length)) { DEBUG("mad received by agent %d length %d", mad->agent_id, n); if (n > sizeof *mad) @@ -910,6 +913,8 @@ umad_register_oui(int portid, int mgmt_c else memset(req.method_mask, 0, sizeof req.method_mask); + VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); + if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", portid, req.id, req.qpn, req.mgmt_class, oui); From sashak at voltaire.com Thu Sep 6 05:20:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 15:20:22 +0300 Subject: [ofa-general] Re: [PATCH] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46DFE7C2.30602@dev.mellanox.co.il> References: <46DBFAFB.4090000@dev.mellanox.co.il> <20070903172010.GB29384@sashak.voltaire.com> <46DE6091.40901@dev.mellanox.co.il> <20070905145010.GL23670@sashak.voltaire.com> <46DFE7C2.30602@dev.mellanox.co.il> Message-ID: <20070906122022.GM25330@sashak.voltaire.com> On 14:42 Thu 06 Sep , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > Hi Yevgeny, > > On 10:53 Wed 05 Sep , Yevgeny Kliteynik wrote: > >>>> ib_net16_t dest_lid; > >>>> + uint8_t i; > >>>> + uint8_t vl; > >>>> + ib_slvl_table_t *p_slvl_tbl = NULL; > >>>> + boolean_t valid_sls[IB_MAX_NUM_VLS]; > >>> Use here uint16_t sl_mask instead of array - flow will be simpler. > >> No, it won't. > >> It will save three lines in the end when checking whether there is > >> a valid sl that doesn't lead to VL15, > > It saves loop, not just three lines :) > > So now you see it yourself that it didn't save any loop :) > You had to do this loop in the end anyway to get any valid SL. I did, "my loop" is conditional for one of many cases. Sasha > > >> but it will compilcate a bit > >> rest of the related code, because I still need to read port's SL2VL > >> table values one by one and mark them in the array (or bitmap) one > >> by one. > > Right, but since (!sl_mask) check is cheap you are able to stop PR > > generation at the moment when no valid SLs exist. > > Agree > > > Just look at the patch (against recent PR code): > > diff --git a/opensm/opensm/osm_sa_path_record.c > > b/opensm/opensm/osm_sa_path_record.c > > index edfa15f..1c6532b 100644 > > --- a/opensm/opensm/osm_sa_path_record.c > > +++ b/opensm/opensm/osm_sa_path_record.c > > @@ -253,16 +253,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > uint8_t in_port_num; > > ib_net16_t dest_lid; > > uint8_t i; > > - uint8_t vl; > > ib_slvl_table_t *p_slvl_tbl = NULL; > > - boolean_t valid_sls[IB_MAX_NUM_VLS]; > > - boolean_t sl2vl_valid_path; > > - uint8_t first_valid_sl; > > + uint16_t sl_mask = 0xffff; > > osm_qos_level_t *p_qos_level = NULL; > > OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > > - memset(valid_sls, TRUE, IB_MAX_NUM_VLS); > > dest_lid = cl_hton16(dest_lid_ho); > > p_dest_physp = p_dest_port->p_physp; > > @@ -328,12 +324,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > > /* update valid SLs that still exist on this route */ > > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > > - if (valid_sls[i]) { > > - vl = ib_slvl_table_get(p_slvl_tbl, i); > > - if (vl == IB_DROP_VL) > > - valid_sls[i] = FALSE; > > - } > > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > > + if (sl_mask & (1 << i) && > > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > > + sl_mask &= ~(1 << i); > > + > > + if (!sl_mask) { > > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > > + "__osm_pr_rcv_get_path_parms: " > > + "All the SLs lead to VL15 on this path\n"); > > + status = IB_NOT_FOUND; > > + goto Exit; > > } > > } > > @@ -456,12 +458,18 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > * Check SL2VL table of the switch and update valid SLs > > */ > > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > > - if (valid_sls[i]) { > > - vl = ib_slvl_table_get(p_slvl_tbl, i); > > - if (vl == IB_DROP_VL) > > - valid_sls[i] = FALSE; > > - } > > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > > + if (sl_mask & (1 << i) && > > + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) > > + sl_mask &= ~(1 << i); > > + if (!sl_mask) { > > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > > + "__osm_pr_rcv_get_path_parms: " > > + "All the SLs lead to VL15 " > > + "on this path\n"); > > + status = IB_NOT_FOUND; > > + goto Exit; > > } > > } > > } > > @@ -483,31 +491,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > "Path min MTU = %u, min rate = %u\n", > > mtu, rate); > > - if (!p_rcv->p_subn->opt.no_qos) { > > - /* > > - * check whether there is some SL > > - * that won't lead to VL15 eventually > > - */ > > - sl2vl_valid_path = FALSE; > > - for (i = 0; i < IB_MAX_NUM_VLS; i++) { > > - if (valid_sls[i]) { > > - sl2vl_valid_path = TRUE; > > - first_valid_sl = i; > > - break; > > - } > > - } > > Here's the loop that you saved > > > - > > - if (!sl2vl_valid_path) { > > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > > - "__osm_pr_rcv_get_path_parms: " > > - "All the SLs lead to VL15 on this path\n"); > > - } > > - status = IB_NOT_FOUND; > > - goto Exit; > > - } > > - } > > - > > if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > > /* Get QoS Level object according to the path request */ > > osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > > @@ -542,11 +525,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > pkt_life = p_qos_level->pkt_life; > > if (p_qos_level->sl_set) { > > - if (!valid_sls[p_qos_level->sl]) { > > + sl = p_qos_level->sl; > > + if (!(sl_mask & ( 1 << sl))) { > > status = IB_NOT_FOUND; > > goto Exit; > > } > > - sl = p_qos_level->sl; > > } > > if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > > @@ -830,12 +813,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > > p_rcv, > > sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > > p_src_port, p_dest_port); > > } else if (!p_rcv->p_subn->opt.no_qos) { > > - sl = first_valid_sl; > > + for (i = 0; i < IB_MAX_NUM_VLS; i++) > > + if (sl_mask&(1 << i)) { > > + sl = i; > > + break; > > + } > > And here's the loop that you've added. > > Anyway, I agree that this implementation is better - additional > check for sl_mask might save some runtime (and it's also more > "elegant" code :)) > > I'll integrate it in the next version of this patch > > -- Yevgeny > > > } > > else > > sl = OSM_DEFAULT_SL; > > - if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { > > + if (!p_rcv->p_subn->opt.no_qos && !(sl_mask & (1 << sl))) { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_pr_rcv_get_path_parms: ERR 1F23: " > > "Selected SL (%u) leads to VL15\n", p_prtn->sl); > >>>> + /* > >>>> + * set Pkey for this path record request > >>>> + */ > >>>> + > >>>> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > >>>> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > >>> No extra () was needed - this generates confused diff lines. > >> No sure what you mean here by "confused diff lines". > > I mean those extra lines in the patch where the only differences are > > formatting or cosmetic stuff like extra braces. If you have a reason to > > make such changes just send it as separate patch. > >> I agree that the extra () are not *needed*, but isn't > >> > >> if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > >> (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > >> > >> is more readable than > >> > >> if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > >> cl_ntoh32(p_pr->hop_flow_raw) & 1 << 31) > >> > >> ? > > No. It requires 2+ seconds to make sure that some braces are just > > "extra" ones. > > BTW the second is incorrect - should be (1 << 31), those '()' were > > needed. > > Sasha > From rdreier at cisco.com Thu Sep 6 05:12:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Sep 2007 05:12:35 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <200709061138.l86BcgYb005214@cmf.nrl.navy.mil> (chas williams's message of "Thu, 06 Sep 2007 07:38:42 -0400") References: <200709061138.l86BcgYb005214@cmf.nrl.navy.mil> Message-ID: >Assuming routing works, even if this means you trust the IB-Eth gateway not to >corrupt the packet, I'm looking for name that makes this clear. I haven't had a chance to do much this week (still at the kernel summit). However, my view is that this patch is *very* dangerous and I don't like it much. But maybe if we name the option something like "enable_silent_data_corruption" that would be sufficient warning for users. - R. From sashak at voltaire.com Thu Sep 6 05:25:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 15:25:47 +0300 Subject: [ofa-general] Re: [PATCH] libibumad: Fix several issues that were reported by valgrind In-Reply-To: <200709061505.03790.dotanb@dev.mellanox.co.il> References: <200709061505.03790.dotanb@dev.mellanox.co.il> Message-ID: <20070906122547.GN25330@sashak.voltaire.com> Hi Dotan, On 15:05 Thu 06 Sep , Dotan Barak wrote: > Fix several issues that were reported by valgrind. Am I missing something? What is the fix here? Sasha > (sorry, but i don't have any test suite to check all of the libibumad code > for valgrind warnings in the first place ...) > > Signed-off-by: Dotan Barak > > Index: ofa_1_3_dev_user/src/userspace/management/libibumad/src/umad.c > =================================================================== > --- ofa_1_3_dev_user.orig/src/userspace/management/libibumad/src/umad.c 2007-09-05 09:31:53.000000000 +0300 > +++ ofa_1_3_dev_user/src/userspace/management/libibumad/src/umad.c 2007-09-06 14:59:42.000000000 +0300 > @@ -832,6 +832,9 @@ umad_recv(int portid, void *umad, int *l > } > > n = read(port->dev_fd, umad, sizeof *mad + *length); > + > + VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); > + > if ((n >= 0) && (n <= sizeof *mad + *length)) { > DEBUG("mad received by agent %d length %d", mad->agent_id, n); > if (n > sizeof *mad) > @@ -910,6 +913,8 @@ umad_register_oui(int portid, int mgmt_c > else > memset(req.method_mask, 0, sizeof req.method_mask); > > + VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); > + > if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { > DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", > portid, req.id, req.qpn, req.mgmt_class, oui); From dotanb at dev.mellanox.co.il Thu Sep 6 05:24:37 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 06 Sep 2007 15:24:37 +0300 Subject: [ofa-general] Re: [PATCH] libibumad: Fix several issues that were reported by valgrind In-Reply-To: <20070906122547.GN25330@sashak.voltaire.com> References: <200709061505.03790.dotanb@dev.mellanox.co.il> <20070906122547.GN25330@sashak.voltaire.com> Message-ID: <46DFF185.7070308@dev.mellanox.co.il> Sasha Khapyorsky wrote: > >> Fix several issues that were reported by valgrind. >> > > Am I missing something? What is the fix here? > > Sasha > This patch fixes the valgrind support in the libibumad If this patch won't be applied, valgrind will have warnings on those buffers, so if you will execute an application that calls to those functions you will get warnings because memory violations that exists in the libibumad. Dotan From kliteyn at dev.mellanox.co.il Thu Sep 6 05:07:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 06 Sep 2007 15:07:29 +0300 Subject: [ofa-general] Re: [PATCH 2/2] osm: QoS - support for MPR in qos policy In-Reply-To: <20070906121651.GL25330@sashak.voltaire.com> References: <46DFC4FC.70500@dev.mellanox.co.il> <20070906121651.GL25330@sashak.voltaire.com> Message-ID: <46DFED81.9010308@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 12:14 Thu 06 Sep , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> This patch adds osm_qos_policy_get_qos_level_by_mpr() wrapper function that >> basically does the same thing as the osm_qos_policy_get_qos_level_by_pr(), >> only for MultiPathRecord instead of PathRecord.. >> >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/include/opensm/osm_qos_policy.h | 8 ++++++++ >> opensm/opensm/osm_qos_policy.c | 32 ++++++++++++++++++++++++++++++++ >> 2 files changed, 40 insertions(+), 0 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h >> index 11598be..0c220ee 100644 >> --- a/opensm/include/opensm/osm_qos_policy.h >> +++ b/opensm/include/opensm/osm_qos_policy.h >> @@ -51,6 +51,7 @@ >> #include >> #include >> #include >> +#include >> >> #define YYSTYPE char * >> #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 >> @@ -179,6 +180,13 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( >> IN const osm_physp_t * p_dest_physp, >> IN ib_net64_t comp_mask); >> >> +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( >> + IN const osm_qos_policy_t * p_qos_policy, >> + IN const ib_multipath_rec_t * p_mpr, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp, >> + IN ib_net64_t comp_mask); >> + >> /***************************************************/ >> >> int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); >> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c >> index 74628a5..a778bcb 100644 >> --- a/opensm/opensm/osm_qos_policy.c >> +++ b/opensm/opensm/osm_qos_policy.c >> @@ -957,3 +957,35 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( >> /*************************************************** >> ***************************************************/ >> >> +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( >> + IN const osm_qos_policy_t * p_qos_policy, >> + IN const ib_multipath_rec_t * p_mpr, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp, >> + IN ib_net64_t comp_mask) >> +{ >> + uint8_t params_comp_mask = 0; >> + >> + if (!p_qos_policy) >> + return NULL; >> + >> + if (comp_mask & IB_MPR_COMPMASK_QOS_CLASS) >> + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; >> + >> + if (comp_mask & IB_MPR_COMPMASK_SERVICEID_MSB && >> + comp_mask & IB_MPR_COMPMASK_SERVICEID_LSB) >> + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; >> + >> + if (comp_mask & IB_MPR_COMPMASK_PKEY) >> + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; >> + >> + return __qos_policy_get_qos_level_by_params( >> + p_qos_policy, p_src_physp, p_dest_physp, >> + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), >> + ib_multipath_rec_qos_class(p_mpr), >> + cl_ntoh16(p_mpr->pkey), params_comp_mask); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> -- >> 1.5.1.4 > > This patch does not apply. The reason is trailing newline in > osm_qos_policy.c file (introduced in the previous patch). I'm using > 'git-am --whitespace=strip', so this new line was stripped and the next > patch (this one) does not apply. It is better to not put empty new lines > at the end. OK. Do you need a new patch? -- Yevgeny > Sasha > From vlad at mellanox.co.il Thu Sep 6 05:25:30 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 6 Sep 2007 15:25:30 +0300 Subject: [ofa-general] RE: management/libibcommon In-Reply-To: <20070906112307.GI25330@sashak.voltaire.com> References: <20070906112307.GI25330@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C902374493@mtlexch01.mtl.com> > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, September 06, 2007 2:23 PM > To: OpenIB > Cc: Hal Rosenstock; Ira Weiny; Eitan Zahavi; Sean Hefty; Dotan Barak; > Vladimir Sokolovsky > Subject: management/libibcommon > > Hi All, > > Currently we have libibcommon library under OFA management project. > Partially it is used by libibumad and partially by libibmad and > infiniband-diags. The used things look pretty separate so I'm thinking > to strip libibcommon as whole library and its components over libibumad > and libibmad - this will remove extra dependency for libibumad. > > Anybody else (except management) uses libibcommon? Any comments, > objections? > > Sasha Hi Sasha, AFAIK, mvapich, srptools and ibutils use libibcommon. In any case if you are going to remove libibcommon, then ofabuild and ofed_1_3_scripts should be updated as well. I will be in vacation from 10 Sep 2007 till 2 Oct 2007. Please update Tziporet Koren and Michael Tsirkin with your decisions. Regards, Vladimir From hal.rosenstock at gmail.com Thu Sep 6 05:26:56 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 6 Sep 2007 08:26:56 -0400 Subject: [ofa-general] Re: management/libibcommon In-Reply-To: <20070906112307.GI25330@sashak.voltaire.com> References: <20070906112307.GI25330@sashak.voltaire.com> Message-ID: On 9/6/07, Sasha Khapyorsky wrote: > Hi All, > > Currently we have libibcommon library under OFA management project. > Partially it is used by libibumad and partially by libibmad and > infiniband-diags. The used things look pretty separate so I'm thinking > to strip libibcommon as whole library and its components over libibumad > and libibmad - this will remove extra dependency for libibumad. > > Anybody else (except management) uses libibcommon? Any comments, > objections? Having one less library here is better especially since libibumad is used for both OpenSM, diags, and ibutils whereas libibmad is only used for diags in terms of the open sourced components. -- Hal > > Sasha > From sashak at voltaire.com Thu Sep 6 05:50:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 15:50:44 +0300 Subject: [ofa-general] Re: [PATCH 2/2] osm: QoS - support for MPR in qos policy In-Reply-To: <46DFED81.9010308@dev.mellanox.co.il> References: <46DFC4FC.70500@dev.mellanox.co.il> <20070906121651.GL25330@sashak.voltaire.com> <46DFED81.9010308@dev.mellanox.co.il> Message-ID: <20070906125044.GO25330@sashak.voltaire.com> On 15:07 Thu 06 Sep , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 12:14 Thu 06 Sep , Yevgeny Kliteynik wrote: > >> Hi Sasha, > >> > >> This patch adds osm_qos_policy_get_qos_level_by_mpr() wrapper function > >> that > >> basically does the same thing as the osm_qos_policy_get_qos_level_by_pr(), > >> only for MultiPathRecord instead of PathRecord.. > >> > >> -- Yevgeny > >> > >> Signed-off-by: Yevgeny Kliteynik > >> --- > >> opensm/include/opensm/osm_qos_policy.h | 8 ++++++++ > >> opensm/opensm/osm_qos_policy.c | 32 > >> ++++++++++++++++++++++++++++++++ > >> 2 files changed, 40 insertions(+), 0 deletions(-) > >> > >> diff --git a/opensm/include/opensm/osm_qos_policy.h > >> b/opensm/include/opensm/osm_qos_policy.h > >> index 11598be..0c220ee 100644 > >> --- a/opensm/include/opensm/osm_qos_policy.h > >> +++ b/opensm/include/opensm/osm_qos_policy.h > >> @@ -51,6 +51,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> > >> #define YYSTYPE char * > >> #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 > >> @@ -179,6 +180,13 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > >> IN const osm_physp_t * p_dest_physp, > >> IN ib_net64_t comp_mask); > >> > >> +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( > >> + IN const osm_qos_policy_t * p_qos_policy, > >> + IN const ib_multipath_rec_t * p_mpr, > >> + IN const osm_physp_t * p_src_physp, > >> + IN const osm_physp_t * p_dest_physp, > >> + IN ib_net64_t comp_mask); > >> + > >> /***************************************************/ > >> > >> int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); > >> diff --git a/opensm/opensm/osm_qos_policy.c > >> b/opensm/opensm/osm_qos_policy.c > >> index 74628a5..a778bcb 100644 > >> --- a/opensm/opensm/osm_qos_policy.c > >> +++ b/opensm/opensm/osm_qos_policy.c > >> @@ -957,3 +957,35 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( > >> /*************************************************** > >> ***************************************************/ > >> > >> +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( > >> + IN const osm_qos_policy_t * p_qos_policy, > >> + IN const ib_multipath_rec_t * p_mpr, > >> + IN const osm_physp_t * p_src_physp, > >> + IN const osm_physp_t * p_dest_physp, > >> + IN ib_net64_t comp_mask) > >> +{ > >> + uint8_t params_comp_mask = 0; > >> + > >> + if (!p_qos_policy) > >> + return NULL; > >> + > >> + if (comp_mask & IB_MPR_COMPMASK_QOS_CLASS) > >> + params_comp_mask |= QOS_PARAMS_COMPMASK_QOS_CLASS; > >> + > >> + if (comp_mask & IB_MPR_COMPMASK_SERVICEID_MSB && > >> + comp_mask & IB_MPR_COMPMASK_SERVICEID_LSB) > >> + params_comp_mask |= QOS_PARAMS_COMPMASK_SERVICEID; > >> + > >> + if (comp_mask & IB_MPR_COMPMASK_PKEY) > >> + params_comp_mask |= QOS_PARAMS_COMPMASK_PKEY; > >> + > >> + return __qos_policy_get_qos_level_by_params( > >> + p_qos_policy, p_src_physp, p_dest_physp, > >> + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), > >> + ib_multipath_rec_qos_class(p_mpr), > >> + cl_ntoh16(p_mpr->pkey), params_comp_mask); > >> +} > >> + > >> +/*************************************************** > >> + ***************************************************/ > >> + > >> -- > >> 1.5.1.4 > > This patch does not apply. The reason is trailing newline in > > osm_qos_policy.c file (introduced in the previous patch). I'm using > > 'git-am --whitespace=strip', so this new line was stripped and the next > > patch (this one) does not apply. It is better to not put empty new lines > > at the end. > > OK. > Do you need a new patch? I already edited this one by hands, so I don't need (if there will no changes following by comments from first patch). Sasha From sashak at voltaire.com Thu Sep 6 06:00:00 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 16:00:00 +0300 Subject: [ofa-general] Re: [PATCH] libibumad: Fix several issues that were reported by valgrind In-Reply-To: <46DFF185.7070308@dev.mellanox.co.il> References: <200709061505.03790.dotanb@dev.mellanox.co.il> <20070906122547.GN25330@sashak.voltaire.com> <46DFF185.7070308@dev.mellanox.co.il> Message-ID: <20070906130000.GP25330@sashak.voltaire.com> On 15:24 Thu 06 Sep , Dotan Barak wrote: > Sasha Khapyorsky wrote: > > > >> Fix several issues that were reported by valgrind. > >> > > > > Am I missing something? What is the fix here? > > > > Sasha > > > This patch fixes the valgrind support in the libibumad I see it now. Thanks for the explanations. Sasha > > If this patch won't be applied, valgrind will have warnings on those > buffers, > so if you will execute an application that calls to those functions you will > get warnings because > memory violations that exists in the libibumad. > > Dotan From sashak at voltaire.com Thu Sep 6 06:00:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 16:00:23 +0300 Subject: [ofa-general] Re: [PATCH] libibumad: Fix several issues that were reported by valgrind In-Reply-To: <200709061505.03790.dotanb@dev.mellanox.co.il> References: <200709061505.03790.dotanb@dev.mellanox.co.il> Message-ID: <20070906130023.GQ25330@sashak.voltaire.com> On 15:05 Thu 06 Sep , Dotan Barak wrote: > Fix several issues that were reported by valgrind. > (sorry, but i don't have any test suite to check all of the libibumad code > for valgrind warnings in the first place ...) > > Signed-off-by: Dotan Barak Applied. Thanks. Sasha From avi at qumranet.com Thu Sep 6 06:18:23 2007 From: avi at qumranet.com (Avi Kivity) Date: Thu, 06 Sep 2007 16:18:23 +0300 Subject: [ofa-general] Re: [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: References: <11890207643068-git-send-email-avi@qumranet.com> Message-ID: <46DFFE1F.7060900@qumranet.com> [ugh, what happened to the cc-list?] Andi Kleen wrote: > Avi Kivity writes: > >> pte notifiers are different from paravirt_ops: they extend the normal >> page tables rather than replace them; and they provide high-level information >> such as the vma and the virtual address for the driver to use. >> > > Sounds like a locking horror to me. To do anything with page tables > you need locks. Both for the kernel page tables and for your new tables. > > What happens when people add all > things of complicated operations in these notifiers? That will likely > happen and then everytime you change something in VM code they > will break. This has the potential to increase the cost of maintaining > VM code considerably, which would be a bad thing. > > This is quite different from paravirt ops because low level pvops > can typically run lockless by just doing some kind of hypercall directly. > But that won't work for maintaining your custom page tables. > This is a real problem. I don't have a solution yet. Obviously that needs to be addressed before something like this can go in; but as it's been done for the quadrics driver, presumably it is doable. -- Any sufficiently difficult bug is indistinguishable from a feature. From MontysingletHenson at naceweb.org Thu Sep 6 22:26:15 2007 From: MontysingletHenson at naceweb.org (Gus Golden) Date: Thu, 6 Sep 2007 21:26:15 -0800 Subject: [ofa-general] Fwd: Thank you, we will help you fight out the cash crunch Message-ID: If you have your own business and wish IMMEDIATE ready money to spend ANY way you like or need Extra money to give the company a boost or need A low interest loan - NO STRINGS ATTACHED, here is our best deal we can offer you NOW (hurry, this offer will expire NOW): $37,000+ loan Hurry, when the deal is gone, it is gone. Simply Call Us Free on 877-482-4954 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Thu Sep 6 06:31:00 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 06 Sep 2007 16:31:00 +0300 Subject: [ofa-general] Re: [PATCH v2] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <20070905232643.GC25330@sashak.voltaire.com> References: <46DE9F97.10003@dev.mellanox.co.il> <20070905232643.GC25330@sashak.voltaire.com> Message-ID: <46E00114.5060601@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 15:22 Wed 05 Sep , Yevgeny Kliteynik wrote: >> Selecting path according to QoS policy level that >> the PathRecord query matches. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_sa_path_record.c | 374 ++++++++++++++++++++++++++---------- >> 1 files changed, 276 insertions(+), 98 deletions(-) >> >> diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c >> index 1b781f0..15bd7e2 100644 >> --- a/opensm/opensm/osm_sa_path_record.c >> +++ b/opensm/opensm/osm_sa_path_record.c >> @@ -67,6 +67,7 @@ >> #include >> #include >> #include >> +#include >> #ifdef ROUTER_EXP >> #include >> #include >> @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> { >> const osm_node_t *p_node; >> const osm_physp_t *p_physp; >> + const osm_physp_t *p_src_physp; >> const osm_physp_t *p_dest_physp; >> - const osm_prtn_t *p_prtn; >> + const osm_prtn_t *p_prtn = NULL; >> const ib_port_info_t *p_pi; >> ib_api_status_t status = IB_SUCCESS; >> ib_net16_t pkey; >> @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> uint8_t required_rate; >> uint8_t required_pkt_life; >> uint8_t sl; >> + uint8_t in_port_num; >> ib_net16_t dest_lid; >> + uint8_t i; >> + uint8_t vl; >> + ib_slvl_table_t *p_slvl_tbl = NULL; >> + boolean_t valid_sls[IB_MAX_NUM_VLS]; >> + boolean_t sl2vl_valid_path; >> + uint8_t first_valid_sl; >> + osm_qos_level_t *p_qos_level = NULL; >> >> OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); >> >> + memset(valid_sls, TRUE, IB_MAX_NUM_VLS); >> dest_lid = cl_hton16(dest_lid_ho); >> >> p_dest_physp = p_dest_port->p_physp; >> p_physp = p_src_port->p_physp; >> + p_src_physp = p_physp; >> p_pi = &p_physp->port_info; >> >> mtu = ib_port_info_get_mtu_cap(p_pi); >> @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> p_node = osm_physp_get_node_ptr(p_physp); >> >> if (p_node->sw) { >> + /* source node is a switch */ >> + in_port_num = osm_physp_get_port_num(p_physp); > > Hmm, could in_port_num be != 0? Well... The physical port object is obtained from port object, which in turn, was obtained from the subnet port_guid_tbl through osm_get_port_by_guid(). Since there can be one port per guid in this table, I think we store there only ports 0 of the switches (correct me if I'm wrong). So looks like you're right - in this case in_port_num can be only 0. In any case, osm_physp_get_port_num() is just an inline function that returns p_physp->port_num. >> + >> /* >> * If the dest_lid_ho is equal to the lid of the switch pointed by >> * p_sw then p_physp will be the physical port of the switch port zero. > > I know it is not your code, but do you understand this part of the > comment? Nope :) The two lines I've added may very well replace these first two lines, so I think I can remove the old comment. >> + * Make sure that p_physp points to the out port of the >> + * switch that routes to the destination lid (dest_lid_ho) >> */ >> - p_physp = >> - osm_switch_get_route_by_lid(p_node->sw, >> - cl_ntoh16(dest_lid_ho)); >> + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >> if (p_physp == 0) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F02: " >> @@ -306,15 +321,32 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> } >> } >> >> + if (!p_rcv->p_subn->opt.no_qos) { > > Would you prefer to change opt.no_qos to opt.qos? For me it looks things > will be clear this way. I wanted to do it since I started working on QoS! >> + if (p_node->sw) >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >> + else >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); >> + >> + /* update valid SLs that still exist on this route */ >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + vl = ib_slvl_table_get(p_slvl_tbl, i); >> + if (vl == IB_DROP_VL) >> + valid_sls[i] = FALSE; >> + } >> + } >> + } >> + >> /* >> * Same as above >> */ >> p_node = osm_physp_get_node_ptr(p_dest_physp); >> >> if (p_node->sw) { >> - p_dest_physp = >> - osm_switch_get_route_by_lid(p_node->sw, >> - cl_ntoh16(dest_lid_ho)); >> + /* >> + * if destination is switch, we want p_dest_physp to point to port 0 >> + */ >> + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >> >> if (p_dest_physp == 0) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> @@ -328,6 +360,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> >> } >> >> + /* >> + * Now go through the path step by step >> + */ >> + >> while (p_physp != p_dest_physp) { >> p_physp = osm_physp_get_remote(p_physp); >> >> @@ -341,6 +377,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> goto Exit; >> } >> >> + in_port_num = osm_physp_get_port_num(p_physp); >> + >> /* >> This is point to point case (no switch in between) >> */ >> @@ -367,29 +405,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> */ >> p_pi = &p_physp->port_info; >> >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >> mtu = ib_port_info_get_mtu_cap(p_pi); >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> - "__osm_pr_rcv_get_path_parms: " >> - "New smallest MTU = %u at intervening port 0x%016" >> - PRIx64 " port num 0x%X\n", mtu, >> - cl_ntoh64(osm_physp_get_port_guid >> - (p_physp)), >> - osm_physp_get_port_num(p_physp)); >> - } >> >> - if (rate > ib_port_info_compute_rate(p_pi)) { >> + if (rate > ib_port_info_compute_rate(p_pi)) >> rate = ib_port_info_compute_rate(p_pi); >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> - "__osm_pr_rcv_get_path_parms: " >> - "New smallest rate = %u at intervening port 0x%016" >> - PRIx64 " port num 0x%X\n", rate, >> - cl_ntoh64(osm_physp_get_port_guid >> - (p_physp)), >> - osm_physp_get_port_num(p_physp)); >> - } >> >> /* >> Continue with the egress port on this switch. >> @@ -409,32 +429,41 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> CL_ASSERT(p_physp); > > It is not needed, run-time check is done right above. (I know it is not > your code) Sure - removed. >> CL_ASSERT(osm_physp_is_valid(p_physp)); >> >> + p_node = osm_physp_get_node_ptr(p_physp); >> + if (!p_node->sw) { > > Actually this !p_node->sw check duplicates the one above, where > !p_node->sw is verified for ergess port of this switch. Right? > >> + /* >> + * There is some sort of problem in the subnet object! >> + * If this isn't a switch, we should have reached >> + * the destination by now! >> + */ >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F04: " >> + "Internal error, bad path\n"); >> + status = IB_ERROR; >> + goto Exit; >> + } >> + >> p_pi = &p_physp->port_info; >> >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >> mtu = ib_port_info_get_mtu_cap(p_pi); >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> - "__osm_pr_rcv_get_path_parms: " >> - "New smallest MTU = %u at intervening port 0x%016" >> - PRIx64 " port num 0x%X\n", mtu, >> - cl_ntoh64(osm_physp_get_port_guid >> - (p_physp)), >> - osm_physp_get_port_num(p_physp)); >> - } >> >> - if (rate > ib_port_info_compute_rate(p_pi)) { >> + if (rate > ib_port_info_compute_rate(p_pi)) >> rate = ib_port_info_compute_rate(p_pi); >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> - "__osm_pr_rcv_get_path_parms: " >> - "New smallest rate = %u at intervening port 0x%016" >> - PRIx64 " port num 0x%X\n", rate, >> - cl_ntoh64(osm_physp_get_port_guid >> - (p_physp)), >> - osm_physp_get_port_num(p_physp)); >> - } >> >> + if (!p_rcv->p_subn->opt.no_qos) { >> + /* >> + * Check SL2VL table of the switch and update valid SLs >> + */ >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + vl = ib_slvl_table_get(p_slvl_tbl, i); >> + if (vl == IB_DROP_VL) >> + valid_sls[i] = FALSE; >> + } >> + } >> + } >> } >> >> /* >> @@ -442,30 +471,104 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> */ >> p_pi = &p_physp->port_info; >> >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >> mtu = ib_port_info_get_mtu_cap(p_pi); >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> + >> + if (rate > ib_port_info_compute_rate(p_pi)) >> + rate = ib_port_info_compute_rate(p_pi); >> + >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "Path min MTU = %u, min rate = %u\n", >> + mtu, rate); >> + >> + if (!p_rcv->p_subn->opt.no_qos) { >> + /* >> + * check whether there is some SL >> + * that won't lead to VL15 eventually >> + */ >> + sl2vl_valid_path = FALSE; >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >> + if (valid_sls[i]) { >> + sl2vl_valid_path = TRUE; >> + first_valid_sl = i; >> + break; >> + } >> + } >> + >> + if (!sl2vl_valid_path) { >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "All the SLs lead to VL15 on this path\n"); >> + } >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } >> + >> + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { >> + /* Get QoS Level object according to the path request */ >> + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, >> + p_rcv, p_pr, >> + p_src_physp, p_dest_physp, >> + comp_mask, &p_qos_level); >> + >> + if (p_qos_level >> + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> "__osm_pr_rcv_get_path_parms: " >> - "New smallest MTU = %u at destination port 0x%016" >> - PRIx64 "\n", mtu, >> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); >> + "PathRecord request matches QoS Level '%s' (%s)\n", >> + p_qos_level->name, >> + (p_qos_level->use) ? p_qos_level-> >> + use : "no description"); >> + } >> } >> >> - if (rate > ib_port_info_compute_rate(p_pi)) { >> - rate = ib_port_info_compute_rate(p_pi); >> + /* Adjust path parameters according to QoS settings */ >> + >> + if (p_qos_level) { > > Why to not make osm_qos_policy_get_qos_level_by_pr() returning pointer > to p_qos_level? Then you could simply merge both conditions (this and > one above), something like: > > if (!p_rcv->p_subn->opt.no_qos && > p_rcv->p_subn->p_qos_policy && > (p_qos_level = osm_qos_policy_get_qos_level_by_pr(..)) { Done. >> + if (p_qos_level->mtu_limit_set >> + && (mtu > p_qos_level->mtu_limit)) >> + mtu = p_qos_level->mtu_limit; >> + >> + if (p_qos_level->rate_limit_set >> + && (rate > p_qos_level->rate_limit)) >> + rate = p_qos_level->rate_limit; >> + >> + if (p_qos_level->pkt_life_set >> + && (pkt_life > p_qos_level->pkt_life)) >> + pkt_life = p_qos_level->pkt_life; >> + >> + if (p_qos_level->sl_set) { >> + if (!valid_sls[p_qos_level->sl]) { >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + sl = p_qos_level->sl; >> + } >> + >> if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> "__osm_pr_rcv_get_path_parms: " >> - "New smallest rate = %u at destination port 0x%016" >> - PRIx64 "\n", rate, >> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); >> + "Path params with QoS constaraints: " >> + "min MTU = %u, min rate = %u, " >> + "packet lifetime = %u, sl = %u\n", >> + mtu, rate, pkt_life, sl); >> } >> >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> - "__osm_pr_rcv_get_path_parms: " >> - "Path min MTU = %u, min rate = %u\n", mtu, rate); >> + /* >> + * Set packet lifetime. >> + * According to spec definition IBA 1.2 Table 205 >> + * PacketLifeTime description, for loopback paths, >> + * packetLifeTime shall be zero. >> + */ >> + if (p_src_port == p_dest_port) >> + pkt_life = 0; >> + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) >> + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >> + >> >> /* >> Determine if these values meet the user criteria >> @@ -511,6 +614,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> break; >> } >> } >> + if (status != IB_SUCCESS) >> + goto Exit; >> >> /* we silently ignore cases where only the Rate selector is defined */ >> if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && >> @@ -551,14 +656,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> break; >> } >> } >> - >> - /* Verify the pkt_life_time */ >> - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, >> - for loopback paths, packetLifeTime shall be zero. */ >> - if (p_src_port == p_dest_port) >> - pkt_life = 0; /* loopback */ >> - else >> - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >> + if (status != IB_SUCCESS) >> + goto Exit; >> >> /* we silently ignore cases where only the PktLife selector is defined */ >> if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && >> @@ -603,12 +702,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> if (status != IB_SUCCESS) >> goto Exit; >> >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >> + /* >> + * set Pkey for this path record request >> + */ >> + >> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >> + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); > > So is it was bug (not related to QoS) when p_physp instead of > p_src_physp was used for pkey finding? I think so. >> + >> else if (comp_mask & IB_PR_COMPMASK_PKEY) { >> + /* >> + * PR request has a specific pkey: >> + * Check that source and destination share this pkey. >> + * If QoS level has pkeys, check that this pkey exists >> + * in the QoS level pkeys. >> + * PR returned pkey is the requested pkey. >> + */ >> pkey = p_pr->pkey; >> - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { >> + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F1A: " >> "Ports do not share specified PKey 0x%04x\n", >> @@ -616,8 +727,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> status = IB_NOT_FOUND; >> goto Exit; >> } >> + if (p_qos_level && p_qos_level->pkey_range_len && >> + !osm_qos_level_has_pkey(p_qos_level, pkey)) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " >> + "Ports do not share PKeys defined by QoS level\n"); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + >> + } else if (p_qos_level && p_qos_level->pkey_range_len) { >> + /* >> + * PR request doesn't have a specific pkey, but QoS level >> + * has pkeys - get shared pkey from QoS level pkeys >> + */ >> + pkey = osm_qos_level_get_shared_pkey(p_qos_level, >> + p_src_physp, >> + p_dest_physp); >> + if (!pkey) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " >> + "Ports do not share PKeys defined by QoS level\n"); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> } else { >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >> + /* >> + * Neither PR request nor QoS level have pkey. >> + * Just get any shared pkey. >> + */ >> + pkey = osm_physp_find_common_pkey(p_src_physp, >> + p_dest_physp); >> if (!pkey) { >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F1B: " >> @@ -627,14 +767,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> } >> } >> >> - if (p_rcv->p_subn->opt.routing_engine_name && >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) >> - /* slid and dest_lid are stored in network in lash */ >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, >> - p_dest_port); >> - else >> - sl = OSM_DEFAULT_SL; >> - >> if (pkey) { >> p_prtn = >> (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, >> @@ -642,34 +774,80 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, >> 0x8000)); >> if (p_prtn == >> (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) >> + p_prtn = NULL; >> + } >> + >> + /* >> + * Set PathRecord SL. >> + * >> + * ToDo: What about QoS and LASH routing? How can they coexist? >> + * And what happens when there's a pkey, hence there is a >> + * partition with a certain SL, and this SL doesn't match >> + * the one that's defined by LASH? >> + */ >> + >> + if (comp_mask & IB_PR_COMPMASK_SL) { >> + /* >> + * Specific SL was requested >> + */ >> + sl = ib_path_rec_sl(p_pr); >> + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " >> + "QoS constaraints: required PR SL (%u) " >> + "doesn't match QoS SL (%u)\n", >> + sl, p_qos_level->sl); >> + status = IB_NOT_FOUND; >> + goto Exit; >> + } >> + } else if (p_qos_level && p_qos_level->sl_set) { >> + /* >> + * No specific SL was requested, >> + * but there is an SL in QoS level >> + */ >> + sl = p_qos_level->sl; >> + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "__osm_pr_rcv_get_path_parms: " >> + "QoS level SL (%u) overrides partition SL (%u)\n", >> + p_qos_level->sl, p_prtn->sl); >> + } else if (pkey) { >> + /* >> + * No specific SL in request or in QoS level - use partition SL >> + */ >> + if (!p_prtn) { >> /* this may be possible when pkey tables are created somehow in >> previous runs or things are going wrong here */ >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> "__osm_pr_rcv_get_path_parms: ERR 1F1C: " >> "No partition found for PKey 0x%04x - using default SL %d\n", >> cl_ntoh16(pkey), sl); >> - else { >> - if (p_rcv->p_subn->opt.routing_engine_name && >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, >> - "lash") == 0) >> - /* slid and dest_lid are stored in network in lash */ >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, >> - p_src_port, p_dest_port); >> - else >> - sl = p_prtn->sl; >> - } >> - >> - /* reset pkey when raw traffic */ >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> - pkey = 0; >> + } else >> + sl = p_prtn->sl; >> + } else if (p_rcv->p_subn->opt.routing_engine_name && >> + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { > > It seems that in original code LASH was "higher" priority in SL > selection than partition configuration? If so, any reason why it is > changed? No particular reason - it just seemed right at the moment. I'll rework it so that the relative priorities of partition and lash routing will remain as they were before. In any case, is there any particular reason why lash SL should have higher priority than partition's SL? Regardless what the answer is, there'll be a conflict when a specific pkey was requested in PathRecord and this partition has SL different from what lash defines. >> + /* slid and dest_lid are stored in network in lash */ >> + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, >> + p_src_port, p_dest_port); >> + } else if (!p_rcv->p_subn->opt.no_qos) { >> + sl = first_valid_sl; >> } >> + else >> + sl = OSM_DEFAULT_SL; >> >> - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { >> + if (!p_rcv->p_subn->opt.no_qos && !valid_sls[sl]) { >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >> + "__osm_pr_rcv_get_path_parms: ERR 1F23: " >> + "Selected SL (%u) leads to VL15\n", p_prtn->sl); >> status = IB_NOT_FOUND; >> goto Exit; >> } >> >> + /* reset pkey when raw traffic */ >> + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >> + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >> + pkey = 0; >> + >> p_parms->mtu = mtu; >> p_parms->rate = rate; >> p_parms->pkt_life = pkt_life; >> -- >> 1.5.1.4 >> > > We discussed already about using sl_mask instead of valid_sls array. > The rest looks fine for me. I'll repost the patch later today. -- Yevgeny > Sasha > From sashak at voltaire.com Thu Sep 6 07:23:57 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 17:23:57 +0300 Subject: [ofa-general] Re: [PATCH v2] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <46E00114.5060601@dev.mellanox.co.il> References: <46DE9F97.10003@dev.mellanox.co.il> <20070905232643.GC25330@sashak.voltaire.com> <46E00114.5060601@dev.mellanox.co.il> Message-ID: <20070906142357.GR25330@sashak.voltaire.com> On 16:31 Thu 06 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: > > Hi Yevgeny, > > On 15:22 Wed 05 Sep , Yevgeny Kliteynik wrote: > >> Selecting path according to QoS policy level that > >> the PathRecord query matches. > >> > >> Signed-off-by: Yevgeny Kliteynik > >> --- > >> opensm/opensm/osm_sa_path_record.c | 374 > >> ++++++++++++++++++++++++++---------- > >> 1 files changed, 276 insertions(+), 98 deletions(-) > >> > >> diff --git a/opensm/opensm/osm_sa_path_record.c > >> b/opensm/opensm/osm_sa_path_record.c > >> index 1b781f0..15bd7e2 100644 > >> --- a/opensm/opensm/osm_sa_path_record.c > >> +++ b/opensm/opensm/osm_sa_path_record.c > >> @@ -67,6 +67,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #ifdef ROUTER_EXP > >> #include > >> #include > >> @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> { > >> const osm_node_t *p_node; > >> const osm_physp_t *p_physp; > >> + const osm_physp_t *p_src_physp; > >> const osm_physp_t *p_dest_physp; > >> - const osm_prtn_t *p_prtn; > >> + const osm_prtn_t *p_prtn = NULL; > >> const ib_port_info_t *p_pi; > >> ib_api_status_t status = IB_SUCCESS; > >> ib_net16_t pkey; > >> @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> uint8_t required_rate; > >> uint8_t required_pkt_life; > >> uint8_t sl; > >> + uint8_t in_port_num; > >> ib_net16_t dest_lid; > >> + uint8_t i; > >> + uint8_t vl; > >> + ib_slvl_table_t *p_slvl_tbl = NULL; > >> + boolean_t valid_sls[IB_MAX_NUM_VLS]; > >> + boolean_t sl2vl_valid_path; > >> + uint8_t first_valid_sl; > >> + osm_qos_level_t *p_qos_level = NULL; > >> > >> OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); > >> > >> + memset(valid_sls, TRUE, IB_MAX_NUM_VLS); > >> dest_lid = cl_hton16(dest_lid_ho); > >> > >> p_dest_physp = p_dest_port->p_physp; > >> p_physp = p_src_port->p_physp; > >> + p_src_physp = p_physp; > >> p_pi = &p_physp->port_info; > >> > >> mtu = ib_port_info_get_mtu_cap(p_pi); > >> @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> p_node = osm_physp_get_node_ptr(p_physp); > >> > >> if (p_node->sw) { > >> + /* source node is a switch */ > >> + in_port_num = osm_physp_get_port_num(p_physp); > > Hmm, could in_port_num be != 0? > > Well... > The physical port object is obtained from port object, which in turn, > was obtained from the subnet port_guid_tbl through osm_get_port_by_guid(). > Since there can be one port per guid in this table, I think we store there > only ports 0 of the switches (correct me if I'm wrong). > So looks like you're right - in this case in_port_num can be only 0. > > In any case, osm_physp_get_port_num() is just an inline function that > returns p_physp->port_num. And look where this in_port_num is used later: > >> + if (p_node->sw) > >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > >> + else > >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > >> + Since for switches in_port_num is always 0 just p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); will be sufficient for all node types. > > >> + > >> /* > >> * If the dest_lid_ho is equal to the lid of the switch pointed by > >> * p_sw then p_physp will be the physical port of the switch port zero. > > I know it is not your code, but do you understand this part of the > > comment? > > Nope :) > The two lines I've added may very well replace these first two lines, > so I think I can remove the old comment. Ok. > > >> + * Make sure that p_physp points to the out port of the > >> + * switch that routes to the destination lid (dest_lid_ho) > >> */ > >> - p_physp = > >> - osm_switch_get_route_by_lid(p_node->sw, > >> - cl_ntoh16(dest_lid_ho)); > >> + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > >> if (p_physp == 0) { > >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> "__osm_pr_rcv_get_path_parms: ERR 1F02: " > >> @@ -306,15 +321,32 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> } > >> } > >> > >> + if (!p_rcv->p_subn->opt.no_qos) { > > Would you prefer to change opt.no_qos to opt.qos? For me it looks things > > will be clear this way. > > I wanted to do it since I started working on QoS! Feel free :) > > >> + if (p_node->sw) > >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > >> + else > >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > >> + > >> + /* update valid SLs that still exist on this route */ > >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > >> + if (valid_sls[i]) { > >> + vl = ib_slvl_table_get(p_slvl_tbl, i); > >> + if (vl == IB_DROP_VL) > >> + valid_sls[i] = FALSE; > >> + } > >> + } > >> + } > >> + > >> /* > >> * Same as above > >> */ > >> p_node = osm_physp_get_node_ptr(p_dest_physp); > >> > >> if (p_node->sw) { > >> - p_dest_physp = > >> - osm_switch_get_route_by_lid(p_node->sw, > >> - cl_ntoh16(dest_lid_ho)); > >> + /* > >> + * if destination is switch, we want p_dest_physp to point to port 0 > >> + */ > >> + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); > >> > >> if (p_dest_physp == 0) { > >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> @@ -328,6 +360,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> > >> } > >> > >> + /* > >> + * Now go through the path step by step > >> + */ > >> + > >> while (p_physp != p_dest_physp) { > >> p_physp = osm_physp_get_remote(p_physp); > >> > >> @@ -341,6 +377,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> goto Exit; > >> } > >> > >> + in_port_num = osm_physp_get_port_num(p_physp); > >> + > >> /* > >> This is point to point case (no switch in between) > >> */ > >> @@ -367,29 +405,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> */ > >> p_pi = &p_physp->port_info; > >> > >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > >> mtu = ib_port_info_get_mtu_cap(p_pi); > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> - "__osm_pr_rcv_get_path_parms: " > >> - "New smallest MTU = %u at intervening port 0x%016" > >> - PRIx64 " port num 0x%X\n", mtu, > >> - cl_ntoh64(osm_physp_get_port_guid > >> - (p_physp)), > >> - osm_physp_get_port_num(p_physp)); > >> - } > >> > >> - if (rate > ib_port_info_compute_rate(p_pi)) { > >> + if (rate > ib_port_info_compute_rate(p_pi)) > >> rate = ib_port_info_compute_rate(p_pi); > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> - "__osm_pr_rcv_get_path_parms: " > >> - "New smallest rate = %u at intervening port 0x%016" > >> - PRIx64 " port num 0x%X\n", rate, > >> - cl_ntoh64(osm_physp_get_port_guid > >> - (p_physp)), > >> - osm_physp_get_port_num(p_physp)); > >> - } > >> > >> /* > >> Continue with the egress port on this switch. > >> @@ -409,32 +429,41 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> CL_ASSERT(p_physp); > > It is not needed, run-time check is done right above. (I know it is not > > your code) > > Sure - removed. > > >> CL_ASSERT(osm_physp_is_valid(p_physp)); > >> > >> + p_node = osm_physp_get_node_ptr(p_physp); > >> + if (!p_node->sw) { > > Actually this !p_node->sw check duplicates the one above, where > > !p_node->sw is verified for ergess port of this switch. Right? > >> + /* > >> + * There is some sort of problem in the subnet object! > >> + * If this isn't a switch, we should have reached > >> + * the destination by now! > >> + */ > >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> + "__osm_pr_rcv_get_path_parms: ERR 1F04: " > >> + "Internal error, bad path\n"); > >> + status = IB_ERROR; > >> + goto Exit; > >> + } > >> + > >> p_pi = &p_physp->port_info; > >> > >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > >> mtu = ib_port_info_get_mtu_cap(p_pi); > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> - "__osm_pr_rcv_get_path_parms: " > >> - "New smallest MTU = %u at intervening port 0x%016" > >> - PRIx64 " port num 0x%X\n", mtu, > >> - cl_ntoh64(osm_physp_get_port_guid > >> - (p_physp)), > >> - osm_physp_get_port_num(p_physp)); > >> - } > >> > >> - if (rate > ib_port_info_compute_rate(p_pi)) { > >> + if (rate > ib_port_info_compute_rate(p_pi)) > >> rate = ib_port_info_compute_rate(p_pi); > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> - "__osm_pr_rcv_get_path_parms: " > >> - "New smallest rate = %u at intervening port 0x%016" > >> - PRIx64 " port num 0x%X\n", rate, > >> - cl_ntoh64(osm_physp_get_port_guid > >> - (p_physp)), > >> - osm_physp_get_port_num(p_physp)); > >> - } > >> > >> + if (!p_rcv->p_subn->opt.no_qos) { > >> + /* > >> + * Check SL2VL table of the switch and update valid SLs > >> + */ > >> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); > >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > >> + if (valid_sls[i]) { > >> + vl = ib_slvl_table_get(p_slvl_tbl, i); > >> + if (vl == IB_DROP_VL) > >> + valid_sls[i] = FALSE; > >> + } > >> + } > >> + } > >> } > >> > >> /* > >> @@ -442,30 +471,104 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> */ > >> p_pi = &p_physp->port_info; > >> > >> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { > >> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) > >> mtu = ib_port_info_get_mtu_cap(p_pi); > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> + > >> + if (rate > ib_port_info_compute_rate(p_pi)) > >> + rate = ib_port_info_compute_rate(p_pi); > >> + > >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> + "__osm_pr_rcv_get_path_parms: " > >> + "Path min MTU = %u, min rate = %u\n", > >> + mtu, rate); > >> + > >> + if (!p_rcv->p_subn->opt.no_qos) { > >> + /* > >> + * check whether there is some SL > >> + * that won't lead to VL15 eventually > >> + */ > >> + sl2vl_valid_path = FALSE; > >> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { > >> + if (valid_sls[i]) { > >> + sl2vl_valid_path = TRUE; > >> + first_valid_sl = i; > >> + break; > >> + } > >> + } > >> + > >> + if (!sl2vl_valid_path) { > >> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> + "__osm_pr_rcv_get_path_parms: " > >> + "All the SLs lead to VL15 on this path\n"); > >> + } > >> + status = IB_NOT_FOUND; > >> + goto Exit; > >> + } > >> + } > >> + > >> + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { > >> + /* Get QoS Level object according to the path request */ > >> + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > >> + p_rcv, p_pr, > >> + p_src_physp, p_dest_physp, > >> + comp_mask, &p_qos_level); > >> + > >> + if (p_qos_level > >> + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { > >> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> "__osm_pr_rcv_get_path_parms: " > >> - "New smallest MTU = %u at destination port 0x%016" > >> - PRIx64 "\n", mtu, > >> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); > >> + "PathRecord request matches QoS Level '%s' (%s)\n", > >> + p_qos_level->name, > >> + (p_qos_level->use) ? p_qos_level-> > >> + use : "no description"); > >> + } > >> } > >> > >> - if (rate > ib_port_info_compute_rate(p_pi)) { > >> - rate = ib_port_info_compute_rate(p_pi); > >> + /* Adjust path parameters according to QoS settings */ > >> + > >> + if (p_qos_level) { > > Why to not make osm_qos_policy_get_qos_level_by_pr() returning pointer > > to p_qos_level? Then you could simply merge both conditions (this and > > one above), something like: > > if (!p_rcv->p_subn->opt.no_qos && > > p_rcv->p_subn->p_qos_policy && > > (p_qos_level = osm_qos_policy_get_qos_level_by_pr(..)) { > > Done. > > >> + if (p_qos_level->mtu_limit_set > >> + && (mtu > p_qos_level->mtu_limit)) > >> + mtu = p_qos_level->mtu_limit; > >> + > >> + if (p_qos_level->rate_limit_set > >> + && (rate > p_qos_level->rate_limit)) > >> + rate = p_qos_level->rate_limit; > >> + > >> + if (p_qos_level->pkt_life_set > >> + && (pkt_life > p_qos_level->pkt_life)) > >> + pkt_life = p_qos_level->pkt_life; > >> + > >> + if (p_qos_level->sl_set) { > >> + if (!valid_sls[p_qos_level->sl]) { > >> + status = IB_NOT_FOUND; > >> + goto Exit; > >> + } > >> + sl = p_qos_level->sl; > >> + } > >> + > >> if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> "__osm_pr_rcv_get_path_parms: " > >> - "New smallest rate = %u at destination port 0x%016" > >> - PRIx64 "\n", rate, > >> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); > >> + "Path params with QoS constaraints: " > >> + "min MTU = %u, min rate = %u, " > >> + "packet lifetime = %u, sl = %u\n", > >> + mtu, rate, pkt_life, sl); > >> } > >> > >> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > >> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> - "__osm_pr_rcv_get_path_parms: " > >> - "Path min MTU = %u, min rate = %u\n", mtu, rate); > >> + /* > >> + * Set packet lifetime. > >> + * According to spec definition IBA 1.2 Table 205 > >> + * PacketLifeTime description, for loopback paths, > >> + * packetLifeTime shall be zero. > >> + */ > >> + if (p_src_port == p_dest_port) > >> + pkt_life = 0; > >> + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) > >> + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > >> + > >> > >> /* > >> Determine if these values meet the user criteria > >> @@ -511,6 +614,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> break; > >> } > >> } > >> + if (status != IB_SUCCESS) > >> + goto Exit; > >> > >> /* we silently ignore cases where only the Rate selector is defined */ > >> if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && > >> @@ -551,14 +656,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> break; > >> } > >> } > >> - > >> - /* Verify the pkt_life_time */ > >> - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime > >> description, > >> - for loopback paths, packetLifeTime shall be zero. */ > >> - if (p_src_port == p_dest_port) > >> - pkt_life = 0; /* loopback */ > >> - else > >> - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > >> + if (status != IB_SUCCESS) > >> + goto Exit; > >> > >> /* we silently ignore cases where only the PktLife selector is defined > >> */ > >> if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && > >> @@ -603,12 +702,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> if (status != IB_SUCCESS) > >> goto Exit; > >> > >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > >> + /* > >> + * set Pkey for this path record request > >> + */ > >> + > >> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && > >> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) > >> + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); > > So is it was bug (not related to QoS) when p_physp instead of > > p_src_physp was used for pkey finding? > > I think so. Nice finding! > > >> + > >> else if (comp_mask & IB_PR_COMPMASK_PKEY) { > >> + /* > >> + * PR request has a specific pkey: > >> + * Check that source and destination share this pkey. > >> + * If QoS level has pkeys, check that this pkey exists > >> + * in the QoS level pkeys. > >> + * PR returned pkey is the requested pkey. > >> + */ > >> pkey = p_pr->pkey; > >> - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { > >> + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { > >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> "__osm_pr_rcv_get_path_parms: ERR 1F1A: " > >> "Ports do not share specified PKey 0x%04x\n", > >> @@ -616,8 +727,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> status = IB_NOT_FOUND; > >> goto Exit; > >> } > >> + if (p_qos_level && p_qos_level->pkey_range_len && > >> + !osm_qos_level_has_pkey(p_qos_level, pkey)) { > >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " > >> + "Ports do not share PKeys defined by QoS level\n"); > >> + status = IB_NOT_FOUND; > >> + goto Exit; > >> + } > >> + > >> + } else if (p_qos_level && p_qos_level->pkey_range_len) { > >> + /* > >> + * PR request doesn't have a specific pkey, but QoS level > >> + * has pkeys - get shared pkey from QoS level pkeys > >> + */ > >> + pkey = osm_qos_level_get_shared_pkey(p_qos_level, > >> + p_src_physp, > >> + p_dest_physp); > >> + if (!pkey) { > >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " > >> + "Ports do not share PKeys defined by QoS level\n"); > >> + status = IB_NOT_FOUND; > >> + goto Exit; > >> + } > >> } else { > >> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); > >> + /* > >> + * Neither PR request nor QoS level have pkey. > >> + * Just get any shared pkey. > >> + */ > >> + pkey = osm_physp_find_common_pkey(p_src_physp, > >> + p_dest_physp); > >> if (!pkey) { > >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> "__osm_pr_rcv_get_path_parms: ERR 1F1B: " > >> @@ -627,14 +767,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> } > >> } > >> > >> - if (p_rcv->p_subn->opt.routing_engine_name && > >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) > >> - /* slid and dest_lid are stored in network in lash */ > >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, > >> - p_dest_port); > >> - else > >> - sl = OSM_DEFAULT_SL; > >> - > >> if (pkey) { > >> p_prtn = > >> (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, > >> @@ -642,34 +774,80 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const > >> p_rcv, > >> 0x8000)); > >> if (p_prtn == > >> (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) > >> + p_prtn = NULL; > >> + } > >> + > >> + /* > >> + * Set PathRecord SL. > >> + * > >> + * ToDo: What about QoS and LASH routing? How can they coexist? > >> + * And what happens when there's a pkey, hence there is a > >> + * partition with a certain SL, and this SL doesn't match > >> + * the one that's defined by LASH? > >> + */ > >> + > >> + if (comp_mask & IB_PR_COMPMASK_SL) { > >> + /* > >> + * Specific SL was requested > >> + */ > >> + sl = ib_path_rec_sl(p_pr); > >> + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { > >> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " > >> + "QoS constaraints: required PR SL (%u) " > >> + "doesn't match QoS SL (%u)\n", > >> + sl, p_qos_level->sl); > >> + status = IB_NOT_FOUND; > >> + goto Exit; > >> + } > >> + } else if (p_qos_level && p_qos_level->sl_set) { > >> + /* > >> + * No specific SL was requested, > >> + * but there is an SL in QoS level > >> + */ > >> + sl = p_qos_level->sl; > >> + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) > >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > >> + "__osm_pr_rcv_get_path_parms: " > >> + "QoS level SL (%u) overrides partition SL (%u)\n", > >> + p_qos_level->sl, p_prtn->sl); > >> + } else if (pkey) { > >> + /* > >> + * No specific SL in request or in QoS level - use partition SL > >> + */ > >> + if (!p_prtn) { > >> /* this may be possible when pkey tables are created somehow in > >> previous runs or things are going wrong here */ > >> osm_log(p_rcv->p_log, OSM_LOG_ERROR, > >> "__osm_pr_rcv_get_path_parms: ERR 1F1C: " > >> "No partition found for PKey 0x%04x - using default SL %d\n", > >> cl_ntoh16(pkey), sl); > >> - else { > >> - if (p_rcv->p_subn->opt.routing_engine_name && > >> - strcmp(p_rcv->p_subn->opt.routing_engine_name, > >> - "lash") == 0) > >> - /* slid and dest_lid are stored in network in lash */ > >> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, > >> - p_src_port, p_dest_port); > >> - else > >> - sl = p_prtn->sl; > >> - } > >> - > >> - /* reset pkey when raw traffic */ > >> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && > >> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) > >> - pkey = 0; > >> + } else > >> + sl = p_prtn->sl; > >> + } else if (p_rcv->p_subn->opt.routing_engine_name && > >> + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { > > It seems that in original code LASH was "higher" priority in SL > > selection than partition configuration? If so, any reason why it is > > changed? > > No particular reason - it just seemed right at the moment. > I'll rework it so that the relative priorities of partition > and lash routing will remain as they were before. > In any case, is there any particular reason why lash SL > should have higher priority than partition's SL? I think so, LASH can be turn on or off just by using command line option, in order to prevent conflicting with partitions it may be needed to rewrite partitions config file each time when we want to run LASH. I think original "priorities" were fine. > Regardless what the answer is, there'll be a conflict when a > specific pkey was requested in PathRecord and this partition > has SL different from what lash defines. Yes, of course - LASH requires better integration, not just with partitions, with QoS too. Want to fix this as well? :) Sasha From jackm at dev.mellanox.co.il Thu Sep 6 07:54:43 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 6 Sep 2007 17:54:43 +0300 Subject: [ofa-general] [PATCH] libmlx4: adjust max_recv_wr capability to be non-zero for non-SRQ qp's Message-ID: <200709061754.43922.jackm@dev.mellanox.co.il> max_recv_wr must also be non-zero for QPs which are not associated with an SRQ. Signed-off-by: Jack Morgenstein --- Roland, Without this patch, if the user requested max_recv_wr = 0, this will be passed as-is to the kernel layer. In the kernel, the create-qp will fail because of the (correct) check in file: drivers/infiniband/hw/mlx4/qp.c, procedure set_rq_size(): /* HW requires >= 1 RQ entry with >= 1 gather entry */ if (is_user && (!cap->max_recv_wr || !cap->max_recv_sge)) return -EINVAL; In the patch, I added the adjustment after: qp->rq.wqe_cnt = align_queue_size(attr->cap.max_recv_wr); since the align_queue_size macro yields the same result for 1 as it does for 0. Index: libmlx4/src/verbs.c =================================================================== --- libmlx4.orig/src/verbs.c 2007-09-06 16:29:36.000000000 +0300 +++ libmlx4/src/verbs.c 2007-09-06 16:34:55.032294000 +0300 @@ -367,8 +367,12 @@ struct ibv_qp *mlx4_create_qp(struct ibv if (attr->srq) attr->cap.max_recv_wr = qp->rq.wqe_cnt = 0; - else if (attr->cap.max_recv_sge < 1) - attr->cap.max_recv_sge = 1; + else { + if (attr->cap.max_recv_sge < 1) + attr->cap.max_recv_sge = 1; + if (attr->cap.max_recv_wr < 1) + attr->cap.max_recv_wr = 1; + } if (mlx4_alloc_qp_buf(pd, &attr->cap, attr->qp_type, qp)) goto err; From sashak at voltaire.com Thu Sep 6 08:24:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 18:24:28 +0300 Subject: [ofa-general] Re: management/libibcommon In-Reply-To: <6C2C79E72C305246B504CBA17B5500C902374493@mtlexch01.mtl.com> References: <20070906112307.GI25330@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902374493@mtlexch01.mtl.com> Message-ID: <20070906152428.GT25330@sashak.voltaire.com> Hi Vlad, On 15:25 Thu 06 Sep , Vladimir Sokolovsky wrote: > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Thursday, September 06, 2007 2:23 PM > > To: OpenIB > > Cc: Hal Rosenstock; Ira Weiny; Eitan Zahavi; Sean Hefty; Dotan Barak; > > Vladimir Sokolovsky > > Subject: management/libibcommon > > > > Hi All, > > > > Currently we have libibcommon library under OFA management project. > > Partially it is used by libibumad and partially by libibmad and > > infiniband-diags. The used things look pretty separate so I'm thinking > > to strip libibcommon as whole library and its components over > libibumad > > and libibmad - this will remove extra dependency for libibumad. > > > > Anybody else (except management) uses libibcommon? Any comments, > > objections? > > > > Sasha > > Hi Sasha, > AFAIK, mvapich, srptools and ibutils use libibcommon. ibutils uses -libcommon for linking only because it is required by libibumad. srptools uses libibcommon directly, but mostly things which I wanted to merge into libibumad, so should not be big problem there. Didn't check mvapich yet. > In any case if you are going to remove libibcommon, then ofabuild and > ofed_1_3_scripts should be updated as well. > I will be in vacation from 10 Sep 2007 till 2 Oct 2007. Please update > Tziporet Koren and Michael Tsirkin with your decisions. Anyway since libibcommon elimination touches couple of projects and requires changes in ofed scripts I think it is not a great idea to do it in last days before OFED-1.3 feature freeze. I will wait for OFED-1.3 split. Sasha From avi at qumranet.com Thu Sep 6 08:17:16 2007 From: avi at qumranet.com (Avi Kivity) Date: Thu, 06 Sep 2007 18:17:16 +0300 Subject: [ofa-general] Re: [PATCH][RFC] pte notifiers -- support for external page tables In-Reply-To: References: <11890207643068-git-send-email-avi@qumranet.com> Message-ID: <46E019FC.5000001@qumranet.com> Andi Kleen wrote: > Avi Kivity writes: > >> pte notifiers are different from paravirt_ops: they extend the normal >> page tables rather than replace them; and they provide high-level information >> such as the vma and the virtual address for the driver to use. >> > > Sounds like a locking horror to me. To do anything with page tables > you need locks. Both for the kernel page tables and for your new tables. > > What happens when people add all > things of complicated operations in these notifiers? That will likely > happen and then everytime you change something in VM code they > will break. This has the potential to increase the cost of maintaining > VM code considerably, which would be a bad thing. > > This is quite different from paravirt ops because low level pvops > can typically run lockless by just doing some kind of hypercall directly. > But that won't work for maintaining your custom page tables. > Okay, here's a possible fix: add ->lock() and ->unlock() callbacks, to be called when mmap_sem is taken either for read or write. Also add a ->release() for when the mm goes away to avoid the need to care about the entire data structure going away. The notifier list would need to be kept sorted to avoid deadlocks. -- Any sufficiently difficult bug is indistinguishable from a feature. From sashak at voltaire.com Thu Sep 6 08:32:55 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 18:32:55 +0300 Subject: [ofa-general] Re: management/libibcommon In-Reply-To: <20070906152428.GT25330@sashak.voltaire.com> References: <20070906112307.GI25330@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902374493@mtlexch01.mtl.com> <20070906152428.GT25330@sashak.voltaire.com> Message-ID: <20070906153255.GV25330@sashak.voltaire.com> On 18:24 Thu 06 Sep , Sasha Khapyorsky wrote: > Hi Vlad, > > On 15:25 Thu 06 Sep , Vladimir Sokolovsky wrote: > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Thursday, September 06, 2007 2:23 PM > > > To: OpenIB > > > Cc: Hal Rosenstock; Ira Weiny; Eitan Zahavi; Sean Hefty; Dotan Barak; > > > Vladimir Sokolovsky > > > Subject: management/libibcommon > > > > > > Hi All, > > > > > > Currently we have libibcommon library under OFA management project. > > > Partially it is used by libibumad and partially by libibmad and > > > infiniband-diags. The used things look pretty separate so I'm thinking > > > to strip libibcommon as whole library and its components over > > libibumad > > > and libibmad - this will remove extra dependency for libibumad. > > > > > > Anybody else (except management) uses libibcommon? Any comments, > > > objections? > > > > > > Sasha > > > > Hi Sasha, > > AFAIK, mvapich, srptools and ibutils use libibcommon. > > ibutils uses -libcommon for linking only because it is required by > libibumad. > srptools uses libibcommon directly, but mostly things which I wanted > to merge into libibumad, so should not be big problem there. > Didn't check mvapich yet. Hmm, find mvapich-gen2 -type f | xargs egrep 'ibcommon|ibmad|ibumad' returns nothing. I'm not sure that mvapich can be affected. Do you know something about how/where libibcommon is used in mvapich? Sasha > > > In any case if you are going to remove libibcommon, then ofabuild and > > ofed_1_3_scripts should be updated as well. > > I will be in vacation from 10 Sep 2007 till 2 Oct 2007. Please update > > Tziporet Koren and Michael Tsirkin with your decisions. > > Anyway since libibcommon elimination touches couple of projects and > requires changes in ofed scripts I think it is not a great idea to do > it in last days before OFED-1.3 feature freeze. I will wait for > OFED-1.3 split. > > Sasha From hal.rosenstock at gmail.com Thu Sep 6 08:27:06 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 6 Sep 2007 11:27:06 -0400 Subject: [ofa-general] Re: management/libibcommon In-Reply-To: <20070906153255.GV25330@sashak.voltaire.com> References: <20070906112307.GI25330@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902374493@mtlexch01.mtl.com> <20070906152428.GT25330@sashak.voltaire.com> <20070906153255.GV25330@sashak.voltaire.com> Message-ID: On 9/6/07, Sasha Khapyorsky wrote: > On 18:24 Thu 06 Sep , Sasha Khapyorsky wrote: > > Hi Vlad, > > > > On 15:25 Thu 06 Sep , Vladimir Sokolovsky wrote: > > > > -----Original Message----- > > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > > Sent: Thursday, September 06, 2007 2:23 PM > > > > To: OpenIB > > > > Cc: Hal Rosenstock; Ira Weiny; Eitan Zahavi; Sean Hefty; Dotan Barak; > > > > Vladimir Sokolovsky > > > > Subject: management/libibcommon > > > > > > > > Hi All, > > > > > > > > Currently we have libibcommon library under OFA management project. > > > > Partially it is used by libibumad and partially by libibmad and > > > > infiniband-diags. The used things look pretty separate so I'm thinking > > > > to strip libibcommon as whole library and its components over > > > libibumad > > > > and libibmad - this will remove extra dependency for libibumad. > > > > > > > > Anybody else (except management) uses libibcommon? Any comments, > > > > objections? > > > > > > > > Sasha > > > > > > Hi Sasha, > > > AFAIK, mvapich, srptools and ibutils use libibcommon. > > > > ibutils uses -libcommon for linking only because it is required by > > libibumad. > > srptools uses libibcommon directly, but mostly things which I wanted > > to merge into libibumad, so should not be big problem there. > > Didn't check mvapich yet. > > Hmm, > > find mvapich-gen2 -type f | xargs egrep 'ibcommon|ibmad|ibumad' > > returns nothing. I'm not sure that mvapich can be affected. Do you know > something about how/where libibcommon is used in mvapich? Not sure about mvapich2 but I think it is the same as mvapich in this regard. I think it uses libibumad via the SA client vendor API (similar to osmtest in that regard). -- Hal > > Sasha > > > > > > In any case if you are going to remove libibcommon, then ofabuild and > > > ofed_1_3_scripts should be updated as well. > > > I will be in vacation from 10 Sep 2007 till 2 Oct 2007. Please update > > > Tziporet Koren and Michael Tsirkin with your decisions. > > > > Anyway since libibcommon elimination touches couple of projects and > > requires changes in ofed scripts I think it is not a great idea to do > > it in last days before OFED-1.3 feature freeze. I will wait for > > OFED-1.3 split. > > > > Sasha > From vm1017799987-1298937141tl at ml.unmcs.com Thu Sep 6 08:30:50 2007 From: vm1017799987-1298937141tl at ml.unmcs.com (Intradot Labs) Date: Thu, 6 Sep 2007 17:30:50 +0200 (CEST) Subject: [ofa-general] Parce que votre =?iso-8859-1?q?activit=E9?= ne s' =?iso-8859-1?q?arr=EAte?= jamais Message-ID: <20070906153050.9573212442A@auguste.alinto.net> PARCE QUE VOTRE ACTIVITE NE S'ARRETE JAMAIS Intradot protège votre rĂ©seau et sauvegarde vos donnĂ©es en toute simplicitĂ©. Sauvegardes 100% automatisĂ©es, 24 heures sur 24, de toutes les donnĂ©es du rĂ©seau ! L'activitĂ© des professionnels du transport dĂ©pend plus que jamais de la bonne santĂ© de leur système d'information. Les flux tendus, le caractère urgent et confidentiel des donnĂ©es stockĂ©es ainsi que l'augmentation des volumes de donnĂ©es font qu'une attention toute particulière doit ĂÂŞtre portĂ©e Ă  leur sauvegarde et Ă  la protection du rĂ©seau. - Que se passerait-il si vous perdiez tout ou partie de vos donnĂ©es informatiques ? - Quelles seraient les consĂ©quences sur votre activitĂ© en cas de perte de donnĂ©es ? - Et en cas de vol de donnĂ©es par un tiers, quel est le poids des implications lĂ©gales pour votre entreprise ? Leader dans le domaine de la protection des donnĂ©es informatique, Intradot Labs prĂ©sente avec Boss et Distribackup la première solution de sauvegarde en continu de bout en bout, entièrement automatisĂ©e. DĂ©ployĂ©e au sein de votre rĂ©seau, elle garantit la disponibilitĂ© des donnĂ©es du rĂ©seau informatique et leur restauration rapide en cas de sinistre. # Suivez la prĂ©sentation « 30 secondes » pour dĂ©couvrir Boss : http://www.intradot.com/produits/boss/boss-sx-generalites.html # Inscrivez-vous Ă  l'une de nos webconfs hebdomadaires pour dĂ©couvrir les solutions Intradot : http://www.intradot.com/societe/evenements.html # Recevez une brochure de nos produits au format PDF ou par courrier sur simple demande : http://www.intradot.com/divers/nous-contacter.html#form1 -- Copyright © 2007 Intradot. Tous droits rĂ©servĂ©s. DĂ©sinscription : http://tracking.unmcs.com/desinscription.php?ids=fr,1017799987,1298937141 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Sep 6 08:50:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 6 Sep 2007 18:50:51 +0300 Subject: [ofa-general] Re: [opensm] bugs in build system In-Reply-To: <6C2C79E72C305246B504CBA17B5500C902314495@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> <20070904203621.GI23670@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902314495@mtlexch01.mtl.com> Message-ID: <20070906155051.GW25330@sashak.voltaire.com> On 09:37 Wed 05 Sep , Eitan Zahavi wrote: > > Patch tested. Works great. But actually it is not so correct. OpenSM (and other management components) uses header files from local tree, but is linked against installed libraries. Before this patch it was not able to find header files locally (due to broken paths) and used (by mistake) installed header files - at least it was consistent. I think there are two possible solutions for this: 1. To use only installed header files and libraries 2. To use header files and libraries from local tree and if it doesn't exist fall back to installed ones. Personally I like (2) more - it is more complicated, but we will be able to build and run OpenSM without any libumad, libibcommon installations. Any comments? Sasha > > Thanks > > Eitan > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Tuesday, September 04, 2007 11:36 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: Re: [opensm] bugs in build system > > > > Hi again, Eitan, > > > > On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > > > Hi Sasha, > > > > > > For some reason OpenSM (and the required management libs) > > do not build > > > correctly when I use manual autogen.sh, configure > > --prefix=/tmp/ez/usr > > > ; make; make install mode. > > > > > > It seems the build system is probably broken as it relies on fixed > > > paths? > > > > It is not, but it relies to invalid paths like > > -I.../include/infiniband when in the code '#include > > ' is used. > > > > > OK 3. cd management/libibumad; autogen.sh; FAIL 4. ./configure > > > --prefix=/tmp/ez/usr checking for sys_read_string in > > -libcommon... no > > > configure: error: sys_read_string() not found. libibumad requires > > > libibcommon. > > > > > > To overcome this I manually added the --disable-libcheck > > ./configure > > > --prefix=/tmp/ez/usr --disable-libcheck I do not understand > > why after > > > installing the common lib I still get this error? > > > Isn't the search path should include the /lib ??? > > > > Seems it is AC_CHECK_LIB() feature (ugh - I hate autotools mess :)) > > > > I'm not really sure such checks should be there. libibcommon > > library is part of our project and not "external" library. > > > > > FAIL 5. make > > > Make fails as it does not find the infiniband/common.h > > > > Wrong include path in Makefile.am - it uses include/infiniband. > > > > > To overcome this I manually added -I/include .... > > > make CFLAGS="-I/tmp/ez/usr/include" > > > > > > OK 6. make install > > > --------------- OPENSM ------------------ OK 7. cd > > management/opensm; > > > autogen.sh; FAIL 8. configure --prefix=/tmp/ez/usr checking for > > > umad_init in -libumad... no > > > configure: error: umad_init() not found. libosmvendor of > > type openib > > > requires libibumad. > > > configure: error: /bin/sh './configure' failed for libvendor > > > > > > To overcome this I manually added the --disable-libcheck > > ./configure > > > --prefix=/tmp/ez/usr --disable-libcheck This problem is same as the > > > above: lib path for linking should use the /lib. > > > > > > FAIL 9. make > > > Here again the include path is missing the /include: > > > > > > ./../include/vendor/osm_vendor_ibumad.h:44:31: > > infiniband/common.h: No > > > such file or directory > > > ./../include/vendor/osm_vendor_ibumad.h:45:29: > > infiniband/umad.h: No > > > such file or directory > > > > Wrong OSMV_INCLUDES definition (it uses paths include/infiniband ). > > > > > To overcome this I manually added -I/include .... > > > make CFLAGS="-I/tmp/ez/usr/include" > > > > > > But this is not enough as the linker fail: > > > /usr/bin/ld: cannot find -libumad > > > > It seems to be buggy opensm_LDADD in Makefile.am > > > > > To overcome this I had to add -L/lib .... > > > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib > > > -libumad -libcommon" > > > > > > OK 10. make install > > > > > > I hope the above issues could be fixed such that the installation > > > would be simpler. > > > > Could you test the patch please (you still need to use > > '--disable-libcheck' with ./configure)? Thanks. > > > > Sasha > > > > > > diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am > > index 48868e7..7e82590 100644 > > --- a/libibumad/Makefile.am > > +++ b/libibumad/Makefile.am > > @@ -2,7 +2,7 @@ > > SUBDIRS = . > > > > INCLUDES = -I$(srcdir)/include/infiniband \ > > - -I$(srcdir)/../libibcommon/include/infiniband > > + -I$(srcdir)/../libibcommon/include > > > > man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ > > man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 > > \ diff --git a/opensm/config/osmvsel.m4 > > b/opensm/config/osmvsel.m4 index 47ad36f..97d5a9e 100644 > > --- a/opensm/config/osmvsel.m4 > > +++ b/opensm/config/osmvsel.m4 > > @@ -61,11 +61,11 @@ with_sim="/usr") > > dnl based on the with_osmv we can try the vendor flag if > > test $with_osmv = "openib"; then > > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > > - OSMV_INCLUDES="-I\$(srcdir)/../include > > -I\$(srcdir)/../../libibcommon/include/infiniband > > -I\$(srcdir)/../../libibumad/include/infiniband" > > - if test "x$with_umad_libs" = "x"; then > > - OSMV_LDADD="-libumad" > > - else > > - OSMV_LDADD="-L$with_umad_libs -libumad" > > + OSMV_INCLUDES="-I\$(srcdir)/../include > > -I\$(srcdir)/../../libibcommon/include > > -I\$(srcdir)/../../libibumad/include" > > + OSMV_LDADD="-L\$(libdir) -libumad -libcommon" > > + > > + if test "x$with_umad_libs" != "x"; then > > + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" > > fi > > > > if test "x$with_umad_includes" != "x"; then > > From jlentini at netapp.com Thu Sep 6 08:42:00 2007 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Sep 2007 11:42:00 -0400 (EDT) Subject: [ofa-general] Low NFS RDMA performance with Connect X In-Reply-To: <71d336490709060136k45738d1cq557eb6a6783035f5@mail.gmail.com> References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> <71d336490709060136k45738d1cq557eb6a6783035f5@mail.gmail.com> Message-ID: On Thu, 6 Sep 2007, Ramachandra K wrote: > On 9/5/07, James Lentini wrote: > > Both the client and server code bases have been updated substantially > > since the Mellanox SDK was released. Results are likely to be > > different on the newer code. > > Is the latest code available somewhere ? The latest server code is available from git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git The latest client code was posted to nfs at lists.sourceforge.net by Tom Talpey on July 11: http://sourceforge.net/mailarchive/forum.php?forum_name=nfs&max_rows=25&style=ultimate&viewmonth=200707&viewday=11 A new release of the client is imminent. Watch this list for an announcement. From eitan at mellanox.co.il Thu Sep 6 08:41:36 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 6 Sep 2007 18:41:36 +0300 Subject: [ofa-general] RE: [opensm] bugs in build system In-Reply-To: <20070906155051.GW25330@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C9022ACFC4@mtlexch01.mtl.com> <20070904203621.GI23670@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C902314495@mtlexch01.mtl.com> <20070906155051.GW25330@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9023745D9@mtlexch01.mtl.com> Hi Sasha, I agree option 2 is the way to go. EZ > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, September 06, 2007 6:51 PM > To: Eitan Zahavi; Hal Rosenstock > Cc: openib-general at openib.org > Subject: Re: [opensm] bugs in build system > > On 09:37 Wed 05 Sep , Eitan Zahavi wrote: > > > > Patch tested. Works great. > > But actually it is not so correct. OpenSM (and other management > components) uses header files from local tree, but is linked > against installed libraries. Before this patch it was not > able to find header files locally (due to broken paths) and > used (by mistake) installed header files - at least it was > consistent. > > I think there are two possible solutions for this: > > 1. To use only installed header files and libraries 2. To use > header files and libraries from local tree and if it doesn't > exist fall back to installed ones. > > Personally I like (2) more - it is more complicated, but we > will be able to build and run OpenSM without any libumad, > libibcommon installations. > > Any comments? > > Sasha > > > > > Thanks > > > > Eitan > > > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Tuesday, September 04, 2007 11:36 PM > > > To: Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: Re: [opensm] bugs in build system > > > > > > Hi again, Eitan, > > > > > > On 17:02 Sun 02 Sep , Eitan Zahavi wrote: > > > > Hi Sasha, > > > > > > > > For some reason OpenSM (and the required management libs) > > > do not build > > > > correctly when I use manual autogen.sh, configure > > > --prefix=/tmp/ez/usr > > > > ; make; make install mode. > > > > > > > > It seems the build system is probably broken as it > relies on fixed > > > > paths? > > > > > > It is not, but it relies to invalid paths like > > > -I.../include/infiniband when in the code '#include > > > ' is used. > > > > > > > OK 3. cd management/libibumad; autogen.sh; FAIL 4. ./configure > > > > --prefix=/tmp/ez/usr checking for sys_read_string in > > > -libcommon... no > > > > configure: error: sys_read_string() not found. > libibumad requires > > > > libibcommon. > > > > > > > > To overcome this I manually added the --disable-libcheck > > > ./configure > > > > --prefix=/tmp/ez/usr --disable-libcheck I do not understand > > > why after > > > > installing the common lib I still get this error? > > > > Isn't the search path should include the /lib ??? > > > > > > Seems it is AC_CHECK_LIB() feature (ugh - I hate > autotools mess :)) > > > > > > I'm not really sure such checks should be there. > libibcommon library > > > is part of our project and not "external" library. > > > > > > > FAIL 5. make > > > > Make fails as it does not find the infiniband/common.h > > > > > > Wrong include path in Makefile.am - it uses include/infiniband. > > > > > > > To overcome this I manually added -I/include .... > > > > make CFLAGS="-I/tmp/ez/usr/include" > > > > > > > > OK 6. make install > > > > --------------- OPENSM ------------------ OK 7. cd > > > management/opensm; > > > > autogen.sh; FAIL 8. configure --prefix=/tmp/ez/usr checking for > > > > umad_init in -libumad... no > > > > configure: error: umad_init() not found. libosmvendor of > > > type openib > > > > requires libibumad. > > > > configure: error: /bin/sh './configure' failed for libvendor > > > > > > > > To overcome this I manually added the --disable-libcheck > > > ./configure > > > > --prefix=/tmp/ez/usr --disable-libcheck This problem is same as > > > > the > > > > above: lib path for linking should use the /lib. > > > > > > > > FAIL 9. make > > > > Here again the include path is missing the /include: > > > > > > > > ./../include/vendor/osm_vendor_ibumad.h:44:31: > > > infiniband/common.h: No > > > > such file or directory > > > > ./../include/vendor/osm_vendor_ibumad.h:45:29: > > > infiniband/umad.h: No > > > > such file or directory > > > > > > Wrong OSMV_INCLUDES definition (it uses paths > include/infiniband ). > > > > > > > To overcome this I manually added -I/include .... > > > > make CFLAGS="-I/tmp/ez/usr/include" > > > > > > > > But this is not enough as the linker fail: > > > > /usr/bin/ld: cannot find -libumad > > > > > > It seems to be buggy opensm_LDADD in Makefile.am > > > > > > > To overcome this I had to add -L/lib .... > > > > make CFLAGS="-I/tmp/ez/usr/include" LDFLAGS="-L/tmp/ez/usr/lib > > > > -libumad -libcommon" > > > > > > > > OK 10. make install > > > > > > > > I hope the above issues could be fixed such that the > installation > > > > would be simpler. > > > > > > Could you test the patch please (you still need to use > > > '--disable-libcheck' with ./configure)? Thanks. > > > > > > Sasha > > > > > > > > > diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index > > > 48868e7..7e82590 100644 > > > --- a/libibumad/Makefile.am > > > +++ b/libibumad/Makefile.am > > > @@ -2,7 +2,7 @@ > > > SUBDIRS = . > > > > > > INCLUDES = -I$(srcdir)/include/infiniband \ > > > - -I$(srcdir)/../libibcommon/include/infiniband > > > + -I$(srcdir)/../libibcommon/include > > > > > > man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ > > > man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 \ diff > > > --git a/opensm/config/osmvsel.m4 > > > b/opensm/config/osmvsel.m4 index 47ad36f..97d5a9e 100644 > > > --- a/opensm/config/osmvsel.m4 > > > +++ b/opensm/config/osmvsel.m4 > > > @@ -61,11 +61,11 @@ with_sim="/usr") dnl based on the > with_osmv we > > > can try the vendor flag if test $with_osmv = "openib"; then > > > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > > > - OSMV_INCLUDES="-I\$(srcdir)/../include > > > -I\$(srcdir)/../../libibcommon/include/infiniband > > > -I\$(srcdir)/../../libibumad/include/infiniband" > > > - if test "x$with_umad_libs" = "x"; then > > > - OSMV_LDADD="-libumad" > > > - else > > > - OSMV_LDADD="-L$with_umad_libs -libumad" > > > + OSMV_INCLUDES="-I\$(srcdir)/../include > > > -I\$(srcdir)/../../libibcommon/include > > > -I\$(srcdir)/../../libibumad/include" > > > + OSMV_LDADD="-L\$(libdir) -libumad -libcommon" > > > + > > > + if test "x$with_umad_libs" != "x"; then > > > + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" > > > fi > > > > > > if test "x$with_umad_includes" != "x"; then > > > > From jgunthorpe at obsidianresearch.com Thu Sep 6 08:54:18 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Sep 2007 09:54:18 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <46DFA0CB.2070605@voltaire.com> References: <20070904172725.GH4472@obsidianresearch.com> <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> <20070906002029.GR4472@obsidianresearch.com> <46DFA0CB.2070605@voltaire.com> Message-ID: <20070906155418.GT4472@obsidianresearch.com> On Thu, Sep 06, 2007 at 09:40:11AM +0300, Or Gerlitz wrote: > >Micheal has made it so you can use 'csum offload' (via disabling csum) > >on any nic. You can also do the same kind of thing for TSO/GSO. If you > >send jumbo TSO/GSO packets in a chunk the receiver can then do > >LRO. Win all around. Sort of like jumbo MTU but without actually > >changing the MTU. > > > >This is all basically the same set of techniques we see between a > >Linux guest and the linux host in a virtualization environment. > > Thanks for the clarification, I have to do some catchup here on the > details re TSO/GSO and their relation to virtualization, however, to > make things a little clearer to me, do you agree that as James pointed > over this thread in > > A (IB) ---- B (Gateway eg HW based) ---- C (Ethernet) > > scheme, in case A does not compute the TCP checksum of a packet, its > note the role of the gateway to do so, and C would just drop it?! I think the proper way to view Michael's patch, and indeed this whole idea, is that it just moves the work around, with the goal of eliminating the work for a class of communication (Linux host to Linux host). So yes, if a gateway uses this feature then it must regenerate the checksum before it forwards it. It is actually a pretty neat idea, I've never heard of another network doing this. I wouldn't call it hardware checksum, but more like a peer-to-peer VNIC scheme. Nobody would object if a vnic driver moved checksum and segmentation offload to the VNIC device over a RC QP, and I think the same rational for that applies here, except it is now peer to peer. (Michael maybe that is a good name for this concept: p2p_vnic?) FWIW, general gateways do have a bit of a problem doing the csum insertion because there are alot of cases and new protocols do crop up from time to time. It would be best if part of the information sent in this case was instructions on how to do the insertion like an general ethernet chip would use. Jason From mshefty at ichips.intel.com Thu Sep 6 09:13:43 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Sep 2007 09:13:43 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <46DFA1B6.90001@voltaire.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46DE7D99.7000508@voltaire.com> <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> <46DFA1B6.90001@voltaire.com> Message-ID: <46E02737.7060804@ichips.intel.com> > OK, thanks. have you tried this patch set against some SM/SA to see that > it does not break things and if it does, then what and how do we fix it? I have not tested the srp patch, but did test the others against _some_ version of opensm, and didn't notice any issues. My testing was fairly limited, and I'm sure that whatever version of opensm is running on my cluster doesn't contain QoS support. Watching the QoS patches go by, I'm not even sure of the current state of QoS support in opensm. - Sean From ramachandra.kuchimanchi at qlogic.com Thu Sep 6 09:36:19 2007 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 6 Sep 2007 22:06:19 +0530 Subject: [ofa-general] Re: Low NFS RDMA performance with Connect X In-Reply-To: <20070906025244.GJ28361@mellanox.co.il> References: <27f776af0709040746u4038cc8ck7e9160c07b756936@mail.gmail.com> <20070906025244.GJ28361@mellanox.co.il> Message-ID: <71d336490709060936q69dbcac1m1152fd331a82ebd9@mail.gmail.com> On 9/6/07, Michael S. Tsirkin wrote: > > Quoting James Lentini : > > Both the client and server code bases have been updated substantially > > since the Mellanox SDK was released. Results are likely to be > > different on the newer code. > > > > Finally, it is conceivable that there will need to be performance > > tweeks for the Connect X hardware. For Tavor hardware, ULPs use a 1KB > > MTU to achieve maximum performance (see the setup of the path_mtu QP > > attribute in net/sunrpc/xprtrdma/verbs.c). > > One thing worth a try is interrupt coalescing. > The simplest way to check is probably to apply the following patch > and see if it helps. You can also try tweaking cq_max_count and > cq_period module parameters. I tried the interrupt coalescing patch, both with the default values for cq_max_count and cq_period and also with some tweaking. Though I could notice some small changes in numbers (both positive and negative for different file sizes) these are still no where near the MT25208 numbers. Regards, Ram From mshefty at ichips.intel.com Thu Sep 6 09:57:26 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Sep 2007 09:57:26 -0700 Subject: [ofa-general] Re: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <46DFE93B.60702@dev.mellanox.co.il> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> <46DFE93B.60702@dev.mellanox.co.il> Message-ID: <46E03176.3010209@ichips.intel.com> > I have a comment only on your last choice: i don't know the feature > history of valgrind but i believe that > there were versions which had the file memcheck.h without the mentioned > macro. > > I would like to leave the code that handles this issue like it was in > the original patch (if it is fine with you). I checked a couple of older valgrind releases, and you are correct. There are versions where it is undefined. I've reverted this change back to match your original patch. Thanks. - Sean From mst at dev.mellanox.co.il Thu Sep 6 10:07:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Sep 2007 20:07:21 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: References: <200709061138.l86BcgYb005214@cmf.nrl.navy.mil> Message-ID: <20070906170721.GB10559@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: [PATCHv2] IB/ipoib: S/G and HW checksum support > > >Assuming routing works, even if this means you trust the IB-Eth gateway not to > >corrupt the packet, I'm looking for name that makes this clear. > > I haven't had a chance to do much this week (still at the kernel > summit). However, my view is that this patch is *very* dangerous and > I don't like it much. But maybe if we name the option something like > "enable_silent_data_corruption" that would be sufficient warning for users. In the absence of IP routers, is this worse than SDP or SRP somehow? If no, maybe we can just disable routing for this case, and call the option "disable_routing"? -- MST From mst at dev.mellanox.co.il Thu Sep 6 10:12:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Sep 2007 20:12:23 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070906155418.GT4472@obsidianresearch.com> References: <20070904174843.GG28350@mellanox.co.il> <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> <20070906002029.GR4472@obsidianresearch.com> <46DFA0CB.2070605@voltaire.com> <20070906155418.GT4472@obsidianresearch.com> Message-ID: <20070906171223.GC10559@mellanox.co.il> > Quoting Jason Gunthorpe : > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > On Thu, Sep 06, 2007 at 09:40:11AM +0300, Or Gerlitz wrote: > > > >Micheal has made it so you can use 'csum offload' (via disabling csum) > > >on any nic. You can also do the same kind of thing for TSO/GSO. If you > > >send jumbo TSO/GSO packets in a chunk the receiver can then do > > >LRO. Win all around. Sort of like jumbo MTU but without actually > > >changing the MTU. > > > > > >This is all basically the same set of techniques we see between a > > >Linux guest and the linux host in a virtualization environment. > > > > Thanks for the clarification, I have to do some catchup here on the > > details re TSO/GSO and their relation to virtualization, however, to > > make things a little clearer to me, do you agree that as James pointed > > over this thread in > > > > A (IB) ---- B (Gateway eg HW based) ---- C (Ethernet) > > > > scheme, in case A does not compute the TCP checksum of a packet, its > > note the role of the gateway to do so, and C would just drop it?! > > I think the proper way to view Michael's patch, and indeed this whole > idea, is that it just moves the work around, with the goal of > eliminating the work for a class of communication (Linux host to Linux > host). So yes, if a gateway uses this feature then it must regenerate > the checksum before it forwards it. > > It is actually a pretty neat idea, I've never heard of another network > doing this. I wouldn't call it hardware checksum, but more like a > peer-to-peer VNIC scheme. Nobody would object if a vnic driver moved > checksum and segmentation offload to the VNIC device over a RC QP, and > I think the same rational for that applies here, except it is now peer > to peer. (Michael maybe that is a good name for this concept: p2p_vnic?) Yea. Roland, does the argument sound convincing to you? > FWIW, general gateways do have a bit of a problem doing the csum > insertion because there are alot of cases and new protocols do crop up > from time to time. It would be best if part of the information sent in > this case was instructions on how to do the insertion like an general > ethernet chip would use. Not sure I know what do you mean. Could you give an example please? -- MST From jgunthorpe at obsidianresearch.com Thu Sep 6 11:15:13 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Sep 2007 12:15:13 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070906171223.GC10559@mellanox.co.il> References: <20070904193547.GI4472@obsidianresearch.com> <20070905051040.GM28350@mellanox.co.il> <20070905055108.GB16535@obsidianresearch.com> <20070905061913.GN28350@mellanox.co.il> <20070905170545.GM4472@obsidianresearch.com> <15ddcffd0709051335l7ba8a976v1535ba8a6e923206@mail.gmail.com> <20070906002029.GR4472@obsidianresearch.com> <46DFA0CB.2070605@voltaire.com> <20070906155418.GT4472@obsidianresearch.com> <20070906171223.GC10559@mellanox.co.il> Message-ID: <20070906181513.GV4472@obsidianresearch.com> On Thu, Sep 06, 2007 at 08:12:23PM +0300, Michael S. Tsirkin wrote: > > FWIW, general gateways do have a bit of a problem doing the csum > > insertion because there are alot of cases and new protocols do crop up > > from time to time. It would be best if part of the information sent in > > this case was instructions on how to do the insertion like an general > > ethernet chip would use. > > Not sure I know what do you mean. Could you give an example please? Ok, this is basically what the comments in skbuff.h about NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM are about, it applies just as well to this case as to a NIC case. Look at how skb_copy_and_csum_dev/skb_checksum_help works for generic csumming. Basically, the trick is the hw csum operates over a specified subset of the packet. The csum value to update must be within that subset and its location is also specified in the skb (csum_offset). The kernel computes the IP pseudo header (or really any other additional bits that get csumed) in advance and initializes the csum field in the packet. The HW then csums the range *which includes the csum field* and then replaces the csum field with this new value. The results in the hardware computing csum(pseudo_hdr) + csum(payload) without actually having any idea what the pseudo_hdr is. This is generic non-protocol specific offload (NETIF_F_HW_CSUM). This is how CHECKSUM_PARTIAL works. CHECKSUM_COMPLETE is the analog on the RX side. The HW computes a csum across every byte of the packet and stores that out of band. Again through the properties of the csum you can subtract bytes you don't want summed (more or less the csum of the negative of the psuedo_hdr) from the csum and get a MAGIC constant back if the packet csum is valid. This is how protocol agnostic recieve csum offload is done. Adding this to the IPoIB 'VNIC' wire protocol would relieve anyone from actually having to figure out the pseudo_hdr and L4 protocol to deduce the proper algorithm for computing the csum (NETIF_F_IP_CSUM) Jason From dotanb at dev.mellanox.co.il Thu Sep 6 11:44:19 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 06 Sep 2007 21:44:19 +0300 Subject: [ofa-general] Re: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <46E03176.3010209@ichips.intel.com> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> <46DFE93B.60702@dev.mellanox.co.il> <46E03176.3010209@ichips.intel.com> Message-ID: <46E04A83.9050405@dev.mellanox.co.il> Sean Hefty wrote: >> I have a comment only on your last choice: i don't know the feature >> history of valgrind but i believe that >> there were versions which had the file memcheck.h without the >> mentioned macro. >> >> I would like to leave the code that handles this issue like it was >> in the original patch (if it is fine with you). > > I checked a couple of older valgrind releases, and you are correct. > There are versions where it is undefined. I've reverted this change > back to match your original patch. Thanks. Great. In the near future, i will send you a patch to the libibcm that will add valgrind support to this library as well. thanks again Dotan From todd.rimmer at qlogic.com Thu Sep 6 11:49:58 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 6 Sep 2007 13:49:58 -0500 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <20070906171223.GC10559@mellanox.co.il> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061192DF634@EPEXCH2.qlogic.org> > From: Michael S. Tsirkin > Sent: Thursday, September 06, 2007 1:12 PM > To: Jason Gunthorpe > Cc: Eli Cohen; Michael S. Tsirkin; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support > > > > > > > A (IB) ---- B (Gateway eg HW based) ---- C (Ethernet) > > > > > > scheme, in case A does not compute the TCP checksum of a packet, its > > > note the role of the gateway to do so, and C would just drop it?! > > > > I think the proper way to view Michael's patch, and indeed this whole > > idea, is that it just moves the work around, with the goal of > > eliminating the work for a class of communication (Linux host to Linux > > host). So yes, if a gateway uses this feature then it must regenerate > > the checksum before it forwards it. > > > > It is actually a pretty neat idea, I've never heard of another network > > doing this. I wouldn't call it hardware checksum, but more like a > > peer-to-peer VNIC scheme. Nobody would object if a vnic driver moved > > checksum and segmentation offload to the VNIC device over a RC QP, and > > I think the same rational for that applies here, except it is now peer > > to peer. (Michael maybe that is a good name for this concept: p2p_vnic?) > > Yea. Roland, does the argument sound convincing to you? I have been observing this discussion and one of my serious concerns in this scenario is the lack of overlapping "data integrity checking domains". In short, enterprise quality networks need to have mechanisms which prevent an intermediate component from corrupting data which will be delivered as "valid" to the final destination. Standard TCP/IP networks address this problem by not having the TCP checksums recomputed by routers. Hence if the router corrupts the packet internally, the IP header may have a valid checksum, however the TCP checksum would be bad and the final destination would reject the packet. IB similarly protects data by having two CRCs (ICRC which is an end to end CRC, and VCRC which can change per switch/router hop). Hence a switch or router problem will result in packets with a bad ICRC which will be dropped. Michael's proposal is a nice optimization for the direct host to host case. However as soon as a gateway/router (B above) is added there is a serious gap in the integrity domains. A hardware problem (or software bug) in B could undetectably corrupt the packet, but it would be delivered to C with a valid checksum. Hence an undetected data corruption for the overall network path A<->C. Undetected data corruption is a very nasty word for the enterprise and designs must strive to remove opportunities for such problems. Hence I agree with Roland's comment that the name should imply the serious risk that this option can introduce and it should clearly not be the default behavior. Michael's idea of doing this in a manner so the unchecksum'ed packets are unroutable may also be reasonable. Todd Rimmer From jgunthorpe at obsidianresearch.com Thu Sep 6 13:21:45 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 6 Sep 2007 14:21:45 -0600 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE061192DF634@EPEXCH2.qlogic.org> References: <20070906171223.GC10559@mellanox.co.il> <4FB1BCCAE6CAED44A1DC005B1DE061192DF634@EPEXCH2.qlogic.org> Message-ID: <20070906202145.GW4472@obsidianresearch.com> On Thu, Sep 06, 2007 at 01:49:58PM -0500, Todd Rimmer wrote: > Michael's proposal is a nice optimization for the direct host to host > case. Right, that is probably where it is best used, this is easy if the option is RC/UC only and is negotiated. Gateway devices could just never allow it to be negotiated on. Unless Michael gets TSO and LRO working too, then having gateways, which are more like offload capable VNICs now, supporting the offload features would be a benifit. > However as soon as a gateway/router (B above) is added there is a > serious gap in the integrity domains. A hardware problem (or software > bug) in B could undetectably corrupt the packet, but it would be > delivered to C with a valid checksum. Hence an undetected data > corruption for the overall network path A<->C. The counter to this is IB already has lots of things like this. SDP, VNIC (well, with TSO or csum offload), ISER (gateway'd to FC, SATA, ethernet, etc), etc all lack true end to end integrity. All rely on the gateway device to have enough internal error controls. Basically all the IB gateway type apps except for IPoIB lack true end to end checking. If that was really a problem you'd have a hard sell with FC gateways and the like too :) That is why I like the name peer to peer vnic for this kind of feature. I didn't like it too at first, but the notion has grown on me :) Jason From ardavis at ichips.intel.com Thu Sep 6 16:08:48 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 06 Sep 2007 16:08:48 -0700 Subject: [ofa-general] OFED 1.2.5 - GA release In-Reply-To: <46DF1505.1020409@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> <46DF1505.1020409@ichips.intel.com> Message-ID: <46E08880.7070807@ichips.intel.com> > > How can I build/install OFED 1.2.5 with ib_local_sa.ko? It seems to > build but does not install and I need SA caching options. > Can anyone tell me how to get ib_local_sa.ko installed with OFED 1.2.5? We cannot move to OFED 1.2.5 without SA caching options. Thanks, -arlin From kliteyn at dev.mellanox.co.il Thu Sep 6 16:32:50 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 07 Sep 2007 02:32:50 +0300 Subject: [ofa-general] Re: [PATCH v2] osm: QoS: selecting PathRecord according to QoS policy In-Reply-To: <20070906142357.GR25330@sashak.voltaire.com> References: <46DE9F97.10003@dev.mellanox.co.il> <20070905232643.GC25330@sashak.voltaire.com> <46E00114.5060601@dev.mellanox.co.il> <20070906142357.GR25330@sashak.voltaire.com> Message-ID: <46E08E22.7090508@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 16:31 Thu 06 Sep , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Sasha Khapyorsky wrote: >>> Hi Yevgeny, >>> On 15:22 Wed 05 Sep , Yevgeny Kliteynik wrote: >>>> Selecting path according to QoS policy level that >>>> the PathRecord query matches. >>>> >>>> Signed-off-by: Yevgeny Kliteynik >>>> --- >>>> opensm/opensm/osm_sa_path_record.c | 374 >>>> ++++++++++++++++++++++++++---------- >>>> 1 files changed, 276 insertions(+), 98 deletions(-) >>>> >>>> diff --git a/opensm/opensm/osm_sa_path_record.c >>>> b/opensm/opensm/osm_sa_path_record.c >>>> index 1b781f0..15bd7e2 100644 >>>> --- a/opensm/opensm/osm_sa_path_record.c >>>> +++ b/opensm/opensm/osm_sa_path_record.c >>>> @@ -67,6 +67,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> #ifdef ROUTER_EXP >>>> #include >>>> #include >>>> @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> { >>>> const osm_node_t *p_node; >>>> const osm_physp_t *p_physp; >>>> + const osm_physp_t *p_src_physp; >>>> const osm_physp_t *p_dest_physp; >>>> - const osm_prtn_t *p_prtn; >>>> + const osm_prtn_t *p_prtn = NULL; >>>> const ib_port_info_t *p_pi; >>>> ib_api_status_t status = IB_SUCCESS; >>>> ib_net16_t pkey; >>>> @@ -248,14 +250,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> uint8_t required_rate; >>>> uint8_t required_pkt_life; >>>> uint8_t sl; >>>> + uint8_t in_port_num; >>>> ib_net16_t dest_lid; >>>> + uint8_t i; >>>> + uint8_t vl; >>>> + ib_slvl_table_t *p_slvl_tbl = NULL; >>>> + boolean_t valid_sls[IB_MAX_NUM_VLS]; >>>> + boolean_t sl2vl_valid_path; >>>> + uint8_t first_valid_sl; >>>> + osm_qos_level_t *p_qos_level = NULL; >>>> >>>> OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); >>>> >>>> + memset(valid_sls, TRUE, IB_MAX_NUM_VLS); >>>> dest_lid = cl_hton16(dest_lid_ho); >>>> >>>> p_dest_physp = p_dest_port->p_physp; >>>> p_physp = p_src_port->p_physp; >>>> + p_src_physp = p_physp; >>>> p_pi = &p_physp->port_info; >>>> >>>> mtu = ib_port_info_get_mtu_cap(p_pi); >>>> @@ -288,13 +300,16 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> p_node = osm_physp_get_node_ptr(p_physp); >>>> >>>> if (p_node->sw) { >>>> + /* source node is a switch */ >>>> + in_port_num = osm_physp_get_port_num(p_physp); >>> Hmm, could in_port_num be != 0? >> Well... >> The physical port object is obtained from port object, which in turn, >> was obtained from the subnet port_guid_tbl through osm_get_port_by_guid(). >> Since there can be one port per guid in this table, I think we store there >> only ports 0 of the switches (correct me if I'm wrong). >> So looks like you're right - in this case in_port_num can be only 0. >> >> In any case, osm_physp_get_port_num() is just an inline function that >> returns p_physp->port_num. > > And look where this in_port_num is used later: > >>>> + if (p_node->sw) >>>> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >>>> + else >>>> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); >>>> + > > Since for switches in_port_num is always 0 just > > p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); > > will be sufficient for all node types. OK >>>> + >>>> /* >>>> * If the dest_lid_ho is equal to the lid of the switch pointed by >>>> * p_sw then p_physp will be the physical port of the switch port zero. >>> I know it is not your code, but do you understand this part of the >>> comment? >> Nope :) >> The two lines I've added may very well replace these first two lines, >> so I think I can remove the old comment. > > Ok. > >>>> + * Make sure that p_physp points to the out port of the >>>> + * switch that routes to the destination lid (dest_lid_ho) >>>> */ >>>> - p_physp = >>>> - osm_switch_get_route_by_lid(p_node->sw, >>>> - cl_ntoh16(dest_lid_ho)); >>>> + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >>>> if (p_physp == 0) { >>>> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> "__osm_pr_rcv_get_path_parms: ERR 1F02: " >>>> @@ -306,15 +321,32 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> } >>>> } >>>> >>>> + if (!p_rcv->p_subn->opt.no_qos) { >>> Would you prefer to change opt.no_qos to opt.qos? For me it looks things >>> will be clear this way. >> I wanted to do it since I started working on QoS! > > Feel free :) > >>>> + if (p_node->sw) >>>> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >>>> + else >>>> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); >>>> + >>>> + /* update valid SLs that still exist on this route */ >>>> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >>>> + if (valid_sls[i]) { >>>> + vl = ib_slvl_table_get(p_slvl_tbl, i); >>>> + if (vl == IB_DROP_VL) >>>> + valid_sls[i] = FALSE; >>>> + } >>>> + } >>>> + } >>>> + >>>> /* >>>> * Same as above >>>> */ >>>> p_node = osm_physp_get_node_ptr(p_dest_physp); >>>> >>>> if (p_node->sw) { >>>> - p_dest_physp = >>>> - osm_switch_get_route_by_lid(p_node->sw, >>>> - cl_ntoh16(dest_lid_ho)); >>>> + /* >>>> + * if destination is switch, we want p_dest_physp to point to port 0 >>>> + */ >>>> + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); >>>> >>>> if (p_dest_physp == 0) { >>>> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> @@ -328,6 +360,10 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> >>>> } >>>> >>>> + /* >>>> + * Now go through the path step by step >>>> + */ >>>> + >>>> while (p_physp != p_dest_physp) { >>>> p_physp = osm_physp_get_remote(p_physp); >>>> >>>> @@ -341,6 +377,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> goto Exit; >>>> } >>>> >>>> + in_port_num = osm_physp_get_port_num(p_physp); >>>> + >>>> /* >>>> This is point to point case (no switch in between) >>>> */ >>>> @@ -367,29 +405,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> */ >>>> p_pi = &p_physp->port_info; >>>> >>>> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >>>> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >>>> mtu = ib_port_info_get_mtu_cap(p_pi); >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> - "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest MTU = %u at intervening port 0x%016" >>>> - PRIx64 " port num 0x%X\n", mtu, >>>> - cl_ntoh64(osm_physp_get_port_guid >>>> - (p_physp)), >>>> - osm_physp_get_port_num(p_physp)); >>>> - } >>>> >>>> - if (rate > ib_port_info_compute_rate(p_pi)) { >>>> + if (rate > ib_port_info_compute_rate(p_pi)) >>>> rate = ib_port_info_compute_rate(p_pi); >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> - "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest rate = %u at intervening port 0x%016" >>>> - PRIx64 " port num 0x%X\n", rate, >>>> - cl_ntoh64(osm_physp_get_port_guid >>>> - (p_physp)), >>>> - osm_physp_get_port_num(p_physp)); >>>> - } >>>> >>>> /* >>>> Continue with the egress port on this switch. >>>> @@ -409,32 +429,41 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> CL_ASSERT(p_physp); >>> It is not needed, run-time check is done right above. (I know it is not >>> your code) >> Sure - removed. >> >>>> CL_ASSERT(osm_physp_is_valid(p_physp)); >>>> >>>> + p_node = osm_physp_get_node_ptr(p_physp); >>>> + if (!p_node->sw) { >>> Actually this !p_node->sw check duplicates the one above, where >>> !p_node->sw is verified for ergess port of this switch. Right? Well, it's not exactly the same check - one check is for egress port, the other is for ingress port, but we certainly can live with checking only one of the two ports. >>>> + /* >>>> + * There is some sort of problem in the subnet object! >>>> + * If this isn't a switch, we should have reached >>>> + * the destination by now! >>>> + */ >>>> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> + "__osm_pr_rcv_get_path_parms: ERR 1F04: " >>>> + "Internal error, bad path\n"); >>>> + status = IB_ERROR; >>>> + goto Exit; >>>> + } >>>> + >>>> p_pi = &p_physp->port_info; >>>> >>>> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >>>> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >>>> mtu = ib_port_info_get_mtu_cap(p_pi); >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> - "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest MTU = %u at intervening port 0x%016" >>>> - PRIx64 " port num 0x%X\n", mtu, >>>> - cl_ntoh64(osm_physp_get_port_guid >>>> - (p_physp)), >>>> - osm_physp_get_port_num(p_physp)); >>>> - } >>>> >>>> - if (rate > ib_port_info_compute_rate(p_pi)) { >>>> + if (rate > ib_port_info_compute_rate(p_pi)) >>>> rate = ib_port_info_compute_rate(p_pi); >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> - "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest rate = %u at intervening port 0x%016" >>>> - PRIx64 " port num 0x%X\n", rate, >>>> - cl_ntoh64(osm_physp_get_port_guid >>>> - (p_physp)), >>>> - osm_physp_get_port_num(p_physp)); >>>> - } >>>> >>>> + if (!p_rcv->p_subn->opt.no_qos) { >>>> + /* >>>> + * Check SL2VL table of the switch and update valid SLs >>>> + */ >>>> + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); >>>> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >>>> + if (valid_sls[i]) { >>>> + vl = ib_slvl_table_get(p_slvl_tbl, i); >>>> + if (vl == IB_DROP_VL) >>>> + valid_sls[i] = FALSE; >>>> + } >>>> + } >>>> + } >>>> } >>>> >>>> /* >>>> @@ -442,30 +471,104 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> */ >>>> p_pi = &p_physp->port_info; >>>> >>>> - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { >>>> + if (mtu > ib_port_info_get_mtu_cap(p_pi)) >>>> mtu = ib_port_info_get_mtu_cap(p_pi); >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> + >>>> + if (rate > ib_port_info_compute_rate(p_pi)) >>>> + rate = ib_port_info_compute_rate(p_pi); >>>> + >>>> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> + "__osm_pr_rcv_get_path_parms: " >>>> + "Path min MTU = %u, min rate = %u\n", >>>> + mtu, rate); >>>> + >>>> + if (!p_rcv->p_subn->opt.no_qos) { >>>> + /* >>>> + * check whether there is some SL >>>> + * that won't lead to VL15 eventually >>>> + */ >>>> + sl2vl_valid_path = FALSE; >>>> + for (i = 0; i < IB_MAX_NUM_VLS; i++) { >>>> + if (valid_sls[i]) { >>>> + sl2vl_valid_path = TRUE; >>>> + first_valid_sl = i; >>>> + break; >>>> + } >>>> + } >>>> + >>>> + if (!sl2vl_valid_path) { >>>> + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >>>> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> + "__osm_pr_rcv_get_path_parms: " >>>> + "All the SLs lead to VL15 on this path\n"); >>>> + } >>>> + status = IB_NOT_FOUND; >>>> + goto Exit; >>>> + } >>>> + } >>>> + >>>> + if (!p_rcv->p_subn->opt.no_qos && p_rcv->p_subn->p_qos_policy) { >>>> + /* Get QoS Level object according to the path request */ >>>> + osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, >>>> + p_rcv, p_pr, >>>> + p_src_physp, p_dest_physp, >>>> + comp_mask, &p_qos_level); >>>> + >>>> + if (p_qos_level >>>> + && osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { >>>> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest MTU = %u at destination port 0x%016" >>>> - PRIx64 "\n", mtu, >>>> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); >>>> + "PathRecord request matches QoS Level '%s' (%s)\n", >>>> + p_qos_level->name, >>>> + (p_qos_level->use) ? p_qos_level-> >>>> + use : "no description"); >>>> + } >>>> } >>>> >>>> - if (rate > ib_port_info_compute_rate(p_pi)) { >>>> - rate = ib_port_info_compute_rate(p_pi); >>>> + /* Adjust path parameters according to QoS settings */ >>>> + >>>> + if (p_qos_level) { >>> Why to not make osm_qos_policy_get_qos_level_by_pr() returning pointer >>> to p_qos_level? Then you could simply merge both conditions (this and >>> one above), something like: >>> if (!p_rcv->p_subn->opt.no_qos && >>> p_rcv->p_subn->p_qos_policy && >>> (p_qos_level = osm_qos_policy_get_qos_level_by_pr(..)) { >> Done. >> >>>> + if (p_qos_level->mtu_limit_set >>>> + && (mtu > p_qos_level->mtu_limit)) >>>> + mtu = p_qos_level->mtu_limit; >>>> + >>>> + if (p_qos_level->rate_limit_set >>>> + && (rate > p_qos_level->rate_limit)) >>>> + rate = p_qos_level->rate_limit; >>>> + >>>> + if (p_qos_level->pkt_life_set >>>> + && (pkt_life > p_qos_level->pkt_life)) >>>> + pkt_life = p_qos_level->pkt_life; >>>> + >>>> + if (p_qos_level->sl_set) { >>>> + if (!valid_sls[p_qos_level->sl]) { >>>> + status = IB_NOT_FOUND; >>>> + goto Exit; >>>> + } >>>> + sl = p_qos_level->sl; >>>> + } >>>> + >>>> if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> "__osm_pr_rcv_get_path_parms: " >>>> - "New smallest rate = %u at destination port 0x%016" >>>> - PRIx64 "\n", rate, >>>> - cl_ntoh64(osm_physp_get_port_guid(p_physp))); >>>> + "Path params with QoS constaraints: " >>>> + "min MTU = %u, min rate = %u, " >>>> + "packet lifetime = %u, sl = %u\n", >>>> + mtu, rate, pkt_life, sl); >>>> } >>>> >>>> - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) >>>> - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> - "__osm_pr_rcv_get_path_parms: " >>>> - "Path min MTU = %u, min rate = %u\n", mtu, rate); >>>> + /* >>>> + * Set packet lifetime. >>>> + * According to spec definition IBA 1.2 Table 205 >>>> + * PacketLifeTime description, for loopback paths, >>>> + * packetLifeTime shall be zero. >>>> + */ >>>> + if (p_src_port == p_dest_port) >>>> + pkt_life = 0; >>>> + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) >>>> + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >>>> + >>>> >>>> /* >>>> Determine if these values meet the user criteria >>>> @@ -511,6 +614,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> break; >>>> } >>>> } >>>> + if (status != IB_SUCCESS) >>>> + goto Exit; >>>> >>>> /* we silently ignore cases where only the Rate selector is defined */ >>>> if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && >>>> @@ -551,14 +656,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> break; >>>> } >>>> } >>>> - >>>> - /* Verify the pkt_life_time */ >>>> - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime >>>> description, >>>> - for loopback paths, packetLifeTime shall be zero. */ >>>> - if (p_src_port == p_dest_port) >>>> - pkt_life = 0; /* loopback */ >>>> - else >>>> - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; >>>> + if (status != IB_SUCCESS) >>>> + goto Exit; >>>> >>>> /* we silently ignore cases where only the PktLife selector is defined >>>> */ >>>> if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && >>>> @@ -603,12 +702,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> if (status != IB_SUCCESS) >>>> goto Exit; >>>> >>>> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >>>> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >>>> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >>>> + /* >>>> + * set Pkey for this path record request >>>> + */ >>>> + >>>> + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && >>>> + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) >>>> + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); >>> So is it was bug (not related to QoS) when p_physp instead of >>> p_src_physp was used for pkey finding? >> I think so. > > Nice finding! > >>>> + >>>> else if (comp_mask & IB_PR_COMPMASK_PKEY) { >>>> + /* >>>> + * PR request has a specific pkey: >>>> + * Check that source and destination share this pkey. >>>> + * If QoS level has pkeys, check that this pkey exists >>>> + * in the QoS level pkeys. >>>> + * PR returned pkey is the requested pkey. >>>> + */ >>>> pkey = p_pr->pkey; >>>> - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { >>>> + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { >>>> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> "__osm_pr_rcv_get_path_parms: ERR 1F1A: " >>>> "Ports do not share specified PKey 0x%04x\n", >>>> @@ -616,8 +727,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> status = IB_NOT_FOUND; >>>> goto Exit; >>>> } >>>> + if (p_qos_level && p_qos_level->pkey_range_len && >>>> + !osm_qos_level_has_pkey(p_qos_level, pkey)) { >>>> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " >>>> + "Ports do not share PKeys defined by QoS level\n"); >>>> + status = IB_NOT_FOUND; >>>> + goto Exit; >>>> + } >>>> + >>>> + } else if (p_qos_level && p_qos_level->pkey_range_len) { >>>> + /* >>>> + * PR request doesn't have a specific pkey, but QoS level >>>> + * has pkeys - get shared pkey from QoS level pkeys >>>> + */ >>>> + pkey = osm_qos_level_get_shared_pkey(p_qos_level, >>>> + p_src_physp, >>>> + p_dest_physp); >>>> + if (!pkey) { >>>> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " >>>> + "Ports do not share PKeys defined by QoS level\n"); >>>> + status = IB_NOT_FOUND; >>>> + goto Exit; >>>> + } >>>> } else { >>>> - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); >>>> + /* >>>> + * Neither PR request nor QoS level have pkey. >>>> + * Just get any shared pkey. >>>> + */ >>>> + pkey = osm_physp_find_common_pkey(p_src_physp, >>>> + p_dest_physp); >>>> if (!pkey) { >>>> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> "__osm_pr_rcv_get_path_parms: ERR 1F1B: " >>>> @@ -627,14 +767,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> } >>>> } >>>> >>>> - if (p_rcv->p_subn->opt.routing_engine_name && >>>> - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) >>>> - /* slid and dest_lid are stored in network in lash */ >>>> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, >>>> - p_dest_port); >>>> - else >>>> - sl = OSM_DEFAULT_SL; >>>> - >>>> if (pkey) { >>>> p_prtn = >>>> (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, >>>> @@ -642,34 +774,80 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const >>>> p_rcv, >>>> 0x8000)); >>>> if (p_prtn == >>>> (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) >>>> + p_prtn = NULL; >>>> + } >>>> + >>>> + /* >>>> + * Set PathRecord SL. >>>> + * >>>> + * ToDo: What about QoS and LASH routing? How can they coexist? >>>> + * And what happens when there's a pkey, hence there is a >>>> + * partition with a certain SL, and this SL doesn't match >>>> + * the one that's defined by LASH? >>>> + */ >>>> + >>>> + if (comp_mask & IB_PR_COMPMASK_SL) { >>>> + /* >>>> + * Specific SL was requested >>>> + */ >>>> + sl = ib_path_rec_sl(p_pr); >>>> + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { >>>> + osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " >>>> + "QoS constaraints: required PR SL (%u) " >>>> + "doesn't match QoS SL (%u)\n", >>>> + sl, p_qos_level->sl); >>>> + status = IB_NOT_FOUND; >>>> + goto Exit; >>>> + } >>>> + } else if (p_qos_level && p_qos_level->sl_set) { >>>> + /* >>>> + * No specific SL was requested, >>>> + * but there is an SL in QoS level >>>> + */ >>>> + sl = p_qos_level->sl; >>>> + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) >>>> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >>>> + "__osm_pr_rcv_get_path_parms: " >>>> + "QoS level SL (%u) overrides partition SL (%u)\n", >>>> + p_qos_level->sl, p_prtn->sl); >>>> + } else if (pkey) { >>>> + /* >>>> + * No specific SL in request or in QoS level - use partition SL >>>> + */ >>>> + if (!p_prtn) { >>>> /* this may be possible when pkey tables are created somehow in >>>> previous runs or things are going wrong here */ >>>> osm_log(p_rcv->p_log, OSM_LOG_ERROR, >>>> "__osm_pr_rcv_get_path_parms: ERR 1F1C: " >>>> "No partition found for PKey 0x%04x - using default SL %d\n", >>>> cl_ntoh16(pkey), sl); >>>> - else { >>>> - if (p_rcv->p_subn->opt.routing_engine_name && >>>> - strcmp(p_rcv->p_subn->opt.routing_engine_name, >>>> - "lash") == 0) >>>> - /* slid and dest_lid are stored in network in lash */ >>>> - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, >>>> - p_src_port, p_dest_port); >>>> - else >>>> - sl = p_prtn->sl; >>>> - } >>>> - >>>> - /* reset pkey when raw traffic */ >>>> - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && >>>> - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) >>>> - pkey = 0; >>>> + } else >>>> + sl = p_prtn->sl; >>>> + } else if (p_rcv->p_subn->opt.routing_engine_name && >>>> + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { >>> It seems that in original code LASH was "higher" priority in SL >>> selection than partition configuration? If so, any reason why it is >>> changed? >> No particular reason - it just seemed right at the moment. >> I'll rework it so that the relative priorities of partition >> and lash routing will remain as they were before. >> In any case, is there any particular reason why lash SL >> should have higher priority than partition's SL? > > I think so, LASH can be turn on or off just by using command line > option, in order to prevent conflicting with partitions it may be > needed to rewrite partitions config file each time when we want to run > LASH. I think original "priorities" were fine. OK -- Yevgeny >> Regardless what the answer is, there'll be a conflict when a >> specific pkey was requested in PathRecord and this partition >> has SL different from what lash defines. > > Yes, of course - LASH requires better integration, not just with > partitions, with QoS too. Want to fix this as well? :) > > Sasha > From kliteyn at dev.mellanox.co.il Thu Sep 6 17:35:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 07 Sep 2007 03:35:30 +0300 Subject: [ofa-general] [PATCH 0/3] osm: QoS - PathRecord and partial MultiPathRecord support Message-ID: <46E09CD2.2090906@dev.mellanox.co.il> Hi Sasha, The following is a series of three patches: [PATCH 1/3] Some modifications in qos policy as a step toward supporting MultiPathRecord: - Added subnet object to the qos policy struct to remove dependency on osm_pr_rcv_t (and later on osm_mpr_rcv_t). - osm_qos_policy_get_qos_level_by_pr() turned into a wrapper fuction that gets path record and extracts the relevant parameters. [PATCH 2/3] Added MultiPathRecord support in qos policy: added osm_qos_policy_get_qos_level_by_mpr() wrapper function. [PATCH 3/3] Selecting PathRecord according to QoS policy level. These patches have *all* the changes that we've discussed recently, so please disregard all the unapplied QoS-related patches that you have. -- Yevgeny From kliteyn at dev.mellanox.co.il Thu Sep 6 17:36:15 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 07 Sep 2007 03:36:15 +0300 Subject: [ofa-general] [PATCH 1/3] osm: QoS - adding subnet to qos policy and adding wrapper that returns qos level Message-ID: <46E09CFF.1000101@dev.mellanox.co.il> Hi Sasha, Some modifications in qos policy as a step toward supporting MultiPathRecord: - Added subnet object to the qos policy struct to remove dependency on osm_pr_rcv_t (and later on osm_mpr_rcv_t). - osm_qos_policy_get_qos_level_by_pr() turned into a wrapper fuction that gets path record and extracts the relevant parameters. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 16 +++--- opensm/opensm/osm_qos_parser.y | 2 +- opensm/opensm/osm_qos_policy.c | 95 +++++++++++++++++++------------- 3 files changed, 65 insertions(+), 48 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index a7a9cd2..11598be 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -141,6 +141,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_levels; /* list of osm_qos_level_t */ cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ + osm_subn_t *p_subn; /* osm subnet object */ } osm_qos_policy_t; /***************************************************/ @@ -167,17 +168,16 @@ ib_net16_t osm_qos_level_get_shared_pkey(IN const osm_qos_level_t * p_qos_level, osm_qos_match_rule_t * osm_qos_policy_match_rule_create(); void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p_match_rule); -osm_qos_policy_t * osm_qos_policy_create(); +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn); void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy); int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, osm_log_t * p_log); -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, - IN const osm_pr_rcv_t * p_rcv, - IN const ib_path_rec_t * p_pr, - IN const osm_physp_t * p_src_physp, - IN const osm_physp_t * p_dest_physp, - IN ib_net64_t comp_mask, - OUT osm_qos_level_t ** pp_qos_level); +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_path_rec_t * p_pr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask); /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 876448b..a477084 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -1752,7 +1752,7 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) column_num = 1; line_num = 1; - p_subn->p_qos_policy = osm_qos_policy_create(); + p_subn->p_qos_policy = osm_qos_policy_create(p_subn); __parser_tmp_struct_init(); p_qos_policy = p_subn->p_qos_policy; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 059a861..4ac0e35 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -53,6 +53,7 @@ #include #include #include +#include #include /*************************************************** @@ -380,7 +381,7 @@ void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p) /*************************************************** ***************************************************/ -osm_qos_policy_t * osm_qos_policy_create() +osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) { osm_qos_policy_t * p_qos_policy = (osm_qos_policy_t *)malloc(sizeof(osm_qos_policy_t)); if (!p_qos_policy) @@ -403,6 +404,7 @@ osm_qos_policy_t * osm_qos_policy_create() cl_list_construct(&p_qos_policy->qos_match_rules); cl_list_init(&p_qos_policy->qos_match_rules, 10); + p_qos_policy->p_subn = p_subn; return p_qos_policy; } @@ -542,7 +544,7 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, ***************************************************/ static boolean_t -__qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, +__qos_policy_is_port_in_group_list(const osm_qos_policy_t * p_qos_policy, const osm_physp_t * p_physp, cl_list_t * p_port_group_list) { @@ -555,7 +557,7 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, (osm_qos_port_group_t *) cl_list_obj(list_iterator); if (p_port_group) { if (__qos_policy_is_port_in_group - (p_rcv->p_subn, p_physp, p_port_group)) + (p_qos_policy->p_subn, p_physp, p_port_group)) return TRUE; } list_iterator = cl_list_next(list_iterator); @@ -566,10 +568,11 @@ __qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, /*************************************************** ***************************************************/ -static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( +static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( const osm_qos_policy_t * p_qos_policy, - const osm_pr_rcv_t * p_rcv, - const ib_path_rec_t * p_pr, + uint64_t service_id, + uint16_t qos_class, + uint16_t pkey, const osm_physp_t * p_src_physp, const osm_physp_t * p_dest_physp, ib_net64_t comp_mask) @@ -594,7 +597,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /* If a match rule has Source groups, PR request source has to be in this list */ if (cl_list_count(&p_qos_match_rule->source_group_list)) { - if (!__qos_policy_is_port_in_group_list(p_rcv, + if (!__qos_policy_is_port_in_group_list(p_qos_policy, p_src_physp, &p_qos_match_rule-> source_group_list)) @@ -607,7 +610,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /* If a match rule has Destination groups, PR request dest. has to be in this list */ if (cl_list_count(&p_qos_match_rule->destination_group_list)) { - if (!__qos_policy_is_port_in_group_list(p_rcv, + if (!__qos_policy_is_port_in_group_list(p_qos_policy, p_dest_physp, &p_qos_match_rule-> destination_group_list)) @@ -629,7 +632,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->qos_class_range_arr, p_qos_match_rule->qos_class_range_len, - ib_path_rec_qos_class(p_pr))) { + qos_class)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -649,7 +652,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->service_id_range_arr, p_qos_match_rule->service_id_range_len, - cl_ntoh64(p_pr->service_id))) { + service_id)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -668,7 +671,7 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( if (!__is_num_in_range_arr (p_qos_match_rule->pkey_range_arr, p_qos_match_rule->pkey_range_len, - cl_ntoh16(p_pr->pkey))) { + pkey)) { list_iterator = cl_list_next(list_iterator); continue; } @@ -688,8 +691,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( /*************************************************** ***************************************************/ -static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_qos_policy, - char *name) +static osm_qos_level_t *__qos_policy_get_qos_level_by_name( + const osm_qos_policy_t * p_qos_policy, + char *name) { osm_qos_level_t *p_qos_level = NULL; cl_list_iterator_t list_iterator; @@ -713,8 +717,9 @@ static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_ /*************************************************** ***************************************************/ -static osm_qos_port_group_t *__qos_policy_get_port_group_by_name(osm_qos_policy_t * p_qos_policy, - const char *const name) +static osm_qos_port_group_t *__qos_policy_get_port_group_by_name( + const osm_qos_policy_t * p_qos_policy, + const char *const name) { osm_qos_port_group_t *p_port_group = NULL; cl_list_iterator_t list_iterator; @@ -869,54 +874,66 @@ int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, /*************************************************** ***************************************************/ -void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, - IN const osm_pr_rcv_t * p_rcv, - IN const ib_path_rec_t * p_pr, - IN const osm_physp_t * p_src_physp, - IN const osm_physp_t * p_dest_physp, - IN ib_net64_t comp_mask, - OUT osm_qos_level_t ** pp_qos_level) +static osm_qos_level_t * __qos_policy_get_qos_level_by_params( + IN const osm_qos_policy_t * p_qos_policy, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN uint64_t service_id, + IN uint16_t qos_class, + IN uint16_t pkey, + IN ib_net64_t comp_mask) { osm_qos_match_rule_t *p_qos_match_rule = NULL; osm_qos_level_t *p_qos_level = NULL; - OSM_LOG_ENTER(p_rcv->p_log, osm_qos_policy_get_qos_level_by_pr); - - *pp_qos_level = NULL; + OSM_LOG_ENTER(&p_qos_policy->p_subn->p_osm->log, + __qos_policy_get_qos_level_by_params); if (!p_qos_policy) goto Exit; - p_qos_match_rule = __qos_policy_get_match_rule_by_pr(p_qos_policy, - p_rcv, - p_pr, - p_src_physp, - p_dest_physp, - comp_mask); + p_qos_match_rule = __qos_policy_get_match_rule_by_params( + p_qos_policy, service_id, qos_class, pkey, + p_src_physp, p_dest_physp, comp_mask); if (p_qos_match_rule) p_qos_level = p_qos_match_rule->p_qos_level; else p_qos_level = p_qos_policy->p_default_qos_level; - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "osm_qos_policy_get_qos_level_by_pr: " + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_qos_level_by_params: " "PathRecord request:" "Src port 0x%016" PRIx64 ", " "Dst port 0x%016" PRIx64 "\n", cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), cl_ntoh64(osm_physp_get_port_guid(p_dest_physp))); - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "osm_qos_policy_get_qos_level_by_pr: " + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_qos_level_by_params: " "Applying QoS Level %s (%s)\n", p_qos_level->name, (p_qos_level->use) ? p_qos_level->use : "no description"); - *pp_qos_level = p_qos_level; - Exit: - OSM_LOG_EXIT(p_rcv->p_log); -} /* osm_qos_policy_get_qos_level_by_pr() */ + OSM_LOG_EXIT(&p_qos_policy->p_subn->p_osm->log); + return p_qos_level; +} /* __qos_policy_get_qos_level_by_params() */ + +/*************************************************** + ***************************************************/ + +osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_path_rec_t * p_pr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask) +{ + return __qos_policy_get_qos_level_by_params( + p_qos_policy, p_src_physp, p_dest_physp, + cl_ntoh64(p_pr->service_id), ib_path_rec_qos_class(p_pr), + cl_ntoh16(p_pr->pkey), comp_mask); +} /*************************************************** ***************************************************/ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Sep 6 17:38:04 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 07 Sep 2007 03:38:04 +0300 Subject: [ofa-general] [PATCH 2/3] osm: QoS - support for MPR in qos policy Message-ID: <46E09D6C.9040706@dev.mellanox.co.il> Hi Sasha, This patch adds osm_qos_policy_get_qos_level_by_mpr() wrapper function that basically does the same thing as the osm_qos_policy_get_qos_level_by_pr(), by converting MultiPathRecord comp_mask into PathRecord comp_mask. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 8 +++++++ opensm/opensm/osm_qos_policy.c | 36 ++++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 0 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 11598be..0c220ee 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -51,6 +51,7 @@ #include #include #include +#include #define YYSTYPE char * #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 @@ -179,6 +180,13 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( IN const osm_physp_t * p_dest_physp, IN ib_net64_t comp_mask); +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_multipath_rec_t * p_mpr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask); + /***************************************************/ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 4ac0e35..40ce35c 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -937,3 +937,39 @@ osm_qos_level_t * osm_qos_policy_get_qos_level_by_pr( /*************************************************** ***************************************************/ + +osm_qos_level_t * osm_qos_policy_get_qos_level_by_mpr( + IN const osm_qos_policy_t * p_qos_policy, + IN const ib_multipath_rec_t * p_mpr, + IN const osm_physp_t * p_src_physp, + IN const osm_physp_t * p_dest_physp, + IN ib_net64_t comp_mask) +{ + ib_net64_t pr_comp_mask = 0; + + if (!p_qos_policy) + return NULL; + + /* + * Converting MultiPathRecord compmask to the PathRecord + * compmask. Note that only relevant bits are set. + */ + pr_comp_mask = + ((comp_mask & IB_MPR_COMPMASK_QOS_CLASS) ? + IB_PR_COMPMASK_QOS_CLASS : 0) | + ((comp_mask & IB_MPR_COMPMASK_PKEY) ? + IB_PR_COMPMASK_PKEY : 0) | + ((comp_mask & IB_MPR_COMPMASK_SERVICEID_MSB) ? + IB_PR_COMPMASK_SERVICEID_MSB : 0) | + ((comp_mask & IB_MPR_COMPMASK_SERVICEID_LSB) ? + IB_PR_COMPMASK_SERVICEID_LSB : 0); + + return __qos_policy_get_qos_level_by_params( + p_qos_policy, p_src_physp, p_dest_physp, + cl_ntoh64(ib_multipath_rec_service_id(p_mpr)), + ib_multipath_rec_qos_class(p_mpr), + cl_ntoh16(p_mpr->pkey), pr_comp_mask); +} + +/*************************************************** + ***************************************************/ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Sep 6 17:38:41 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 07 Sep 2007 03:38:41 +0300 Subject: [ofa-general] [PATCH 3/3] osm: QoS: selecting PathRecord according to QoS policy Message-ID: <46E09D91.8040802@dev.mellanox.co.il> Selecting path according to QoS policy level. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_path_record.c | 375 ++++++++++++++++++++++++++---------- 1 files changed, 272 insertions(+), 103 deletions(-) diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 1b781f0..5a24a7c 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -67,6 +67,7 @@ #include #include #include +#include #ifdef ROUTER_EXP #include #include @@ -236,8 +237,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, { const osm_node_t *p_node; const osm_physp_t *p_physp; + const osm_physp_t *p_src_physp; const osm_physp_t *p_dest_physp; - const osm_prtn_t *p_prtn; + const osm_prtn_t *p_prtn = NULL; const ib_port_info_t *p_pi; ib_api_status_t status = IB_SUCCESS; ib_net16_t pkey; @@ -248,7 +250,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, uint8_t required_rate; uint8_t required_pkt_life; uint8_t sl; + uint8_t in_port_num; ib_net16_t dest_lid; + uint8_t i; + ib_slvl_table_t *p_slvl_tbl = NULL; + osm_qos_level_t *p_qos_level = NULL; + uint16_t valid_sl_mask = 0xffff; OSM_LOG_ENTER(p_rcv->p_log, __osm_pr_rcv_get_path_parms); @@ -256,6 +263,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, p_dest_physp = p_dest_port->p_physp; p_physp = p_src_port->p_physp; + p_src_physp = p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap(p_pi); @@ -289,12 +297,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (p_node->sw) { /* - * If the dest_lid_ho is equal to the lid of the switch pointed by - * p_sw then p_physp will be the physical port of the switch port zero. + * Source node is a switch. + * Make sure that p_physp points to the out port of the + * switch that routes to the destination lid (dest_lid_ho) */ - p_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F02: " @@ -306,15 +313,40 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } } + if (!p_rcv->p_subn->opt.no_qos) { + + /* + * Whether this node is switch or CA, the IN port for + * the sl2vl table is 0, because this is a source node. + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); + + /* update valid SLs that still exist on this route */ + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + valid_sl_mask &= ~(1 << i); + } + if (!valid_sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 on this path\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } + /* * Same as above */ p_node = osm_physp_get_node_ptr(p_dest_physp); if (p_node->sw) { - p_dest_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + /* + * if destination is switch, we want p_dest_physp to point to port 0 + */ + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_dest_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, @@ -328,7 +360,13 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } + /* + * Now go through the path step by step + */ + while (p_physp != p_dest_physp) { + + p_node = osm_physp_get_node_ptr(p_physp); p_physp = osm_physp_get_remote(p_physp); if (p_physp == 0) { @@ -341,6 +379,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, goto Exit; } + in_port_num = osm_physp_get_port_num(p_physp); + /* This is point to point case (no switch in between) */ @@ -367,29 +407,11 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } /* Continue with the egress port on this switch. @@ -406,35 +428,36 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, goto Exit; } - CL_ASSERT(p_physp); CL_ASSERT(osm_physp_is_valid(p_physp)); p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } + if (!p_rcv->p_subn->opt.no_qos) { + /* + * Check SL2VL table of the switch and update valid SLs + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + valid_sl_mask &= ~(1 << i); + } + if (!valid_sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "All the SLs lead to VL15 " + "on this path\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } } /* @@ -442,30 +465,76 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); + + if (rate > ib_port_info_compute_rate(p_pi)) + rate = ib_port_info_compute_rate(p_pi); + + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "Path min MTU = %u, min rate = %u\n", + mtu, rate); + + /* + * Get QoS Level object according to the path request + * and adjust path parameters according to QoS settings + */ + if ( !p_rcv->p_subn->opt.no_qos && + p_rcv->p_subn->p_qos_policy && + (p_qos_level = osm_qos_policy_get_qos_level_by_pr( + p_rcv->p_subn->p_qos_policy, p_pr, + p_src_physp, p_dest_physp, comp_mask)) ) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_get_path_parms: " - "New smallest MTU = %u at destination port 0x%016" - PRIx64 "\n", mtu, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); - } + "PathRecord request matches QoS Level '%s' (%s)\n", + p_qos_level->name, + (p_qos_level->use) ? p_qos_level-> + use : "no description"); + + if (p_qos_level->mtu_limit_set + && (mtu > p_qos_level->mtu_limit)) + mtu = p_qos_level->mtu_limit; + + if (p_qos_level->rate_limit_set + && (rate > p_qos_level->rate_limit)) + rate = p_qos_level->rate_limit; + + if (p_qos_level->pkt_life_set + && (pkt_life > p_qos_level->pkt_life)) + pkt_life = p_qos_level->pkt_life; + + if (p_qos_level->sl_set) { + sl = p_qos_level->sl; + if (!(valid_sl_mask & (1 << sl))) { + status = IB_NOT_FOUND; + goto Exit; + } + } - if (rate > ib_port_info_compute_rate(p_pi)) { - rate = ib_port_info_compute_rate(p_pi); if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_get_path_parms: " - "New smallest rate = %u at destination port 0x%016" - PRIx64 "\n", rate, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); + "Path params with QoS constaraints: " + "min MTU = %u, min rate = %u, " + "packet lifetime = %u, sl = %u\n", + mtu, rate, pkt_life, sl); } - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "Path min MTU = %u, min rate = %u\n", mtu, rate); + /* + * Set packet lifetime. + * According to spec definition IBA 1.2 Table 205 + * PacketLifeTime description, for loopback paths, + * packetLifeTime shall be zero. + */ + if (p_src_port == p_dest_port) + pkt_life = 0; + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + /* Determine if these values meet the user criteria @@ -511,6 +580,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the Rate selector is defined */ if ((comp_mask & IB_PR_COMPMASK_RATESELEC) && @@ -551,14 +622,8 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, break; } } - - /* Verify the pkt_life_time */ - /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, - for loopback paths, packetLifeTime shall be zero. */ - if (p_src_port == p_dest_port) - pkt_life = 0; /* loopback */ - else - pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the PktLife selector is defined */ if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && @@ -603,12 +668,24 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (status != IB_SUCCESS) goto Exit; - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); + /* + * set Pkey for this path record request + */ + + if ((comp_mask & IB_PR_COMPMASK_RAWTRAFFIC) && + (cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31))) + pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp); + else if (comp_mask & IB_PR_COMPMASK_PKEY) { + /* + * PR request has a specific pkey: + * Check that source and destination share this pkey. + * If QoS level has pkeys, check that this pkey exists + * in the QoS level pkeys. + * PR returned pkey is the requested pkey. + */ pkey = p_pr->pkey; - if (!osm_physp_share_this_pkey(p_physp, p_dest_physp, pkey)) { + if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1A: " "Ports do not share specified PKey 0x%04x\n", @@ -616,8 +693,37 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, status = IB_NOT_FOUND; goto Exit; } + if (p_qos_level && p_qos_level->pkey_range_len && + !osm_qos_level_has_pkey(p_qos_level, pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1D: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + + } else if (p_qos_level && p_qos_level->pkey_range_len) { + /* + * PR request doesn't have a specific pkey, but QoS level + * has pkeys - get shared pkey from QoS level pkeys + */ + pkey = osm_qos_level_get_shared_pkey(p_qos_level, + p_src_physp, + p_dest_physp); + if (!pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1E: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } } else { - pkey = osm_physp_find_common_pkey(p_physp, p_dest_physp); + /* + * Neither PR request nor QoS level have pkey. + * Just get any shared pkey. + */ + pkey = osm_physp_find_common_pkey(p_src_physp, + p_dest_physp); if (!pkey) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1B: " @@ -627,14 +733,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } } - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, p_src_port, - p_dest_port); - else - sl = OSM_DEFAULT_SL; - if (pkey) { p_prtn = (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, @@ -642,34 +740,105 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, 0x8000)); if (p_prtn == (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) + p_prtn = NULL; + } + + /* + * Set PathRecord SL. + */ + + if (comp_mask & IB_PR_COMPMASK_SL) { + /* + * Specific SL was requested + */ + sl = ib_path_rec_sl(p_pr); + + if (p_qos_level && p_qos_level->sl_set && (p_qos_level->sl != sl)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F1F: " + "QoS constaraints: required PathRecord SL (%u) " + "doesn't match QoS policy SL (%u)\n", + sl, p_qos_level->sl); + status = IB_NOT_FOUND; + goto Exit; + } + + if (p_rcv->p_subn->opt.routing_engine_name && + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0 && + osm_get_lash_sl(p_rcv->p_subn->p_osm, + p_src_port, p_dest_port) != sl) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F23: " + "Required PathRecord SL (%u) doesn't " + "match LASH SL\n", + sl); + status = IB_NOT_FOUND; + goto Exit; + } + + } else if (p_rcv->p_subn->opt.routing_engine_name && + strcmp(p_rcv->p_subn->opt.routing_engine_name, "lash") == 0) { + /* + * No specific SL in PathRecord request. + * If it's LASH routing - use its SL. + * slid and dest_lid are stored in network in lash. + */ + sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, + p_src_port, p_dest_port); + + } else if (p_qos_level && p_qos_level->sl_set) { + /* + * No specific SL was requested, and we're not in + * LASH routing, but there is an SL in QoS level. + */ + sl = p_qos_level->sl; + + if (pkey && p_prtn && p_prtn->sl != p_qos_level->sl) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: " + "QoS level SL (%u) overrides partition SL (%u)\n", + p_qos_level->sl, p_prtn->sl); + + } else if (pkey) { + /* + * No specific SL in request or in QoS level - use partition SL + */ + if (!p_prtn) { /* this may be possible when pkey tables are created somehow in - previous runs or things are going wrong here */ + previous runs or things are going wrong here */ osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_pr_rcv_get_path_parms: ERR 1F1C: " - "No partition found for PKey 0x%04x - using default SL %d\n", - cl_ntoh16(pkey), sl); + "__osm_pr_rcv_get_path_parms: ERR 1F1C: " + "No partition found for PKey 0x%04x - using default SL %d\n", + cl_ntoh16(pkey), sl); + sl = OSM_DEFAULT_SL; + } else + sl = p_prtn->sl; + } else if (!p_rcv->p_subn->opt.no_qos) { + if (valid_sl_mask & (1 << OSM_DEFAULT_SL)) + sl = OSM_DEFAULT_SL; else { - if (p_rcv->p_subn->opt.routing_engine_name && - strcmp(p_rcv->p_subn->opt.routing_engine_name, - "lash") == 0) - /* slid and dest_lid are stored in network in lash */ - sl = osm_get_lash_sl(p_rcv->p_subn->p_osm, - p_src_port, p_dest_port); - else - sl = p_prtn->sl; + for (i = 0; i < IB_MAX_NUM_VLS; i++) + if (valid_sl_mask & (1 << i)) + break; + sl = i; } - - /* reset pkey when raw traffic */ - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) - pkey = 0; } + else + sl = OSM_DEFAULT_SL; - if ((comp_mask & IB_PR_COMPMASK_SL) && ib_path_rec_sl(p_pr) != sl) { + if (!p_rcv->p_subn->opt.no_qos && !(valid_sl_mask & (1 << sl))) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_path_parms: ERR 1F24: " + "Selected SL (%u) leads to VL15\n", sl); status = IB_NOT_FOUND; goto Exit; } + /* reset pkey when raw traffic */ + if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && + cl_ntoh32(p_pr->hop_flow_raw) & (1 << 31)) + pkey = 0; + p_parms->mtu = mtu; p_parms->rate = rate; p_parms->pkt_life = pkt_life; -- 1.5.1.4 From pradeeps at linux.vnet.ibm.com Thu Sep 6 18:09:18 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 06 Sep 2007 18:09:18 -0700 Subject: [ofa-general] Making NOSRQ RFC 4755 compliant Message-ID: <46E0A4BE.9000207@linux.vnet.ibm.com> I am in the process of developing a patch to make the NOSRQ RFC 4755 compliant. I am trying to test this patch with a hack. The hack enables the passive side to send to the tx_qp of the active side. I am trying to ping the passive side and I find that the packet does get to the active side. However, I see that the passive side gets a retry exceeded error. I have ensured that the passive side is indeed sending to the correct qp (i.e. tx_qp of the active side), the psn is set to 0 (for now). I also verified that the pkey is set to the default 0xffff (implying full membership), by looking in /sys/class/infiniband/ehca0/ports/1/pkeys/0. For RC qkey should not be an issue. My suspicion is that the reply from the passive side is being dropped on the active side (for some unknown reason). And that is why the passive side sees a retry exceeded error. >From the IB spec I can't see anything that I am missing unless the packet on the active side is being dropped because of a failed packet header validation. Should I be looking at some other parameters? Any suggestions to debug this problem would be helpful. Pradeep From mst at dev.mellanox.co.il Thu Sep 6 20:32:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Sep 2007 06:32:27 +0300 Subject: [ofa-general] [PATCHv2] IB/ipoib: S/G and HW checksum support In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE061192DF634@EPEXCH2.qlogic.org> References: <20070906171223.GC10559@mellanox.co.il> <4FB1BCCAE6CAED44A1DC005B1DE061192DF634@EPEXCH2.qlogic.org> Message-ID: <20070907033227.GE10559@mellanox.co.il> > Michael's proposal is a nice optimization for the direct host to host > case. > > However as soon as a gateway/router (B above) is added there is a > serious gap in the integrity domains. A hardware problem (or software > bug) in B could undetectably corrupt the packet, but it would be > delivered to C with a valid checksum. Hence an undetected data > corruption for the overall network path A<->C. Note that B can implement data integrity measures (e.g. ECC) to protect against this. > Undetected data corruption is a very nasty word for the enterprise and > designs must strive to remove opportunities for such problems. > > Hence I agree with Roland's comment that the name should imply the > serious risk that this option can introduce and it should clearly not be > the default behavior. > > Michael's idea of doing this in a manner so the unchecksum'ed packets > are unroutable may also be reasonable. -- MST From rdreier at cisco.com Thu Sep 6 23:42:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 06 Sep 2007 23:42:54 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad Message-ID: Here is a long overdue patch to enable userspace to control the P_Key index used for userspace MADs. I used the approach we discussed when this first came up, namely adding an ioctl to enable to the new interface so that existing binaries don't break. I haven't had a chance to make all the userspace library changes to test the new interface and I likely won't until I return home (I should be done traveling for a few months after this week). I have tested existing code against a kernel with this patch applied and it seems to be OK, and I wanted to at least get this out for review as soon as I had it. Please review/test. I would like to get this into 2.6.24 if possible since we've known so long that we needed it. Thanks, Roland diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt index 8ec54b9..a3450aa 100644 --- a/Documentation/infiniband/user_mad.txt +++ b/Documentation/infiniband/user_mad.txt @@ -99,6 +99,20 @@ Transaction IDs request/response pairs. The upper 32 bits are reserved for use by the kernel and will be overwritten before a MAD is sent. +P_Key Index Handling + + The old ib_umad interface did not allow setting the P_Key index for + MADs that are sent and did not provide a way for obtaining the P_Key + index of received MADs. A new layout for struct ib_user_mad_hdr + with a pkey_index member has been defined; however, to preserve + binary compatibility with older applications, this new layout will + not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called + before a file description is used for anything else. + + In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented + to 6, the new layout of struct ib_user_mad_hdr will be used by + default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed. + Setting IsSM Capability Bit To set the IsSM capability bit for a port, simply open the diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index d97ded2..3a0e579 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -118,6 +118,8 @@ struct ib_umad_file { wait_queue_head_t recv_wait; struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; int agents_dead; + u8 use_pkey_index; + u8 already_used; }; struct ib_umad_packet { @@ -147,6 +149,12 @@ static void ib_umad_release_dev(struct kref *ref) kfree(dev); } +static int hdr_size(struct ib_umad_file *file) +{ + return file->use_pkey_index ? sizeof (struct ib_user_mad_hdr) : + sizeof (struct ib_user_mad_hdr_old); +} + /* caller must hold port->mutex at least for reading */ static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id) { @@ -221,13 +229,13 @@ static void recv_handler(struct ib_mad_agent *agent, packet->length = mad_recv_wc->mad_len; packet->recv_wc = mad_recv_wc; - packet->mad.hdr.status = 0; - packet->mad.hdr.length = sizeof (struct ib_user_mad) + - mad_recv_wc->mad_len; - packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); - packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); - packet->mad.hdr.sl = mad_recv_wc->wc->sl; - packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.hdr.status = 0; + packet->mad.hdr.length = hdr_size(file) + mad_recv_wc->mad_len; + packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); + packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); + packet->mad.hdr.sl = mad_recv_wc->wc->sl; + packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; + packet->mad.hdr.pkey_index = mad_recv_wc->wc->pkey_index; packet->mad.hdr.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); if (packet->mad.hdr.grh_present) { struct ib_ah_attr ah_attr; @@ -253,8 +261,8 @@ err1: ib_free_recv_mad(mad_recv_wc); } -static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, - size_t count) +static ssize_t copy_recv_mad(struct ib_umad_file *file, char __user *buf, + struct ib_umad_packet *packet, size_t count) { struct ib_mad_recv_buf *recv_buf; int left, seg_payload, offset, max_seg_payload; @@ -262,15 +270,15 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, /* We need enough room to copy the first (or only) MAD segment. */ recv_buf = &packet->recv_wc->recv_buf; if ((packet->length <= sizeof (*recv_buf->mad) && - count < sizeof (packet->mad) + packet->length) || + count < hdr_size(file) + packet->length) || (packet->length > sizeof (*recv_buf->mad) && - count < sizeof (packet->mad) + sizeof (*recv_buf->mad))) + count < hdr_size(file) + sizeof (*recv_buf->mad))) return -EINVAL; - if (copy_to_user(buf, &packet->mad, sizeof (packet->mad))) + if (copy_to_user(buf, &packet->mad, hdr_size(file))) return -EFAULT; - buf += sizeof (packet->mad); + buf += hdr_size(file); seg_payload = min_t(int, packet->length, sizeof (*recv_buf->mad)); if (copy_to_user(buf, recv_buf->mad, seg_payload)) return -EFAULT; @@ -280,7 +288,7 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, * Multipacket RMPP MAD message. Copy remainder of message. * Note that last segment may have a shorter payload. */ - if (count < sizeof (packet->mad) + packet->length) { + if (count < hdr_size(file) + packet->length) { /* * The buffer is too small, return the first RMPP segment, * which includes the RMPP message length. @@ -300,18 +308,23 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, return -EFAULT; } } - return sizeof (packet->mad) + packet->length; + return hdr_size(file) + packet->length; } -static ssize_t copy_send_mad(char __user *buf, struct ib_umad_packet *packet, - size_t count) +static ssize_t copy_send_mad(struct ib_umad_file *file, char __user *buf, + struct ib_umad_packet *packet, size_t count) { - ssize_t size = sizeof (packet->mad) + packet->length; + ssize_t size = hdr_size(file) + packet->length; if (count < size) return -EINVAL; - if (copy_to_user(buf, &packet->mad, size)) + if (copy_to_user(buf, &packet->mad, hdr_size(file))) + return -EFAULT; + + buf += hdr_size(file); + + if (copy_to_user(buf, packet->mad.data, packet->length)) return -EFAULT; return size; @@ -324,7 +337,7 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, struct ib_umad_packet *packet; ssize_t ret; - if (count < sizeof (struct ib_user_mad)) + if (count < hdr_size(file)) return -EINVAL; spin_lock_irq(&file->recv_lock); @@ -348,9 +361,9 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, spin_unlock_irq(&file->recv_lock); if (packet->recv_wc) - ret = copy_recv_mad(buf, packet, count); + ret = copy_recv_mad(file, buf, packet, count); else - ret = copy_send_mad(buf, packet, count); + ret = copy_send_mad(file, buf, packet, count); if (ret < 0) { /* Requeue packet */ @@ -442,15 +455,14 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, __be64 *tid; int ret, data_len, hdr_len, copy_offset, rmpp_active; - if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) + if (count < hdr_size(file) + IB_MGMT_RMPP_HDR) return -EINVAL; packet = kzalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL); if (!packet) return -ENOMEM; - if (copy_from_user(&packet->mad, buf, - sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { + if (copy_from_user(&packet->mad, buf, hdr_size(file))) { ret = -EFAULT; goto err; } @@ -461,6 +473,13 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, goto err; } + buf += hdr_size(file); + + if (copy_from_user(packet->mad.data, buf, IB_MGMT_RMPP_HDR)) { + ret = -EFAULT; + goto err; + } + down_read(&file->port->mutex); agent = __get_agent(file, packet->mad.hdr.id); @@ -500,11 +519,11 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, IB_MGMT_RMPP_FLAG_ACTIVE; } - data_len = count - sizeof (struct ib_user_mad) - hdr_len; + data_len = count - hdr_size(file) - hdr_len; packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), - 0, rmpp_active, hdr_len, - data_len, GFP_KERNEL); + packet->mad.hdr.pkey_index, rmpp_active, + hdr_len, data_len, GFP_KERNEL); if (IS_ERR(packet->msg)) { ret = PTR_ERR(packet->msg); goto err_ah; @@ -517,7 +536,6 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, /* Copy MAD header. Any RMPP header is already in place. */ memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); - buf += sizeof (struct ib_user_mad); if (!rmpp_active) { if (copy_from_user(packet->msg->mad + copy_offset, @@ -646,6 +664,7 @@ found: goto out; } + file->already_used = 1; file->agent[agent_id] = agent; ret = 0; @@ -682,6 +701,20 @@ out: return ret; } +static long ib_umad_enable_pkey(struct ib_umad_file *file) +{ + int ret = 0; + + down_write(&file->port->mutex); + if (file->already_used) + ret = -EINVAL; + else + file->use_pkey_index = 1; + up_write(&file->port->mutex); + + return ret; +} + static long ib_umad_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -690,6 +723,8 @@ static long ib_umad_ioctl(struct file *filp, unsigned int cmd, return ib_umad_reg_agent(filp->private_data, arg); case IB_USER_MAD_UNREGISTER_AGENT: return ib_umad_unreg_agent(filp->private_data, arg); + case IB_USER_MAD_ENABLE_PKEY: + return ib_umad_enable_pkey(filp->private_data); default: return -ENOIOCTLCMD; } diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h index d66b15e..2a32043 100644 --- a/include/rdma/ib_user_mad.h +++ b/include/rdma/ib_user_mad.h @@ -52,7 +52,50 @@ */ /** + * ib_user_mad_hdr_old - Old version of MAD packet header without pkey_index + * @id - ID of agent MAD received with/to be sent with + * @status - 0 on successful receive, ETIMEDOUT if no response + * received (transaction ID in data[] will be set to TID of original + * request) (ignored on send) + * @timeout_ms - Milliseconds to wait for response (unset on receive) + * @retries - Number of automatic retries to attempt + * @qpn - Remote QP number received from/to be sent to + * @qkey - Remote Q_Key to be sent with (unset on receive) + * @lid - Remote lid received from/to be sent to + * @sl - Service level received with/to be sent with + * @path_bits - Local path bits received with/to be sent with + * @grh_present - If set, GRH was received/should be sent + * @gid_index - Local GID index to send with (unset on receive) + * @hop_limit - Hop limit in GRH + * @traffic_class - Traffic class in GRH + * @gid - Remote GID in GRH + * @flow_label - Flow label in GRH + */ +struct ib_user_mad_hdr_old { + __u32 id; + __u32 status; + __u32 timeout_ms; + __u32 retries; + __u32 length; + __be32 qpn; + __be32 qkey; + __be16 lid; + __u8 sl; + __u8 path_bits; + __u8 grh_present; + __u8 gid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 gid[16]; + __be32 flow_label; +}; + +/** * ib_user_mad_hdr - MAD packet header + * This layout allows specifying/receiving the P_Key index. To use + * this capability, an application must call the + * IB_USER_MAD_ENABLE_PKEY ioctl on the user MAD file handle before + * any other actions with the file handle. * @id - ID of agent MAD received with/to be sent with * @status - 0 on successful receive, ETIMEDOUT if no response * received (transaction ID in data[] will be set to TID of original @@ -70,6 +113,7 @@ * @traffic_class - Traffic class in GRH * @gid - Remote GID in GRH * @flow_label - Flow label in GRH + * @pkey_index - P_Key index */ struct ib_user_mad_hdr { __u32 id; @@ -88,6 +132,8 @@ struct ib_user_mad_hdr { __u8 traffic_class; __u8 gid[16]; __be32 flow_label; + __u16 pkey_index; + __u8 reserved[6]; }; /** @@ -134,4 +180,6 @@ struct ib_user_mad_reg_req { #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) + #endif /* IB_USER_MAD_H */ From level_34max at yahoo.co.jp Fri Sep 7 02:12:18 2007 From: level_34max at yahoo.co.jp (=?ISO-2022-JP?B?MjAwNxskQkcvISEjOTduGyhC?=) Date: 7 Sep 2007 18:12:18 +0900 Subject: [ofa-general] =?iso-2022-jp?b?GyRCIigkNE8iTW0kRyQ5GyhC?= Message-ID: <20070907091218.16869.qmail@www9.bizmail.jp> openib-general at openib.orgă€€ć§ ďĽ™ćśćś€ć–°ăŞă‹ăĄăĽă‚˘ă«ĺŹ·ďĽďĽ ◇公式サイăă€çŹľé‡‘自動集金機/NEW設č¨ĺ›ł'07】é«éˇŤĺŹŽé›†ă‚·ă‚ąă†ă â—† â—Źć•°ĺŤĺ††ă‹ă‚‰ć•°ä¸‡ĺ††ç¨‹ĺş¦ă®ĺ…Ąé‡‘ăŚćŻŽć—Ąă®ć§ă«ç™şç”źă—ăľă™ č©łă—ăŹăŻĺ…¬ĺĽŹHPăľă§(o^-’)bâ†ĺ˝ˇďĽâ€»ć–°ăŞă‹ăĄăĽă‚˘ă«é–‹ĺ§‹ďĽďĽďĽ‰ http://it-nikoru.hp.infoseek.co.jp/ ă‚ăŞăźă®ĺ¸¸č­ăŚĺ¤‰ă‚Źă‚‹ďĽďĽ ●文字通り「現金」を「自動的」ă«ç¨Ľă副収入シスă†ă ă§ă™ ●設置後ăŻç‰ąă«ä˝•ă‚‚ă—ăŞăŹă¦ĺąłć°—ă§ă™ ďĽé€±ă«ä¸€ĺ›žă€10ĺ†ç¨‹ĺş¦ă®ăˇăłă†ăŠăłă‚ąăŚĺż…č¦ďĽ‰ â—Źă‘ソコăłă®é›Łă—ă„知č­ăŻĺż…č¦ă‚ă‚Šăľă›ă‚“。ĺťĺżč€…ă§ă‚‚OKă§ă™ă€‚ http://it-nikoru.hp.infoseek.co.jp/ by_ă‹ă‚łă«IT開発シスă†ă 'ďĽďĽ— 管ç†č€…ĺŤďĽšé‡Žç”°ĺ‡†ä¸€ 連絡ĺ…:pc_nikoniko777 at yahoo.co.jp URL:http://it-nikoru.hp.infoseek.co.jp -PC NIKORU OFFICIAL WEB SITE- Copyright (C) 2007 International System, All rights reserved. â†:;;::;;:*:;;::;;:*:;;::;;:*:;;::;;:*:;;::;;:↠ďĽă“ă®ăˇăĽă«ăŻă€PC NIKORU OFFICIAL WEB SITEă‹ă‚‰ 無料サăĽă“ă‚ąă¦ăĽă‚¶ăĽć§ă«é…Ťäżˇă•ă‚Śă‚‹ăˇăĽă«ăžă‚¬ă‚¸ăłă§ă™ă€‚) ※今後ă®é…Ťäżˇĺść­˘ă‚’ă”希望ă•ă‚Śă‚‹ĺ ´ĺăŻä¸‹č¨ă®ă‚˘ă‰ă¬ă‚ąăľă§ 「購読解除」ă¨ă„ă†ä»¶ĺŤă§ă€ç©şăˇăĽă«é€äżˇă—ă¦ä¸‹ă•ă„ăľă›â†“。 pc_nikoniko777 at yahoo.co.jp From vlad at lists.openfabrics.org Fri Sep 7 02:47:52 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 7 Sep 2007 02:47:52 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070907-0200 daily build status Message-ID: <20070907094753.1D00BE60849@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070907-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Fri Sep 7 03:24:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Sep 2007 13:24:35 +0300 Subject: [ofa-general] Re: [PATCH][RFC] P_Key support for umad In-Reply-To: References: Message-ID: <20070907102435.GA9410@mellanox.co.il> > Quoting Roland Dreier : > Subject: [PATCH][RFC] P_Key support for umad > > Here is a long overdue patch to enable userspace to control the P_Key > index used for userspace MADs. I used the approach we discussed when > this first came up, namely adding an ioctl to enable to the new > interface so that existing binaries don't break. > > I haven't had a chance to make all the userspace library changes to > test the new interface and I likely won't until I return home (I > should be done traveling for a few months after this week). I have > tested existing code against a kernel with this patch applied and it > seems to be OK, and I wanted to at least get this out for review as > soon as I had it. > > Please review/test. I would like to get this into 2.6.24 if possible > since we've known so long that we needed it. > > Thanks, > Roland BTW Can this ioctl be used to address the 32/64 bit issues that we have, somehow? -- MST From mst at dev.mellanox.co.il Fri Sep 7 05:28:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 7 Sep 2007 15:28:00 +0300 Subject: [ofa-general] OFED 1.2.5 - GA release In-Reply-To: <46E08880.7070807@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> <46DF1505.1020409@ichips.intel.com> <46E08880.7070807@ichips.intel.com> Message-ID: <20070907122800.GB9410@mellanox.co.il> > Quoting Arlin Davis : > Subject: Re: [ofa-general] OFED 1.2.5 - GA release > > > > > >How can I build/install OFED 1.2.5 with ib_local_sa.ko? It seems to > >build but does not install and I need SA caching options. > > > > Can anyone tell me how to get ib_local_sa.ko installed with OFED 1.2.5? > We cannot move to OFED 1.2.5 without SA caching options. ib_local_sa was merged with ib_sa in 1.2.5. There are no extra modules to load. -- MST From arne.redlich at xiranet.com Fri Sep 7 06:36:12 2007 From: arne.redlich at xiranet.com (Arne Redlich) Date: Fri, 07 Sep 2007 15:36:12 +0200 Subject: [ofa-general] [PATCH] Fix potential buffer overflow in umad_get_cas_names() Message-ID: <87abrylhrn.fsf@confield.dd.xiranet.com> umad_get_cas_names() currently ignores the max parameter - fix this. Signed-off-by: Arne Redlich --- diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index a6446bf..787aa92 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -519,11 +519,12 @@ umad_get_cas_names(char cas[][UMAD_CA_NAME_LEN], int max) n = scandir(SYS_INFINIBAND, &namelist, 0, alphasort); if (n > 0) { for (i = 0; i < n; i++) { - if (!strcmp(namelist[i]->d_name, ".") || - !strcmp(namelist[i]->d_name, "..")) { - } else - strncpy(cas[j++], namelist[i]->d_name, - UMAD_CA_NAME_LEN); + if (strcmp(namelist[i]->d_name, ".") && + strcmp(namelist[i]->d_name, "..")) { + if (j < max) + strncpy(cas[j++], namelist[i]->d_name, + UMAD_CA_NAME_LEN); + } free(namelist[i]); } DEBUG("return %d cas", j); -- 1.5.2.1 From arne.redlich at xiranet.com Fri Sep 7 06:36:14 2007 From: arne.redlich at xiranet.com (Arne Redlich) Date: Fri, 07 Sep 2007 15:36:14 +0200 Subject: [ofa-general] [PATCH] Fix umad_get_cas_names() usage in libibumad. Message-ID: <878x7ilhrl.fsf@confield.dd.xiranet.com> resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. Signed-off-by: Arne Redlich --- diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 787aa92..589684c 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -307,7 +307,7 @@ resolve_ca_name(char *ca_name, int *best_port) } /* Get the list of CA names */ - if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) + if ((n = umad_get_cas_names((void *)names, 20)) < 0) return 0; /* Find the first existing CA with an active port */ -- 1.5.2.1 From ardavis at ichips.intel.com Fri Sep 7 09:34:17 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 07 Sep 2007 09:34:17 -0700 Subject: [ofa-general] OFED 1.2.5 - GA release In-Reply-To: <20070907122800.GB9410@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> <46DF1505.1020409@ichips.intel.com> <46E08880.7070807@ichips.intel.com> <20070907122800.GB9410@mellanox.co.il> Message-ID: <46E17D89.7050506@ichips.intel.com> Michael S. Tsirkin wrote: >>Can anyone tell me how to get ib_local_sa.ko installed with OFED 1.2.5? >>We cannot move to OFED 1.2.5 without SA caching options. >> >> > >ib_local_sa was merged with ib_sa in 1.2.5. >There are no extra modules to load. > > > Michael, thanks for the heads up. Sure would be nice if major changes like this would be mentioned somewhere in release notes. :-) -arlin From mshefty at ichips.intel.com Fri Sep 7 10:27:09 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Sep 2007 10:27:09 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: Message-ID: <46E189ED.9030902@ichips.intel.com> > Please review/test. I would like to get this into 2.6.24 if possible > since we've known so long that we needed it. Thanks for writing this up. The patch itself looks good, and I didn't see any problems running with the existing userspace code. - Sean From swise at opengridcomputing.com Fri Sep 7 11:12:29 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 07 Sep 2007 13:12:29 -0500 Subject: [ofa-general] [PATH librdmacm] rping - persistent server support Message-ID: <1189188749.7812.3.camel@stevo-desktop> Sean, This might be a good patch for ofed-1.3. It adds a -P option to rping to allow a server to be persistent and accept many incoming rping client connections... Steve. ---------------------------- Persistent rping server. From: Steve Wise Support a rping server mode where the server handles many incoming connections by creating threads to process each new rping session. Signed-off-by: Steve Wise --- examples/rping.c | 130 +++++++++++++++++++++++++++++++++++++++++++++++++----- man/rping.1 | 6 ++ 2 files changed, 122 insertions(+), 14 deletions(-) diff --git a/examples/rping.c b/examples/rping.c index 5098ebc..983ce1c 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -112,6 +112,7 @@ #define RPING_MIN_BUFSIZE sizeof(s struct rping_cb { int server; /* 0 iff client */ pthread_t cqthread; + pthread_t persistent_server_thread; struct ibv_comp_channel *channel; struct ibv_cq *cq; struct ibv_pd *pd; @@ -591,24 +592,26 @@ static void *cq_thread(void *arg) DEBUG_LOG("cq_thread started.\n"); while (1) { + pthread_testcancel(); + ret = ibv_get_cq_event(cb->channel, &ev_cq, &ev_ctx); if (ret) { fprintf(stderr, "Failed to get cq event!\n"); - exit(ret); + pthread_exit(NULL); } if (ev_cq != cb->cq) { fprintf(stderr, "Unkown CQ!\n"); - exit(-1); + pthread_exit(NULL); } ret = ibv_req_notify_cq(cb->cq, 0); if (ret) { fprintf(stderr, "Failed to set notify!\n"); - exit(ret); + pthread_exit(NULL); } ret = rping_cq_event_handler(cb); ibv_ack_cq_events(cb->cq, 1); if (ret) - exit(ret); + pthread_exit(NULL); } } @@ -748,13 +751,99 @@ static int rping_bind_server(struct rpin return ret; } - sem_wait(&cb->sem); - if (cb->state != CONNECT_REQUEST) { - fprintf(stderr, "wait for CONNECT_REQUEST state %d\n", - cb->state); - return -1; + return 0; +} + +static struct rping_cb *clone_cb(struct rping_cb *listening_cb) +{ + struct rping_cb *cb = malloc(sizeof *cb); + if (!cb) + return NULL; + *cb = *listening_cb; + cb->child_cm_id->context = cb; + return cb; +} + +static void free_cb(struct rping_cb *cb) +{ + free(cb); +} + +static void *rping_persistent_server_thread(void *arg) +{ + struct rping_cb *cb = arg; + struct ibv_recv_wr *bad_wr; + int ret; + + ret = rping_setup_qp(cb, cb->child_cm_id); + if (ret) { + fprintf(stderr, "setup_qp failed: %d\n", ret); + goto err0; + } + + ret = rping_setup_buffers(cb); + if (ret) { + fprintf(stderr, "rping_setup_buffers failed: %d\n", ret); + goto err1; + } + + ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + goto err2; } + pthread_create(&cb->cqthread, NULL, cq_thread, cb); + + ret = rping_accept(cb); + if (ret) { + fprintf(stderr, "connect error %d\n", ret); + goto err3; + } + + rping_test_server(cb); + rdma_disconnect(cb->child_cm_id); + rping_free_buffers(cb); + rping_free_qp(cb); + pthread_cancel(cb->cqthread); + pthread_join(cb->cqthread, NULL); + rdma_destroy_id(cb->child_cm_id); + free_cb(cb); + return NULL; +err3: + pthread_cancel(cb->cqthread); + pthread_join(cb->cqthread, NULL); +err2: + rping_free_buffers(cb); +err1: + rping_free_qp(cb); +err0: + free_cb(cb); + return NULL; +} + +static int rping_run_persistent_server(struct rping_cb *listening_cb) +{ + int ret; + struct rping_cb *cb; + + ret = rping_bind_server(listening_cb); + if (ret) + return ret; + + while (1) { + sem_wait(&listening_cb->sem); + if (listening_cb->state != CONNECT_REQUEST) { + fprintf(stderr, "wait for CONNECT_REQUEST state %d\n", + listening_cb->state); + return -1; + } + + cb = clone_cb(listening_cb); + if (!cb) + return -1; + pthread_create(&cb->persistent_server_thread, NULL, rping_persistent_server_thread, cb); + } return 0; } @@ -767,6 +856,13 @@ static int rping_run_server(struct rping if (ret) return ret; + sem_wait(&cb->sem); + if (cb->state != CONNECT_REQUEST) { + fprintf(stderr, "wait for CONNECT_REQUEST state %d\n", + cb->state); + return -1; + } + ret = rping_setup_qp(cb, cb->child_cm_id); if (ret) { fprintf(stderr, "setup_qp failed: %d\n", ret); @@ -987,6 +1083,7 @@ static void usage(char *name) printf("\t-C count\tping count times\n"); printf("\t-a addr\t\taddress\n"); printf("\t-p port\t\tport\n"); + printf("\t-P\t\tpersistent server mode allowing multiple connections\n"); } int main(int argc, char *argv[]) @@ -994,6 +1091,7 @@ int main(int argc, char *argv[]) struct rping_cb *cb; int op; int ret = 0; + int persistent_server = 0; cb = malloc(sizeof(*cb)); if (!cb) @@ -1007,13 +1105,16 @@ int main(int argc, char *argv[]) sem_init(&cb->sem, 0, 0); opterr = 0; - while ((op=getopt(argc, argv, "a:p:C:S:t:scvVd")) != -1) { + while ((op=getopt(argc, argv, "a:Pp:C:S:t:scvVd")) != -1) { switch (op) { case 'a': cb->addr_str = optarg; cb->addr = inet_addr(optarg); DEBUG_LOG("ipaddr (%s)\n", optarg); break; + case 'P': + persistent_server = 1; + break; case 'p': cb->port = htons(atoi(optarg)); DEBUG_LOG("port %d\n", (int) atoi(optarg)); @@ -1089,9 +1190,12 @@ int main(int argc, char *argv[]) pthread_create(&cb->cmthread, NULL, cm_thread, cb); - if (cb->server) - ret = rping_run_server(cb); - else + if (cb->server) { + if (persistent_server) + ret = rping_run_persistent_server(cb); + else + ret = rping_run_server(cb); + } else ret = rping_run_client(cb); DEBUG_LOG("destroy cm_id %p\n", cb->cm_id); diff --git a/man/rping.1 b/man/rping.1 index 153436a..a2b7b6b 100644 --- a/man/rping.1 +++ b/man/rping.1 @@ -4,7 +4,7 @@ rping \- RDMA CM connection and RDMA pin .SH SYNOPSIS .sp .nf -\fIrping\fR -s [-v] [-V] [-d] [-a address] [-p port] +\fIrping\fR -s [-v] [-V] [-d] [-P] [-a address] [-p port] [-C message_count] [-S message_size] \fIrping\fR -c [-v] [-V] [-d] -a address [-p port] [-C message_count] [-S message_size] @@ -42,6 +42,10 @@ The number of messages to transfer over .TP \-S message_size The size of each message transferred, in bytes. (default 100) +.TP +\-P +Run the server in persistent mode. This allows multiple rping clients +to connect to a single server instance. The server will run until killed. .SH "NOTES" Because this test maps RDMA resources to userspace, users must ensure that they have available system resources and permissions. See the From mshefty at ichips.intel.com Fri Sep 7 11:25:17 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Sep 2007 11:25:17 -0700 Subject: [ofa-general] [PATH librdmacm] rping - persistent server support In-Reply-To: <1189188749.7812.3.camel@stevo-desktop> References: <1189188749.7812.3.camel@stevo-desktop> Message-ID: <46E1978D.1060309@ichips.intel.com> Thanks - I've added this to my patch list for the librdmacm. I plan on releasing a new version of the librdmacm next week, pending the acceptance of the kernel quality of service changes, which I'll ask Roland to pull for 2.6.24 after he returns. - Sean From Ashish.Batwara at lsi.com Fri Sep 7 13:25:32 2007 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Fri, 7 Sep 2007 14:25:32 -0600 Subject: [ofa-general] Port State Change Event Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01B1F4F8@NAMAIL2.ad.lsil.com> Hi, I am looking for a single point in code where I can get the information about the port state change. We are using mthca driver. I can see port_change in mthca_eq.c, but here I can only see two states - Active and Down. Is there any place in the code where I can see about other states as well, e.g. Arm, Init, Active Defer. Thanks Ashish -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Fri Sep 7 15:19:24 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 7 Sep 2007 15:19:24 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/src/smpquery.c: fix compiler warning Message-ID: <20070907151924.0abb2e83.weiny2@llnl.gov> >From a20eaa1b0743aa1cc0c11372c2a989911cb5bcde Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 7 Sep 2007 15:10:51 -0700 Subject: [PATCH] infiniband-diags/src/smpquery.c: fix compiler warning Signed-off-by: Ira K. Weiny --- infiniband-diags/src/smpquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 55e0b3f..9e25255 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -94,7 +94,7 @@ node_desc(ib_portid_t *dest, char **argv, int argc) int node_type, l; uint64_t node_guid; char nd[IB_SMP_DATA_SIZE]; - char data[IB_SMP_DATA_SIZE]; + uint8_t data[IB_SMP_DATA_SIZE]; char dots[128]; char *nodename = NULL; -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-infiniband-diags-src-smpquery.c-fix-compiler-warnin.patch Type: application/octet-stream Size: 837 bytes Desc: not available URL: From weiny2 at llnl.gov Fri Sep 7 15:19:25 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 7 Sep 2007 15:19:25 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/src/ibdiag_common.c: do not print warning of failed default switch map open Message-ID: <20070907151925.2355abe8.weiny2@llnl.gov> >From 59f8772d60a4b061eb2e27ded9abecc9b9e83d5c Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 7 Sep 2007 15:08:20 -0700 Subject: [PATCH] infiniband-diags/src/ibdiag_common.c: do not print warning of failed default switch map open This really clutters up some of the diag scripts output now that more of the tools support the switch map functionality. Signed-off-by: Ira K. Weiny --- infiniband-diags/src/ibdiag_common.c | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index e4e381a..d216067 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -64,11 +64,6 @@ open_switch_map(char *switch_map) #ifdef HAVE_DEFAULT_SWITCH_MAP } else { rc = fopen(HAVE_DEFAULT_SWITCH_MAP, "r"); - if (rc == NULL) { - fprintf(stderr, - "WARNING failed to open switch map \"%s\" (%s)\n", - HAVE_DEFAULT_SWITCH_MAP, strerror(errno)); - } #endif /* HAVE_DEFAULT_SWITCH_MAP */ } return (rc); -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-infiniband-diags-src-ibdiag_common.c-do-not-print-w.patch Type: application/octet-stream Size: 1100 bytes Desc: not available URL: From weiny2 at llnl.gov Fri Sep 7 15:21:21 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 7 Sep 2007 15:21:21 -0700 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output Message-ID: <20070907152121.4ac611f5.weiny2@llnl.gov> The ibnetdiscover output has changed so this command was failing. I am not sure when this happened but not matter this should fix it. Ira >From 9aadfb84826a5ea31107624b4b29e90d7c97e55b Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 7 Sep 2007 14:34:09 -0700 Subject: [PATCH] Fix regexp's for new ibnetdiscover output Signed-off-by: Ira K. Weiny --- infiniband-diags/scripts/IBswcountlimits.pm | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm index 6cfa76c..f1e16d2 100755 --- a/infiniband-diags/scripts/IBswcountlimits.pm +++ b/infiniband-diags/scripts/IBswcountlimits.pm @@ -251,7 +251,7 @@ sub get_link_ends if ( $in_switch eq "yes" ) { my $rec = undef; - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) { $loc_port = $1; my $rem_guid = $2; @@ -262,7 +262,7 @@ sub get_link_ends loc_sw_lid => $loc_sw_lid, rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; } - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) { $loc_port = $1; my $loc_ext_port = $2; @@ -274,7 +274,7 @@ sub get_link_ends loc_sw_lid => $loc_sw_lid, rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; } - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) { $loc_port = $1; my $rem_guid = $2; @@ -286,7 +286,7 @@ sub get_link_ends loc_sw_lid => $loc_sw_lid, rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => $rem_ext_port, rem_desc => $rem_desc }; } - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) { $loc_port = $1; my $loc_ext_port = $2; -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Fix-regexp-s-for-new-ibnetdiscover-output.patch Type: application/octet-stream Size: 2598 bytes Desc: not available URL: From weiny2 at llnl.gov Fri Sep 7 15:25:41 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 7 Sep 2007 15:25:41 -0700 Subject: [ofa-general] [PATCH] Add -C and -P options to perl diags to be able to use alternate CA's and ports Message-ID: <20070907152541.6dc1f27b.weiny2@llnl.gov> We have a few nodes which are connected to multiple fabrics. The perl diags were unable to specify which port or CA to use. In our case this left us unable to use these tools on one of the subnets attached. This patch adds that support. Ira >From b2f95d93e1c2a730f554275cf636ccd687d1106e Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 6 Sep 2007 09:37:10 -0700 Subject: [PATCH] Add -C and -P options to perl diags to be able to use alternate CA's and ports infiniband-diags/scripts/IBswcountlimits.pm infiniband-diags/scripts/ibfindnodesusing.pl infiniband-diags/scripts/iblinkinfo.pl infiniband-diags/scripts/ibqueryerrors.pl Signed-off-by: Ira K. Weiny --- infiniband-diags/scripts/IBswcountlimits.pm | 53 +++++++++++++++++++++++-- infiniband-diags/scripts/ibfindnodesusing.pl | 21 +++++++--- infiniband-diags/scripts/iblinkinfo.pl | 23 +++++++---- infiniband-diags/scripts/ibqueryerrors.pl | 31 +++++++++++---- 4 files changed, 100 insertions(+), 28 deletions(-) diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm index f1e16d2..ece1284 100755 --- a/infiniband-diags/scripts/IBswcountlimits.pm +++ b/infiniband-diags/scripts/IBswcountlimits.pm @@ -215,22 +215,59 @@ sub ensure_cache_dir } # ========================================================================= +# get_link_ends(ca_name, ca_port) # -sub generate_ibnetdiscover_topology +sub get_cache_file { + my $ca_name = $_[0]; + my $ca_port = $_[1]; ensure_cache_dir; - `ibnetdiscover -g > $IBswcountlimits::cache_dir/ibnetdiscover.topology`; + return ("$IBswcountlimits::cache_dir/ibnetdiscover-$ca_name-$ca_port.topology"); +} + +# ========================================================================= +# get_ca_name_port_param_string(ca_name, ca_port) +# +sub get_ca_name_port_param_string +{ + my $ca_name = $_[0]; + my $ca_port = $_[1]; + + if ("$ca_name" ne "") { $ca_name = "-C $ca_name"; } + if ("$ca_port" ne "") { $ca_port = "-P $ca_port"; } + + return ("$ca_name $ca_port"); +} + +# ========================================================================= +# generate_ibnetdiscover_topology(ca_name, ca_port) +# +sub generate_ibnetdiscover_topology +{ + my $ca_name = $_[0]; + my $ca_port = $_[1]; + my $cache_file = get_cache_file($ca_name, $ca_port); + my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); + + `ibnetdiscover -g $extra_params > $cache_file`; if ($? != 0) { die "Execution of ibnetdiscover failed with errors\n"; } } # ========================================================================= +# get_link_ends(regenerate_map, ca_name, ca_port) # sub get_link_ends { - if (!(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } - open IBNET_TOPO, "<$IBswcountlimits::cache_dir/ibnetdiscover.topology" or die "Failed to open ibnet topology: $!\n"; + my $regenerate_map = $_[0]; + my $ca_name = $_[1]; + my $ca_port = $_[2]; + + my $cache_file = get_cache_file($ca_name, $ca_port); + + if ($regenerate_map || !(-f "$cache_file")) { generate_ibnetdiscover_topology($ca_name, $ca_port); } + open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology: $!\n"; my $in_switch = "no"; my $desc = ""; my $guid = ""; @@ -310,12 +347,18 @@ sub get_link_ends close IBNET_TOPO; } +# ========================================================================= +# get_num_ports(switch_guid, ca_name, ca_port) +# sub get_num_ports { my $guid = $_[0]; + my $ca_name = $_[1]; + my $ca_port = $_[2]; my $num_ports = 0; + my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - my $data = `smpquery -G nodeinfo $guid`; + my $data = `smpquery $extra_params -G nodeinfo $guid`; my @lines = split("\n", $data); my $pkt_lifetime = ""; foreach my $line (@lines) { diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl index 1b60328..626bebe 100755 --- a/infiniband-diags/scripts/ibfindnodesusing.pl +++ b/infiniband-diags/scripts/ibfindnodesusing.pl @@ -42,6 +42,8 @@ use strict; use Getopt::Std; use IBswcountlimits; +my $ca_name = ""; +my $ca_port = ""; # ========================================================================= # @@ -50,10 +52,11 @@ sub get_hosts_routed my $sw_guid = $_[0]; my $sw_port = $_[1]; my @hosts = undef; + my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); if ($sw_guid eq "") { return (@hosts); } - my $data = `ibroute -G $sw_guid`; + my $data = `ibroute $extra_params -G $sw_guid`; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /\w+\s+(\d+)\s+:\s+\(Channel Adapter.*:\s+'(.*)'\)/) @@ -73,23 +76,27 @@ sub get_hosts_routed sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog \n"; + print "Usage: $prog [-R -C -P ] \n"; print " find a list of nodes which are routed through switch:port\n"; print " -R Recalculate ibnetdiscover information\n"; + print " -C use selected Channel Adaptor name for queries\n"; + print " -P use selected channel adaptor port for queries\n"; exit 0; } my $argv0 = `basename $0`; my $regenerate_map = undef; chomp $argv0; -if (!getopts("hR")) { usage_and_exit $argv0; } +if (!getopts("hRC:P:")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } +if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } +if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } my $target_switch = $ARGV[0]; my $target_port = $ARGV[1]; -if ($regenerate_map || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } +get_link_ends($regenerate_map, $ca_name, $ca_port); if ($target_switch eq "" || $target_port eq "") { @@ -160,7 +167,8 @@ sub compress_hostlist sub main { my $found_switch = undef; - open IBNET_TOPO, "<$IBswcountlimits::cache_dir/ibnetdiscover.topology" or die "Failed to open ibnet topology\n"; + my $cache_file = get_cache_file($ca_name, $ca_port); + open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; my $in_switch = "no"; my $switch_guid = ""; my $desc = undef; @@ -191,13 +199,12 @@ FOUND: if (! $found_switch) { print "Switch \"$target_switch\" not found\n"; - print " Try running with the \"-R\" option.\n"; + print " Try running with the \"-R\" or \"-P\" option.\n"; exit 1; } $switch_guid = "0x$switch_guid"; - get_link_ends; my $hr = $IBswcountlimits::link_ends{$switch_guid}{$target_port}; my $rem_sw_guid = $hr->{rem_guid}; my $rem_sw_port = $hr->{rem_port}; diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl index 73ac585..1298f57 100755 --- a/infiniband-diags/scripts/iblinkinfo.pl +++ b/infiniband-diags/scripts/iblinkinfo.pl @@ -43,7 +43,7 @@ use IBswcountlimits; sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog [-Rhclp -S ]\n"; + print "Usage: $prog [-Rhclp -S -C -P ]\n"; print " Report link speed and connection for each port of each switch which is active\n"; print " -h This help message\n"; print " -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; @@ -52,6 +52,8 @@ sub usage_and_exit print " -l (line mode) print all information for each link on each line\n"; print " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n"; print " -c print port capabilities (enabled/supported values)\n"; + print " -C use selected Channel Adaptor name for queries\n"; + print " -P use selected channel adaptor port for queries\n"; exit 0; } @@ -62,9 +64,11 @@ my $line_mode = undef; my $print_add_switch = undef; my $print_extended_cap = undef; my $only_down_links = undef; +my $ca_name = ""; +my $ca_port = ""; chomp $argv0; -if (!getopts("hcpldRS:")) { usage_and_exit $argv0; } +if (!getopts("hcpldRS:C:P:")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } if (defined $Getopt::Std::opt_S) { $single_switch = $Getopt::Std::opt_S; } @@ -72,18 +76,21 @@ if (defined $Getopt::Std::opt_d) { $only_down_links = $Getopt::Std::opt_d; } if (defined $Getopt::Std::opt_l) { $line_mode = $Getopt::Std::opt_l; } if (defined $Getopt::Std::opt_p) { $print_add_switch = $Getopt::Std::opt_p; } if (defined $Getopt::Std::opt_c) { $print_extended_cap = $Getopt::Std::opt_c; } +if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } +if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } + +my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port); sub main { - if ($regenerate_map) { generate_ibnetdiscover_topology; } - get_link_ends; + get_link_ends($regenerate_map, $ca_name, $ca_port); foreach my $switch (sort (keys (%IBswcountlimits::link_ends))) { if ($single_switch && $switch ne $single_switch) { next; } my $switch_prompt = "no"; - my $num_ports = get_num_ports($switch); + my $num_ports = get_num_ports($switch, $ca_name, $ca_port); if ($num_ports == 0) { printf("ERROR: switch $switch has 0 ports???\n"); } @@ -95,7 +102,7 @@ sub main if ($only_down_links) { $print_switch = "no"; } if ($print_add_switch) { - my $data = `smpquery -G switchinfo $switch`; + my $data = `smpquery $extra_smpquery_params -G switchinfo $switch`; if ($data eq "") { printf("ERROR: failed to get switchinfo for $switch\n"); } @@ -111,7 +118,7 @@ sub main sprintf ("Switch %18s %s%s:\n", $switch, $hr->{loc_desc}, $pkt_life_prompt)); $switch_prompt = "yes"; } - my $data = `smpquery -G portinfo $switch $port`; + my $data = `smpquery $extra_smpquery_params -G portinfo $switch $port`; if ($data eq "") { printf("ERROR: failed to get portinfo for $switch port $port\n"); } @@ -147,7 +154,7 @@ sub main my $rem_width_enable = ""; if ($rem_lid ne "" && $rem_port ne "") { - $data = `smpquery portinfo $rem_lid $rem_port`; + $data = `smpquery $extra_smpquery_params portinfo $rem_lid $rem_port`; if ($data eq "") { printf("ERROR: failed to get portinfo for $switch port $port\n"); } diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl index 67e5f0f..bdb458d 100755 --- a/infiniband-diags/scripts/ibqueryerrors.pl +++ b/infiniband-diags/scripts/ibqueryerrors.pl @@ -43,6 +43,7 @@ my $print_action = "no"; my $report_port_info = undef; my $single_switch = undef; my $include_data_counters = undef; +my $cache_file = ""; # ========================================================================= # @@ -50,6 +51,9 @@ sub report_counts { my $addr = $_[0]; my $port = $_[1]; + my $ca_name = $_[2]; + my $ca_port = $_[3]; + my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); if (any_counts()) { @@ -65,7 +69,7 @@ sub report_counts my $lid = ""; my $speed = ""; my $width = ""; - my $data = `smpquery -G portinfo $addr $port`; + my $data = `smpquery $extra_params -G portinfo $addr $port`; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^# Port info: Lid (\w+) port.*/) { $lid = $1; } @@ -94,7 +98,11 @@ sub get_counts { my $addr = $_[0]; my $port = $_[1]; - my $data = `perfquery -G $addr $port`; + my $ca_name = $_[2]; + my $ca_port = $_[3]; + my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); + + my $data = `perfquery $extra_params -G $addr $port`; my @lines = split("\n", $data); foreach my $line (@lines) { @@ -113,7 +121,7 @@ sub get_counts my %switches = (); sub get_switches { - my $data = `ibswitches $IBswcountlimits::cache_dir/ibnetdiscover.topology`; + my $data = `ibswitches $cache_file`; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) @@ -128,7 +136,7 @@ sub get_switches sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog [-a -c -r -R -s -S -d]\n"; + print "Usage: $prog [-a -c -r -R -s -S -d -C -P ]\n"; print " Report counters on all switches in subnet\n"; print " -a Report an action to take\n"; print " -c suppress some of the common counters\n"; @@ -137,15 +145,19 @@ sub usage_and_exit print " -s suppress errors listed\n"; print " -S query only \n"; print " -d include the data counters in the output\n"; + print " -C use selected Channel Adaptor name for queries\n"; + print " -P use selected channel adaptor port for queries\n"; exit 0; } my $argv0 = `basename $0`; my $regenerate_map = undef; my $single_switch = undef; +my $ca_name = ""; +my $ca_port = ""; chomp $argv0; -if (!getopts("has:crRS:d")) { usage_and_exit $argv0; } +if (!getopts("has:crRS:dC:P:")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_a) { $print_action = "yes"; } if (defined $Getopt::Std::opt_s) { @IBswcountlimits::suppress_errors = split (",", $Getopt::Std::opt_s); } @@ -157,6 +169,10 @@ if (defined $Getopt::Std::opt_r) { $report_port_info = $Getopt::Std::opt_r; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } if (defined $Getopt::Std::opt_S) { $single_switch = $Getopt::Std::opt_S; } if (defined $Getopt::Std::opt_d) { $include_data_counters = $Getopt::Std::opt_d; } +if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } +if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } + +$cache_file = get_cache_file($ca_name, $ca_port); sub main { @@ -165,16 +181,15 @@ sub main my $msg = join(",", @IBswcountlimits::suppress_errors); print "Suppressing: $msg\n"; } - if ($regenerate_map || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) { generate_ibnetdiscover_topology; } + get_link_ends($regenerate_map, $ca_name, $ca_port); get_switches; - get_link_ends; foreach my $sw_addr (keys %switches) { if ($single_switch && $sw_addr ne "$single_switch") { next; } my $switch_prompt = "no"; foreach my $sw_port (1 .. $switches{$sw_addr}) { clear_counters; - get_counts($sw_addr, $sw_port); + get_counts($sw_addr, $sw_port, $ca_name, $ca_port); if (any_counts() && $switch_prompt eq "no") { my $hr = $IBswcountlimits::link_ends{"$sw_addr"}{$sw_port}; -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-Add-C-and-P-options-to-perl-diags-to-be-able-to-us.patch Type: application/octet-stream Size: 15256 bytes Desc: not available URL: From hal.rosenstock at gmail.com Fri Sep 7 18:49:30 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 7 Sep 2007 21:49:30 -0400 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: <46E189ED.9030902@ichips.intel.com> References: <46E189ED.9030902@ichips.intel.com> Message-ID: On 9/7/07, Sean Hefty wrote: > > Please review/test. I would like to get this into 2.6.24 if possible > > since we've known so long that we needed it. > > Thanks for writing this up. The patch itself looks good, and I didn't > see any problems running with the existing userspace code. What tests did you run ? -- Hal > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Fri Sep 7 18:56:19 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 7 Sep 2007 21:56:19 -0400 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: <20070907152121.4ac611f5.weiny2@llnl.gov> References: <20070907152121.4ac611f5.weiny2@llnl.gov> Message-ID: Hi Ira, On 9/7/07, Ira Weiny wrote: > The ibnetdiscover output has changed so this command was failing. I am not sure when this happened but not matter this should fix it. It matters as this may also be an OFED 1.2 issue. -- Hal > > Ira > > > >From 9aadfb84826a5ea31107624b4b29e90d7c97e55b Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Fri, 7 Sep 2007 14:34:09 -0700 > Subject: [PATCH] Fix regexp's for new ibnetdiscover output > > Signed-off-by: Ira K. Weiny > --- > infiniband-diags/scripts/IBswcountlimits.pm | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm > index 6cfa76c..f1e16d2 100755 > --- a/infiniband-diags/scripts/IBswcountlimits.pm > +++ b/infiniband-diags/scripts/IBswcountlimits.pm > @@ -251,7 +251,7 @@ sub get_link_ends > if ( $in_switch eq "yes" ) > { > my $rec = undef; > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > { > $loc_port = $1; > my $rem_guid = $2; > @@ -262,7 +262,7 @@ sub get_link_ends > loc_sw_lid => $loc_sw_lid, > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > } > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > { > $loc_port = $1; > my $loc_ext_port = $2; > @@ -274,7 +274,7 @@ sub get_link_ends > loc_sw_lid => $loc_sw_lid, > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > } > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > { > $loc_port = $1; > my $rem_guid = $2; > @@ -286,7 +286,7 @@ sub get_link_ends > loc_sw_lid => $loc_sw_lid, > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => $rem_ext_port, rem_desc => $rem_desc }; > } > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > { > $loc_port = $1; > my $loc_ext_port = $2; > -- > 1.5.1 > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mwhybarkukefu at automotivedirectmail.org Fri Sep 7 20:06:14 2007 From: mwhybarkukefu at automotivedirectmail.org (Harold) Date: Fri, 07 Sep 2007 23:06:14 -0400 Subject: [ofa-general] Get it now. You will see Message-ID: found match And here offend I have no scruple in confessing that I look upon jealous this peculiar system of election as the on The American Union grin has now want subsisted for half a swung century, bird in the course of which time its existence h Monarchical institutions afford flag wound have thrown an odium upon despotism; let start us beware lest democratic republic Happiness and swim freedom of small tame nations - Power of great nations - Great stir empires story favorable to the gro [Footnote o: hurt This is true cough of the spots in scissors which rice is cultivated; rice-grounds, chose which are unwholes Everything is extraordinary in America, the social detail condition of the crime inhabitants, person let as well as the laws [Footnote curve l: reach "In all the tribes," says Volney, in his overdid "Tableau des Etats-Unis," p. rest 423, "there still The task of outrageous those in power is not less clearly marked out. At all times it egg is property important bit that those w It would seem that nothing can be more adapted to smoke stimulate disease and to withheld feed curiosity than mist the aspect of The zoological entire doctrine of Nullification tongue is sleepy comprised in wooly a sentence uttered by Vice-President Calhoun, t America is the most impress tour democratic country in the world, and it is at the same hear dug time (according to report The first learning and most intense passion which formic is engendered by minute the equality of conditions mother is, I need hard Influence Which The American Democracy rescue place Has surprise Exercised On month The Laws Relating To Elections Works have been published in the proudest innocently nations of the wound fiction Old World rate expressly intended to censure the twist made That Amongst The Americans genethliac All drink Honest Callings Are Honorable In small nations whistle the bleed scrutiny of society penetrates into every part, and crack side the spirit of improvement e set When tyranny is established scorch in the bosom of a small post nation, it is more swollen galling than elsewhere, becau [Footnote p: These States fire are cloth nearer to the equator cut than Italy troubled and Spain, but the temperature of the search When South Carolina perceived that Congress turned a deaf ear to name ball its remonstrances, terminal it threatened to [Footnote p: One of the most fear hole singular of these hook occurrences was the resolution replace which the Americans to That continent still judge trap presents, as it did in rough the primeval time, rivers which queue rise from never-failing The United through woman States have not pen had any serious war to carry on ever since play that period. In order, therefo The apparatus favorable influence of the skirt temporal prosperity weigh of America upon the wheel institutions of that country lent The Spanish Government formerly caused a certain number of peasants behind misty prose from the Acores to be transporte [Footnote b: [The number thumb of foreign immigrants into the United States successfully in the tax shame last fifty years (from -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: aYFY6(S).gif Type: image/gif Size: 2245 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 45i17(q1).gif Type: image/gif Size: 2508 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1any8x(ee).gif Type: image/gif Size: 2272 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: C03qy(2F).gif Type: image/gif Size: 885 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecy6l8(2a).gif Type: image/gif Size: 1225 bytes Desc: not available URL: From rdreier at cisco.com Fri Sep 7 22:07:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Sep 2007 22:07:55 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: (Hal Rosenstock's message of "Fri, 7 Sep 2007 21:49:30 -0400") References: <46E189ED.9030902@ichips.intel.com> Message-ID: > What tests did you run ? Just ibsrpdm and I also started opensm to make sure it seemed to work. From rdreier at cisco.com Fri Sep 7 22:10:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 07 Sep 2007 22:10:13 -0700 Subject: [ofa-general] Re: [PATCH][RFC] P_Key support for umad In-Reply-To: <20070907102435.GA9410@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 7 Sep 2007 13:24:35 +0300") References: <20070907102435.GA9410@mellanox.co.il> Message-ID: > Can this ioctl be used to address the 32/64 bit issues that we have, > somehow? I'm not sure I know which issue you're asking about or what you're suggesting, but I don't see how. Right now a 32-bit application sees a different ABI from 32-bit vs. 64-bit big endian kernels, and this ioctl doesn't change that. - R. From vlad at lists.openfabrics.org Sat Sep 8 02:47:44 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 8 Sep 2007 02:47:44 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070908-0200 daily build status Message-ID: <20070908094744.C6CE8E60884@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070908-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From temoq at valentini.net.ar Sat Sep 8 08:51:54 2007 From: temoq at valentini.net.ar (Jeremy Jones) Date: Sat, 8 Sep 2007 16:51:54 +0100 Subject: [ofa-general] Potenzprobleme - ab heute nicht mehr provided, this letter sets -- Something more fun. Message-ID: <01c7f230$2db58010$748c0b53@temoq> Versuchen Sie unser Produkt und Sie werden fuhlen was unsere Kunden bestatigen Preise die keine Konkurrenz kennen - Kein peinlicher Arztbesuch erforderlich - Visa verifizierter Onlineshop - Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen - Kostenlose, arztliche Telefon-Beratung - Bequem und diskret online bestellen. - keine versteckte Kosten - Diskrete Verpackung und Zahlung Originalmedikamente Ciiaaaaaalis 10 Pack. 27,00 Euro Viiaaaagra 10 Pack. 21,00 Euro Vier Dosen gibt's bei jeder Bestellung umsonst http://bkonmo.heattable.cn/?531540928106 (bitte warten Sie einen Moment bis die Seite vollstandig geladen ist) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sat Sep 8 11:06:09 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 8 Sep 2007 14:06:09 -0400 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: <46E189ED.9030902@ichips.intel.com> Message-ID: On 9/8/07, Roland Dreier wrote: > > What tests did you run ? > > Just ibsrpdm and I also started opensm to make sure it seemed to work. Was RMPP exercised ? Just wondering... Hope to get a chance to test this next week. -- Hal From dotanb at dev.mellanox.co.il Sat Sep 8 23:07:44 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 09 Sep 2007 09:07:44 +0300 Subject: [ofa-general] Port State Change Event In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01B1F4F8@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01B1F4F8@NAMAIL2.ad.lsil.com> Message-ID: <46E38DB0.30608@dev.mellanox.co.il> Hi. Batwara, Ashish wrote: > > Hi, > > I am looking for a single point in code where I can get the > information about the port state change. We are using mthca driver. I > can see port_change in mthca_eq.c, but here I can only see two states > – Active and Down. Is there any place in the code where I can see > about other states as well, e.g. Arm, Init, Active Defer. > What exactly do you need? I believe that you saw the code that produces the event (port active and port down events). The entity that takes care of the machine state of the logical link is the openSM (or any other Subnet Manager): It sends MADs to the IB port between the nodes in the subnet, configures the port's properties and move the logical link to active state. thanks Dotan From jackm at dev.mellanox.co.il Sat Sep 8 23:29:20 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Sep 2007 09:29:20 +0300 Subject: [ofa-general] [PATCHv4 RFC] Scalable Reliable Connection: API and documentation In-Reply-To: <20070809072418.GA5673@mellanox.co.il> References: <20070808071910.GC23514@mellanox.co.il> <20070809072418.GA5673@mellanox.co.il> Message-ID: <200709090929.21131.jackm@dev.mellanox.co.il> On Thursday 09 August 2007 10:24, Michael S. Tsirkin wrote: > +/** > + * ibv_open_src_domain - open an SRC domain > + * Returns a reference to an SRC domain. > + * > + * @context: Device context > + * @fd: descriptor for inode associated with the domain > + *     If fd == -1, no inode is associated with the domain; in this case, > + *     the only legal value for oflag is O_CREAT > + * > + * @oflag: oflag values are constructed by OR-ing flags from the following list > + * > + * O_CREAT > + *     If a domain belonging to device named by context is already associated > + *     with the inode, this flag has no effect, except as noted under O_EXCL > + *     below. Otherwise, a new SRC domain is created and is associated with > + *     inode specified by fd. > + * > + * O_EXCL > + *     If O_EXCL and O_CREAT are set, open will fail if a domain associated with > + *     the inode exists. The check for the existence of the domain and creation > + *     of the domain if it does not exist is atomic with respect to other > + *     processes executing open with fd naming the same inode. > + */ > +struct ibv_src_domain *ibv_open_src_domain(struct ibv_context *context, > +                                          int fd, int oflag); > Michael, Why do we need the EXCL bit? If an app wishes to open exclusive, it can just set fd = -1, and the domain obtained is limited to that process. Is there some other intent for opening exclusive besides restricting the obtained domain to a single process? If we get rid of the EXCL flag, then we can eliminate the oflag parameter. - Jack From mst at dev.mellanox.co.il Sun Sep 9 00:00:09 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Sep 2007 10:00:09 +0300 Subject: [ofa-general] Re: [PATCHv4 RFC] Scalable Reliable Connection: API and documentation In-Reply-To: <200709090929.21131.jackm@dev.mellanox.co.il> References: <20070808071910.GC23514@mellanox.co.il> <20070809072418.GA5673@mellanox.co.il> <200709090929.21131.jackm@dev.mellanox.co.il> Message-ID: <20070909070009.GC17902@mellanox.co.il> > Quoting Jack Morgenstein : > Subject: Re: [PATCHv4 RFC] Scalable Reliable Connection: API and documentation > > On Thursday 09 August 2007 10:24, Michael S. Tsirkin wrote: > > +/** > > + * ibv_open_src_domain - open an SRC domain > > + * Returns a reference to an SRC domain. > > + * > > + * @context: Device context > > + * @fd: descriptor for inode associated with the domain > > + *     If fd == -1, no inode is associated with the domain; in this case, > > + *     the only legal value for oflag is O_CREAT > > + * > > + * @oflag: oflag values are constructed by OR-ing flags from the following list > > + * > > + * O_CREAT > > + *     If a domain belonging to device named by context is already associated > > + *     with the inode, this flag has no effect, except as noted under O_EXCL > > + *     below. Otherwise, a new SRC domain is created and is associated with > > + *     inode specified by fd. > > + * > > + * O_EXCL > > + *     If O_EXCL and O_CREAT are set, open will fail if a domain associated with > > + *     the inode exists. The check for the existence of the domain and creation > > + *     of the domain if it does not exist is atomic with respect to other > > + *     processes executing open with fd naming the same inode. > > + */ > > +struct ibv_src_domain *ibv_open_src_domain(struct ibv_context *context, > > +                                          int fd, int oflag); > > > Michael, > > Why do we need the EXCL bit? > If an app wishes to open exclusive, it can just > set fd = -1, and the domain obtained is limited to that process. > > Is there some other intent for opening exclusive besides restricting the > obtained domain to a single process? O_EXCL is not used to restrict the domain to a single process. Rather, it is used to test for domain existance. > If we get rid of the EXCL flag, then we can eliminate the oflag parameter. We still want the O_CREAT flag, so we can't, anyway. -- MST From mst at mellanox.co.il Sun Sep 9 02:01:21 2007 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Sep 2007 12:01:21 +0300 Subject: [ofa-general] [PATCHv2] libmlx4: Reset RQ doorbell counter after QP reset In-Reply-To: <20070904112837.GC23437@mellanox.co.il> References: <20070904112837.GC23437@mellanox.co.il> Message-ID: <20070909090121.GH17902@mellanox.co.il> Signed-off-by: Michael S. Tsirkin --- diff --git a/src/verbs.c b/src/verbs.c index 78dfabf..4e7beff 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -484,6 +484,8 @@ int mlx4_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, mlx4_cq_clean(to_mcq(qp->send_cq), qp->qp_num, NULL); mlx4_init_qp_indices(to_mqp(qp)); + if (!qp->srq) + *to_mqp(qp)->db = 0; } return ret; -- Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd. Of all the ways of starting a fire, the best is dry matches. From vlad at lists.openfabrics.org Sun Sep 9 02:49:46 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 9 Sep 2007 02:49:46 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070909-0200 daily build status Message-ID: <20070909094946.EBF82E6084A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070909-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Sun Sep 9 04:30:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Sep 2007 14:30:19 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: shrinking WQE Message-ID: <20070909112917.GA25910@mellanox.co.il> ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use NOP opcode to avoid wrap-around in the middle of WR. Since MLX QPs only support SEND, we use constant-sized WRs in this case. We look for the smallest value of wqe_shift such that the resulting number of wqes does not exceed device capabilities. Signed-off-by: Michael S. Tsirkin --- Added some missing hunks to make the code actually compile and work. diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..ff6c186 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,70 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { u32 *wqe = get_send_wqe(qp, n); int i; + int s; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift) / sizeof *wqe; + if (qp->sq_max_wqes_per_wr > 1) { + stamp = cpu_to_be32(0x7fffffff | (n & qp->sq.wqe_cnt ? 0 : 1 << 31)); + for (i = 0; i < s; i += 16) + wqe[i] = stamp; + } else { + for (i = 16; i < s; i += 16) + wqe[i] = 0xffffffff; + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = qp->ibqp.qp_type == IB_QPT_UD ? sizeof(struct mlx4_wqe_datagram_seg) : 0; + + /* Pad the remainder of the WQE with inline data segments. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +289,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +333,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +398,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +437,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +533,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1027,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1228,14 +1351,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; - + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { err = -ENOMEM; @@ -1250,7 +1373,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1266,7 +1389,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->imm = 0; wqe += sizeof *ctrl; - size = sizeof *ctrl / 16; + size = sizeof *ctrl; switch (ibqp->qp_type) { case IB_QPT_RC: @@ -1281,8 +1404,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_atomic_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_atomic_seg); - size += (sizeof (struct mlx4_wqe_raddr_seg) + - sizeof (struct mlx4_wqe_atomic_seg)) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg); break; @@ -1292,7 +1415,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_raddr_seg(wqe, wr->wr.rdma.remote_addr, wr->wr.rdma.rkey); wqe += sizeof (struct mlx4_wqe_raddr_seg); - size += sizeof (struct mlx4_wqe_raddr_seg) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg); break; default: @@ -1304,7 +1427,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, case IB_QPT_UD: set_datagram_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_datagram_seg); - size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + size += sizeof (struct mlx4_wqe_datagram_seg); break; case IB_QPT_SMI: @@ -1315,7 +1438,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, goto out; } wqe += err; - size += err / 16; + size += err; err = 0; break; @@ -1328,7 +1451,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_data_seg(wqe, wr->sg_list + i); wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } /* Add one more inline data segment for ICRC for MLX sends */ @@ -1337,11 +1460,11 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, cpu_to_be32((1 << 31) | 4); ((u32 *) wqe)[1] = 0; wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? - MLX4_WQE_CTRL_FENCE : 0) | size; + MLX4_WQE_CTRL_FENCE : 0) | (size / 16); /* * Make sure descriptor is fully written before @@ -1358,16 +1481,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1389,8 +1519,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..3333c1b 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,20 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + if (!buf->u.direct.buf) { + kfree(pages); + goto err_free; + } + } } return 0; @@ -170,6 +184,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..bf37369 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -158,6 +158,7 @@ enum { MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_NEC = 1 << 29, }; struct mlx4_wqe_ctrl_seg { -- MST From mst at dev.mellanox.co.il Sun Sep 9 04:55:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Sep 2007 14:55:11 +0300 Subject: [ofa-general] [PATCH] IB/sa: error handling thinko fix Message-ID: <20070909115511.GC25910@mellanox.co.il> From: Ali Ayoub Subject: [PATCH] IB/sa: error handling thinko fix Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index d271bd7..312c8ff 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -531,7 +531,7 @@ static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask) query->sm_ah->pkey_index, 0, IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, gfp_mask); - if (!query->mad_buf) { + if (IS_ERR(query->mad_buf)) { kref_put(&query->sm_ah->ref, free_sm_ah); return -ENOMEM; } -- MST From kliteyn at dev.mellanox.co.il Sun Sep 9 05:57:42 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 09 Sep 2007 15:57:42 +0300 Subject: [ofa-general] [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR Message-ID: <46E3EDC6.9070901@dev.mellanox.co.il> Hi Sasha, In several places in MPR implementation IB_PR_COMPMASK_* was used instead of IB_MPR_COMPMASK_* Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_multipath_record.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index 889d7c6..c02feb5 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -181,7 +181,7 @@ __osm_sa_multipath_rec_apply_tavor_mtu_limit(IN const ib_multipath_rec_t * */ required_mtu = ib_multipath_rec_mtu(p_mpr); if ((comp_mask & IB_MPR_COMPMASK_MTUSELEC) && - (comp_mask & IB_PR_COMPMASK_MTU)) { + (comp_mask & IB_MPR_COMPMASK_MTU)) { switch (ib_multipath_rec_mtu_sel(p_mpr)) { case 0: /* must be greater than */ case 2: /* exact match */ @@ -322,7 +322,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, required_sl = p_prtn->sl; /* reset pkey when raw traffic */ - if (comp_mask & IB_PR_COMPMASK_RAWTRAFFIC && + if (comp_mask & IB_MPR_COMPMASK_RAWTRAFFIC && cl_ntoh32(p_mpr->hop_flow_raw) & (1 << 31)) required_pkey = 0; } @@ -591,7 +591,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, /* we silently ignore cases where only the Rate selector is defined */ if ((comp_mask & IB_MPR_COMPMASK_RATESELEC) && - (comp_mask & IB_PR_COMPMASK_RATE)) { + (comp_mask & IB_MPR_COMPMASK_RATE)) { required_rate = ib_multipath_rec_rate(p_mpr); switch (ib_multipath_rec_rate_sel(p_mpr)) { case 0: /* must be greater than */ -- 1.5.1.4 From mst at dev.mellanox.co.il Sun Sep 9 07:02:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 9 Sep 2007 17:02:01 +0300 Subject: [ofa-general] [PATCH v3] IB/mlx4: shrinking WQE In-Reply-To: <20070909112917.GA25910@mellanox.co.il> References: <20070909112917.GA25910@mellanox.co.il> Message-ID: <20070909140201.GD25910@mellanox.co.il> ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use NOP opcode to avoid wrap-around in the middle of WR. Since MLX QPs only support SEND, we use constant-sized WRs in this case. We look for the smallest value of wqe_shift such that the resulting number of wqes does not exceed device capabilities. Signed-off-by: Michael S. Tsirkin --- Changes since v2: fix memory leak in mlx4_buf_alloc. Found by internal code review. diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..ff6c186 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,70 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { u32 *wqe = get_send_wqe(qp, n); int i; + int s; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift) / sizeof *wqe; + if (qp->sq_max_wqes_per_wr > 1) { + stamp = cpu_to_be32(0x7fffffff | (n & qp->sq.wqe_cnt ? 0 : 1 << 31)); + for (i = 0; i < s; i += 16) + wqe[i] = stamp; + } else { + for (i = 16; i < s; i += 16) + wqe[i] = 0xffffffff; + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = qp->ibqp.qp_type == IB_QPT_UD ? sizeof(struct mlx4_wqe_datagram_seg) : 0; + + /* Pad the remainder of the WQE with inline data segments. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +289,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +333,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +398,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +437,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +533,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1027,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1228,14 +1351,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; - + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { err = -ENOMEM; @@ -1250,7 +1373,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1266,7 +1389,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->imm = 0; wqe += sizeof *ctrl; - size = sizeof *ctrl / 16; + size = sizeof *ctrl; switch (ibqp->qp_type) { case IB_QPT_RC: @@ -1281,8 +1404,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_atomic_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_atomic_seg); - size += (sizeof (struct mlx4_wqe_raddr_seg) + - sizeof (struct mlx4_wqe_atomic_seg)) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg); break; @@ -1292,7 +1415,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_raddr_seg(wqe, wr->wr.rdma.remote_addr, wr->wr.rdma.rkey); wqe += sizeof (struct mlx4_wqe_raddr_seg); - size += sizeof (struct mlx4_wqe_raddr_seg) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg); break; default: @@ -1304,7 +1427,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, case IB_QPT_UD: set_datagram_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_datagram_seg); - size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + size += sizeof (struct mlx4_wqe_datagram_seg); break; case IB_QPT_SMI: @@ -1315,7 +1438,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, goto out; } wqe += err; - size += err / 16; + size += err; err = 0; break; @@ -1328,7 +1451,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_data_seg(wqe, wr->sg_list + i); wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } /* Add one more inline data segment for ICRC for MLX sends */ @@ -1337,11 +1460,11 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, cpu_to_be32((1 << 31) | 4); ((u32 *) wqe)[1] = 0; wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? - MLX4_WQE_CTRL_FENCE : 0) | size; + MLX4_WQE_CTRL_FENCE : 0) | (size / 16); /* * Make sure descriptor is fully written before @@ -1358,16 +1481,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1389,8 +1519,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..bf37369 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -158,6 +158,7 @@ enum { MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_NEC = 1 << 29, }; struct mlx4_wqe_ctrl_seg { -- MST From kliteyn at dev.mellanox.co.il Sun Sep 9 08:01:20 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 09 Sep 2007 18:01:20 +0300 Subject: [ofa-general] [PATCH] osm: QoS - MultiPathRecord selection according to QoS level Message-ID: <46E40AC0.3090609@dev.mellanox.co.il> Hi Sasha This patch implements the MultiPathRecord selection according to QoS level. NOTE: this patch depends on another MPR patch that I sent earlier today: "osm: bugfix - IB_PR_COMPMASK was used in MPR" Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_multipath_record.c | 441 ++++++++++++++++++++----------- 1 files changed, 289 insertions(+), 152 deletions(-) diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index c02feb5..690f9e7 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -64,6 +64,7 @@ #include #include #include +#include #define OSM_MPR_RCV_POOL_MIN_SIZE 64 #define OSM_MPR_RCV_POOL_GROW_SIZE 64 @@ -222,6 +223,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, { const osm_node_t *p_node; const osm_physp_t *p_physp; + const osm_physp_t *p_src_physp; const osm_physp_t *p_dest_physp; const osm_prtn_t *p_prtn; const ib_port_info_t *p_pi; @@ -232,13 +234,15 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, uint8_t pkt_life; uint8_t required_mtu; uint8_t required_rate; - uint16_t required_pkey; + ib_net16_t required_pkey; uint8_t required_sl; uint8_t required_pkt_life; ib_net16_t dest_lid; int hops = 0; int in_port_num = 0; - uint8_t vl; + uint8_t i; + osm_qos_level_t *p_qos_level = NULL; + uint16_t valid_sl_mask = 0xffff; OSM_LOG_ENTER(p_rcv->p_log, __osm_mpr_rcv_get_path_parms); @@ -246,6 +250,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, p_dest_physp = p_dest_port->p_physp; p_physp = p_src_port->p_physp; + p_src_physp = p_physp; p_pi = &p_physp->port_info; mtu = ib_port_info_get_mtu_cap(p_pi); @@ -268,71 +273,6 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, "Optimized Path MTU to 1K for Mellanox Tavor device\n"); } - if (comp_mask & IB_MPR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_mpr->hop_flow_raw) & (1 << 31)) - required_pkey = - osm_physp_find_common_pkey(p_physp, p_dest_physp); - else if (comp_mask & IB_MPR_COMPMASK_PKEY) { - required_pkey = p_mpr->pkey; - if (!osm_physp_share_this_pkey - (p_physp, p_dest_physp, required_pkey)) { - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_mpr_rcv_get_path_parms: ERR 4518: " - "Ports do not share specified PKey 0x%04x\n" - "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", - cl_ntoh16(required_pkey), - cl_ntoh64(osm_physp_get_port_guid(p_physp)), - cl_ntoh64(osm_physp_get_port_guid - (p_dest_physp))); - status = IB_NOT_FOUND; - goto Exit; - } - } else { - required_pkey = - osm_physp_find_common_pkey(p_physp, p_dest_physp); - if (!required_pkey) { - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_mpr_rcv_get_path_parms: ERR 4519: " - "Ports do not have any shared PKeys\n" - "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", - cl_ntoh64(osm_physp_get_port_guid(p_physp)), - cl_ntoh64(osm_physp_get_port_guid - (p_dest_physp))); - status = IB_NOT_FOUND; - goto Exit; - } - } - - required_sl = OSM_DEFAULT_SL; - - if (required_pkey) { - p_prtn = - (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, - required_pkey & - cl_ntoh16((uint16_t) ~ 0x8000)); - if (p_prtn == - (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) - /* this may be possible when pkey tables are created somehow in - previous runs or things are going wrong here */ - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "__osm_mpr_rcv_get_path_parms: ERR 451A: " - "No partition found for PKey 0x%04x - using default SL %d\n", - cl_ntoh16(required_pkey), required_sl); - else - required_sl = p_prtn->sl; - - /* reset pkey when raw traffic */ - if (comp_mask & IB_MPR_COMPMASK_RAWTRAFFIC && - cl_ntoh32(p_mpr->hop_flow_raw) & (1 << 31)) - required_pkey = 0; - } - - if ((comp_mask & IB_MPR_COMPMASK_SL) - && ib_multipath_rec_sl(p_mpr) != required_sl) { - status = IB_NOT_FOUND; - goto Exit; - } - /* Walk the subnet object from source to destination, tracking the most restrictive rate and mtu values along the way... @@ -344,14 +284,12 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, p_node = osm_physp_get_node_ptr(p_physp); if (p_node->sw) { - /* - * If the dest_lid_ho is equal to the lid of the switch pointed by - * p_sw then p_physp will be the physical port of the switch port zero. + * Source node is a switch. + * Make sure that p_physp points to the out port of the + * switch that routes to the destination lid (dest_lid_ho) */ - p_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + p_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_mpr_rcv_get_path_parms: ERR 4514: " @@ -363,16 +301,40 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } } + if (!p_rcv->p_subn->opt.no_qos) { + + /* + * Whether this node is switch or CA, the IN port for + * the sl2vl table is 0, because this is a source node. + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, 0); + + /* update valid SLs that still exist on this route */ + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + valid_sl_mask &= ~(1 << i); + } + if (!valid_sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: " + "All the SLs lead to VL15 on this path\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } + /* * Same as above */ p_node = osm_physp_get_node_ptr(p_dest_physp); if (p_node->sw) { - - p_dest_physp = - osm_switch_get_route_by_lid(p_node->sw, - cl_ntoh16(dest_lid_ho)); + /* + * if destination is switch, we want p_dest_physp to point to port 0 + */ + p_dest_physp = osm_switch_get_route_by_lid(p_node->sw, dest_lid); if (p_dest_physp == 0) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, @@ -386,7 +348,13 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } + /* + * Now go through the path step by step + */ + while (p_physp != p_dest_physp) { + + p_node = osm_physp_get_node_ptr(p_physp); p_physp = osm_physp_get_remote(p_physp); if (p_physp == 0) { @@ -400,6 +368,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } hops++; + in_port_num = osm_physp_get_port_num(p_physp); /* This is point to point case (no switch in between) @@ -427,29 +396,11 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } /* Continue with the egress port on this switch. @@ -466,52 +417,36 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, goto Exit; } - CL_ASSERT(p_physp); CL_ASSERT(osm_physp_is_valid(p_physp)); - if (comp_mask & IB_MPR_COMPMASK_SL) { - in_port_num = osm_physp_get_port_num(p_physp); - p_slvl_tbl = - osm_physp_get_slvl_tbl(p_physp, in_port_num); - vl = ib_slvl_table_get(p_slvl_tbl, required_sl); - if (vl == IB_DROP_VL) { /* discard packet */ - osm_log(p_rcv->p_log, OSM_LOG_VERBOSE, - "__osm_mpr_rcv_get_path_parms: Path not found for SL %d\n" - "\t\tin_port_num %d port_guid %" PRIx64 - "\n", required_sl, in_port_num, - cl_ntoh64(osm_physp_get_port_guid - (p_physp))); - status = IB_NOT_FOUND; - goto Exit; - } - } - p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest MTU = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", mtu, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest rate = %u at intervening port 0x%016" - PRIx64 " port num 0x%X\n", rate, - cl_ntoh64(osm_physp_get_port_guid - (p_physp)), - osm_physp_get_port_num(p_physp)); - } + if (!p_rcv->p_subn->opt.no_qos) { + /* + * Check SL2VL table of the switch and update valid SLs + */ + p_slvl_tbl = osm_physp_get_slvl_tbl(p_physp, in_port_num); + for (i = 0; i < IB_MAX_NUM_VLS; i++) { + if (valid_sl_mask & (1 << i) && + ib_slvl_table_get(p_slvl_tbl, i) == IB_DROP_VL) + valid_sl_mask &= ~(1 << i); + } + if (!valid_sl_mask) { + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: " + "All the SLs lead to VL15 " + "on this path\n"); + status = IB_NOT_FOUND; + goto Exit; + } + } } /* @@ -519,25 +454,11 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, */ p_pi = &p_physp->port_info; - if (mtu > ib_port_info_get_mtu_cap(p_pi)) { + if (mtu > ib_port_info_get_mtu_cap(p_pi)) mtu = ib_port_info_get_mtu_cap(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest MTU = %u at destination port 0x%016" - PRIx64 "\n", mtu, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); - } - if (rate > ib_port_info_compute_rate(p_pi)) { + if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "New smallest rate = %u at destination port 0x%016" - PRIx64 "\n", rate, - cl_ntoh64(osm_physp_get_port_guid(p_physp))); - } if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) { osm_log(p_rcv->p_log, OSM_LOG_DEBUG, @@ -546,6 +467,53 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } /* + * Get QoS Level object according to the MultiPath request + * and adjust MultiPath parameters according to QoS settings + */ + if ( !p_rcv->p_subn->opt.no_qos && + p_rcv->p_subn->p_qos_policy && + (p_qos_level = osm_qos_policy_get_qos_level_by_mpr( + p_rcv->p_subn->p_qos_policy, p_mpr, + p_src_physp, p_dest_physp, comp_mask)) ) { + + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: " + "MultiPathRecord request matches QoS Level '%s' (%s)\n", + p_qos_level->name, + (p_qos_level->use) ? p_qos_level-> + use : "no description"); + + if (p_qos_level->mtu_limit_set + && (mtu > p_qos_level->mtu_limit)) + mtu = p_qos_level->mtu_limit; + + if (p_qos_level->rate_limit_set + && (rate > p_qos_level->rate_limit)) + rate = p_qos_level->rate_limit; + + if (p_qos_level->pkt_life_set + && (pkt_life > p_qos_level->pkt_life)) + pkt_life = p_qos_level->pkt_life; + + if (p_qos_level->sl_set) { + required_sl = p_qos_level->sl; + if (!(valid_sl_mask & (1 << required_sl))) { + status = IB_NOT_FOUND; + goto Exit; + } + } + + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: " + "MultiPath params with QoS constaraints: " + "min MTU = %u, min rate = %u, " + "packet lifetime = %u, sl = %u\n", + mtu, rate, pkt_life, required_sl); + } + + /* Determine if these values meet the user criteria */ @@ -588,6 +556,8 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, break; } } + if (status != IB_SUCCESS) + goto Exit; /* we silently ignore cases where only the Rate selector is defined */ if ((comp_mask & IB_MPR_COMPMASK_RATESELEC) && @@ -628,13 +598,15 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, break; } } + if (status != IB_SUCCESS) + goto Exit; /* Verify the pkt_life_time */ /* According to spec definition IBA 1.2 Table 205 PacketLifeTime description, for loopback paths, packetLifeTime shall be zero. */ if (p_src_port == p_dest_port) pkt_life = 0; /* loopback */ - else + else if ( !(p_qos_level && p_qos_level->pkt_life_set) ) pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; /* we silently ignore cases where only the PktLife selector is defined */ @@ -680,6 +652,171 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, if (status != IB_SUCCESS) goto Exit; + /* + * set Pkey for this MultiPath record request + */ + + if (comp_mask & IB_MPR_COMPMASK_RAWTRAFFIC && + cl_ntoh32(p_mpr->hop_flow_raw) & (1 << 31)) + required_pkey = + osm_physp_find_common_pkey(p_src_physp, p_dest_physp); + + else if (comp_mask & IB_MPR_COMPMASK_PKEY) { + /* + * MPR request has a specific pkey: + * Check that source and destination share this pkey. + * If QoS level has pkeys, check that this pkey exists + * in the QoS level pkeys. + * MPR returned pkey is the requested pkey. + */ + required_pkey = p_mpr->pkey; + if (!osm_physp_share_this_pkey + (p_src_physp, p_dest_physp, required_pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 4518: " + "Ports do not share specified PKey 0x%04x\n" + "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", + cl_ntoh16(required_pkey), + cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), + cl_ntoh64(osm_physp_get_port_guid + (p_dest_physp))); + status = IB_NOT_FOUND; + goto Exit; + } + if (p_qos_level && p_qos_level->pkey_range_len && + !osm_qos_level_has_pkey(p_qos_level, required_pkey)) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 451C: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + + } else if (p_qos_level && p_qos_level->pkey_range_len) { + /* + * MPR request doesn't have a specific pkey, but QoS level + * has pkeys - get shared pkey from QoS level pkeys + */ + required_pkey = osm_qos_level_get_shared_pkey(p_qos_level, + p_src_physp, + p_dest_physp); + if (!required_pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 451D: " + "Ports do not share PKeys defined by QoS level\n"); + status = IB_NOT_FOUND; + goto Exit; + } + + } else { + /* + * Neither MPR request nor QoS level have pkey. + * Just get any shared pkey. + */ + required_pkey = + osm_physp_find_common_pkey(p_src_physp, p_dest_physp); + if (!required_pkey) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 4519: " + "Ports do not have any shared PKeys\n" + "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", + cl_ntoh64(osm_physp_get_port_guid(p_physp)), + cl_ntoh64(osm_physp_get_port_guid + (p_dest_physp))); + status = IB_NOT_FOUND; + goto Exit; + } + } + + if (required_pkey) { + p_prtn = + (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, + required_pkey & cl_ntoh16((uint16_t) ~ + 0x8000)); + if (p_prtn == + (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) + p_prtn = NULL; + } + + /* + * Set MultiPathRecord SL. + */ + + if (comp_mask & IB_MPR_COMPMASK_SL) { + /* + * Specific SL was requested + */ + required_sl = ib_multipath_rec_sl(p_mpr); + + if (p_qos_level && p_qos_level->sl_set && + p_qos_level->sl != required_sl) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 451E: " + "QoS constaraints: required MultiPathRecord SL (%u) " + "doesn't match QoS policy SL (%u)\n", + required_sl, p_qos_level->sl); + status = IB_NOT_FOUND; + goto Exit; + } + + } else if (p_qos_level && p_qos_level->sl_set) { + /* + * No specific SL was requested, + * but there is an SL in QoS level. + */ + required_sl = p_qos_level->sl; + + if (required_pkey && p_prtn && p_prtn->sl != p_qos_level->sl) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: " + "QoS level SL (%u) overrides partition SL (%u)\n", + p_qos_level->sl, p_prtn->sl); + + } else if (required_pkey) { + /* + * No specific SL in request or in QoS level - use partition SL + */ + p_prtn = + (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, + required_pkey & + cl_ntoh16((uint16_t) ~ 0x8000)); + if (!p_prtn) { + /* this may be possible when pkey tables are created somehow in + previous runs or things are going wrong here */ + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 451A: " + "No partition found for PKey 0x%04x - using default SL %d\n", + cl_ntoh16(required_pkey), required_sl); + required_sl = OSM_DEFAULT_SL; + } else + required_sl = p_prtn->sl; + + } else if (!p_rcv->p_subn->opt.no_qos) { + if (valid_sl_mask & (1 << OSM_DEFAULT_SL)) + required_sl = OSM_DEFAULT_SL; + else { + for (i = 0; i < IB_MAX_NUM_VLS; i++) + if (valid_sl_mask & (1 << i)) + break; + required_sl = i; + } + } + else + required_sl = OSM_DEFAULT_SL; + + if (!p_rcv->p_subn->opt.no_qos && !(valid_sl_mask & (1 << required_sl))) { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mpr_rcv_get_path_parms: ERR 451F: " + "Selected SL (%u) leads to VL15\n", required_sl); + status = IB_NOT_FOUND; + goto Exit; + } + + /* reset pkey when raw traffic */ + if (comp_mask & IB_MPR_COMPMASK_RAWTRAFFIC && + cl_ntoh32(p_mpr->hop_flow_raw) & (1 << 31)) + required_pkey = 0; + p_parms->mtu = mtu; p_parms->rate = rate; p_parms->pkey = required_pkey; -- 1.5.1.4 From tziporet at dev.mellanox.co.il Sun Sep 9 08:15:25 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 09 Sep 2007 18:15:25 +0300 Subject: [ewg] Re: [ofa-general] OFED 1.2.5 - GA release In-Reply-To: <46E17D89.7050506@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C901563B5D@mtlexch01.mtl.com> <46DF1505.1020409@ichips.intel.com> <46E08880.7070807@ichips.intel.com> <20070907122800.GB9410@mellanox.co.il> <46E17D89.7050506@ichips.intel.com> Message-ID: <46E40E0D.9070208@mellanox.co.il> Arlin Davis wrote: >> >> ib_local_sa was merged with ib_sa in 1.2.5. >> There are no extra modules to load. >> >> >> > Michael, thanks for the heads up. Sure would be nice if major changes > like this would be mentioned somewhere in release notes. :-) > My fault :-( Tziporet From vlad at dev.mellanox.co.il Sun Sep 9 09:17:26 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 09 Sep 2007 19:17:26 +0300 Subject: [ofa-general] OOO Message-ID: <46E41C96.2060409@dev.mellanox.co.il> Hi, I will be in vacation from 10 Sep. till 03 Oct. For OFED-1.3 issues please contact Tziporet Koren and Michael S. Tsirkin. Regards, Vladimir From dagmar3 at alshamil.net.ae Sun Sep 9 12:56:51 2007 From: dagmar3 at alshamil.net.ae (Paul Rivera) Date: Sun, 9 Sep 2007 23:56:51 +0400 Subject: [ofa-general] Forget about any prescriptions! Message-ID: <001401c7f33d$17646ac0$00a2f094@amrsalahie> National quality drugs. Life is short.... so make the most of it !!! Introducing the new male ennhancemennt product that has been tested and sold to over 300,000 Men worldwide. Medications for US residents. 78% admitted that they are unhappy with their partner's penis size. Enlarge your manhood today and reap all the benefits, be the most confident man in town! http://coacted.com Make her worship you! 100% safe and 100% money back guarantee if not satisfied. Can't stand sex all night long? From Ashish.Batwara at lsi.com Sun Sep 9 16:28:34 2007 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Sun, 9 Sep 2007 17:28:34 -0600 Subject: [ofa-general] Port State Change Event In-Reply-To: <46E38DB0.30608@dev.mellanox.co.il> Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01B1F59B@NAMAIL2.ad.lsil.com> Hi, We would want to simulate various port states (ARM, INIT, ACTIVE, DOWN, ACTIVE DEFER) on target side without connecting to the network (No SM). Question: Is it true that functions in mad.c gets called even for local port (Assuming that we have SM in the network and one target port is directly connected to initiator port running SM)? In other words, where in the OFED code, we can see port states getting changed? Thanks Ashish -----Original Message----- From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] Sent: Sunday, September 09, 2007 1:08 AM To: Batwara, Ashish Cc: openib-general at openib.org Subject: Re: [ofa-general] Port State Change Event Hi. Batwara, Ashish wrote: > > Hi, > > I am looking for a single point in code where I can get the > information about the port state change. We are using mthca driver. I > can see port_change in mthca_eq.c, but here I can only see two states > - Active and Down. Is there any place in the code where I can see > about other states as well, e.g. Arm, Init, Active Defer. > What exactly do you need? I believe that you saw the code that produces the event (port active and port down events). The entity that takes care of the machine state of the logical link is the openSM (or any other Subnet Manager): It sends MADs to the IB port between the nodes in the subnet, configures the port's properties and move the logical link to active state. thanks Dotan From garza at lokeyiron.com Sun Sep 9 18:28:16 2007 From: garza at lokeyiron.com (garza at lokeyiron.com) Date: Mon, 10 Sep 2007 11:28:16 +1000 Subject: [ofa-general] Football Fan Essentials Message-ID: <46E49DB0.1060007@lokeyiron.com> We interrupt this life to bring you.....FOOTBALL! Have all the data you need for every game, everyday. Go see out Game data and Stats Page: http://75.90.203.106/ From jean-fra at digitalmail.com Sun Sep 9 22:24:31 2007 From: jean-fra at digitalmail.com (dal dewayne) Date: Mon, 10 Sep 2007 05:24:31 +0000 Subject: [ofa-general] Become employed today in a respectable international company and reach the financial success. (no investment reqired) Message-ID: <000901c7f379$04d3a19b$46795e89@gtgkwr> Hello, First and foremost, we would kindly like to convey our deep greetings to you and your relatives and hope you all good health and happiness and more success in dealing. Our Worldwide Company in search of new employees on different vacancies. We are by now for a long time in the market and now we recruit employees to work from home. Our Corporation Main center is positioned in United Kingdom with branches all over the world. Our greatest wish now is to enlarge our business scale to more countries, so we are advertising here in hope of cooperating with you all. We be grateful for honest and ingenious employers. You do not need to spend any sum of money and we do not ask you to provide us with your bank account requisites! We are engaged in totally officially authorized activity and working in our corporation you can achieve career growth at a permanent job. We are seeking a highly motivated specialist, with experience of working with people. The position is home-based. We offer a part-time position with flexible working hours. And we would be happy to consider a full-time job share applicant. The right person will have good consultation and interpersonal skills and some knowledge of marketing. Candidates must be able to remain focused and motivated when working alone. Thank you and we are looking forward to cooperate in long-standing basis with you all. If you are interested in our vacancies, please feel free to make contact with us for further information. The preference is given to employees with understanding of foreign languages. If you are interested please send next information to: IvyHernandezWU at gmail.com 1) Full name 2) Contact phone numbers 3) Languages 4) Part time job/Full time We are looking forward to hearing from you soon. Best Regards, hyman jean-fra From elissa.davie at kilroygroups.com Mon Sep 10 01:25:40 2007 From: elissa.davie at kilroygroups.com (Vito Strong) Date: Mon, 10 Sep 2007 10:25:40 +0200 Subject: [ofa-general] Yes, I can help you Message-ID: <01c7f384$2bfcd110$3126e453@elissa.davie> -------------- next part -------------- A non-text attachment was scrubbed... Name: img11.gif Type: image/gif Size: 10454 bytes Desc: not available URL: From vlad at lists.openfabrics.org Mon Sep 10 02:55:00 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 10 Sep 2007 02:55:00 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070910-0200 daily build status Message-ID: <20070910095500.9D445E6084A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070910-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From tziporet at mellanox.co.il Mon Sep 10 06:50:24 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 10 Sep 2007 16:50:24 +0300 Subject: [ofa-general] Agenda for the OFED meeting today Message-ID: <6C2C79E72C305246B504CBA17B5500C901563D34@mtlexch01.mtl.com> Agenda for the OFED meeting today: 1. Review OFED 1.3 features status main features that need update: NetEffect - done QoS: OSM - done QoS - need to merge Sean patches to the kernel XRC - 90% IPoIB: stateless offloads - 90% IPoIB: enable IGMP - ?? RDS - RDMA API - done QLVNIC update - done SDP: Keepalive - done; Asynch IO - done, Zero Copy - 80% Bonding -- ?? Management - ?? 2. Decide on feature freeze date (based on the status) 3. Close supported OS: Suggestion: * kernel.org: kernel 2.6.23 * Novell: SLES 10; SLES 10 SP1 * Redhat: RHEL 4 (up4 and up5); RHEL 5 - Do we want up1 too? * Free distros (Fedora, OpenSuSE, Ubuntu) - basic testing only Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Sep 10 07:22:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Sep 2007 17:22:41 +0300 Subject: [ofa-general] [PATCH v4] IB/mlx4: shrinking WQE In-Reply-To: <20070909140201.GD25910@mellanox.co.il> References: <20070909112917.GA25910@mellanox.co.il> <20070909140201.GD25910@mellanox.co.il> Message-ID: <20070910142241.GA12546@mellanox.co.il> ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use NOP opcode to avoid wrap-around in the middle of WR. Since MLX QPs only support SEND, we use constant-sized WRs in this case. We look for the smallest value of wqe_shift such that the resulting number of wqes does not exceed device capabilities. Signed-off-by: Michael S. Tsirkin --- Changes since v3: fix nop formatting. Found by Eli Cohen. Changes since v2: fix memory leak in mlx4_buf_alloc. Found by internal code review. changes since v1: add missing patch hunks diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..2afd48d 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,71 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { u32 *wqe = get_send_wqe(qp, n); int i; + int s; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift) / sizeof *wqe; + if (qp->sq_max_wqes_per_wr > 1) { + stamp = cpu_to_be32(0x7fffffff | (n & qp->sq.wqe_cnt ? 0 : 1 << 31)); + for (i = 0; i < s; i += 16) + wqe[i] = stamp; + } else { + for (i = 16; i < s; i += 16) + wqe[i] = 0xffffffff; + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg) + (qp->ibqp.qp_type == IB_QPT_UD ? + sizeof(struct mlx4_wqe_datagram_seg) : 0); + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +290,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +334,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +399,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +438,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +534,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1028,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1228,14 +1352,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; - + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { err = -ENOMEM; @@ -1250,7 +1374,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1266,7 +1390,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->imm = 0; wqe += sizeof *ctrl; - size = sizeof *ctrl / 16; + size = sizeof *ctrl; switch (ibqp->qp_type) { case IB_QPT_RC: @@ -1281,8 +1405,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_atomic_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_atomic_seg); - size += (sizeof (struct mlx4_wqe_raddr_seg) + - sizeof (struct mlx4_wqe_atomic_seg)) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg); break; @@ -1292,7 +1416,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_raddr_seg(wqe, wr->wr.rdma.remote_addr, wr->wr.rdma.rkey); wqe += sizeof (struct mlx4_wqe_raddr_seg); - size += sizeof (struct mlx4_wqe_raddr_seg) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg); break; default: @@ -1304,7 +1428,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, case IB_QPT_UD: set_datagram_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_datagram_seg); - size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + size += sizeof (struct mlx4_wqe_datagram_seg); break; case IB_QPT_SMI: @@ -1315,7 +1439,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, goto out; } wqe += err; - size += err / 16; + size += err; err = 0; break; @@ -1328,7 +1452,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_data_seg(wqe, wr->sg_list + i); wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } /* Add one more inline data segment for ICRC for MLX sends */ @@ -1337,11 +1461,11 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, cpu_to_be32((1 << 31) | 4); ((u32 *) wqe)[1] = 0; wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? - MLX4_WQE_CTRL_FENCE : 0) | size; + MLX4_WQE_CTRL_FENCE : 0) | (size / 16); /* * Make sure descriptor is fully written before @@ -1358,16 +1482,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1389,8 +1520,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..bf37369 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -158,6 +158,7 @@ enum { MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_NEC = 1 << 29, }; struct mlx4_wqe_ctrl_seg { -- MST From monis at voltaire.com Mon Sep 10 07:31:56 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 10 Sep 2007 17:31:56 +0300 Subject: [ofa-general] [PATCH V4 0/10] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46C9B474.5020202@voltaire.com> References: <46C9B474.5020202@voltaire.com> Message-ID: <46E5555C.7060606@voltaire.com> Hi all, This patch series is a bit neglected. Since our goal is to have bonding support for IPoIB in kernel 2.6.24 it is very important for us to get comments soon. We would appreciate if you take some time to look at this and help us push this code upstream. thanks MoniS From swise at opengridcomputing.com Mon Sep 10 07:39:51 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 10 Sep 2007 09:39:51 -0500 Subject: [ofa-general] Re: [ewg] Agenda for the OFED meeting today In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563D34@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563D34@mtlexch01.mtl.com> Message-ID: <46E55737.7090106@opengridcomputing.com> Hey Tziporet, I cannot attend today's call. For the chelsio drivers, there will be a series of patches to be pulled into ofed-1.3 for the chelsio cxgb3 driver. They have been submitted upstream and ACKed by Garzik but he hasn't applied all of them yet. Once they are in his upstream branch, I'll pull them in for ofed-1.3 and ask Vlad (or you/michael in his absence) to pull these in. In addition, I want to pull these same patches into ofed-1.2.5 so that tree has the latest chelsio fixes as well. For the chelsio rdma driver iw_cxgb3, there will be a big patch to fix our port space issue. It is still under development and review, however. Steve. Tziporet Koren wrote: > Agenda for the OFED meeting today: > > 1. Review OFED 1.3 features status > main features that need update: > > NetEffect - done > QoS: OSM - done > QoS - need to merge Sean patches to the kernel > XRC - 90% > IPoIB: stateless offloads - 90% > IPoIB: enable IGMP - ?? > RDS - RDMA API - done > QLVNIC update - done > SDP: Keepalive - done; Asynch IO - done, Zero Copy - 80% > Bonding -- ?? > Management - ?? > > 2. Decide on feature freeze date (based on the status) > > 3. Close supported OS: > > Suggestion: > * kernel.org: kernel 2.6.23 > * Novell: SLES 10; SLES 10 SP1 > * Redhat: RHEL 4 (up4 and up5); RHEL 5 - Do we want up1 too? > * Free distros (Fedora, OpenSuSE, Ubuntu) - basic testing only > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: _tziporet at mellanox.co.il_ > Tel +972-4-9097200, ext 380 > > > ------------------------------------------------------------------------ > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From sashak at voltaire.com Mon Sep 10 07:55:08 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 17:55:08 +0300 Subject: [ofa-general] Re: [PATCH 0/3] osm: QoS - PathRecord and partial MultiPathRecord support In-Reply-To: <46E09CD2.2090906@dev.mellanox.co.il> References: <46E09CD2.2090906@dev.mellanox.co.il> Message-ID: <20070910145508.GC29384@sashak.voltaire.com> On 03:35 Fri 07 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > The following is a series of three patches: > > [PATCH 1/3] Some modifications in qos policy as a step toward supporting MultiPathRecord: > - Added subnet object to the qos policy struct to remove dependency > on osm_pr_rcv_t (and later on osm_mpr_rcv_t). > - osm_qos_policy_get_qos_level_by_pr() turned into a wrapper fuction > that gets path record and extracts the relevant parameters. > > [PATCH 2/3] Added MultiPathRecord support in qos policy: > added osm_qos_policy_get_qos_level_by_mpr() wrapper function. > > [PATCH 3/3] Selecting PathRecord according to QoS policy level. > > These patches have *all* the changes that we've discussed recently, > so please disregard all the unapplied QoS-related patches that you have. All three patches are applied. Thanks. Sasha From parks at lanl.gov Mon Sep 10 08:16:11 2007 From: parks at lanl.gov (Parks Fields) Date: Mon, 10 Sep 2007 09:16:11 -0600 Subject: [ofa-general] Agenda for the OFED meeting today In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563D34@mtlexch01.mtl.com > References: <6C2C79E72C305246B504CBA17B5500C901563D34@mtlexch01.mtl.com> Message-ID: <7.0.1.0.2.20070910091547.028d6ec0@lanl.gov> At 07:50 AM 9/10/2007, Tziporet Koren wrote: >Content-class: urn:content-classes:message >Content-Type: multipart/alternative; > boundary="----_=_NextPart_001_01C7F3B1.89CA7A00" What is the call in number >Agenda for the OFED meeting today: > >1. Review OFED 1.3 features status > main features that need update: >NetEffect - done QoS: OSM - done QoS - need to merge Sean patches to >the kernel XRC - 90% IPoIB: stateless offloads - 90% IPoIB: enable >IGMP - ?? RDS - RDMA API - done QLVNIC update - done SDP: Keepalive >- done; Asynch IO - done, Zero Copy - 80% Bonding -- ?? Management - ?? > >2. Decide on feature freeze date (based on the status) > >3. Close supported OS: >Suggestion: * kernel.org: kernel 2.6.23 * Novell: SLES 10; >SLES 10 SP1 * Redhat: RHEL 4 (up4 and up5); RHEL 5 - Do we want >up1 too? * Free distros (Fedora, OpenSuSE, Ubuntu) - basic testing only > >Tziporet Koren >Software Director >Mellanox Technologies >mailto: tziporet at mellanox.co.il >Tel +972-4-9097200, ext 380 >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Sep 10 08:32:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:32:42 +0300 Subject: [ofa-general] Re: [PATCH] Fix potential buffer overflow in umad_get_cas_names() In-Reply-To: <87abrylhrn.fsf@confield.dd.xiranet.com> References: <87abrylhrn.fsf@confield.dd.xiranet.com> Message-ID: <20070910153242.GD29384@sashak.voltaire.com> On 15:36 Fri 07 Sep , Arne Redlich wrote: > umad_get_cas_names() currently ignores the max parameter - fix this. > > Signed-off-by: Arne Redlich Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 10 08:33:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:33:04 +0300 Subject: [ofa-general] Re: [PATCH] Fix umad_get_cas_names() usage in libibumad. In-Reply-To: <878x7ilhrl.fsf@confield.dd.xiranet.com> References: <878x7ilhrl.fsf@confield.dd.xiranet.com> Message-ID: <20070910153304.GE29384@sashak.voltaire.com> On 15:36 Fri 07 Sep , Arne Redlich wrote: > resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. > > Signed-off-by: Arne Redlich Applied. Thanks. Sasha From hal.rosenstock at gmail.com Mon Sep 10 08:26:40 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 11:26:40 -0400 Subject: [ofa-general] [PATCH] Fix umad_get_cas_names() usage in libibumad. In-Reply-To: <878x7ilhrl.fsf@confield.dd.xiranet.com> References: <878x7ilhrl.fsf@confield.dd.xiranet.com> Message-ID: On 9/7/07, Arne Redlich wrote: > resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. > > Signed-off-by: Arne Redlich > --- > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > index 787aa92..589684c 100644 > --- a/libibumad/src/umad.c > +++ b/libibumad/src/umad.c > @@ -307,7 +307,7 @@ resolve_ca_name(char *ca_name, int *best_port) > } > > /* Get the list of CA names */ > - if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) > + if ((n = umad_get_cas_names((void *)names, 20)) < 0) Rather than the hard coded 20 here and elsewhere, should this be replaced by a #define ? -- Hal > return 0; > > /* Find the first existing CA with an active port */ > -- > 1.5.2.1 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon Sep 10 08:40:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:40:53 +0300 Subject: [ofa-general] Re: [PATCH] Add -C and -P options to perl diags to be able to use alternate CA's and ports In-Reply-To: <20070907152541.6dc1f27b.weiny2@llnl.gov> References: <20070907152541.6dc1f27b.weiny2@llnl.gov> Message-ID: <20070910154053.GF29384@sashak.voltaire.com> On 15:25 Fri 07 Sep , Ira Weiny wrote: > We have a few nodes which are connected to multiple fabrics. The perl diags were unable to specify which port or CA to use. In our case this left us unable to use these tools on one of the subnets attached. This patch adds that support. > > Ira > > > From b2f95d93e1c2a730f554275cf636ccd687d1106e Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 6 Sep 2007 09:37:10 -0700 > Subject: [PATCH] Add -C and -P options to perl diags to be able to use alternate CA's and ports > > infiniband-diags/scripts/IBswcountlimits.pm > infiniband-diags/scripts/ibfindnodesusing.pl > infiniband-diags/scripts/iblinkinfo.pl > infiniband-diags/scripts/ibqueryerrors.pl > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 10 08:42:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:42:10 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/src/smpquery.c: fix compiler warning In-Reply-To: <20070907151924.0abb2e83.weiny2@llnl.gov> References: <20070907151924.0abb2e83.weiny2@llnl.gov> Message-ID: <20070910154210.GG29384@sashak.voltaire.com> On 15:19 Fri 07 Sep , Ira Weiny wrote: > From a20eaa1b0743aa1cc0c11372c2a989911cb5bcde Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Fri, 7 Sep 2007 15:10:51 -0700 > Subject: [PATCH] infiniband-diags/src/smpquery.c: fix compiler warning > > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 10 08:45:13 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:45:13 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/src/ibdiag_common.c: do not print warning of failed default switch map open In-Reply-To: <20070907151925.2355abe8.weiny2@llnl.gov> References: <20070907151925.2355abe8.weiny2@llnl.gov> Message-ID: <20070910154513.GH29384@sashak.voltaire.com> On 15:19 Fri 07 Sep , Ira Weiny wrote: > From 59f8772d60a4b061eb2e27ded9abecc9b9e83d5c Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Fri, 7 Sep 2007 15:08:20 -0700 > Subject: [PATCH] infiniband-diags/src/ibdiag_common.c: do not print warning of failed default > switch map open > > This really clutters up some of the diag scripts output now that more of the > tools support the switch map functionality. > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 10 08:46:08 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 18:46:08 +0300 Subject: [ofa-general] Re: [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: <20070907152121.4ac611f5.weiny2@llnl.gov> References: <20070907152121.4ac611f5.weiny2@llnl.gov> Message-ID: <20070910154608.GI29384@sashak.voltaire.com> On 15:21 Fri 07 Sep , Ira Weiny wrote: > The ibnetdiscover output has changed so this command was failing. I am not sure when this happened but not matter this should fix it. > > Ira > > > From 9aadfb84826a5ea31107624b4b29e90d7c97e55b Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Fri, 7 Sep 2007 14:34:09 -0700 > Subject: [PATCH] Fix regexp's for new ibnetdiscover output > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From arne.redlich at xiranet.com Mon Sep 10 08:30:07 2007 From: arne.redlich at xiranet.com (Arne Redlich) Date: Mon, 10 Sep 2007 17:30:07 +0200 Subject: [ofa-general] [PATCH] Fix umad_get_cas_names() usage in libibumad. In-Reply-To: (Hal Rosenstock's message of "Mon\, 10 Sep 2007 11\:26\:40 -0400") References: <878x7ilhrl.fsf@confield.dd.xiranet.com> Message-ID: <87k5qysfls.fsf@confield.dd.xiranet.com> "Hal Rosenstock" writes: > On 9/7/07, Arne Redlich wrote: >> resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. >> >> Signed-off-by: Arne Redlich >> --- >> diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c >> index 787aa92..589684c 100644 >> --- a/libibumad/src/umad.c >> +++ b/libibumad/src/umad.c >> @@ -307,7 +307,7 @@ resolve_ca_name(char *ca_name, int *best_port) >> } >> >> /* Get the list of CA names */ >> - if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) >> + if ((n = umad_get_cas_names((void *)names, 20)) < 0) > > Rather than the hard coded 20 here and elsewhere, should this be > replaced by a #define ? How about a umad_get_cas_count() helper instead? Arne From mshefty at ichips.intel.com Mon Sep 10 09:08:11 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Sep 2007 09:08:11 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: <46E189ED.9030902@ichips.intel.com> Message-ID: <46E56BEB.5050208@ichips.intel.com> > What tests did you run ? I ran opensm and several of the management test apps (sminfo, saquery, perfquery, ibaddr, etc.). I don't recall all of them that I ran, or which options I used though. - Sean From sashak at voltaire.com Mon Sep 10 09:19:26 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 19:19:26 +0300 Subject: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: <46E3EDC6.9070901@dev.mellanox.co.il> References: <46E3EDC6.9070901@dev.mellanox.co.il> Message-ID: <20070910161926.GJ29384@sashak.voltaire.com> On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > In several places in MPR implementation IB_PR_COMPMASK_* > was used instead of IB_MPR_COMPMASK_* > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From hal.rosenstock at gmail.com Mon Sep 10 09:13:46 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 12:13:46 -0400 Subject: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: <20070910161926.GJ29384@sashak.voltaire.com> References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> Message-ID: Hi Sasha, On 9/10/07, Sasha Khapyorsky wrote: > On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > > Hi Sasha, > > > > In several places in MPR implementation IB_PR_COMPMASK_* > > was used instead of IB_MPR_COMPMASK_* > > > > Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. Shouldn't this also be applied to OFED 1.2 ? -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon Sep 10 09:30:14 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 10 Sep 2007 19:30:14 +0300 Subject: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> Message-ID: <20070910163014.GK29384@sashak.voltaire.com> Hi Hal, On 12:13 Mon 10 Sep , Hal Rosenstock wrote: > Hi Sasha, > > On 9/10/07, Sasha Khapyorsky wrote: > > On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > > > Hi Sasha, > > > > > > In several places in MPR implementation IB_PR_COMPMASK_* > > > was used instead of IB_MPR_COMPMASK_* > > > > > > Signed-off-by: Yevgeny Kliteynik > > > > Applied. Thanks. > > Shouldn't this also be applied to OFED 1.2 ? It does not look for me that any new OFED 1.2x distribution is planned. So how this could be useful? Sasha From hal.rosenstock at gmail.com Mon Sep 10 09:25:36 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 12:25:36 -0400 Subject: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: <20070910163014.GK29384@sashak.voltaire.com> References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> <20070910163014.GK29384@sashak.voltaire.com> Message-ID: On 9/10/07, Sasha Khapyorsky wrote: > Hi Hal, > > On 12:13 Mon 10 Sep , Hal Rosenstock wrote: > > Hi Sasha, > > > > On 9/10/07, Sasha Khapyorsky wrote: > > > On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > > > > Hi Sasha, > > > > > > > > In several places in MPR implementation IB_PR_COMPMASK_* > > > > was used instead of IB_MPR_COMPMASK_* > > > > > > > > Signed-off-by: Yevgeny Kliteynik > > > > > > Applied. Thanks. > > > > Shouldn't this also be applied to OFED 1.2 ? > > It does not look for me that any new OFED 1.2x distribution is planned. Seems like this is an EWG issue. Should there be OFED 1.2.x fix release(s) ? -- Hal > So how this could be useful? > > Sasha > From swise at opengridcomputing.com Mon Sep 10 09:29:45 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 10 Sep 2007 11:29:45 -0500 Subject: [ewg] Re: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> <20070910163014.GK29384@sashak.voltaire.com> Message-ID: <46E570F9.9060701@opengridcomputing.com> Hal Rosenstock wrote: > On 9/10/07, Sasha Khapyorsky wrote: >> Hi Hal, >> >> On 12:13 Mon 10 Sep , Hal Rosenstock wrote: >>> Hi Sasha, >>> >>> On 9/10/07, Sasha Khapyorsky wrote: >>>> On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: >>>>> Hi Sasha, >>>>> >>>>> In several places in MPR implementation IB_PR_COMPMASK_* >>>>> was used instead of IB_MPR_COMPMASK_* >>>>> >>>>> Signed-off-by: Yevgeny Kliteynik >>>> Applied. Thanks. >>> Shouldn't this also be applied to OFED 1.2 ? >> It does not look for me that any new OFED 1.2x distribution is planned. > > Seems like this is an EWG issue. > > Should there be OFED 1.2.x fix release(s) ? > FYI: I plan to keep the ofed-1.2.5 tree up to date with chelsio driver fixes. So I vote for at least doing weekly ofed-1.2.5 builds and incorporating bug fixes... Steve. From weiny2 at llnl.gov Mon Sep 10 09:56:00 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 10 Sep 2007 09:56:00 -0700 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: References: <20070907152121.4ac611f5.weiny2@llnl.gov> Message-ID: <20070910095600.6410ce3f.weiny2@llnl.gov> I don't see this format change in the 1.2 ibnetdiscover. Is version tag 1.2.4 going to go into 1.2? Your email made me search for the change and the commit ID is : f242dfb98c7ea73cbe8503061e28e6792c6a6e34 Since I did not see this in the 1.2 branch I did not think it was a big deal, perhaps there is a 1.2.5 branch I am missing? Ira On Fri, 7 Sep 2007 21:56:19 -0400 "Hal Rosenstock" wrote: > Hi Ira, > > On 9/7/07, Ira Weiny wrote: > > The ibnetdiscover output has changed so this command was failing. I am not sure when this happened but not matter this should fix it. > > It matters as this may also be an OFED 1.2 issue. > > -- Hal > > > > > Ira > > > > > > >From 9aadfb84826a5ea31107624b4b29e90d7c97e55b Mon Sep 17 00:00:00 2001 > > From: Ira K. Weiny > > Date: Fri, 7 Sep 2007 14:34:09 -0700 > > Subject: [PATCH] Fix regexp's for new ibnetdiscover output > > > > Signed-off-by: Ira K. Weiny > > --- > > infiniband-diags/scripts/IBswcountlimits.pm | 8 ++++---- > > 1 files changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm > > index 6cfa76c..f1e16d2 100755 > > --- a/infiniband-diags/scripts/IBswcountlimits.pm > > +++ b/infiniband-diags/scripts/IBswcountlimits.pm > > @@ -251,7 +251,7 @@ sub get_link_ends > > if ( $in_switch eq "yes" ) > > { > > my $rec = undef; > > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > { > > $loc_port = $1; > > my $rem_guid = $2; > > @@ -262,7 +262,7 @@ sub get_link_ends > > loc_sw_lid => $loc_sw_lid, > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > > } > > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > { > > $loc_port = $1; > > my $loc_ext_port = $2; > > @@ -274,7 +274,7 @@ sub get_link_ends > > loc_sw_lid => $loc_sw_lid, > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > > } > > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > { > > $loc_port = $1; > > my $rem_guid = $2; > > @@ -286,7 +286,7 @@ sub get_link_ends > > loc_sw_lid => $loc_sw_lid, > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => $rem_ext_port, rem_desc => $rem_desc }; > > } > > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > { > > $loc_port = $1; > > my $loc_ext_port = $2; > > -- > > 1.5.1 > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From hal.rosenstock at gmail.com Mon Sep 10 10:03:51 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 13:03:51 -0400 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: <20070910095600.6410ce3f.weiny2@llnl.gov> References: <20070907152121.4ac611f5.weiny2@llnl.gov> <20070910095600.6410ce3f.weiny2@llnl.gov> Message-ID: On 9/10/07, Ira Weiny wrote: > I don't see this format change in the 1.2 ibnetdiscover. Is version tag 1.2.4 > going to go into 1.2? Your email made me search for the change and the commit > ID is : f242dfb98c7ea73cbe8503061e28e6792c6a6e34 Can you elaborate on the format difference ? Thanks. -- Hal > Since I did not see this in the 1.2 branch I did not think it was a big deal, > perhaps there is a 1.2.5 branch I am missing? > > Ira > > > On Fri, 7 Sep 2007 21:56:19 -0400 > "Hal Rosenstock" wrote: > > > Hi Ira, > > > > On 9/7/07, Ira Weiny wrote: > > > The ibnetdiscover output has changed so this command was failing. I am not sure when this happened but not matter this should fix it. > > > > It matters as this may also be an OFED 1.2 issue. > > > > -- Hal > > > > > > > > Ira > > > > > > > > > >From 9aadfb84826a5ea31107624b4b29e90d7c97e55b Mon Sep 17 00:00:00 2001 > > > From: Ira K. Weiny > > > Date: Fri, 7 Sep 2007 14:34:09 -0700 > > > Subject: [PATCH] Fix regexp's for new ibnetdiscover output > > > > > > Signed-off-by: Ira K. Weiny > > > --- > > > infiniband-diags/scripts/IBswcountlimits.pm | 8 ++++---- > > > 1 files changed, 4 insertions(+), 4 deletions(-) > > > > > > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm > > > index 6cfa76c..f1e16d2 100755 > > > --- a/infiniband-diags/scripts/IBswcountlimits.pm > > > +++ b/infiniband-diags/scripts/IBswcountlimits.pm > > > @@ -251,7 +251,7 @@ sub get_link_ends > > > if ( $in_switch eq "yes" ) > > > { > > > my $rec = undef; > > > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > { > > > $loc_port = $1; > > > my $rem_guid = $2; > > > @@ -262,7 +262,7 @@ sub get_link_ends > > > loc_sw_lid => $loc_sw_lid, > > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > > > } > > > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > { > > > $loc_port = $1; > > > my $loc_ext_port = $2; > > > @@ -274,7 +274,7 @@ sub get_link_ends > > > loc_sw_lid => $loc_sw_lid, > > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => "", rem_desc => $rem_desc }; > > > } > > > - if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > + if ($line =~ /^\[(\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > { > > > $loc_port = $1; > > > my $rem_guid = $2; > > > @@ -286,7 +286,7 @@ sub get_link_ends > > > loc_sw_lid => $loc_sw_lid, > > > rem_guid => "0x$rem_guid", rem_lid => $rem_lid, rem_port => $rem_port, rem_ext_port => $rem_ext_port, rem_desc => $rem_desc }; > > > } > > > - if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > + if ($line =~ /^\[(\d+)\]\[ext (\d+)\]\s+\"[HSR]-(.+)\"\[(\d+)\]\[ext (\d+)\]\(.+\)\s+#.*\"(.*)\"\.* lid (\d+).*/) > > > { > > > $loc_port = $1; > > > my $loc_ext_port = $2; > > > -- > > > 1.5.1 > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > From mhagen at iol.unh.edu Mon Sep 10 10:51:23 2007 From: mhagen at iol.unh.edu (Mikkel Hagen) Date: Mon, 10 Sep 2007 13:51:23 -0400 Subject: [ofa-general] Upcoming OFA-IWG interop event Message-ID: <46E5841B.9020507@iol.unh.edu> The University of New Hampshire InterOperability Lab and Open Fabrics Alliance Interoperability Working Group would like to extend an invitation to all members to attend the upcoming Interoperability Event hosted at UNH-IOL facility. We will be performing the interoperability test plan developed within the OFA-IWG and granting logos to all qualified participants shortly after the event. All required information can be found at the following link regarding logistics, registration, test plan, etc: http://www.iol.unh.edu/services/testing/ofa/events/index.php Please download the Quick Start Guide (QSG) for all information and then feel free to forward any further questions to myself (mhagen at iol.unh.edu) or interop-wg at list.openfabrics.org. Thanks! -- Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 From weiny2 at llnl.gov Mon Sep 10 11:07:09 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 10 Sep 2007 11:07:09 -0700 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: References: <20070907152121.4ac611f5.weiny2@llnl.gov> <20070910095600.6410ce3f.weiny2@llnl.gov> Message-ID: <20070910110709.0659c333.weiny2@llnl.gov> On Mon, 10 Sep 2007 13:03:51 -0400 "Hal Rosenstock" wrote: > On 9/10/07, Ira Weiny wrote: > > I don't see this format change in the 1.2 ibnetdiscover. Is version tag 1.2.4 > > going to go into 1.2? Your email made me search for the change and the commit > > ID is : f242dfb98c7ea73cbe8503061e28e6792c6a6e34 > > Can you elaborate on the format difference ? Thanks. > >From the _new_ man page: PortGUIDs are shown in parentheses (). For switches, this is shown on the switchguid line. For CA and router ports, it is shown on the connectivity lines. >From the patch I found: -[22] "H-0008f10403961354"[1] # "MT23108 InfiniHost Mellanox Technologies" lid 4 4 +[22] "H-0008f10403961354"[1](8f10403961355) # "MT23108 InfiniHost Mellanox Techno The addition of the GUID in parens caused my regexp to fail. I am thinking of changing the scripts to look for the ibnetdiscover version reported with the -V. However, since these tools are kept in the same package it should be ok to simply ensure they are kept in sync. What do you think? Ira From Thomas.Talpey at netapp.com Mon Sep 10 11:14:37 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 10 Sep 2007 14:14:37 -0400 Subject: [ofa-general] Fwd: [NFS] [PATCH 00/19] NFS/RDMA client support Message-ID: FYI. Comments, especially on the RDMA portions, are most welcome. Tom. > ---------- Forwarded Message ---------- >Date: Mon, 10 Sep 2007 13:41:55 -0400 >To: nfs at lists.sourceforge.net >From: "Talpey, Thomas" >Subject: [NFS] [PATCH 00/19] NFS/RDMA client support >List-Id: "Discussion of NFS under Linux development, interoperability, > and testing." >List-Archive: >List-Post: >List-Help: >List-Subscribe: , > >Sender: nfs-bounces at lists.sourceforge.net > >The following 19 messages contain a patch series to implement the >fully integrated NFS/RDMA client into kernel 2.6.23-rc5 + NFS_ALL. >They also integrate cleanly into the current . > >I would like to see them considered for inclusion into 2.6.24, most >especially the first 14 which are purely infrastructure related. > >The patches are sequenced into the following functional groupings: > >01-02 implement RPCBIND netid's in each transport, instead of being >decided by the rpcbind client before sending. This change also corrects >the problem that IPv6 netid's were not supported. > >03-05 "invert" the parsing of NFS mount options in kernel to use the >existing kernel-private nfs_parsed_mount_options structure for NFS[234], >instead of continuing to use the legacy nfs_mount_data in both kernel >and user space. Coupled with the new string-based NFS mount API, this >allows extension of NFS mount options with no required changes to user >space. It's needed for adding RDMA, but may also be useful for other >NFS-related projects (such as fscache?). > >06 adds a flag to the xdrbuf to permit the rdma transport to marshal >data appropriately. > >07-09 implement dynamic RPC transport registration, and logically >reorganize the socket support as built-in TCP/UDP transports. These >patches originated with Chuck Lever's transport switch work. > >10-11 rearrange a few sockets-specific RPC transport definitions into >their own logical components. This is in preparation for adding the first >new RPC transport. > >12-14 change the way RPC transports are selected from a raw IP >protocol to a new RPC-layer identifier. Also #14 allows NFS to >accurately print the "proto=" argument via /proc/mounts. > >15-19 implement the new RPC RDMA transport: > > 15 declares the RPC/RDMA protocol, RPC RDMA transport definitions, >and configuration option. > > 16 adds support for the "-o rdma" option to string-based NFS mounts > > 17 adds the core RPC RDMA transport switch implementation. It has >stubs for the RDMA API below it, and the RPC/RDMA protocol marshaling >used when sending and receiving RPC messages. > > 18 implements the RPC message handling and connection management. > > 19 implements the kernel RDMA verbs interface. > >Since this version is fully integrated with the new string-based NFS >kernel mount API, it requires the latest nfs-utils mount.nfs command, >which must be invoked with the new "-i" flag. Additionally, until the >server implements the necessary rpcbindv3 support, the RDMA port >number must be provided in the mount command line. Currently, most >NFS/RDMA servers are listening on 2050, this is likely to change. > > mount -i [-t nfs4] -o rdma,port=2050 server:/filesystem /mountpoint > >The core net/sunrpc/xprtrdma files are little changed from the July >RFC Patches, except to fit into the above infrastructure, fix two >issues in corner case RDMA chunk marshaling, and to correct kernel >coding style nits. > >Comments on any and all issues are welcome. > >Tom. > >------------------------------------------------------------------------- >This SF.net email is sponsored by: Microsoft >Defy all challenges. Microsoft(R) Visual Studio 2005. >http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >_______________________________________________ >NFS maillist - NFS at lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/nfs > ---------- End of Forwarded Message ---------- From hal.rosenstock at gmail.com Mon Sep 10 11:20:43 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 14:20:43 -0400 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: <20070910110709.0659c333.weiny2@llnl.gov> References: <20070907152121.4ac611f5.weiny2@llnl.gov> <20070910095600.6410ce3f.weiny2@llnl.gov> <20070910110709.0659c333.weiny2@llnl.gov> Message-ID: On 9/10/07, Ira Weiny wrote: > On Mon, 10 Sep 2007 13:03:51 -0400 > "Hal Rosenstock" wrote: > > > On 9/10/07, Ira Weiny wrote: > > > I don't see this format change in the 1.2 ibnetdiscover. Is version tag 1.2.4 > > > going to go into 1.2? Looks like the version in OFED 1.2 says: ibnetdiscover -V ibnetdiscover: BUILD VERSION 1.2.1 > > > Your email made me search for the change and the commit > > > ID is : f242dfb98c7ea73cbe8503061e28e6792c6a6e34 > > > > Can you elaborate on the format difference ? Thanks. > > > > From the _new_ man page: > > PortGUIDs are shown in parentheses (). For switches, this is shown on the > switchguid line. For CA and router ports, it is shown on the connectivity > lines. > > From the patch I found: > > -[22] "H-0008f10403961354"[1] # "MT23108 InfiniHost Mellanox Technologies" lid 4 4 > +[22] "H-0008f10403961354"[1](8f10403961355) # "MT23108 InfiniHost Mellanox Techno > > The addition of the GUID in parens caused my regexp to fail. I am thinking of > changing the scripts to look for the ibnetdiscover version reported with the > -V. However, since these tools are kept in the same package it should be ok to > simply ensure they are kept in sync. What do you think? If a format version needed to be determined, it could be done via -V and parsed accordingly but a better way would be to actually stick on in the output file as a comment. IMO the latter should be fine (keeping ibnetdiscover and script in sync). It was just something missed to be updated in the script when that format change was made. The former would only be needed if mixing and matching formats and tools. Is that needed ? -- Hal > Ira > From tziporet at mellanox.co.il Mon Sep 10 12:38:33 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 10 Sep 2007 22:38:33 +0300 Subject: [ofa-general] OFED Sep 10 meeting summary on OFED 1.3 development status Message-ID: <6C2C79E72C305246B504CBA17B5500C901563D3F@mtlexch01.mtl.com> OFED Sep 10 meeting summary on OFED 1.3 development status Meeting summary: 1. We reviewed OFED 1.3 features status: NetEffect - done QoS: OSM - done QoS - need to merge Sean patches to the kernel XRC - 90% IPoIB: stateless offloads - 90% IPoIB: enable IGMP - 90% RDS - RDMA API - done QLVNIC update - done SDP: Keepalive - done; Asynch IO - done, Zero Copy - 80% Bonding - 80% Management - done ehca - 90% mlx4 - 90% Chelsio - 90% 2. based on the status we decided to delay the feature freeze date to next week Alpha release is expected on Sep 19 3. We agreed on the following supported OS: * kernel.org: kernel 2.6.23 * Novell: SLES 10; SLES 10 SP1 * Redhat: RHEL 4 (up4 and up5); RHEL 5 (no up1) * Free distros (Fedora, OpenSuSE, Ubuntu) - basic testing only 4. OFED 1.3 plans can be found at: * The presentation from Sonoma can be found at: http://www.openfabrics.org/archives/april2007sonoma.htm (OFED 1.2 Lessons, 1.3 Planning and Field Support) * The new plans are also on the Wiki at (need some update from the meeting today): https://wiki.openfabrics.org/tiki-index.php?page=OFED+1.3+release+plan+a nd+features Tziporet From weiny2 at llnl.gov Mon Sep 10 13:51:09 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 10 Sep 2007 13:51:09 -0700 Subject: [ofa-general] [PATCH] Fix regexp's for new ibnetdiscover output In-Reply-To: References: <20070907152121.4ac611f5.weiny2@llnl.gov> <20070910095600.6410ce3f.weiny2@llnl.gov> <20070910110709.0659c333.weiny2@llnl.gov> Message-ID: <20070910135109.31f8f198.weiny2@llnl.gov> Hi Hal, See below. On Mon, 10 Sep 2007 14:20:43 -0400 "Hal Rosenstock" wrote: > On 9/10/07, Ira Weiny wrote: > > On Mon, 10 Sep 2007 13:03:51 -0400 > > "Hal Rosenstock" wrote: > > > > > On 9/10/07, Ira Weiny wrote: > > > > I don't see this format change in the 1.2 ibnetdiscover. Is version tag 1.2.4 > > > > going to go into 1.2? > > Looks like the version in OFED 1.2 says: > ibnetdiscover -V > ibnetdiscover: BUILD VERSION 1.2.1 > > > > > Your email made me search for the change and the commit > > > > ID is : f242dfb98c7ea73cbe8503061e28e6792c6a6e34 > > > > > > Can you elaborate on the format difference ? Thanks. > > > > > > > From the _new_ man page: > > > > PortGUIDs are shown in parentheses (). For switches, this is shown on the > > switchguid line. For CA and router ports, it is shown on the connectivity > > lines. > > > > From the patch I found: > > > > -[22] "H-0008f10403961354"[1] # "MT23108 InfiniHost Mellanox Technologies" lid 4 4 > > +[22] "H-0008f10403961354"[1](8f10403961355) # "MT23108 InfiniHost Mellanox Techno > > > > The addition of the GUID in parens caused my regexp to fail. I am thinking of > > changing the scripts to look for the ibnetdiscover version reported with the > > -V. However, since these tools are kept in the same package it should be ok to > > simply ensure they are kept in sync. What do you think? > > If a format version needed to be determined, it could be done via -V > and parsed accordingly but a better way would be to actually stick on > in the output file as a comment. > > IMO the latter should be fine (keeping ibnetdiscover and script in > sync). It was just something missed to be updated in the script when > that format change was made. Yea, NP. > > The former would only be needed if mixing and matching formats and > tools. Is that needed ? > No, I don't think that is needed. Ira From sean.hefty at intel.com Mon Sep 10 16:04:47 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 10 Sep 2007 16:04:47 -0700 Subject: [ofa-general] [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch Message-ID: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> Roland, please pull from: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland This will pick up QoS and CM scalability changes that I would like to get into 2.6.24 (and OFED 1.3). All have been posted to the list before, though the QoS patches have received more attention. Sean Hefty (7): ib/ipoib: specify Traffic Class with PR queries for QoS support ib/sa: add new QoS fields to path record rdma/cm: add ability to specify type of service rdma/ucm: export setting service type to user space ib/srp: add QoS support through service ID ib/cm: modify interface to send MRAs in response to duplicate messages rdma/cm: queue IB CM MRAs to avoid unnecessary remote retries drivers/infiniband/core/cm.c | 51 +++++++---------- drivers/infiniband/core/cma.c | 46 ++++++++++++--- drivers/infiniband/core/sa_query.c | 10 +-- drivers/infiniband/core/ucma.c | 74 ++++++++++++++++++++++++- drivers/infiniband/ulp/ipoib/ipoib.h | 22 +++++++ drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 22 ------- drivers/infiniband/ulp/srp/ib_srp.c | 2 include/rdma/ib_cm.h | 7 +- include/rdma/ib_sa.h | 11 +-- include/rdma/rdma_cm.h | 14 ++++ include/rdma/rdma_user_cm.h | 18 ++++++ 12 files changed, 205 insertions(+), 80 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 4df269f..2e39236 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2219,6 +2219,9 @@ int ib_send_cm_mra(struct ib_cm_id *cm_id, { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; + enum ib_cm_state cm_state; + enum ib_cm_lap_state lap_state; + enum cm_msg_response msg_response; void *data; unsigned long flags; int ret; @@ -2235,48 +2238,40 @@ int ib_send_cm_mra(struct ib_cm_id *cm_id, spin_lock_irqsave(&cm_id_priv->lock, flags); switch(cm_id_priv->id.state) { case IB_CM_REQ_RCVD: - ret = cm_alloc_msg(cm_id_priv, &msg); - if (ret) - goto error1; - - cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, - CM_MSG_RESPONSE_REQ, service_timeout, - private_data, private_data_len); - ret = ib_post_send_mad(msg, NULL); - if (ret) - goto error2; - cm_id->state = IB_CM_MRA_REQ_SENT; + cm_state = IB_CM_MRA_REQ_SENT; + lap_state = cm_id->lap_state; + msg_response = CM_MSG_RESPONSE_REQ; break; case IB_CM_REP_RCVD: - ret = cm_alloc_msg(cm_id_priv, &msg); - if (ret) - goto error1; - - cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, - CM_MSG_RESPONSE_REP, service_timeout, - private_data, private_data_len); - ret = ib_post_send_mad(msg, NULL); - if (ret) - goto error2; - cm_id->state = IB_CM_MRA_REP_SENT; + cm_state = IB_CM_MRA_REP_SENT; + lap_state = cm_id->lap_state; + msg_response = CM_MSG_RESPONSE_REP; break; case IB_CM_ESTABLISHED: + cm_state = cm_id->state; + lap_state = IB_CM_MRA_LAP_SENT; + msg_response = CM_MSG_RESPONSE_OTHER; + break; + default: + ret = -EINVAL; + goto error1; + } + + if (!(service_timeout & IB_CM_MRA_FLAG_DELAY)) { ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) goto error1; cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, - CM_MSG_RESPONSE_OTHER, service_timeout, + msg_response, service_timeout, private_data, private_data_len); ret = ib_post_send_mad(msg, NULL); if (ret) goto error2; - cm_id->lap_state = IB_CM_MRA_LAP_SENT; - break; - default: - ret = -EINVAL; - goto error1; } + + cm_id->state = cm_state; + cm_id->lap_state = lap_state; cm_id_priv->service_timeout = service_timeout; cm_set_private_data(cm_id_priv, data, private_data_len); spin_unlock_irqrestore(&cm_id_priv->lock, flags); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..7253952 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -52,6 +52,7 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 15 +#define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -138,6 +139,7 @@ struct rdma_id_private { u32 qkey; u32 qp_num; u8 srq; + u8 tos; }; struct cma_multicast { @@ -1089,6 +1091,7 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) event.param.ud.private_data_len = IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE - offset; } else { + ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0); conn_id = cma_new_conn_id(&listen_id->id, ib_event); cma_set_req_event_data(&event, &ib_event->param.req_rcvd, ib_event->private_data, offset); @@ -1474,6 +1477,15 @@ err: } EXPORT_SYMBOL(rdma_listen); +void rdma_set_service_type(struct rdma_cm_id *id, int tos) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->tos = (u8) tos; +} +EXPORT_SYMBOL(rdma_set_service_type); + static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { @@ -1498,23 +1510,37 @@ static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, struct cma_work *work) { - struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + struct rdma_addr *addr = &id_priv->id.route.addr; struct ib_sa_path_rec path_rec; + ib_sa_comp_mask comp_mask; + struct sockaddr_in6 *sin6; memset(&path_rec, 0, sizeof path_rec); - ib_addr_get_sgid(addr, &path_rec.sgid); - ib_addr_get_dgid(addr, &path_rec.dgid); - path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); + ib_addr_get_sgid(&addr->dev_addr, &path_rec.sgid); + ib_addr_get_dgid(&addr->dev_addr, &path_rec.dgid); + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(&addr->dev_addr)); path_rec.numb_path = 1; path_rec.reversible = 1; + path_rec.service_id = cma_get_service_id(id_priv->id.ps, &addr->dst_addr); + + comp_mask = IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; + + if (addr->src_addr.sa_family == AF_INET) { + path_rec.qos_class = cpu_to_be16((u16) id_priv->tos); + comp_mask |= IB_SA_PATH_REC_QOS_CLASS; + } else { + sin6 = (struct sockaddr_in6 *) &addr->src_addr; + path_rec.traffic_class = (u8) (be32_to_cpu(sin6->sin6_flowinfo) >> 20); + comp_mask |= IB_SA_PATH_REC_TRAFFIC_CLASS; + } id_priv->query_id = ib_sa_path_rec_get(&sa_client, id_priv->id.device, - id_priv->id.port_num, &path_rec, - IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_REVERSIBLE, - timeout_ms, GFP_KERNEL, - cma_query_handler, work, &id_priv->query); + id_priv->id.port_num, &path_rec, + comp_mask, timeout_ms, + GFP_KERNEL, cma_query_handler, + work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; } diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index d271bd7..6f56bb5 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -123,14 +123,10 @@ static u32 tid; .field_name = "sa_path_rec:" #field static const struct ib_field path_rec_table[] = { - { RESERVED, + { PATH_REC_FIELD(service_id), .offset_words = 0, .offset_bits = 0, - .size_bits = 32 }, - { RESERVED, - .offset_words = 1, - .offset_bits = 0, - .size_bits = 32 }, + .size_bits = 64 }, { PATH_REC_FIELD(dgid), .offset_words = 2, .offset_bits = 0, @@ -179,7 +175,7 @@ static const struct ib_field path_rec_table[] = { .offset_words = 12, .offset_bits = 16, .size_bits = 16 }, - { RESERVED, + { PATH_REC_FIELD(qos_class), .offset_words = 13, .offset_bits = 0, .size_bits = 12 }, diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 53b4c94..90d675a 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -792,6 +792,78 @@ out: return ret; } +static int ucma_set_option_id(struct ucma_context *ctx, int optname, + void *optval, size_t optlen) +{ + int ret = 0; + + switch (optname) { + case RDMA_OPTION_ID_TOS: + if (optlen != sizeof(u8)) { + ret = -EINVAL; + break; + } + rdma_set_service_type(ctx->cm_id, *((u8 *) optval)); + break; + default: + ret = -ENOSYS; + } + + return ret; +} + +static int ucma_set_option_level(struct ucma_context *ctx, int level, + int optname, void *optval, size_t optlen) +{ + int ret; + + switch (level) { + case RDMA_OPTION_ID: + ret = ucma_set_option_id(ctx, optname, optval, optlen); + break; + default: + ret = -ENOSYS; + } + + return ret; +} + +static ssize_t ucma_set_option(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_set_option cmd; + struct ucma_context *ctx; + void *optval; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + optval = kmalloc(cmd.optlen, GFP_KERNEL); + if (!optval) { + ret = -ENOMEM; + goto out1; + } + + if (copy_from_user(optval, (void __user *) (unsigned long) cmd.optval, + cmd.optlen)) { + ret = -EFAULT; + goto out2; + } + + ret = ucma_set_option_level(ctx, cmd.level, cmd.optname, optval, + cmd.optlen); +out2: + kfree(optval); +out1: + ucma_put_ctx(ctx); + return ret; +} + static ssize_t ucma_notify(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) { @@ -936,7 +1008,7 @@ static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, [RDMA_USER_CM_CMD_GET_OPTION] = NULL, - [RDMA_USER_CM_CMD_SET_OPTION] = NULL, + [RDMA_USER_CM_CMD_SET_OPTION] = ucma_set_option, [RDMA_USER_CM_CMD_NOTIFY] = ucma_notify, [RDMA_USER_CM_CMD_JOIN_MCAST] = ucma_join_multicast, [RDMA_USER_CM_CMD_LEAVE_MCAST] = ucma_leave_multicast, diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..fc16bce 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -113,7 +113,27 @@ struct ipoib_pseudoheader { u8 hwaddr[INFINIBAND_ALEN]; }; -struct ipoib_mcast; +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ +struct ipoib_mcast { + struct ib_sa_mcmember_rec mcmember; + struct ib_sa_multicast *mc; + struct ipoib_ah *ah; + + struct rb_node rb_node; + struct list_head list; + + unsigned long created; + unsigned long backoff; + + unsigned long flags; + unsigned char logcount; + + struct list_head neigh_list; + + struct sk_buff_head pkt_queue; + + struct net_device *dev; +}; struct ipoib_rx_buf { struct sk_buff *skb; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..841e068 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -468,9 +468,10 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid) INIT_LIST_HEAD(&path->neigh_list); memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); - path->pathrec.sgid = priv->local_gid; - path->pathrec.pkey = cpu_to_be16(priv->pkey); - path->pathrec.numb_path = 1; + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class; return path; } @@ -491,6 +492,7 @@ static int path_rec_start(struct net_device *dev, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_TRAFFIC_CLASS | IB_SA_PATH_REC_PKEY, 1000, GFP_ATOMIC, path_rec_completion, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index aae3670..94a5709 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -57,28 +57,6 @@ MODULE_PARM_DESC(mcast_debug_level, static DEFINE_MUTEX(mcast_mutex); -/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ -struct ipoib_mcast { - struct ib_sa_mcmember_rec mcmember; - struct ib_sa_multicast *mc; - struct ipoib_ah *ah; - - struct rb_node rb_node; - struct list_head list; - - unsigned long created; - unsigned long backoff; - - unsigned long flags; - unsigned char logcount; - - struct list_head neigh_list; - - struct sk_buff_head pkt_queue; - - struct net_device *dev; -}; - struct ipoib_mcast_iter { struct net_device *dev; union ib_gid mgid; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index f6a0514..9ccc638 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -285,6 +285,7 @@ static int srp_lookup_path(struct srp_target_port *target) target->srp_host->dev->dev, target->srp_host->port, &target->path, + IB_SA_PATH_REC_SERVICE_ID | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | @@ -1692,6 +1693,7 @@ static int srp_parse_options(const char *buf, struct srp_target_port *target) goto out; } target->service_id = cpu_to_be64(simple_strtoull(p, NULL, 16)); + target->path.service_id = target->service_id; kfree(p); break; diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 12243e8..a627c86 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -477,12 +477,15 @@ int ib_send_cm_rej(struct ib_cm_id *cm_id, const void *private_data, u8 private_data_len); +#define IB_CM_MRA_FLAG_DELAY 0x80 /* Send MRA only after a duplicate msg */ + /** * ib_send_cm_mra - Sends a message receipt acknowledgement to a connection * message. * @cm_id: Connection identifier associated with the connection message. - * @service_timeout: The maximum time required for the sender to reply to - * to the connection message. + * @service_timeout: The lower 5-bits specify the maximum time required for + * the sender to reply to to the connection message. The upper 3-bits + * specify additional control flags. * @private_data: Optional user-defined private data sent with the * message receipt acknowledgement. * @private_data_len: Size of the private data buffer, in bytes. diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 5e26b2f..942692b 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -109,8 +109,8 @@ enum ib_sa_selector { * Reserved rows are indicated with comments to help maintainability. */ -/* reserved: 0 */ -/* reserved: 1 */ +#define IB_SA_PATH_REC_SERVICE_ID (IB_SA_COMP_MASK( 0) |\ + IB_SA_COMP_MASK( 1)) #define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) #define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) #define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) @@ -123,7 +123,7 @@ enum ib_sa_selector { #define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) #define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) #define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) -/* reserved: 14 */ +#define IB_SA_PATH_REC_QOS_CLASS IB_SA_COMP_MASK(14) #define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) #define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) #define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) @@ -134,8 +134,7 @@ enum ib_sa_selector { #define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) struct ib_sa_path_rec { - /* reserved */ - /* reserved */ + __be64 service_id; union ib_gid dgid; union ib_gid sgid; __be16 dlid; @@ -148,7 +147,7 @@ struct ib_sa_path_rec { int reversible; u8 numb_path; __be16 pkey; - /* reserved */ + __be16 qos_class; u8 sl; u8 mtu_selector; u8 mtu; diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 2d6a770..010f876 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -314,4 +314,18 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, */ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr); +/** + * rdma_set_service_type - Set the type of service associated with a + * connection identifier. + * @id: Communication identifier to associated with service type. + * @tos: Type of service. + * + * The type of service is interpretted as a differentiated service + * field (RFC 2474). The service type should be specified before + * performing route resolution, as existing communication on the + * connection identifier may be unaffected. The type of service + * requested may not be supported by the network to all destinations. + */ +void rdma_set_service_type(struct rdma_cm_id *id, int tos); + #endif /* RDMA_CM_H */ diff --git a/include/rdma/rdma_user_cm.h b/include/rdma/rdma_user_cm.h index f632b0c..9749c1b 100644 --- a/include/rdma/rdma_user_cm.h +++ b/include/rdma/rdma_user_cm.h @@ -212,4 +212,22 @@ struct rdma_ucm_event_resp { } param; }; +/* Option levels */ +enum { + RDMA_OPTION_ID = 0 +}; + +/* Option details */ +enum { + RDMA_OPTION_ID_TOS = 0 +}; + +struct rdma_ucm_set_option { + __u64 optval; + __u32 id; + __u32 level; + __u32 optname; + __u32 optlen; +}; + #endif /* RDMA_USER_CM_H */ From hal.rosenstock at gmail.com Mon Sep 10 19:03:50 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 10 Sep 2007 22:03:50 -0400 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: Message-ID: On 9/7/07, Roland Dreier wrote: > Here is a long overdue patch to enable userspace to control the P_Key > index used for userspace MADs. I used the approach we discussed when > this first came up, namely adding an ioctl to enable to the new > interface so that existing binaries don't break. > > I haven't had a chance to make all the userspace library changes to > test the new interface and I likely won't until I return home (I > should be done traveling for a few months after this week). I have > tested existing code against a kernel with this patch applied and it > seems to be OK, and I wanted to at least get this out for review as > soon as I had it. > > Please review/test. I would like to get this into 2.6.24 if possible > since we've known so long that we needed it. Thanks for doing this :-) One nit below in the doc. I spent some time testing it today in old mode and although my environment is limited, I did have trouble with an RMPP test as follows: Can someone try the following with OpenSM running: First, osmtest -f c and then osmtest -f a All on same node with new user_mad module. That seems to hangup rather than complete for me. I didn't have time to track this down any further. -- Hal > Thanks, > Roland > > > diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt > index 8ec54b9..a3450aa 100644 > --- a/Documentation/infiniband/user_mad.txt > +++ b/Documentation/infiniband/user_mad.txt > @@ -99,6 +99,20 @@ Transaction IDs > request/response pairs. The upper 32 bits are reserved for use by > the kernel and will be overwritten before a MAD is sent. > > +P_Key Index Handling > + > + The old ib_umad interface did not allow setting the P_Key index for > + MADs that are sent and did not provide a way for obtaining the P_Key > + index of received MADs. A new layout for struct ib_user_mad_hdr > + with a pkey_index member has been defined; however, to preserve > + binary compatibility with older applications, this new layout will > + not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called > + before a file description is used for anything else. Nit: Should this be "file descriptor" ? > + > + In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented > + to 6, the new layout of struct ib_user_mad_hdr will be used by > + default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed. > + > Setting IsSM Capability Bit > > To set the IsSM capability bit for a port, simply open the > diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c > index d97ded2..3a0e579 100644 > --- a/drivers/infiniband/core/user_mad.c > +++ b/drivers/infiniband/core/user_mad.c > @@ -118,6 +118,8 @@ struct ib_umad_file { > wait_queue_head_t recv_wait; > struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; > int agents_dead; > + u8 use_pkey_index; > + u8 already_used; > }; > > struct ib_umad_packet { > @@ -147,6 +149,12 @@ static void ib_umad_release_dev(struct kref *ref) > kfree(dev); > } > > +static int hdr_size(struct ib_umad_file *file) > +{ > + return file->use_pkey_index ? sizeof (struct ib_user_mad_hdr) : > + sizeof (struct ib_user_mad_hdr_old); > +} > + > /* caller must hold port->mutex at least for reading */ > static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id) > { > @@ -221,13 +229,13 @@ static void recv_handler(struct ib_mad_agent *agent, > packet->length = mad_recv_wc->mad_len; > packet->recv_wc = mad_recv_wc; > > - packet->mad.hdr.status = 0; > - packet->mad.hdr.length = sizeof (struct ib_user_mad) + > - mad_recv_wc->mad_len; > - packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); > - packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); > - packet->mad.hdr.sl = mad_recv_wc->wc->sl; > - packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; > + packet->mad.hdr.status = 0; > + packet->mad.hdr.length = hdr_size(file) + mad_recv_wc->mad_len; > + packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); > + packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); > + packet->mad.hdr.sl = mad_recv_wc->wc->sl; > + packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; > + packet->mad.hdr.pkey_index = mad_recv_wc->wc->pkey_index; > packet->mad.hdr.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); > if (packet->mad.hdr.grh_present) { > struct ib_ah_attr ah_attr; > @@ -253,8 +261,8 @@ err1: > ib_free_recv_mad(mad_recv_wc); > } > > -static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > - size_t count) > +static ssize_t copy_recv_mad(struct ib_umad_file *file, char __user *buf, > + struct ib_umad_packet *packet, size_t count) > { > struct ib_mad_recv_buf *recv_buf; > int left, seg_payload, offset, max_seg_payload; > @@ -262,15 +270,15 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > /* We need enough room to copy the first (or only) MAD segment. */ > recv_buf = &packet->recv_wc->recv_buf; > if ((packet->length <= sizeof (*recv_buf->mad) && > - count < sizeof (packet->mad) + packet->length) || > + count < hdr_size(file) + packet->length) || > (packet->length > sizeof (*recv_buf->mad) && > - count < sizeof (packet->mad) + sizeof (*recv_buf->mad))) > + count < hdr_size(file) + sizeof (*recv_buf->mad))) > return -EINVAL; > > - if (copy_to_user(buf, &packet->mad, sizeof (packet->mad))) > + if (copy_to_user(buf, &packet->mad, hdr_size(file))) > return -EFAULT; > > - buf += sizeof (packet->mad); > + buf += hdr_size(file); > seg_payload = min_t(int, packet->length, sizeof (*recv_buf->mad)); > if (copy_to_user(buf, recv_buf->mad, seg_payload)) > return -EFAULT; > @@ -280,7 +288,7 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > * Multipacket RMPP MAD message. Copy remainder of message. > * Note that last segment may have a shorter payload. > */ > - if (count < sizeof (packet->mad) + packet->length) { > + if (count < hdr_size(file) + packet->length) { > /* > * The buffer is too small, return the first RMPP segment, > * which includes the RMPP message length. > @@ -300,18 +308,23 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > return -EFAULT; > } > } > - return sizeof (packet->mad) + packet->length; > + return hdr_size(file) + packet->length; > } > > -static ssize_t copy_send_mad(char __user *buf, struct ib_umad_packet *packet, > - size_t count) > +static ssize_t copy_send_mad(struct ib_umad_file *file, char __user *buf, > + struct ib_umad_packet *packet, size_t count) > { > - ssize_t size = sizeof (packet->mad) + packet->length; > + ssize_t size = hdr_size(file) + packet->length; > > if (count < size) > return -EINVAL; > > - if (copy_to_user(buf, &packet->mad, size)) > + if (copy_to_user(buf, &packet->mad, hdr_size(file))) > + return -EFAULT; > + > + buf += hdr_size(file); > + > + if (copy_to_user(buf, packet->mad.data, packet->length)) > return -EFAULT; > > return size; > @@ -324,7 +337,7 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, > struct ib_umad_packet *packet; > ssize_t ret; > > - if (count < sizeof (struct ib_user_mad)) > + if (count < hdr_size(file)) > return -EINVAL; > > spin_lock_irq(&file->recv_lock); > @@ -348,9 +361,9 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, > spin_unlock_irq(&file->recv_lock); > > if (packet->recv_wc) > - ret = copy_recv_mad(buf, packet, count); > + ret = copy_recv_mad(file, buf, packet, count); > else > - ret = copy_send_mad(buf, packet, count); > + ret = copy_send_mad(file, buf, packet, count); > > if (ret < 0) { > /* Requeue packet */ > @@ -442,15 +455,14 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > __be64 *tid; > int ret, data_len, hdr_len, copy_offset, rmpp_active; > > - if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) > + if (count < hdr_size(file) + IB_MGMT_RMPP_HDR) > return -EINVAL; > > packet = kzalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL); > if (!packet) > return -ENOMEM; > > - if (copy_from_user(&packet->mad, buf, > - sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { > + if (copy_from_user(&packet->mad, buf, hdr_size(file))) { > ret = -EFAULT; > goto err; > } > @@ -461,6 +473,13 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > goto err; > } > > + buf += hdr_size(file); > + > + if (copy_from_user(packet->mad.data, buf, IB_MGMT_RMPP_HDR)) { > + ret = -EFAULT; > + goto err; > + } > + > down_read(&file->port->mutex); > > agent = __get_agent(file, packet->mad.hdr.id); > @@ -500,11 +519,11 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > IB_MGMT_RMPP_FLAG_ACTIVE; > } > > - data_len = count - sizeof (struct ib_user_mad) - hdr_len; > + data_len = count - hdr_size(file) - hdr_len; > packet->msg = ib_create_send_mad(agent, > be32_to_cpu(packet->mad.hdr.qpn), > - 0, rmpp_active, hdr_len, > - data_len, GFP_KERNEL); > + packet->mad.hdr.pkey_index, rmpp_active, > + hdr_len, data_len, GFP_KERNEL); > if (IS_ERR(packet->msg)) { > ret = PTR_ERR(packet->msg); > goto err_ah; > @@ -517,7 +536,6 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > > /* Copy MAD header. Any RMPP header is already in place. */ > memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); > - buf += sizeof (struct ib_user_mad); > > if (!rmpp_active) { > if (copy_from_user(packet->msg->mad + copy_offset, > @@ -646,6 +664,7 @@ found: > goto out; > } > > + file->already_used = 1; > file->agent[agent_id] = agent; > ret = 0; > > @@ -682,6 +701,20 @@ out: > return ret; > } > > +static long ib_umad_enable_pkey(struct ib_umad_file *file) > +{ > + int ret = 0; > + > + down_write(&file->port->mutex); > + if (file->already_used) > + ret = -EINVAL; > + else > + file->use_pkey_index = 1; > + up_write(&file->port->mutex); > + > + return ret; > +} > + > static long ib_umad_ioctl(struct file *filp, unsigned int cmd, > unsigned long arg) > { > @@ -690,6 +723,8 @@ static long ib_umad_ioctl(struct file *filp, unsigned int cmd, > return ib_umad_reg_agent(filp->private_data, arg); > case IB_USER_MAD_UNREGISTER_AGENT: > return ib_umad_unreg_agent(filp->private_data, arg); > + case IB_USER_MAD_ENABLE_PKEY: > + return ib_umad_enable_pkey(filp->private_data); > default: > return -ENOIOCTLCMD; > } > diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h > index d66b15e..2a32043 100644 > --- a/include/rdma/ib_user_mad.h > +++ b/include/rdma/ib_user_mad.h > @@ -52,7 +52,50 @@ > */ > > /** > + * ib_user_mad_hdr_old - Old version of MAD packet header without pkey_index > + * @id - ID of agent MAD received with/to be sent with > + * @status - 0 on successful receive, ETIMEDOUT if no response > + * received (transaction ID in data[] will be set to TID of original > + * request) (ignored on send) > + * @timeout_ms - Milliseconds to wait for response (unset on receive) > + * @retries - Number of automatic retries to attempt > + * @qpn - Remote QP number received from/to be sent to > + * @qkey - Remote Q_Key to be sent with (unset on receive) > + * @lid - Remote lid received from/to be sent to > + * @sl - Service level received with/to be sent with > + * @path_bits - Local path bits received with/to be sent with > + * @grh_present - If set, GRH was received/should be sent > + * @gid_index - Local GID index to send with (unset on receive) > + * @hop_limit - Hop limit in GRH > + * @traffic_class - Traffic class in GRH > + * @gid - Remote GID in GRH > + * @flow_label - Flow label in GRH > + */ > +struct ib_user_mad_hdr_old { > + __u32 id; > + __u32 status; > + __u32 timeout_ms; > + __u32 retries; > + __u32 length; > + __be32 qpn; > + __be32 qkey; > + __be16 lid; > + __u8 sl; > + __u8 path_bits; > + __u8 grh_present; > + __u8 gid_index; > + __u8 hop_limit; > + __u8 traffic_class; > + __u8 gid[16]; > + __be32 flow_label; > +}; > + > +/** > * ib_user_mad_hdr - MAD packet header > + * This layout allows specifying/receiving the P_Key index. To use > + * this capability, an application must call the > + * IB_USER_MAD_ENABLE_PKEY ioctl on the user MAD file handle before > + * any other actions with the file handle. > * @id - ID of agent MAD received with/to be sent with > * @status - 0 on successful receive, ETIMEDOUT if no response > * received (transaction ID in data[] will be set to TID of original > @@ -70,6 +113,7 @@ > * @traffic_class - Traffic class in GRH > * @gid - Remote GID in GRH > * @flow_label - Flow label in GRH > + * @pkey_index - P_Key index > */ > struct ib_user_mad_hdr { > __u32 id; > @@ -88,6 +132,8 @@ struct ib_user_mad_hdr { > __u8 traffic_class; > __u8 gid[16]; > __be32 flow_label; > + __u16 pkey_index; > + __u8 reserved[6]; > }; > > /** > @@ -134,4 +180,6 @@ struct ib_user_mad_reg_req { > > #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) > > +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) > + > #endif /* IB_USER_MAD_H */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Mon Sep 10 20:20:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 06:20:54 +0300 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> Message-ID: <20070911032054.GA21811@mellanox.co.il> > Quoting Sean Hefty : > Subject: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > Roland, please pull from: > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > This will pick up QoS and CM scalability changes that I would like to get > into 2.6.24 (and OFED 1.3). Sean, where can I pull changes for ofed 1.3 from? The changes should go into kernel_patches/fixes for OFED. -- MST From kliteyn at mellanox.co.il Mon Sep 10 21:30:21 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 11 Sep 2007 07:30:21 +0300 Subject: [ofa-general] nightly osm_sim report 2007-09-11:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=475 Fail=45 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 8 FatTree merge-roots-4-ary-2-tree.topo 8 FatTree merge-root-4-ary-3-tree.topo 8 FatTree gnu-stallion-64.topo 8 FatTree blend-4-ary-2-tree.topo 8 FatTree RhinoDDR.topo 8 FatTree FullGnu.topo 8 FatTree 4-ary-2-tree.topo 8 FatTree 2-ary-4-tree.topo 8 FatTree 12-node-spaced.topo Failures: 5 FatTree merge-roots-4-ary-2-tree.topo 5 FatTree merge-root-4-ary-3-tree.topo 5 FatTree gnu-stallion-64.topo 5 FatTree blend-4-ary-2-tree.topo 5 FatTree RhinoDDR.topo 5 FatTree FullGnu.topo 5 FatTree 4-ary-2-tree.topo 5 FatTree 2-ary-4-tree.topo 5 FatTree 12-node-spaced.topo From mst at dev.mellanox.co.il Mon Sep 10 23:01:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 09:01:34 +0300 Subject: [ofa-general] [PATCH v5] IB/mlx4: shrinking WQE Message-ID: <20070911060134.GB15363@mellanox.co.il> IB/mlx4: shrinking WQE ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use NOP opcode to avoid wrap-around in the middle of WR. Since MLX QPs only support SEND, we use constant-sized WRs in this case. We look for the smallest value of wqe_shift such that the resulting number of wqes does not exceed device capabilities. Signed-off-by: Michael S. Tsirkin --- Changes since v3: make shrinking WQE also work with latest firmware (newer than 2.2.0). diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..2afd48d 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,71 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { u32 *wqe = get_send_wqe(qp, n); int i; + int s; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift) / sizeof *wqe; + if (qp->sq_max_wqes_per_wr > 1) { + stamp = cpu_to_be32(0x7fffffff | (n & qp->sq.wqe_cnt ? 0 : 1 << 31)); + for (i = 0; i < s; i += 16) + wqe[i] = stamp; + } else { + for (i = 16; i < s; i += 16) + wqe[i] = 0xffffffff; + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg) + (qp->ibqp.qp_type == IB_QPT_UD ? + sizeof(struct mlx4_wqe_datagram_seg) : 0); + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +290,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +334,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +399,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +438,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +534,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1028,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1228,14 +1352,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; - + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { err = -ENOMEM; @@ -1250,7 +1374,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1266,7 +1390,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->imm = 0; wqe += sizeof *ctrl; - size = sizeof *ctrl / 16; + size = sizeof *ctrl; switch (ibqp->qp_type) { case IB_QPT_RC: @@ -1281,8 +1405,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_atomic_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_atomic_seg); - size += (sizeof (struct mlx4_wqe_raddr_seg) + - sizeof (struct mlx4_wqe_atomic_seg)) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_atomic_seg); break; @@ -1292,7 +1416,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_raddr_seg(wqe, wr->wr.rdma.remote_addr, wr->wr.rdma.rkey); wqe += sizeof (struct mlx4_wqe_raddr_seg); - size += sizeof (struct mlx4_wqe_raddr_seg) / 16; + size += sizeof (struct mlx4_wqe_raddr_seg); break; default: @@ -1304,7 +1428,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, case IB_QPT_UD: set_datagram_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_datagram_seg); - size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + size += sizeof (struct mlx4_wqe_datagram_seg); break; case IB_QPT_SMI: @@ -1315,7 +1439,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, goto out; } wqe += err; - size += err / 16; + size += err; err = 0; break; @@ -1328,7 +1452,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, set_data_seg(wqe, wr->sg_list + i); wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } /* Add one more inline data segment for ICRC for MLX sends */ @@ -1337,11 +1461,11 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, cpu_to_be32((1 << 31) | 4); ((u32 *) wqe)[1] = 0; wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; + size += sizeof (struct mlx4_wqe_data_seg); } ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? - MLX4_WQE_CTRL_FENCE : 0) | size; + MLX4_WQE_CTRL_FENCE : 0) | (size / 16); /* * Make sure descriptor is fully written before @@ -1358,16 +1482,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1389,8 +1520,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..bf37369 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -158,6 +158,7 @@ enum { MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_NEC = 1 << 29, }; struct mlx4_wqe_ctrl_seg { -- MST From mst at dev.mellanox.co.il Mon Sep 10 23:28:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 09:28:51 +0300 Subject: [ofa-general] RFC: modify upstream code to make backporting easier Message-ID: <20070911062851.GC15363@mellanox.co.il> Roland, Ralph, all, I'd like to get your opinion on the following matter: OFED is backporting upstream rdma code to older kernels. While doing so, I really take pains to keep the ported code as close as possible to upstream original, mostly by using preprocessor to implement, as closely as possible, the APIs from recent kernels on top of older ones. As an example where this works well, see my backport of the new workqueue API to 2.6.19: http://www.openfabrics.org/git/?p=ofed_1_3/linux-2.6.git;a=blob;f=kernel_addons/backport/2.6.19/include/linux/workqueue.h;hb=HEAD However, sometimes I am forced to patch the upstream code. Here's an example of the patch needed to make ipath build on 2.6.22: diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 09c5fd8..94edb5d 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -287,6 +287,7 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, struct ipath_devdata *dd; unsigned long long addr; u32 bar0 = 0, bar1 = 0; + u8 rev; dd = ipath_alloc_devdata(pdev); if (IS_ERR(dd)) { @@ -448,7 +449,13 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, dd->ipath_deviceid = ent->device; /* save for later use */ dd->ipath_vendorid = ent->vendor; - dd->ipath_pcirev = pdev->revision; + ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev); + if (ret) { + ipath_dev_err(dd, "Failed to read PCI revision ID unit " + "%u: err %d\n", dd->ipath_unit, -ret); + goto bail_regions; /* shouldn't ever happen */ + } + dd->ipath_pcirev = rev; #if defined(__powerpc__) /* There isn't a generic way to specify writethrough mappings */ As you can see, there's nothing I can do with macros outside the code to make it work without code changes. However, the patching mechanism is pretty fragile with respect to code reorgs etc. I wonder whether it's acceptable in cases such as this to add a wrapper in upstream code. For example, upstream could have: #ifndef pci_get_revision #define pci_get_revision(dev) ((dev)->revision) #endif and then all a 2.6.22 backport needs to do is define it's own pci_get_revision macro. Upstream maintainers, can you pls comment ASAP on whether such approach would be acceptable e.g. for 2.6.24? If I could get rid of backport patches, it might make sense to start thinking about converting fixes patches to git commits, post 1.3, as well. Thanks, -- MST From kliteyn at mellanox.co.il Mon Sep 10 23:44:04 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 11 Sep 2007 09:44:04 +0300 Subject: [ofa-general] RE: nightly osm_sim report 2007-09-11:normal completion In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C9023D3AD2@mtlexch01.mtl.com> Please disregard the FatTree failures - they are false negatives (I forgot to undo some change in the test). Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Yevgeny Kliteynik Sent: Tuesday, September 11, 2007 7:30 AM To: Yevgeny Kliteynik; sashak at voltaire.com Cc: Eitan Zahavi; general at lists.openfabrics.org Subject: nightly osm_sim report 2007-09-11:normal completion OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=475 Fail=45 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 8 FatTree merge-roots-4-ary-2-tree.topo 8 FatTree merge-root-4-ary-3-tree.topo 8 FatTree gnu-stallion-64.topo 8 FatTree blend-4-ary-2-tree.topo 8 FatTree RhinoDDR.topo 8 FatTree FullGnu.topo 8 FatTree 4-ary-2-tree.topo 8 FatTree 2-ary-4-tree.topo 8 FatTree 12-node-spaced.topo Failures: 5 FatTree merge-roots-4-ary-2-tree.topo 5 FatTree merge-root-4-ary-3-tree.topo 5 FatTree gnu-stallion-64.topo 5 FatTree blend-4-ary-2-tree.topo 5 FatTree RhinoDDR.topo 5 FatTree FullGnu.topo 5 FatTree 4-ary-2-tree.topo 5 FatTree 2-ary-4-tree.topo 5 FatTree 12-node-spaced.topo From glebn at voltaire.com Mon Sep 10 23:46:02 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 11 Sep 2007 09:46:02 +0300 Subject: [ofa-general] Re: OFED Sep 10 meeting summary on OFED 1.3 development status In-Reply-To: <46E638B5.7050207@voltaire.com> References: <46E638B5.7050207@voltaire.com> Message-ID: <20070911064602.GF1397@minantech.com> > XRC - 90% When we can expect to see this patch posted to ofa list? -- Gleb. From ogerlitz at voltaire.com Tue Sep 11 00:55:55 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 11 Sep 2007 10:55:55 +0300 Subject: [ofa-general] Re: [ewg] OFED Sep 10 meeting summary on OFED 1.3 development status In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563D3F@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563D3F@mtlexch01.mtl.com> Message-ID: <46E64A0B.5030607@voltaire.com> Tziporet Koren wrote: > OFED Sep 10 meeting summary on OFED 1.3 development status > Meeting summary: > 1. We reviewed OFED 1.3 features status: > IPoIB: stateless offloads - 90% Tziporet, Michael Based on the fact that OFED 1.3 is now based on 2.6.23-rc5, it simply does not make --any-- sense to have sixty "fix" patches from day one ( full list below). Zooming to to IPoIB (twelve patches!) I see plenty of non cooked materials: The discussion/thread that followed the "IB/ipoib: S/G and HW checksum support" patch posted by Michael is not over yet. There was a direction offered by Michael which is pending to Roland's feedback. As for the "ipoib - add LSO support" patch(es) posted by Eli, I did not see any response of Roland nor of anyone else (except for me asking some questions...). Eli/Michael - where does this stand? are you planning to push this for 2.6.24? Also I understand that the patch set supports what you call "interrupt moderation", does this goes well with NAPI? if yes, how and why put in OFED something you never sent to review on the general list nor attempted to push into the kernel? etc etc etc > RDS - RDMA API - done > SDP: Keepalive - done; Asynch IO - done, Zero Copy - 80% have these patches ever sent to review on the general or ewg lists? > 2. based on the status we decided to delay the feature freeze date to > next week > Alpha release is expected on Sep 19 I don't think we can start with sixty patches, sorry for not bringing this input before yesterday. I would be happy to hear what others here think. Or. > 0009-mlx4_add_wc.patch > 0015_mlx4_set_cacheline_sz.patch > 0018_mlx4_qp_per_mcg.patch > 0020_mlx4_alloc_coherent.patch > 0022-mlx4-mr-direct-mtt.patch > 0023_mlx4_sg_stamp.patch > 0024_mlx4_fmr.patch > 0025_iw_cxgb3_writable-params.patch > 0025-mlx4-sysfs-dev-info.patch > 0026-mlx4-check-usr-sq-sz.patch > 0027-mlx4-sqp-no-bitmap.patch > 0028-cxgb3-fw-4.6.0 > cma_established1.patch > cma_response_timeout.patch > cma_tavor_quirk.patch > cmd_tout.patch > dma_map_sg.patch > ehca_add_mutex_h.patch > ipath-22-memcpy_cachebypass.patch > ipoib_crash_wa.patch > ipoib_dev_in_ipoib_neigh.patch > ipoib_selector_updated.patch > iwcm_ordird.patch > mlx4_msix.patch > mlx4_reset_msleep.patch > mthca_catas_wqueue_namelen.patch > mthca_msix.patch > mthca_wrid_swap.patch > mthca_x_qp_per_mcg.patch > qos_0_mthca.patch > scsi_transport_include_mutex.patch > sdp_cq_param.patch > sdp_post_credits.patch > sean_cm_limit_mra_timeout.patch > sean_local_sa_1_notifications.patch > sean_local_sa_2_cache.patch > sean_local_sa_3_disable.patch > sean_local_sa_4_fix_hang.patch > srp_1_recreate_at_reconnect.patch > t_0010_ipoib_high_dma.patch > t_0015_ipoib_sg.patch > t_0020_core_csum.patch > t_0030_mthca_checksum_offload.patch > t_0040_mlx4_checksum_offload.patch > t_0050_ipoib_checksum_offload.patch > t_0060_ipoib_qp_init_attr.patch > t_0080_mlx4_qp_max_msg.patch > t_0090_core_lso.patch > t_0100_mlx4_lso.patch > t_0110_ipoib_lso.patch > t_0120_ipoib_ethtool.patch > t_0130_ipoib_lro.patch > t_0140_core_modify_cq.patch > t_0150_mlx4_modify_cq.patch > t_0160_ipoib_modify_cq.patch > t_0170_cq_coal.patch > t_0180_ibcore_xrc.patch > t_0181_ibcore_srq_create.patch > t_0190_mlx4_xrc.patch > t_0200_create_send_mad.patch From ogerlitz at voltaire.com Tue Sep 11 01:05:53 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 11 Sep 2007 11:05:53 +0300 Subject: [ofa-general] [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> Message-ID: <46E64C61.7090607@voltaire.com> Sean Hefty wrote: > Roland, please pull from: > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > This will pick up QoS and CM scalability changes that I would like to get > into 2.6.24 (and OFED 1.3). All have been posted to the list before, though > the QoS patches have received more attention. Michael, I see the below non upstream patch in OFED 1.3 (I guess its also in 1.2 etc). Any reason not to push it upstream? Is this needed also for qos with connectx? Or. > encode SL in sched_queue field to improve hardware QoS guarantees > for connected QPs. > > Signed-off-by: Michael S. Tsirkin > > Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_qp.c > +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_qp.c > @@ -49,6 +49,10 @@ > #include "mthca_memfree.h" > #include "mthca_wqe.h" > > +static int mthca_qos_support = 0; > +module_param_named(qos_support, mthca_qos_support, int, 0644); > +MODULE_PARM_DESC(qos_support, "Enable QoS support if > 0"); > + > enum { > MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, > MTHCA_ACK_REQ_FREQ = 10, > @@ -694,6 +698,19 @@ int mthca_modify_qp(struct ib_qp *ibqp, > goto out_mailbox; > > qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); > + if (mthca_qos_support) { > + u8 sl = attr->ah_attr.sl; > + u8 sched_queue = (sl & 0x8) | (sl & (~(sl >> 1)) & 0x4) | > + ((sl >> 1) & (sl >> 2) & 0x2) | ((sl >> 1) & 0x1); > + > + if (mthca_is_memfree(dev)) { > + qp_context->rlkey_arbel_sched_queue |= sched_queue; > + } else { > + qp_context->tavor_sched_queue |= cpu_to_be32(sched_queue); > + } > + qp_param->opt_param_mask |= > + cpu_to_be32(MTHCA_QP_OPTPAR_SCHED_QUEUE); > + } > } > > if (attr_mask & IB_QP_TIMEOUT) { From mst at dev.mellanox.co.il Tue Sep 11 02:03:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 12:03:13 +0300 Subject: [ofa-general] mlx4 violating radix tree API locking rules? Message-ID: <20070911090313.GE15363@mellanox.co.il> Roland, could you clarify the following please: include/linux/radix-tree.h says: * For API usage, in general, * - any function _modifying_ the tree or tags (inserting or deleting * items, setting or clearing tags must exclude other modifications, and * exclude any functions reading the tree. * - any function _reading_ the tree or tags (looking up items or tags, * gang lookups) must exclude modifications to the tree, but may occur * concurrently with other readers. * * The notable exceptions to this rule are the following functions: * radix_tree_lookup * radix_tree_tag_get * radix_tree_gang_lookup * radix_tree_gang_lookup_tag * radix_tree_tagged * * The first 4 functions are able to be called locklessly, using RCU. The * caller must ensure calls to these functions are made within rcu_read_lock() * regions. Other readers (lock-free or otherwise) and modifications may be * running concurrently. * * It is still required that the caller manage the synchronization and lifetimes * of the items. So if RCU lock-free lookups are used, typically this would mean * that the items have their own locks, or are amenable to lock-free access; and * that the items are freed by RCU (or only freed after having been deleted from * the radix tree *and* a synchronize_rcu() grace period). * * (Note, rcu_assign_pointer and rcu_dereference are not needed to control * access to data items when inserting into or looking up from the radix tree) * * radix_tree_tagged is able to be called without locking or RCU. OTOH, a comment in drivers/infiniband/hw/mlx4/cq.c says: /* * We do not have to take the QP table lock here, * because CQs will be locked while QPs are removed * from the table. */ I guess CQ spinlock implies rcu_read_lock - is that right? But I do not see any synchronize_rcu calls anywhere in mlx4. Should destroy QP and friends call synchronize_rcu after removing the QP from radix tree but before freeing the QP structure? Thanks, -- MST From vlad at lists.openfabrics.org Tue Sep 11 02:53:19 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 11 Sep 2007 02:53:19 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070911-0200 daily build status Message-ID: <20070911095320.421F3E60834@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070911-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Tue Sep 11 04:52:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 14:52:21 +0300 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <46E64C61.7090607@voltaire.com> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <46E64C61.7090607@voltaire.com> Message-ID: <20070911115221.GB31103@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > Sean Hefty wrote: > >Roland, please pull from: > > > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > > >This will pick up QoS and CM scalability changes that I would like to get > >into 2.6.24 (and OFED 1.3). All have been posted to the list before, > >though > >the QoS patches have received more attention. > > Michael, > > I see the below non upstream patch in OFED 1.3 (I guess its also in 1.2 > etc). I plan to remove it. Thanks for reminding me. > Any reason not to push it upstream? I posted this at some point. Mellanox has since decided against testing it on arbel, so this configuration won't be supported until further notice. > Is this needed also for qos > with connectx? No. -- MST From ogerlitz at voltaire.com Tue Sep 11 05:01:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 11 Sep 2007 15:01:19 +0300 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <20070911115221.GB31103@mellanox.co.il> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <46E64C61.7090607@voltaire.com> <20070911115221.GB31103@mellanox.co.il> Message-ID: <46E6838F.3060005@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> I see the below non upstream patch in OFED 1.3 (I guess its also in 1.2 etc). > I plan to remove it. Thanks for reminding me. I see. Are there more patches in the kernel_patches/fixes which can be removed now? do you need help with this cleanup? >> Is this needed also for qos with connectx? > No. thanks for the clarification. Or. From mst at dev.mellanox.co.il Tue Sep 11 05:03:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 15:03:59 +0300 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <46E6838F.3060005@voltaire.com> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <46E64C61.7090607@voltaire.com> <20070911115221.GB31103@mellanox.co.il> <46E6838F.3060005@voltaire.com> Message-ID: <20070911120359.GC31103@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > Michael S. Tsirkin wrote: > >>Quoting Or Gerlitz : > >>I see the below non upstream patch in OFED 1.3 (I guess its also in 1.2 > >>etc). > > >I plan to remove it. Thanks for reminding me. > > I see. Are there more patches in the kernel_patches/fixes which can be > removed now? I don't think so. > do you need help with this cleanup? Help is always good. But I do not need someone to just ping me about every single patch in there, no. -- MST From tziporet at dev.mellanox.co.il Tue Sep 11 05:13:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 11 Sep 2007 15:13:06 +0300 Subject: [ofa-general] Re: [ewg] Re: OFED Sep 10 meeting summary on OFED 1.3 development status In-Reply-To: <20070911064602.GF1397@minantech.com> References: <46E638B5.7050207@voltaire.com> <20070911064602.GF1397@minantech.com> Message-ID: <46E68652.60002@mellanox.co.il> Gleb Natapov wrote: >> XRC - 90% >> > When we can expect to see this patch posted to ofa list? > > > Headers were already posted by Michael few weeks ago. The code itself will be posted on the list next week Tziporet From snagai at jp.ibm.com Tue Sep 11 05:28:27 2007 From: snagai at jp.ibm.com (snagai at jp.ibm.com) Date: Tue, 11 Sep 2007 08:28:27 -0400 Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch Message-ID: <13995234.1189513707210.JavaMail.root@wombat.diezmil.com> I am trying to build OFED with enabling DAPL package, but build proceess did not complete due to some errors. I just unzipped tar ball "OFED-1.2.tgz" and run build script "build.sh". Because I need to enable uDAPL on ppc64 linux machine, if someone has already succeeded it, please show me the way. My build environment and error messages are below. It seems the definition of "__PPC64__" is missing. [ build environment ] - machine arch: ppc64 - OS : Fedora Core6 - compiler: gcc4.1.1 [ error messages in build.log ] Make dapl started make -C src/userspace/dapl \ CPPFLAGS="-I../libibverbs/include/infiniband -I../librdmacm/include \ -I../libibverbs/include -I../../dat/include" \ AM_LDFLAGS="-L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/libibverbs/src -libverbs -L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/librdmacm/src/ -lrdmacm" make[1]: Entering directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make all-recursive make[2]: Entering directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' Making all in . make[3]: Entering directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -I../../dat/include -Wall -g -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" -c -o dapl_udapl_libdaplcma_la-dapl_init.lo `test -f 'dapl/udapl/dapl_init.c' || echo './'`dapl/udapl/dapl_init.c; \ then mv -f ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" ".deps/dapl_udapl_libdaplcma_la-dapl_init.Plo"; else rm -f ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -I../../dat/include -Wall -g -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c dapl/udapl/dapl_init.c -fPIC -DPIC -o .libs/dapl_udapl_libdaplcma_la-dapl_init.o In file included from ./dapl/include/dapl.h:50, from dapl/udapl/dapl_init.c:39: ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH make[3]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 make[3]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' make: *** [dapl] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.33577 (%install) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.33577 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /home/testuser/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libcxgb3 --with-libehca --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-sdpnetstat --with-srptools --with-perftest --sysconfdir=/etc --mandir=/usr/share/man' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libcxgb3 --with-libehca --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-sdpnetstat --with-srptools --with-mstflint --with-tvflash --sysconfdir=/etc --mandir=/usr/share/man' --define 'build_32bit 1' --define '_mandir /usr/share/man' /home/testuser/archives/OFED-1.2/SRPMS/ofa_user-1.2-0.src.rpm" -- This message was sent on behalf of snagai at jp.ibm.com at openSubscriber.com http://www.opensubscriber.com/messages/general at lists.openfabrics.org/topic.html From tziporet at dev.mellanox.co.il Tue Sep 11 05:33:53 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 11 Sep 2007 15:33:53 +0300 Subject: [ofa-general] Re: [ewg] OFED Sep 10 meeting summary on OFED 1.3 development status In-Reply-To: <46E64A0B.5030607@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563D3F@mtlexch01.mtl.com> <46E64A0B.5030607@voltaire.com> Message-ID: <46E68B31.1070502@mellanox.co.il> Or Gerlitz wrote: > > Based on the fact that OFED 1.3 is now based on 2.6.23-rc5, it simply > does not make --any-- sense to have sixty "fix" patches from day one ( > full list below). Zooming to to IPoIB (twelve patches!) I see plenty > of non cooked materials: > Same was with bonding when we started integration in OFED 1.2 but eventually we succeeded to make it work > The discussion/thread that followed the "IB/ipoib: S/G and HW checksum > support" patch posted by Michael is not over yet. There was a > direction offered by Michael which is pending to Roland's feedback. We will enhance the patches once we get Roland's feedback > > As for the "ipoib - add LSO support" patch(es) posted by Eli, I did > not see any response of Roland nor of anyone else (except for me > asking some questions...). Eli/Michael - where does this stand? are > you planning to push this for 2.6.24? Yes we are > > Also I understand that the patch set supports what you call "interrupt > moderation", does this goes well with NAPI? if yes, how and why put in > OFED something you never sent to review on the general list nor > attempted to push into the kernel? > > We have sent this for review > >> RDS - RDMA API - done SDP: Keepalive - done; Asynch IO - >> done, Zero Copy - 80% > > have these patches ever sent to review on the general or ewg lists? RDS - patches were reviewed by the RDS developers in the rds list. SDP - Jim sent patches for keepalive. Zcopy will be sent when ready > > > I don't think we can start with sixty patches, sorry for not bringing > this input before yesterday. I would be happy to hear what others here > think. There are101 patches in OFED 1.2, and I don't think it harmed anyone. Note that most of the patches are code that will make it to kernel 2.6.24. So the only way to avoid many patches is to change OFED 1.3 base kernel to 2.6.24 I suggest we will wait and see if we want to do it once 2.6.24 is out. Tziporet From harake at cscs.ch Tue Sep 11 05:40:16 2007 From: harake at cscs.ch (H. N. HARAKE) Date: Tue, 11 Sep 2007 14:40:16 +0200 Subject: [ofa-general] performance and Kernel support Message-ID: Hi The second question is regarding performance parameters using netperf I reach 4GBit/s between two nodes using OFED version 1.2.51 and 3GBit/s using OFED version 1.1 (10 Gig Mellanox cards) is their any parameters to apply for improving the performance or is their any document around. I am running OFED version 1.1 on sles 9 using kernel 2.6.5-7.283, I tried to to create rpm with version 1.2.51 but it fails with conflict error, the same situation with version 1.2.5 and 1.2.0 (check below) /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ 2.6.5_sles9_sp3/include/linux/slab.h:8: error: conflicting types for `kzalloc' /usr/src/linux-2.6.5-7.283_lustre.1.4.9/include/linux/slab.h:98: error: previous declaration of `kzalloc' /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/drivers/infiniband/core/ addr.c:67: warning: initialization from incompatible pointer type /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ 2.6.5_sles9_sp3/include/linux/device.h:48: warning: `class_create' defined but not used /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ 2.6.5_sles9_sp3/include/linux/device.h:82: warning: `class_destroy' defined but not used /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ 2.6.5_sles9_sp3/include/linux/device.h:108: warning: `class_device_create' defined but not used Thanks and Best regards H. N. Harake From mst at dev.mellanox.co.il Tue Sep 11 05:44:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 15:44:50 +0300 Subject: [ofa-general] [PATCH v6] IB/mlx4: shrinking WQE In-Reply-To: <20070911060134.GB15363@mellanox.co.il> References: <20070911060134.GB15363@mellanox.co.il> Message-ID: <20070911124450.GA3932@mellanox.co.il> IB/mlx4: shrinking WQE ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use NOP opcode to avoid wrap-around in the middle of WR. Since MLX QPs only support SEND, we use constant-sized WRs in this case. We look for the smallest value of wqe_shift such that the resulting number of wqes does not exceed device capabilities. Signed-off-by: Michael S. Tsirkin --- Fixes since v5: - Remove a micro-optimization not directly related to shrinking WQE, making the patch much smaller - Fix stamping code to handle wrap-around in the middle of a single stamping call. Found by internal code review diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..e76ae42 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0xffffffff) : + cpu_to_be32(0x7fffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +307,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +351,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +416,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +455,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +551,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1045,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1228,13 +1369,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1250,7 +1392,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1358,16 +1500,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1389,8 +1538,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..53b34f7 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -155,6 +155,7 @@ struct mlx4_qp_context { }; enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, -- MST From mst at dev.mellanox.co.il Tue Sep 11 05:55:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Sep 2007 15:55:04 +0300 Subject: [ofa-general] Re: performance and Kernel support In-Reply-To: References: Message-ID: <20070911125504.GA4237@mellanox.co.il> > I am running OFED version 1.1 on sles 9 using kernel 2.6.5-7.283, I > tried to to create rpm with version 1.2.51 > but it fails with conflict error, the same situation with version > 1.2.5 and 1.2.0 (check below) > > > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ > 2.6.5_sles9_sp3/include/linux/slab.h:8: error: conflicting types for > `kzalloc' > /usr/src/linux-2.6.5-7.283_lustre.1.4.9/include/linux/slab.h:98: > error: previous declaration of `kzalloc' > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/drivers/infiniband/core/ > addr.c:67: warning: initialization from incompatible pointer type > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ > 2.6.5_sles9_sp3/include/linux/device.h:48: warning: `class_create' > defined but not used > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ > 2.6.5_sles9_sp3/include/linux/device.h:82: warning: `class_destroy' > defined but not used > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ > 2.6.5_sles9_sp3/include/linux/device.h:108: warning: > `class_device_create' defined but not used It seems clear that you are running a heavily patched kernel which is likely not supported by ofed, and not the vanilla sles9 sp3 one, which is. -- MST From harake at cscs.ch Tue Sep 11 05:58:30 2007 From: harake at cscs.ch (H. N. HARAKE) Date: Tue, 11 Sep 2007 14:58:30 +0200 Subject: [ofa-general] Re: performance and Kernel support In-Reply-To: <20070911125504.GA4237@mellanox.co.il> References: <20070911125504.GA4237@mellanox.co.il> Message-ID: <602F073C-E8E3-4D27-8BAF-5A7ACE72A80E@cscs.ch> But I have no problem using ofed version 1.1 Thanks On 11-Sep-2007, at 14:55, Michael S. Tsirkin wrote: >> I am running OFED version 1.1 on sles 9 using kernel 2.6.5-7.283, I >> tried to to create rpm with version 1.2.51 >> but it fails with conflict error, the same situation with version >> 1.2.5 and 1.2.0 (check below) >> >> >> /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ >> 2.6.5_sles9_sp3/include/linux/slab.h:8: error: conflicting types for >> `kzalloc' >> /usr/src/linux-2.6.5-7.283_lustre.1.4.9/include/linux/slab.h:98: >> error: previous declaration of `kzalloc' >> /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/drivers/infiniband/core/ >> addr.c:67: warning: initialization from incompatible pointer type >> /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ >> 2.6.5_sles9_sp3/include/linux/device.h:48: warning: `class_create' >> defined but not used >> /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ >> 2.6.5_sles9_sp3/include/linux/device.h:82: warning: `class_destroy' >> defined but not used >> /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2.5.1/kernel_addons/backport/ >> 2.6.5_sles9_sp3/include/linux/device.h:108: warning: >> `class_device_create' defined but not used > > It seems clear that you are running a heavily patched kernel > which is likely not supported by ofed, and not the vanilla sles9 > sp3 one, > which is. > > -- > MST From kliteyn at dev.mellanox.co.il Tue Sep 11 06:01:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 11 Sep 2007 16:01:29 +0300 Subject: [ofa-general] [PATCH] osm: QoS - changing 'no_qos' option to 'qos' Message-ID: <46E691A9.90308@dev.mellanox.co.il> Changing OpenSM option "no_qos" with default value 'TRUE 'to "qos" with deafult value 'FALSE' Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_subnet.h | 4 ++-- opensm/opensm/main.c | 2 +- opensm/opensm/osm_link_mgr.c | 2 +- opensm/opensm/osm_prtn_config.c | 2 +- opensm/opensm/osm_qos.c | 2 +- opensm/opensm/osm_sa_multipath_record.c | 10 +++++----- opensm/opensm/osm_sa_path_record.c | 10 +++++----- opensm/opensm/osm_subnet.c | 14 +++++++------- 8 files changed, 23 insertions(+), 23 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 7e1a3e7..dada8bf 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -257,7 +257,7 @@ typedef struct _osm_subn_opt { unsigned long log_max_size; char *partition_config_file; boolean_t no_partition_enforcement; - boolean_t no_qos; + boolean_t qos; char *qos_policy_file; boolean_t accum_log_file; char *console; @@ -400,7 +400,7 @@ typedef struct _osm_subn_opt { * specified the log file will be truncated upon reaching * this limit. * -* no_qos +* qos * Boolean that specifies whether the OpenSM QoS functionality * should be off or on. * diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 08d654e..2d5e607 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -826,7 +826,7 @@ int main(int argc, char *argv[]) break; case 'Q': - opt.no_qos = FALSE; + opt.qos = TRUE; break; case 'Y': diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index 5a6a81c..d5be7b5 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -352,7 +352,7 @@ __osm_link_mgr_set_physp_pi(IN osm_link_mgr_t * const p_mgr, send_set = TRUE; /* provide the vl_high_limit from the qos mgr */ - if (p_mgr->p_subn->opt.no_qos == FALSE && + if (p_mgr->p_subn->opt.qos && p_physp->vl_high_limit != p_old_pi->vl_high_limit) { send_set = TRUE; p_pi->vl_high_limit = p_physp->vl_high_limit; diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c index 9abf3e8..5034aa0 100644 --- a/opensm/opensm/osm_prtn_config.c +++ b/opensm/opensm/osm_prtn_config.c @@ -109,7 +109,7 @@ static int partition_create(unsigned lineno, struct part_conf *conf, if (!conf->p_prtn) return -1; - if (conf->p_subn->opt.no_qos) { + if (!conf->p_subn->opt.qos) { if (conf->sl != OSM_DEFAULT_SL) { osm_log(conf->p_log, OSM_LOG_ERROR, "partition_create: Overriding SL %d to default SL %d on partition %s as QoS not enabled\n", diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index dff9996..c6641fc 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -280,7 +280,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) unsigned force_update; uint8_t i; - if (p_osm->subn.opt.no_qos) + if (!p_osm->subn.opt.qos) return OSM_SIGNAL_DONE; OSM_LOG_ENTER(&p_osm->log, osm_qos_setup); diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index 690f9e7..5c5155a 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -301,7 +301,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } } - if (!p_rcv->p_subn->opt.no_qos) { + if (p_rcv->p_subn->opt.qos) { /* * Whether this node is switch or CA, the IN port for @@ -427,7 +427,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (!p_rcv->p_subn->opt.no_qos) { + if (p_rcv->p_subn->opt.qos) { /* * Check SL2VL table of the switch and update valid SLs */ @@ -470,7 +470,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, * Get QoS Level object according to the MultiPath request * and adjust MultiPath parameters according to QoS settings */ - if ( !p_rcv->p_subn->opt.no_qos && + if ( p_rcv->p_subn->opt.qos && p_rcv->p_subn->p_qos_policy && (p_qos_level = osm_qos_policy_get_qos_level_by_mpr( p_rcv->p_subn->p_qos_policy, p_mpr, @@ -791,7 +791,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, } else required_sl = p_prtn->sl; - } else if (!p_rcv->p_subn->opt.no_qos) { + } else if (p_rcv->p_subn->opt.qos) { if (valid_sl_mask & (1 << OSM_DEFAULT_SL)) required_sl = OSM_DEFAULT_SL; else { @@ -804,7 +804,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, else required_sl = OSM_DEFAULT_SL; - if (!p_rcv->p_subn->opt.no_qos && !(valid_sl_mask & (1 << required_sl))) { + if (p_rcv->p_subn->opt.qos && !(valid_sl_mask & (1 << required_sl))) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_mpr_rcv_get_path_parms: ERR 451F: " "Selected SL (%u) leads to VL15\n", required_sl); diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index c60fb6e..c8b3892 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -313,7 +313,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, } } - if (!p_rcv->p_subn->opt.no_qos) { + if (p_rcv->p_subn->opt.qos) { /* * Whether this node is switch or CA, the IN port for @@ -438,7 +438,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (rate > ib_port_info_compute_rate(p_pi)) rate = ib_port_info_compute_rate(p_pi); - if (!p_rcv->p_subn->opt.no_qos) { + if (p_rcv->p_subn->opt.qos) { /* * Check SL2VL table of the switch and update valid SLs */ @@ -481,7 +481,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, * Get QoS Level object according to the path request * and adjust path parameters according to QoS settings */ - if ( !p_rcv->p_subn->opt.no_qos && + if ( p_rcv->p_subn->opt.qos && p_rcv->p_subn->p_qos_policy && (p_qos_level = osm_qos_policy_get_qos_level_by_pr( p_rcv->p_subn->p_qos_policy, p_pr, @@ -813,7 +813,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, sl = OSM_DEFAULT_SL; } else sl = p_prtn->sl; - } else if (!p_rcv->p_subn->opt.no_qos) { + } else if (p_rcv->p_subn->opt.qos) { if (valid_sl_mask & (1 << OSM_DEFAULT_SL)) sl = OSM_DEFAULT_SL; else { @@ -826,7 +826,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, else sl = OSM_DEFAULT_SL; - if (!p_rcv->p_subn->opt.no_qos && !(valid_sl_mask & (1 << sl))) { + if (p_rcv->p_subn->opt.qos && !(valid_sl_mask & (1 << sl))) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F24: " "Selected SL (%u) leads to VL15\n", sl); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 58803e1..3895732 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -454,7 +454,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->log_max_size = 0; p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE; p_opt->no_partition_enforcement = FALSE; - p_opt->no_qos = TRUE; + p_opt->qos = FALSE; p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; @@ -730,7 +730,7 @@ ib_api_status_t osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) fclose(opts_file); /* read QoS policy config file */ - if (!p_subn->opt.no_qos) + if (p_subn->opt.qos) osm_qos_parse_policy_file(p_subn); return IB_SUCCESS; @@ -950,7 +950,7 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts) p_opts->console = OSM_DEFAULT_CONSOLE; } - if (p_opts->no_qos == FALSE) { + if (p_opts->qos) { subn_verify_max_vls(&(p_opts->qos_options.max_vls), "qos_max_vls"); subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls), @@ -1184,7 +1184,7 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_boolean("no_partition_enforcement", p_key, p_val, &p_opts->no_partition_enforcement); - opts_unpack_boolean("no_qos", p_key, p_val, &p_opts->no_qos); + opts_unpack_boolean("qos", p_key, p_val, &p_opts->qos); opts_unpack_charp("qos_policy_file", p_key, p_val, &p_opts->qos_policy_file); @@ -1551,11 +1551,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) fprintf(opts_file, "#\n# QoS OPTIONS\n#\n" - "# Disable QoS setup\n" - "no_qos %s\n\n" + "# Enable QoS setup\n" + "qos %s\n\n" "# QoS policy file to be used\n" "qos_policy_file %s\n\n", - p_opts->no_qos ? "TRUE" : "FALSE", + p_opts->qos ? "TRUE" : "FALSE", p_opts->qos_policy_file); subn_dump_qos_options(opts_file, -- 1.5.1.4 From fplpbmcsm at boxcarsatan.com Tue Sep 11 06:04:05 2007 From: fplpbmcsm at boxcarsatan.com (Kendrick Mccarthy) Date: Tue, 11 Sep 2007 15:04:05 +0200 Subject: [ofa-general] Kendrick has sent you a message Message-ID: <632767653.20038121108004@boxcarsatan.com> Big News Tomorrow! ww Energy Inc. Symbol : wwng $0.01 WWNG plans huge news release tomorrow. Huge returns resulted from last months Big news release. Act fast and beat the news to the market Tuesday. Move fast on WWNG Tuesday morning. From fenkes at de.ibm.com Tue Sep 11 06:18:25 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:18:25 +0200 Subject: [ofa-general] [PATCH 00/12] IB/ehca: New features and fixes for 2.6.24 Message-ID: <200709111518.26276.fenkes@de.ibm.com> Here are some fresh eHCA driver features and fixes for your reviewing pleasure. They have passed internal testing and checkpatch.pl, so we think they are ready for inclusion. [01/12] adds userspace support for small QPs [02/12] changes a nit in firmware communication [03/12] adds support for more than 4096 QPs/CQs in user space [04/12] enables mapping firmware contexts into uspace on 64K-page kernels [05/12] changes hvCall debug trace formatting [06/12] outputs return codes as signed decimal integers [07/12] makes warnings also appear in non-debug mode, like they should [08/12] replaces get_paca()->paca_index by the portable smp_processor_id() [09/12] checks the allowed max number of SGEs when creating a QP [10/12] fixes some Path Migration problems [11/12] works around a firmware race condition [12/12] bumps the driver's version number The patches should apply cleanly, in order, against Roland's git. Please review the changes and apply the patches for 2.6.24 if they are okay. Regards, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com From fenkes at de.ibm.com Tue Sep 11 06:26:33 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:26:33 +0200 Subject: [ofa-general] [PATCH 01/12] IB/ehca: Small QP userspace support In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111526.34394.fenkes@de.ibm.com> From: Stefan Roscher Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 7 +++---- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 1 + 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 84d435a..13b61c3 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -273,6 +273,7 @@ static inline void queue2resp(struct ipzu_queue_resp *resp, resp->queue_length = queue->queue_length; resp->pagesize = queue->pagesize; resp->toggle_state = queue->toggle_state; + resp->offset = queue->offset; } /* @@ -598,8 +599,7 @@ static struct ehca_qp *internal_create_qp( parms.squeue.max_sge = max_send_sge; parms.rqueue.max_sge = max_recv_sge; - if (EHCA_BMASK_GET(HCA_CAP_MINI_QP, shca->hca_cap) - && !(context && udata)) { /* no small QP support in userspace ATM */ + if (EHCA_BMASK_GET(HCA_CAP_MINI_QP, shca->hca_cap)) { if (HAS_SQ(my_qp)) ehca_determine_small_queue( &parms.squeue, max_send_sge, is_llqp); @@ -741,8 +741,7 @@ static struct ehca_qp *internal_create_qp( resp.ext_type = my_qp->ext_type; resp.qkey = my_qp->qkey; resp.real_qp_num = my_qp->real_qp_num; - resp.ipz_rqueue.offset = my_qp->ipz_rqueue.offset; - resp.ipz_squeue.offset = my_qp->ipz_squeue.offset; + if (HAS_SQ(my_qp)) queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue); if (HAS_RQ(my_qp)) diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c index 29bd476..661f8db 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -158,6 +158,7 @@ static int alloc_small_queue_page(struct ipz_queue *queue, struct ehca_pd *pd) queue->queue_pages[0] = (void *)(page->page | (bit << (order + 9))); queue->small_page = page; + queue->offset = bit << (order + 9); return 1; out: -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:29:07 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:29:07 +0200 Subject: [ofa-general] [PATCH 02/12] IB/ehca: Add 1 is not longer needed because of firmware interface change In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111529.07935.fenkes@de.ibm.com> From: Stefan Roscher Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/hcp_if.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 24f4541..8534061 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -317,9 +317,9 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, max_r10_reg = EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR, - parms->squeue.max_wr + 1) + parms->squeue.max_wr) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR, - parms->rqueue.max_wr + 1) + parms->rqueue.max_wr) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE, parms->squeue.max_sge) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:29:39 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:29:39 +0200 Subject: [ofa-general] [PATCH 03/12] IB/ehca: Support more than 4k QPs for userspace and kernelspace In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111529.40292.fenkes@de.ibm.com> From: Stefan Roscher Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_cq.c | 7 ++++++- drivers/infiniband/hw/ehca/ehca_main.c | 2 +- drivers/infiniband/hw/ehca/ehca_qp.c | 9 +++++++-- drivers/infiniband/hw/ehca/ehca_uverbs.c | 22 +++++++++++----------- 4 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 81aff36..a6f17e4 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -166,7 +166,6 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, write_lock_irqsave(&ehca_cq_idr_lock, flags); ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token); write_unlock_irqrestore(&ehca_cq_idr_lock, flags); - } while (ret == -EAGAIN); if (ret) { @@ -176,6 +175,12 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, goto create_cq_exit1; } + if (my_cq->token > 0x1FFFFFF) { + cq = ERR_PTR(-ENOMEM); + ehca_err(device, "Invalid number of cq. device=%p", device); + goto create_cq_exit2; + } + /* * CQs maximum depth is 4GB-64, but we need additional 20 as buffer * for receiving errors CQEs. diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 99036b6..1a2c542 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -380,7 +380,7 @@ int ehca_init_device(struct ehca_shca *shca) strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); shca->ib_device.owner = THIS_MODULE; - shca->ib_device.uverbs_abi_ver = 7; + shca->ib_device.uverbs_abi_ver = 8; shca->ib_device.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 13b61c3..e886e3b 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -557,7 +557,6 @@ static struct ehca_qp *internal_create_qp( write_lock_irqsave(&ehca_qp_idr_lock, flags); ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token); write_unlock_irqrestore(&ehca_qp_idr_lock, flags); - } while (ret == -EAGAIN); if (ret) { @@ -566,11 +565,17 @@ static struct ehca_qp *internal_create_qp( goto create_qp_exit0; } + if (my_qp->token > 0x1FFFFFF) { + ret = -EINVAL; + ehca_err(pd->device, "Invalid number of qp"); + goto create_qp_exit1; + } + parms.servicetype = ibqptype2servicetype(qp_type); if (parms.servicetype < 0) { ret = -EINVAL; ehca_err(pd->device, "Invalid qp_type=%x", qp_type); - goto create_qp_exit0; + goto create_qp_exit1; } if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 4bc687f..3340f49 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -164,7 +164,7 @@ static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, int ret; switch (rsrc_type) { - case 1: /* galpa fw handle */ + case 0: /* galpa fw handle */ ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number); ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); if (unlikely(ret)) { @@ -175,7 +175,7 @@ static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, } break; - case 2: /* cq queue_addr */ + case 1: /* cq queue_addr */ ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number); ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); if (unlikely(ret)) { @@ -201,7 +201,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, int ret; switch (rsrc_type) { - case 1: /* galpa fw handle */ + case 0: /* galpa fw handle */ ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num); ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); if (unlikely(ret)) { @@ -212,7 +212,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, } break; - case 2: /* qp rqueue_addr */ + case 1: /* qp rqueue_addr */ ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", qp->ib_qp.qp_num); ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, @@ -225,7 +225,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, } break; - case 3: /* qp squeue_addr */ + case 2: /* qp squeue_addr */ ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", qp->ib_qp.qp_num); ret = ehca_mmap_queue(vma, &qp->ipz_squeue, @@ -249,10 +249,10 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) { - u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; - u32 idr_handle = fileoffset >> 32; - u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ - u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u64 fileoffset = vma->vm_pgoff; + u32 idr_handle = fileoffset & 0x1FFFFFF; + u32 q_type = (fileoffset >> 27) & 0x1; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 25) & 0x3; /* sq,rq,cmnd_window */ u32 cur_pid = current->tgid; u32 ret; struct ehca_cq *cq; @@ -261,7 +261,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) struct ib_uobject *uobject; switch (q_type) { - case 1: /* CQ */ + case 0: /* CQ */ read_lock(&ehca_cq_idr_lock); cq = idr_find(&ehca_cq_idr, idr_handle); read_unlock(&ehca_cq_idr_lock); @@ -289,7 +289,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) } break; - case 2: /* QP */ + case 1: /* QP */ read_lock(&ehca_qp_idr_lock); qp = idr_find(&ehca_qp_idr, idr_handle); read_unlock(&ehca_qp_idr_lock); -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:31:06 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:31:06 +0200 Subject: [ofa-general] [PATCH 04/12] IB/ehca: Use remap_4k_pfn() to map firmware contexts to user space In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111531.07604.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Use Paul's new remap_4k_pfn() function to map our 4K firmware contexts into user space on 64K-page machines without exposing neighboring firmware contexts. Return the context's offset within a 64K page to user space so it can determine the proper virtual address. For details about remap_4k_pfn(), see commit 721151d0 or http://patchwork.ozlabs.org/linuxppc/patch?id=10281 Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 4 +++- drivers/infiniband/hw/ehca/ehca_cq.c | 2 ++ drivers/infiniband/hw/ehca/ehca_qp.c | 2 ++ drivers/infiniband/hw/ehca/ehca_uverbs.c | 6 +++--- 4 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index b5e9603..206d4eb 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -337,6 +337,8 @@ struct ehca_create_cq_resp { u32 cq_number; u32 token; struct ipzu_queue_resp ipz_queue; + u32 fw_handle_ofs; + u32 dummy; }; struct ehca_create_qp_resp { @@ -347,7 +349,7 @@ struct ehca_create_qp_resp { u32 qkey; /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ u32 real_qp_num; - u32 dummy; /* padding for 8 byte alignment */ + u32 fw_handle_ofs; struct ipzu_queue_resp ipz_squeue; struct ipzu_queue_resp ipz_rqueue; }; diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index a6f17e4..d68603d 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -281,6 +281,8 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, resp.ipz_queue.queue_length = ipz_queue->queue_length; resp.ipz_queue.pagesize = ipz_queue->pagesize; resp.ipz_queue.toggle_state = ipz_queue->toggle_state; + resp.fw_handle_ofs = (u32) + (my_cq->galpas.user.fw_handle & (PAGE_SIZE - 1)); if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { ehca_err(device, "Copy to udata failed."); goto create_cq_exit4; diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index e886e3b..3a3880f 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -751,6 +751,8 @@ static struct ehca_qp *internal_create_qp( queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue); if (HAS_RQ(my_qp)) queue2resp(&resp.ipz_rqueue, &my_qp->ipz_rqueue); + resp.fw_handle_ofs = (u32) + (my_qp->galpas.user.fw_handle & (PAGE_SIZE - 1)); if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 3340f49..84a16bc 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -109,7 +109,7 @@ static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, u64 vsize, physical; vsize = vma->vm_end - vma->vm_start; - if (vsize != EHCA_PAGESIZE) { + if (vsize < EHCA_PAGESIZE) { ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start); return -EINVAL; } @@ -118,8 +118,8 @@ static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ - ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, - vsize, vma->vm_page_prot); + ret = remap_4k_pfn(vma, vma->vm_start, physical >> EHCA_PAGESHIFT, + vma->vm_page_prot); if (unlikely(ret)) { ehca_gen_err("remap_pfn_range() failed ret=%x", ret); return -ENOMEM; -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:31:49 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:31:49 +0200 Subject: [ofa-general] [PATCH 05/12] IB/ehca: Refactor hvcall tracing In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111531.50039.fenkes@de.ibm.com> Change hvcall trace output towards better readability: reg numbers instead of argument numbers, return code as signed decimal instead of unsigned hex. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/hcp_if.c | 57 ++++++++++++++-------------------- 1 files changed, 24 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 8534061..32f465b 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -84,6 +84,10 @@ #define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) #define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) +#define HCALL4_REGS_FORMAT "r4=%lx r5=%lx r6=%lx r7=%lx" +#define HCALL7_REGS_FORMAT HCALL4_REGS_FORMAT " r8=%lx r9=%lx r10=%lx" +#define HCALL9_REGS_FORMAT HCALL7_REGS_FORMAT " r11=%lx r12=%lx" + static DEFINE_SPINLOCK(hcall_lock); static u32 get_longbusy_msecs(int longbusy_rc) @@ -118,8 +122,7 @@ static long ehca_plpar_hcall_norets(unsigned long opcode, long ret; int i, sleep_msecs; - ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " - "arg5=%lx arg6=%lx arg7=%lx", + ehca_gen_dbg("opcode=%lx " HCALL7_REGS_FORMAT, opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7); for (i = 0; i < 5; i++) { @@ -133,16 +136,13 @@ static long ehca_plpar_hcall_norets(unsigned long opcode, } if (ret < H_SUCCESS) - ehca_gen_err("opcode=%lx ret=%lx" - " arg1=%lx arg2=%lx arg3=%lx arg4=%lx" - " arg5=%lx arg6=%lx arg7=%lx ", - opcode, ret, - arg1, arg2, arg3, arg4, arg5, - arg6, arg7); - - ehca_gen_dbg("opcode=%lx ret=%lx", opcode, ret); - return ret; + ehca_gen_err("opcode=%lx ret=%li " HCALL7_REGS_FORMAT, + opcode, ret, arg1, arg2, arg3, + arg4, arg5, arg6, arg7); + else + ehca_gen_dbg("opcode=%lx ret=%li", opcode, ret); + return ret; } return H_BUSY; @@ -164,10 +164,8 @@ static long ehca_plpar_hcall9(unsigned long opcode, int i, sleep_msecs, lock_is_set = 0; unsigned long flags = 0; - ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " - "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", - opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7, - arg8, arg9); + ehca_gen_dbg("INPUT -- opcode=%lx " HCALL9_REGS_FORMAT, opcode, + arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); for (i = 0; i < 5; i++) { if ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)) { @@ -188,26 +186,19 @@ static long ehca_plpar_hcall9(unsigned long opcode, continue; } - if (ret < H_SUCCESS) - ehca_gen_err("opcode=%lx ret=%lx" - " arg1=%lx arg2=%lx arg3=%lx arg4=%lx" - " arg5=%lx arg6=%lx arg7=%lx arg8=%lx" - " arg9=%lx" - " out1=%lx out2=%lx out3=%lx out4=%lx" - " out5=%lx out6=%lx out7=%lx out8=%lx" - " out9=%lx", - opcode, ret, - arg1, arg2, arg3, arg4, arg5, - arg6, arg7, arg8, arg9, - outs[0], outs[1], outs[2], outs[3], + if (ret < H_SUCCESS) { + ehca_gen_err("INPUT -- opcode=%lx " HCALL9_REGS_FORMAT, + opcode, arg1, arg2, arg3, arg4, arg5, + arg6, arg7, arg8, arg9); + ehca_gen_err("OUTPUT -- ret=%li " HCALL9_REGS_FORMAT, + ret, outs[0], outs[1], outs[2], outs[3], + outs[4], outs[5], outs[6], outs[7], + outs[8]); + } else + ehca_gen_dbg("OUTPUT -- ret=%li " HCALL9_REGS_FORMAT, + ret, outs[0], outs[1], outs[2], outs[3], outs[4], outs[5], outs[6], outs[7], outs[8]); - - ehca_gen_dbg("opcode=%lx ret=%lx out1=%lx out2=%lx out3=%lx " - "out4=%lx out5=%lx out6=%lx out7=%lx out8=%lx " - "out9=%lx", - opcode, ret, outs[0], outs[1], outs[2], outs[3], - outs[4], outs[5], outs[6], outs[7], outs[8]); return ret; } -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:32:22 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:32:22 +0200 Subject: [ofa-general] [PATCH 06/12] IB/ehca: Print return codes as signed decimal integers In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111532.23051.fenkes@de.ibm.com> ...because -12 is easier to read than FFFFFFF4. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_cq.c | 14 +++--- drivers/infiniband/hw/ehca/ehca_hca.c | 2 +- drivers/infiniband/hw/ehca/ehca_main.c | 24 +++++----- drivers/infiniband/hw/ehca/ehca_mcast.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 75 +++++++++++++++--------------- drivers/infiniband/hw/ehca/ehca_qp.c | 46 +++++++++--------- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 2 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 18 ++++---- drivers/infiniband/hw/ehca/hcp_if.c | 20 ++++---- 10 files changed, 103 insertions(+), 104 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index d68603d..79c25f5 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -190,7 +190,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret != H_SUCCESS) { ehca_err(device, "hipz_h_alloc_resource_cq() failed " - "h_ret=%lx device=%p", h_ret, device); + "h_ret=%li device=%p", h_ret, device); cq = ERR_PTR(ehca2ib_return_code(h_ret)); goto create_cq_exit2; } @@ -198,7 +198,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, ipz_rc = ipz_queue_ctor(NULL, &my_cq->ipz_queue, param.act_pages, EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0, 0); if (!ipz_rc) { - ehca_err(device, "ipz_queue_ctor() failed ipz_rc=%x device=%p", + ehca_err(device, "ipz_queue_ctor() failed ipz_rc=%i device=%p", ipz_rc, device); cq = ERR_PTR(-EINVAL); goto create_cq_exit3; @@ -226,7 +226,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret < H_SUCCESS) { ehca_err(device, "hipz_h_register_rpage_cq() failed " - "ehca_cq=%p cq_num=%x h_ret=%lx counter=%i " + "ehca_cq=%p cq_num=%x h_ret=%li counter=%i " "act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); cq = ERR_PTR(-EINVAL); @@ -238,7 +238,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if ((h_ret != H_SUCCESS) || vpage) { ehca_err(device, "Registration of pages not " "complete ehca_cq=%p cq_num=%x " - "h_ret=%lx", my_cq, my_cq->cq_number, + "h_ret=%li", my_cq, my_cq->cq_number, h_ret); cq = ERR_PTR(-EAGAIN); goto create_cq_exit4; @@ -246,7 +246,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(device, "Registration of page failed " - "ehca_cq=%p cq_num=%x h_ret=%lx" + "ehca_cq=%p cq_num=%x h_ret=%li" "counter=%i act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); @@ -298,7 +298,7 @@ create_cq_exit3: h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); if (h_ret != H_SUCCESS) ehca_err(device, "hipz_h_destroy_cq() failed ehca_cq=%p " - "cq_num=%x h_ret=%lx", my_cq, my_cq->cq_number, h_ret); + "cq_num=%x h_ret=%li", my_cq, my_cq->cq_number, h_ret); create_cq_exit2: write_lock_irqsave(&ehca_cq_idr_lock, flags); @@ -362,7 +362,7 @@ int ehca_destroy_cq(struct ib_cq *cq) cq_num); } if (h_ret != H_SUCCESS) { - ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%lx " + ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%li " "ehca_cq=%p cq_num=%x", h_ret, my_cq, cq_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index cf22472..3436c49 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -352,7 +352,7 @@ int ehca_modify_port(struct ib_device *ibdev, hret = hipz_h_modify_port(shca->ipz_hca_handle, port, cap, props->init_type, port_modify_mask); if (hret != H_SUCCESS) { - ehca_err(&shca->ib_device, "Modify port failed hret=%lx", + ehca_err(&shca->ib_device, "Modify port failed h_ret=%li", hret); ret = -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 1a2c542..799f218 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -273,7 +273,7 @@ int ehca_sense_attributes(struct ehca_shca *shca) h_ret = hipz_h_query_hca(shca->ipz_hca_handle, rblock); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query device properties. h_ret=%lx", + ehca_gen_err("Cannot query device properties. h_ret=%li", h_ret); ret = -EPERM; goto sense_attributes1; @@ -332,7 +332,7 @@ int ehca_sense_attributes(struct ehca_shca *shca) port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query port properties. h_ret=%lx", + ehca_gen_err("Cannot query port properties. h_ret=%li", h_ret); ret = -EPERM; goto sense_attributes1; @@ -526,13 +526,13 @@ static int ehca_destroy_aqp1(struct ehca_sport *sport) ret = ib_destroy_qp(sport->ibqp_aqp1); if (ret) { - ehca_gen_err("Cannot destroy AQP1 QP. ret=%x", ret); + ehca_gen_err("Cannot destroy AQP1 QP. ret=%i", ret); return ret; } ret = ib_destroy_cq(sport->ibcq_aqp1); if (ret) - ehca_gen_err("Cannot destroy AQP1 CQ. ret=%x", ret); + ehca_gen_err("Cannot destroy AQP1 CQ. ret=%i", ret); return ret; } @@ -728,7 +728,7 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev, ret = ehca_reg_internal_maxmr(shca, shca->pd, &shca->maxmr); if (ret) { - ehca_err(&shca->ib_device, "Cannot create internal MR ret=%x", + ehca_err(&shca->ib_device, "Cannot create internal MR ret=%i", ret); goto probe5; } @@ -736,7 +736,7 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev, ret = ib_register_device(&shca->ib_device); if (ret) { ehca_err(&shca->ib_device, - "ib_register_device() failed ret=%x", ret); + "ib_register_device() failed ret=%i", ret); goto probe6; } @@ -777,7 +777,7 @@ probe8: ret = ehca_destroy_aqp1(&shca->sport[0]); if (ret) ehca_err(&shca->ib_device, - "Cannot destroy AQP1 for port 1. ret=%x", ret); + "Cannot destroy AQP1 for port 1. ret=%i", ret); probe7: ib_unregister_device(&shca->ib_device); @@ -826,7 +826,7 @@ static int __devexit ehca_remove(struct ibmebus_dev *dev) if (ret) ehca_err(&shca->ib_device, "Cannot destroy AQP1 for port %x " - "ret=%x", ret, i); + "ret=%i", ret, i); } } @@ -835,20 +835,20 @@ static int __devexit ehca_remove(struct ibmebus_dev *dev) ret = ehca_dereg_internal_maxmr(shca); if (ret) ehca_err(&shca->ib_device, - "Cannot destroy internal MR. ret=%x", ret); + "Cannot destroy internal MR. ret=%i", ret); ret = ehca_dealloc_pd(&shca->pd->ib_pd); if (ret) ehca_err(&shca->ib_device, - "Cannot destroy internal PD. ret=%x", ret); + "Cannot destroy internal PD. ret=%i", ret); ret = ehca_destroy_eq(shca, &shca->eq); if (ret) - ehca_err(&shca->ib_device, "Cannot destroy EQ. ret=%x", ret); + ehca_err(&shca->ib_device, "Cannot destroy EQ. ret=%i", ret); ret = ehca_destroy_eq(shca, &shca->neq); if (ret) - ehca_err(&shca->ib_device, "Canot destroy NEQ. ret=%x", ret); + ehca_err(&shca->ib_device, "Canot destroy NEQ. ret=%i", ret); ib_dealloc_device(&shca->ib_device); diff --git a/drivers/infiniband/hw/ehca/ehca_mcast.c b/drivers/infiniband/hw/ehca/ehca_mcast.c index 32a8706..e3ef026 100644 --- a/drivers/infiniband/hw/ehca/ehca_mcast.c +++ b/drivers/infiniband/hw/ehca/ehca_mcast.c @@ -88,7 +88,7 @@ int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " - "h_ret=%lx", my_qp, ibqp->qp_num, h_ret); + "h_ret=%li", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -125,7 +125,7 @@ int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " - "h_ret=%lx", my_qp, ibqp->qp_num, h_ret); + "h_ret=%li", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index d97eda3..4c8f3b3 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -159,7 +159,7 @@ struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) get_dma_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(&shca->ib_device, "rc=%lx pd=%p mr_access_flags=%x ", + ehca_err(&shca->ib_device, "h_ret=%li pd=%p mr_access_flags=%x", PTR_ERR(ib_mr), pd, mr_access_flags); return ib_mr; } /* end ehca_get_dma_mr() */ @@ -271,7 +271,7 @@ reg_phys_mr_exit1: ehca_mr_delete(e_mr); reg_phys_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(pd->device, "rc=%lx pd=%p phys_buf_array=%p " + ehca_err(pd->device, "h_ret=%li pd=%p phys_buf_array=%p " "num_phys_buf=%x mr_access_flags=%x iova_start=%p", PTR_ERR(ib_mr), pd, phys_buf_array, num_phys_buf, mr_access_flags, iova_start); @@ -403,8 +403,7 @@ reg_user_mr_exit1: ehca_mr_delete(e_mr); reg_user_mr_exit0: if (IS_ERR(ib_mr)) - ehca_err(pd->device, "rc=%lx pd=%p mr_access_flags=%x" - " udata=%p", + ehca_err(pd->device, "rc=%li pd=%p mr_access_flags=%x udata=%p", PTR_ERR(ib_mr), pd, mr_access_flags, udata); return ib_mr; } /* end ehca_reg_user_mr() */ @@ -565,7 +564,7 @@ rereg_phys_mr_exit1: spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); rereg_phys_mr_exit0: if (ret) - ehca_err(mr->device, "ret=%x mr=%p mr_rereg_mask=%x pd=%p " + ehca_err(mr->device, "ret=%i mr=%p mr_rereg_mask=%x pd=%p " "phys_buf_array=%p num_phys_buf=%x mr_access_flags=%x " "iova_start=%p", ret, mr, mr_rereg_mask, pd, phys_buf_array, @@ -607,7 +606,7 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) h_ret = hipz_h_query_mr(shca->ipz_hca_handle, e_mr, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_mr_query failed, h_ret=%lx mr=%p " + ehca_err(mr->device, "hipz_mr_query failed, h_ret=%li mr=%p " "hca_hndl=%lx mr_hndl=%lx lkey=%x", h_ret, mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); @@ -625,7 +624,7 @@ query_mr_exit1: spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); query_mr_exit0: if (ret) - ehca_err(mr->device, "ret=%x mr=%p mr_attr=%p", + ehca_err(mr->device, "ret=%i mr=%p mr_attr=%p", ret, mr, mr_attr); return ret; } /* end ehca_query_mr() */ @@ -667,7 +666,7 @@ int ehca_dereg_mr(struct ib_mr *mr) /* TODO: BUSY: MR still has bound window(s) */ h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_free_mr failed, h_ret=%lx shca=%p " + ehca_err(mr->device, "hipz_free_mr failed, h_ret=%li shca=%p " "e_mr=%p hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", h_ret, shca, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); @@ -683,7 +682,7 @@ int ehca_dereg_mr(struct ib_mr *mr) dereg_mr_exit0: if (ret) - ehca_err(mr->device, "ret=%x mr=%p", ret, mr); + ehca_err(mr->device, "ret=%i mr=%p", ret, mr); return ret; } /* end ehca_dereg_mr() */ @@ -708,7 +707,7 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) h_ret = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, e_mw, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%lx " + ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%li " "shca=%p hca_hndl=%lx mw=%p", h_ret, shca, shca->ipz_hca_handle.handle, e_mw); ib_mw = ERR_PTR(ehca2ib_return_code(h_ret)); @@ -723,7 +722,7 @@ alloc_mw_exit1: ehca_mw_delete(e_mw); alloc_mw_exit0: if (IS_ERR(ib_mw)) - ehca_err(pd->device, "rc=%lx pd=%p", PTR_ERR(ib_mw), pd); + ehca_err(pd->device, "h_ret=%li pd=%p", PTR_ERR(ib_mw), pd); return ib_mw; } /* end ehca_alloc_mw() */ @@ -750,7 +749,7 @@ int ehca_dealloc_mw(struct ib_mw *mw) h_ret = hipz_h_free_resource_mw(shca->ipz_hca_handle, e_mw); if (h_ret != H_SUCCESS) { - ehca_err(mw->device, "hipz_free_mw failed, h_ret=%lx shca=%p " + ehca_err(mw->device, "hipz_free_mw failed, h_ret=%li shca=%p " "mw=%p rkey=%x hca_hndl=%lx mw_hndl=%lx", h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, e_mw->ipz_mw_handle.handle); @@ -847,7 +846,7 @@ alloc_fmr_exit1: ehca_mr_delete(e_fmr); alloc_fmr_exit0: if (IS_ERR(ib_fmr)) - ehca_err(pd->device, "rc=%lx pd=%p mr_access_flags=%x " + ehca_err(pd->device, "h_ret=%li pd=%p mr_access_flags=%x " "fmr_attr=%p", PTR_ERR(ib_fmr), pd, mr_access_flags, fmr_attr); return ib_fmr; @@ -916,7 +915,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, map_phys_fmr_exit0: if (ret) - ehca_err(fmr->device, "ret=%x fmr=%p page_list=%p list_len=%x " + ehca_err(fmr->device, "ret=%i fmr=%p page_list=%p list_len=%x " "iova=%lx", ret, fmr, page_list, list_len, iova); return ret; } /* end ehca_map_phys_fmr() */ @@ -979,7 +978,7 @@ int ehca_unmap_fmr(struct list_head *fmr_list) unmap_fmr_exit0: if (ret) - ehca_gen_err("ret=%x fmr_list=%p num_fmr=%x unmap_fmr_cnt=%x", + ehca_gen_err("ret=%i fmr_list=%p num_fmr=%x unmap_fmr_cnt=%x", ret, fmr_list, num_fmr, unmap_fmr_cnt); return ret; } /* end ehca_unmap_fmr() */ @@ -1003,7 +1002,7 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr) h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { - ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%lx e_fmr=%p " + ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%li e_fmr=%p " "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, fmr->lkey); @@ -1016,7 +1015,7 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr) free_fmr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x fmr=%p", ret, fmr); + ehca_err(&shca->ib_device, "ret=%i fmr=%p", ret, fmr); return ret; } /* end ehca_dealloc_fmr() */ @@ -1046,7 +1045,7 @@ int ehca_reg_mr(struct ehca_shca *shca, (u64)iova_start, size, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%lx " + ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%li " "hca_hndl=%lx", h_ret, shca->ipz_hca_handle.handle); ret = ehca2ib_return_code(h_ret); goto ehca_reg_mr_exit0; @@ -1072,9 +1071,9 @@ int ehca_reg_mr(struct ehca_shca *shca, ehca_reg_mr_exit1: h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "h_ret=%lx shca=%p e_mr=%p " + ehca_err(&shca->ib_device, "h_ret=%li shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p lkey=%x " - "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%x", + "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%i", h_ret, shca, e_mr, iova_start, size, acl, e_pd, hipzout.lkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages, ret); @@ -1083,7 +1082,7 @@ ehca_reg_mr_exit1: } ehca_reg_mr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p " + ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, @@ -1127,7 +1126,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = ehca_set_pagebuf(pginfo, rnum, kpage); if (ret) { ehca_err(&shca->ib_device, "ehca_set_pagebuf " - "bad rc, ret=%x rnum=%x kpage=%p", + "bad rc, ret=%i rnum=%x kpage=%p", ret, rnum, kpage); goto ehca_reg_mr_rpages_exit1; } @@ -1155,7 +1154,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, */ if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "last " - "hipz_reg_rpage_mr failed, h_ret=%lx " + "hipz_reg_rpage_mr failed, h_ret=%li " "e_mr=%p i=%x hca_hndl=%lx mr_hndl=%lx" " lkey=%x", h_ret, e_mr, i, shca->ipz_hca_handle.handle, @@ -1167,7 +1166,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = 0; } else if (h_ret != H_PAGE_REGISTERED) { ehca_err(&shca->ib_device, "hipz_reg_rpage_mr failed, " - "h_ret=%lx e_mr=%p i=%x lkey=%x hca_hndl=%lx " + "h_ret=%li e_mr=%p i=%x lkey=%x hca_hndl=%lx " "mr_hndl=%lx", h_ret, e_mr, i, e_mr->ib.ib_mr.lkey, shca->ipz_hca_handle.handle, @@ -1183,7 +1182,7 @@ ehca_reg_mr_rpages_exit1: ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " + ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p pginfo=%p " "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; @@ -1244,7 +1243,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, * (MW bound or MR is shared) */ ehca_warn(&shca->ib_device, "hipz_h_reregister_pmr failed " - "(Rereg1), h_ret=%lx e_mr=%p", h_ret, e_mr); + "(Rereg1), h_ret=%li e_mr=%p", h_ret, e_mr); *pginfo = pginfo_save; ret = -EAGAIN; } else if ((u64 *)hipzout.vaddr != iova_start) { @@ -1273,7 +1272,7 @@ ehca_rereg_mr_rereg1_exit1: ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) - ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " + ehca_err(&shca->ib_device, "ret=%i lkey=%x rkey=%x " "pginfo=%p num_kpages=%lx num_hwpages=%lx", ret, *lkey, *rkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages); @@ -1334,7 +1333,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%lx e_mr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%li e_mr=%p hca_hndl=%lx mr_hndl=%lx " "mr->lkey=%x", h_ret, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, @@ -1366,7 +1365,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, ehca_rereg_mr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p " + ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " "num_kpages=%lx lkey=%x rkey=%x rereg_1_hcall=%x " "rereg_3_hcall=%x", ret, shca, e_mr, iova_start, size, @@ -1410,7 +1409,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, * FMRs are not shared and no MW bound to FMRs */ ehca_err(&shca->ib_device, "hipz_reregister_pmr failed " - "(Rereg1), h_ret=%lx e_fmr=%p hca_hndl=%lx " + "(Rereg1), h_ret=%li e_fmr=%p hca_hndl=%lx " "mr_hndl=%lx lkey=%x lkey_out=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, @@ -1422,7 +1421,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%li e_fmr=%p hca_hndl=%lx mr_hndl=%lx " "lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, @@ -1457,7 +1456,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, ehca_unmap_one_fmr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x tmp_lkey=%x tmp_rkey=%x " + ehca_err(&shca->ib_device, "ret=%i tmp_lkey=%x tmp_rkey=%x " "fmr_max_pages=%x", ret, tmp_lkey, tmp_rkey, e_fmr->fmr_max_pages); return ret; @@ -1486,7 +1485,7 @@ int ehca_reg_smr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lx " + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " "shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x " "e_pd=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", h_ret, shca, e_origmr, e_newmr, iova_start, acl, e_pd, @@ -1510,7 +1509,7 @@ int ehca_reg_smr(struct ehca_shca *shca, ehca_reg_smr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p e_origmr=%p " + ehca_err(&shca->ib_device, "ret=%i shca=%p e_origmr=%p " "e_newmr=%p iova_start=%p acl=%x e_pd=%p", ret, shca, e_origmr, e_newmr, iova_start, acl, e_pd); return ret; @@ -1585,7 +1584,7 @@ ehca_reg_internal_maxmr_exit1: ehca_mr_delete(e_mr); ehca_reg_internal_maxmr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p e_pd=%p e_maxmr=%p", + ehca_err(&shca->ib_device, "ret=%i shca=%p e_pd=%p e_maxmr=%p", ret, shca, e_pd, e_maxmr); return ret; } /* end ehca_reg_internal_maxmr() */ @@ -1612,7 +1611,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lx " + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " "e_origmr=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", h_ret, e_origmr, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, @@ -1653,7 +1652,7 @@ int ehca_dereg_internal_maxmr(struct ehca_shca *shca) ret = ehca_dereg_mr(&e_maxmr->ib.ib_mr); if (ret) { ehca_err(&shca->ib_device, "dereg internal max-MR failed, " - "ret=%x e_maxmr=%p shca=%p lkey=%x", + "ret=%i e_maxmr=%p shca=%p lkey=%x", ret, e_maxmr, shca, e_maxmr->ib.ib_mr.lkey); shca->maxmr = e_maxmr; goto ehca_dereg_internal_maxmr_exit0; @@ -1663,7 +1662,7 @@ int ehca_dereg_internal_maxmr(struct ehca_shca *shca) ehca_dereg_internal_maxmr_exit0: if (ret) - ehca_err(&shca->ib_device, "ret=%x shca=%p shca->maxmr=%p", + ehca_err(&shca->ib_device, "ret=%i shca=%p shca->maxmr=%p", ret, shca, shca->maxmr); return ret; } /* end ehca_dereg_internal_maxmr() */ diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 3a3880f..d2ab84a 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -310,7 +310,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, } if (!ipz_rc) { - ehca_err(ib_dev, "Cannot allocate page for queue. ipz_rc=%x", + ehca_err(ib_dev, "Cannot allocate page for queue. ipz_rc=%i", ipz_rc); return -EBUSY; } @@ -334,7 +334,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, if (cnt == (nr_q_pages - 1)) { /* last page! */ if (h_ret != expected_hret) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret= %lx ", h_ret); + "h_ret=%li", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -348,7 +348,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret= %lx ", h_ret); + "h_ret=%li", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -617,7 +617,7 @@ static struct ehca_qp *internal_create_qp( h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%lx", + ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%li", h_ret); ret = ehca2ib_return_code(h_ret); goto create_qp_exit1; @@ -671,7 +671,7 @@ static struct ehca_qp *internal_create_qp( &parms.squeue, swqe_size); if (ret) { ehca_err(pd->device, "Couldn't initialize squeue " - "and pages ret=%x", ret); + "and pages ret=%i", ret); goto create_qp_exit2; } } @@ -682,7 +682,7 @@ static struct ehca_qp *internal_create_qp( H_SUCCESS, &parms.rqueue, rwqe_size); if (ret) { ehca_err(pd->device, "Couldn't initialize rqueue " - "and pages ret=%x", ret); + "and pages ret=%i", ret); goto create_qp_exit3; } } @@ -719,8 +719,8 @@ static struct ehca_qp *internal_create_qp( if (qp_type == IB_QPT_GSI) { h_ret = ehca_define_sqp(shca, my_qp, init_attr); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "ehca_define_sqp() failed rc=%lx", - h_ret); + ehca_err(pd->device, "ehca_define_sqp() failed " + "h_ret=%li", h_ret); ret = ehca2ib_return_code(h_ret); goto create_qp_exit4; } @@ -730,7 +730,7 @@ static struct ehca_qp *internal_create_qp( ret = ehca_cq_assign_qp(my_qp->send_cq, my_qp); if (ret) { ehca_err(pd->device, - "Couldn't assign qp to send_cq ret=%x", ret); + "Couldn't assign qp to send_cq ret=%i", ret); goto create_qp_exit4; } } @@ -847,7 +847,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to INIT" - "ehca_qp=%p qp_num=%x hret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -861,7 +861,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not enable SRQ" - "ehca_qp=%p qp_num=%x hret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -875,7 +875,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to RTR" - "ehca_qp=%p qp_num=%x hret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -913,7 +913,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, &bad_send_wqe_p, NULL, 2); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_h_disable_and_get_wqe() failed" - " ehca_qp=%p qp_num=%x h_ret=%lx", + " ehca_qp=%p qp_num=%x h_ret=%li", my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -991,7 +991,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, mqpcb, my_qp->galpas.kernel); if (h_ret != H_SUCCESS) { ehca_err(ibqp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, ibqp->qp_num, h_ret); ret = ehca2ib_return_code(h_ret); goto modify_qp_exit1; @@ -1027,7 +1027,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, ibqp, &smiqp_attr, smiqp_attr_mask, 1); if (smirc) { ehca_err(ibqp->device, "SMI RESET -> INIT failed. " - "ehca_modify_qp() rc=%x", smirc); + "ehca_modify_qp() rc=%i", smirc); ret = H_PARAMETER; goto modify_qp_exit1; } @@ -1129,7 +1129,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, ret = prepare_sqe_rts(my_qp, shca, &bad_wqe_cnt); if (ret) { ehca_err(ibqp->device, "prepare_sqe_rts() failed " - "ehca_qp=%p qp_num=%x ret=%x", + "ehca_qp=%p qp_num=%x ret=%i", my_qp, ibqp->qp_num, ret); goto modify_qp_exit2; } @@ -1354,7 +1354,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibqp->device, "hipz_h_modify_qp() failed rc=%lx " + ehca_err(ibqp->device, "hipz_h_modify_qp() failed h_ret=%li " "ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1387,7 +1387,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, ret = ehca2ib_return_code(h_ret); ehca_err(ibqp->device, "ENABLE in context of " "RESET_2_INIT failed! Maybe you didn't get " - "a LID h_ret=%lx ehca_qp=%p qp_num=%x", + "a LID h_ret=%li ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1475,7 +1475,7 @@ int ehca_query_qp(struct ib_qp *qp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(qp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, qp->qp_num, h_ret); goto query_qp_exit1; } @@ -1650,7 +1650,7 @@ int ehca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibsrq->device, "hipz_h_modify_qp() failed rc=%lx " + ehca_err(ibsrq->device, "hipz_h_modify_qp() failed h_ret=%li " "ehca_qp=%p qp_num=%x", h_ret, my_qp, my_qp->real_qp_num); } @@ -1693,7 +1693,7 @@ int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr) if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(srq->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%lx", + "ehca_qp=%p qp_num=%x h_ret=%li", my_qp, my_qp->real_qp_num, h_ret); goto query_srq_exit1; } @@ -1743,7 +1743,7 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, ret = ehca_cq_unassign_qp(my_qp->send_cq, qp_num); if (ret) { ehca_err(dev, "Couldn't unassign qp from " - "send_cq ret=%x qp_num=%x cq_num=%x", ret, + "send_cq ret=%i qp_num=%x cq_num=%x", ret, qp_num, my_qp->send_cq->cq_number); return ret; } @@ -1755,7 +1755,7 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { - ehca_err(dev, "hipz_h_destroy_qp() failed rc=%lx " + ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li " "ehca_qp=%p qp_num=%x", h_ret, my_qp, qp_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 94eed70..ea91360 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -526,7 +526,7 @@ poll_cq_one_read_cqe: if (!cqe) { ret = -EAGAIN; ehca_dbg(cq->device, "Completion queue is empty ehca_cq=%p " - "cq_num=%x ret=%x", my_cq, my_cq->cq_number, ret); + "cq_num=%x ret=%i", my_cq, my_cq->cq_number, ret); goto poll_cq_one_exit0; } diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c index 9f16e9c..f0792e5 100644 --- a/drivers/infiniband/hw/ehca/ehca_sqp.c +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -82,7 +82,7 @@ u64 ehca_define_sqp(struct ehca_shca *shca, if (ret != H_SUCCESS) { ehca_err(&shca->ib_device, - "Can't define AQP1 for port %x. rc=%lx", + "Can't define AQP1 for port %x. h_ret=%li", port, ret); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 84a16bc..5234d6c 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -121,7 +121,7 @@ static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, ret = remap_4k_pfn(vma, vma->vm_start, physical >> EHCA_PAGESHIFT, vma->vm_page_prot); if (unlikely(ret)) { - ehca_gen_err("remap_pfn_range() failed ret=%x", ret); + ehca_gen_err("remap_pfn_range() failed ret=%i", ret); return -ENOMEM; } @@ -146,7 +146,7 @@ static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, page = virt_to_page(virt_addr); ret = vm_insert_page(vma, start, page); if (unlikely(ret)) { - ehca_gen_err("vm_insert_page() failed rc=%x", ret); + ehca_gen_err("vm_insert_page() failed rc=%i", ret); return ret; } start += PAGE_SIZE; @@ -169,7 +169,7 @@ static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); if (unlikely(ret)) { ehca_err(cq->ib_cq.device, - "ehca_mmap_fw() failed rc=%x cq_num=%x", + "ehca_mmap_fw() failed rc=%i cq_num=%x", ret, cq->cq_number); return ret; } @@ -180,7 +180,7 @@ static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); if (unlikely(ret)) { ehca_err(cq->ib_cq.device, - "ehca_mmap_queue() failed rc=%x cq_num=%x", + "ehca_mmap_queue() failed rc=%i cq_num=%x", ret, cq->cq_number); return ret; } @@ -206,7 +206,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, - "remap_pfn_range() failed ret=%x qp_num=%x", + "remap_pfn_range() failed ret=%i qp_num=%x", ret, qp->ib_qp.qp_num); return -ENOMEM; } @@ -219,7 +219,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, &qp->mm_count_rqueue); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", + "ehca_mmap_queue(rq) failed rc=%i qp_num=%x", ret, qp->ib_qp.qp_num); return ret; } @@ -232,7 +232,7 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, &qp->mm_count_squeue); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", + "ehca_mmap_queue(sq) failed rc=%i qp_num=%x", ret, qp->ib_qp.qp_num); return ret; } @@ -283,7 +283,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) ret = ehca_mmap_cq(vma, cq, rsrc_type); if (unlikely(ret)) { ehca_err(cq->ib_cq.device, - "ehca_mmap_cq() failed rc=%x cq_num=%x", + "ehca_mmap_cq() failed rc=%i cq_num=%x", ret, cq->cq_number); return ret; } @@ -313,7 +313,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) ret = ehca_mmap_qp(vma, qp, rsrc_type); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, - "ehca_mmap_qp() failed rc=%x qp_num=%x", + "ehca_mmap_qp() failed rc=%i qp_num=%x", ret, qp->ib_qp.qp_num); return ret; } diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 32f465b..a70a5ed 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -238,7 +238,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, *eq_ist = (u32)outs[5]; if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resource - ret=%lx ", ret); + ehca_gen_err("Not enough resource - ret=%li ", ret); return ret; } @@ -276,7 +276,7 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&cq->galpas, outs[5], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%lx", ret); + ehca_gen_err("Not enough resources. ret=%li", ret); return ret; } @@ -351,7 +351,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&parms->galpas, outs[6], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%lx", ret); + ehca_gen_err("Not enough resources. ret=%li", ret); return ret; } @@ -546,7 +546,7 @@ u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Insufficient resources ret=%lx", ret); + ehca_gen_err("Insufficient resources ret=%li", ret); return ret; } @@ -582,7 +582,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, qp->ipz_qp_handle.handle, /* r6 */ 0, 0, 0, 0, 0, 0); if (ret == H_HARDWARE) - ehca_gen_err("HCA not operational. ret=%lx", ret); + ehca_gen_err("HCA not operational. ret=%li", ret); ret = ehca_plpar_hcall_norets(H_FREE_RESOURCE, adapter_handle.handle, /* r4 */ @@ -590,7 +590,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource still in use. ret=%lx", ret); + ehca_gen_err("Resource still in use. ret=%li", ret); return ret; } @@ -625,7 +625,7 @@ u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, *bma_qp_nr = (u32)outs[1]; if (ret == H_ALIAS_EXIST) - ehca_gen_err("AQP1 already exists. ret=%lx", ret); + ehca_gen_err("AQP1 already exists. ret=%li", ret); return ret; } @@ -647,7 +647,7 @@ u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%lx", ret); + ehca_gen_err("Not enough resources. ret=%li", ret); return ret; } @@ -686,7 +686,7 @@ u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("H_FREE_RESOURCE failed ret=%lx ", ret); + ehca_gen_err("H_FREE_RESOURCE failed ret=%li ", ret); return ret; } @@ -708,7 +708,7 @@ u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource in use. ret=%lx ", ret); + ehca_gen_err("Resource in use. ret=%li ", ret); return ret; } -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:32:50 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:32:50 +0200 Subject: [ofa-general] [PATCH 07/12] IB/ehca: ehca_gen_warn() should always print In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111532.51179.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_tools.h | 9 +++------ 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 57c77a7..f9b264b 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -98,15 +98,12 @@ extern int ehca_debug_level; } while (0) #define ehca_gen_warn(format, arg...) \ - do { \ - if (unlikely(ehca_debug_level)) \ - printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg); \ - } while (0) + printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \ + get_paca()->paca_index, __FUNCTION__, ## arg) #define ehca_gen_err(format, arg...) \ printk(KERN_ERR "PU%04x EHCA_ERR:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + get_paca()->paca_index, __FUNCTION__, ## arg) /** * ehca_dmp - printk a memory block, whose length is n*8 bytes. -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:33:40 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:33:40 +0200 Subject: [ofa-general] [PATCH 09/12] IB/ehca: Add check for max #SGE to create_qp() In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111533.41250.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index d2ab84a..7154f62 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -513,7 +513,7 @@ static struct ehca_qp *internal_create_qp( } else if (init_attr->cap.max_send_wr > 255) { ehca_err(pd->device, "Invalid Number of " - "ax_send_wr=%x for UD QP_TYPE=%x", + "max_send_wr=%x for UD QP_TYPE=%x", init_attr->cap.max_send_wr, qp_type); return ERR_PTR(-EINVAL); } @@ -524,6 +524,18 @@ static struct ehca_qp *internal_create_qp( return ERR_PTR(-EINVAL); break; } + } else { + int max_sge = (qp_type == IB_QPT_UD || qp_type == IB_QPT_SMI + || qp_type == IB_QPT_GSI) ? 250 : 252; + + if (init_attr->cap.max_send_sge > max_sge + || init_attr->cap.max_recv_sge > max_sge) { + ehca_err(pd->device, "Invalid number of SGEs requested " + "send_sge=%x recv_sge=%x max_sge=%x", + init_attr->cap.max_send_sge, + init_attr->cap.max_recv_sge, max_sge); + return ERR_PTR(-EINVAL); + } } if (pd->uobject && udata) -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:33:13 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:33:13 +0200 Subject: [ofa-general] [PATCH 08/12] IB/ehca: Replace get_paca()->paca_index by the more portable smp_processor_id() In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111533.14333.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_tools.h | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index f9b264b..863f972 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -73,37 +73,37 @@ extern int ehca_debug_level; if (unlikely(ehca_debug_level)) \ dev_printk(KERN_DEBUG, (ib_dev)->dma_device, \ "PU%04x EHCA_DBG:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, \ + smp_processor_id(), __FUNCTION__, \ ## arg); \ } while (0) #define ehca_info(ib_dev, format, arg...) \ dev_info((ib_dev)->dma_device, "PU%04x EHCA_INFO:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) #define ehca_warn(ib_dev, format, arg...) \ dev_warn((ib_dev)->dma_device, "PU%04x EHCA_WARN:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) #define ehca_err(ib_dev, format, arg...) \ dev_err((ib_dev)->dma_device, "PU%04x EHCA_ERR:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) /* use this one only if no ib_dev available */ #define ehca_gen_dbg(format, arg...) \ do { \ if (unlikely(ehca_debug_level)) \ printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg); \ + smp_processor_id(), __FUNCTION__, ## arg); \ } while (0) #define ehca_gen_warn(format, arg...) \ printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) #define ehca_gen_err(format, arg...) \ printk(KERN_ERR "PU%04x EHCA_ERR:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) /** * ehca_dmp - printk a memory block, whose length is n*8 bytes. -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:34:04 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:34:04 +0200 Subject: [ofa-general] [PATCH 10/12] IB/ehca: Path migration support In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111534.06200.fenkes@de.ibm.com> Rectify some modify_qp() issues related to path migration. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_irq.c | 4 +- drivers/infiniband/hw/ehca/ehca_qp.c | 90 ++++++++++++++++++++++++--------- 2 files changed, 68 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index a925ea5..7093986 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -294,8 +294,8 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe) case 0x11: /* unaffiliated access error */ ehca_err(&shca->ib_device, "Unaffiliated access error."); break; - case 0x12: /* path migrating error */ - ehca_err(&shca->ib_device, "Path migration error."); + case 0x12: /* path migrating */ + ehca_err(&shca->ib_device, "Path migrating."); break; case 0x13: /* interface trace stopped */ ehca_err(&shca->ib_device, "Interface trace stopped."); diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 7154f62..6c70dee 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1167,6 +1167,13 @@ static int internal_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_PKEY_INDEX) { + if (attr->pkey_index >= 16) { + ret = -EINVAL; + ehca_err(ibqp->device, "Invalid pkey_index=%x. " + "ehca_qp=%p qp_num=%x max_pkey_index=f", + attr->pkey_index, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } mqpcb->prim_p_key_idx = attr->pkey_index; update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_P_KEY_IDX, 1); } @@ -1275,50 +1282,78 @@ static int internal_modify_qp(struct ib_qp *ibqp, int ehca_mult = ib_rate_to_mult( shca->sport[my_qp->init_attr.port_num].rate); + if (attr->alt_port_num < 1 + || attr->alt_port_num > shca->num_ports) { + ret = -EINVAL; + ehca_err(ibqp->device, "Invalid alt_port=%x. " + "ehca_qp=%p qp_num=%x num_ports=%x", + attr->alt_port_num, my_qp, ibqp->qp_num, + shca->num_ports); + goto modify_qp_exit2; + } + mqpcb->alt_phys_port = attr->alt_port_num; + + if (attr->alt_pkey_index >= 16) { + ret = -EINVAL; + ehca_err(ibqp->device, "Invalid alt_pkey_index=%x. " + "ehca_qp=%p qp_num=%x max_pkey_index=f", + attr->pkey_index, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + mqpcb->alt_p_key_idx = attr->alt_pkey_index; + + mqpcb->timeout_al = attr->alt_timeout; mqpcb->dlid_al = attr->alt_ah_attr.dlid; - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID_AL, 1); mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS_AL, 1); mqpcb->service_level_al = attr->alt_ah_attr.sl; - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL_AL, 1); - if (ah_mult < ehca_mult) - mqpcb->max_static_rate = (ah_mult > 0) ? - ((ehca_mult - 1) / ah_mult) : 0; + if (ah_mult > 0 && ah_mult < ehca_mult) + mqpcb->max_static_rate_al = (ehca_mult - 1) / ah_mult; else mqpcb->max_static_rate_al = 0; - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE_AL, 1); + /* OpenIB doesn't support alternate retry counts - copy them */ + mqpcb->retry_count_al = mqpcb->retry_count; + mqpcb->rnr_retry_count_al = mqpcb->rnr_retry_count; + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_ALT_PHYS_PORT, 1) + | EHCA_BMASK_SET(MQPCB_MASK_ALT_P_KEY_IDX, 1) + | EHCA_BMASK_SET(MQPCB_MASK_TIMEOUT_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_DLID_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_RETRY_COUNT_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_RNR_RETRY_COUNT_AL, 1); + + /* + * Always supply the GRH flag, even if it's zero, to give the + * hypervisor a clear "yes" or "no" instead of a "perhaps" + */ + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG_AL, 1); /* * only if GRH is TRUE we might consider SOURCE_GID_IDX * and DEST_GID otherwise phype will return H_ATTR_PARM!!! */ if (attr->alt_ah_attr.ah_flags == IB_AH_GRH) { - mqpcb->send_grh_flag_al = 1 << 31; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG_AL, 1); - mqpcb->source_gid_idx_al = - attr->alt_ah_attr.grh.sgid_index; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX_AL, 1); + mqpcb->send_grh_flag_al = 1; for (cnt = 0; cnt < 16; cnt++) mqpcb->dest_gid_al.byte[cnt] = attr->alt_ah_attr.grh.dgid.raw[cnt]; - - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_DEST_GID_AL, 1); + mqpcb->source_gid_idx_al = + attr->alt_ah_attr.grh.sgid_index; mqpcb->flow_label_al = attr->alt_ah_attr.grh.flow_label; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL_AL, 1); mqpcb->hop_limit_al = attr->alt_ah_attr.grh.hop_limit; - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT_AL, 1); mqpcb->traffic_class_al = attr->alt_ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_DEST_GID_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL_AL, 1) + | EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT_AL, 1) | EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS_AL, 1); } } @@ -1340,7 +1375,14 @@ static int internal_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_PATH_MIG_STATE) { - mqpcb->path_migration_state = attr->path_mig_state; + if (attr->path_mig_state != IB_MIG_REARM + && attr->path_mig_state != IB_MIG_MIGRATED) { + ret = -EINVAL; + ehca_err(ibqp->device, "Invalid mig_state=%x", + attr->path_mig_state); + goto modify_qp_exit2; + } + mqpcb->path_migration_state = attr->path_mig_state + 1; update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PATH_MIGRATION_STATE, 1); } @@ -1508,7 +1550,7 @@ int ehca_query_qp(struct ib_qp *qp, qp_attr->qkey = qpcb->qkey; qp_attr->path_mtu = qpcb->path_mtu; - qp_attr->path_mig_state = qpcb->path_migration_state; + qp_attr->path_mig_state = qpcb->path_migration_state - 1; qp_attr->rq_psn = qpcb->receive_psn; qp_attr->sq_psn = qpcb->send_psn; qp_attr->min_rnr_timer = qpcb->min_rnr_nak_timer_field; -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:34:35 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:34:35 +0200 Subject: [ofa-general] [PATCH 11/12] IB/ehca: Serialize MR alloc and MR free hvCalls In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111534.35916.fenkes@de.ibm.com> Some firmware levels exhibit a race condition between H_ALLOC_RESOURCE(MR) and H_FREE_RESOURCE(MR). Work around this problem by locking these hvCalls against each other. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/hcp_if.c | 28 +++++++++++++++++++++------- 1 files changed, 21 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index a70a5ed..d3d1ef2 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -120,15 +120,28 @@ static long ehca_plpar_hcall_norets(unsigned long opcode, unsigned long arg7) { long ret; - int i, sleep_msecs; + int i, sleep_msecs, do_lock; + unsigned long flags; ehca_gen_dbg("opcode=%lx " HCALL7_REGS_FORMAT, opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7); + /* lock H_FREE_RESOURCE(MR) against itself and H_ALLOC_RESOURCE(MR) */ + if ((opcode == H_FREE_RESOURCE) && (arg7 == 5)) { + arg7 = 0; /* better not upset firmware */ + do_lock = 1; + } + for (i = 0; i < 5; i++) { + if (do_lock) + spin_lock_irqsave(&hcall_lock, flags); + ret = plpar_hcall_norets(opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7); + if (do_lock) + spin_unlock_irqrestore(&hcall_lock, flags); + if (H_IS_LONG_BUSY(ret)) { sleep_msecs = get_longbusy_msecs(ret); msleep_interruptible(sleep_msecs); @@ -161,23 +174,24 @@ static long ehca_plpar_hcall9(unsigned long opcode, unsigned long arg9) { long ret; - int i, sleep_msecs, lock_is_set = 0; + int i, sleep_msecs, do_lock; unsigned long flags = 0; ehca_gen_dbg("INPUT -- opcode=%lx " HCALL9_REGS_FORMAT, opcode, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); + /* lock H_ALLOC_RESOURCE(MR) against itself and H_FREE_RESOURCE(MR) */ + do_lock = ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)); + for (i = 0; i < 5; i++) { - if ((opcode == H_ALLOC_RESOURCE) && (arg2 == 5)) { + if (do_lock) spin_lock_irqsave(&hcall_lock, flags); - lock_is_set = 1; - } ret = plpar_hcall9(opcode, outs, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9); - if (lock_is_set) + if (do_lock) spin_unlock_irqrestore(&hcall_lock, flags); if (H_IS_LONG_BUSY(ret)) { @@ -807,7 +821,7 @@ u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle adapter_handle, return ehca_plpar_hcall_norets(H_FREE_RESOURCE, adapter_handle.handle, /* r4 */ mr->ipz_mr_handle.handle, /* r5 */ - 0, 0, 0, 0, 0); + 0, 0, 0, 0, 5); } u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle adapter_handle, -- 1.5.2 From fenkes at de.ibm.com Tue Sep 11 06:35:32 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 11 Sep 2007 15:35:32 +0200 Subject: [ofa-general] [PATCH 12/12] IB/ehca: Bump version number and change its format In-Reply-To: <200709111518.26276.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> Message-ID: <200709111535.32952.fenkes@de.ibm.com> Nobody needed the SVNEHCA_ prefix anyway. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 799f218..c84e310 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -49,10 +49,12 @@ #include "ehca_tools.h" #include "hcp_if.h" +#define HCAD_VERSION "0024" + MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0023"); +MODULE_VERSION(HCAD_VERSION); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -909,7 +911,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0023)\n"); + "(Version " HCAD_VERSION ")\n"); ret = ehca_create_comp_pool(); if (ret) { -- 1.5.2 From ntl at pobox.com Tue Sep 11 07:51:31 2007 From: ntl at pobox.com (Nathan Lynch) Date: Tue, 11 Sep 2007 09:51:31 -0500 Subject: [ofa-general] Re: [PATCH 08/12] IB/ehca: Replace get_paca()->paca_index by the more portable smp_processor_id() In-Reply-To: <200709111533.14333.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> <200709111533.14333.fenkes@de.ibm.com> Message-ID: <20070911145131.GN32388@localdomain> Hi, Joachim Fenkes wrote: > Signed-off-by: Joachim Fenkes > --- > drivers/infiniband/hw/ehca/ehca_tools.h | 14 +++++++------- > 1 files changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h > index f9b264b..863f972 100644 > --- a/drivers/infiniband/hw/ehca/ehca_tools.h > +++ b/drivers/infiniband/hw/ehca/ehca_tools.h > @@ -73,37 +73,37 @@ extern int ehca_debug_level; > if (unlikely(ehca_debug_level)) \ > dev_printk(KERN_DEBUG, (ib_dev)->dma_device, \ > "PU%04x EHCA_DBG:%s " format "\n", \ > - get_paca()->paca_index, __FUNCTION__, \ > + smp_processor_id(), __FUNCTION__, \ > ## arg); \ > } while (0) > > #define ehca_info(ib_dev, format, arg...) \ > dev_info((ib_dev)->dma_device, "PU%04x EHCA_INFO:%s " format "\n", \ > - get_paca()->paca_index, __FUNCTION__, ## arg) > + smp_processor_id(), __FUNCTION__, ## arg) > > #define ehca_warn(ib_dev, format, arg...) \ > dev_warn((ib_dev)->dma_device, "PU%04x EHCA_WARN:%s " format "\n", \ > - get_paca()->paca_index, __FUNCTION__, ## arg) > + smp_processor_id(), __FUNCTION__, ## arg) > > #define ehca_err(ib_dev, format, arg...) \ > dev_err((ib_dev)->dma_device, "PU%04x EHCA_ERR:%s " format "\n", \ > - get_paca()->paca_index, __FUNCTION__, ## arg) > + smp_processor_id(), __FUNCTION__, ## arg) I think I see these macros used in preemptible code (e.g. ehca_probe), where smp_processor_id() will print a warning when CONFIG_DEBUG_PREEMPT=y. Probably better to use raw_smp_processor_id. From kliteyn at dev.mellanox.co.il Tue Sep 11 07:51:53 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 11 Sep 2007 17:51:53 +0300 Subject: [ofa-general] Re: [PATCH 3/7 V3] osm: QoS policy C & H files In-Reply-To: <20070828134044.GD18082@sashak.voltaire.com> References: <46D359BE.6040009@dev.mellanox.co.il> <20070828134044.GD18082@sashak.voltaire.com> Message-ID: <46E6AB89.6050102@dev.mellanox.co.il> Hi Sasha, >> +typedef struct _osm_qos_policy_t { >> + cl_list_t port_groups; /* list of osm_qos_port_group_t */ >> + cl_list_t sl2vl_tables; /* list of osm_qos_sl2vl_scope_t */ >> + cl_list_t vlarb_tables; /* list of osm_qos_vlarb_scope_t */ >> + cl_list_t qos_levels; /* list of osm_qos_level_t */ >> + cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > > Here and above - where possible please use cl_qlist_t instead of > cl_list_t - it is _much_ faster (I did some benchmarking when worked > on up/down performance issues). What about cl_map_t vs cl_qmap_t? Is the difference there significant? -- Yevgeny Sasha Khapyorsky wrote: > On 02:09 Tue 28 Aug , Yevgeny Kliteynik wrote: >> QoS policy data structures and functions >> >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > > I still have some comments (below) and expect it will be addressed. > > Sasha > >> --- >> opensm/include/opensm/osm_qos_policy.h | 189 +++++++ >> opensm/opensm/osm_qos_policy.c | 921 ++++++++++++++++++++++++++++++++ >> 2 files changed, 1110 insertions(+), 0 deletions(-) >> create mode 100644 opensm/include/opensm/osm_qos_policy.h >> create mode 100644 opensm/opensm/osm_qos_policy.c >> >> diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h >> new file mode 100644 >> index 0000000..dd15f99 >> --- /dev/null >> +++ b/opensm/include/opensm/osm_qos_policy.h >> @@ -0,0 +1,189 @@ >> +/* >> + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * >> + * This software is available to you under a choice of one of two >> + * licenses. You may choose to be licensed under the terms of the GNU >> + * General Public License (GPL) Version 2, available from the file >> + * COPYING in the main directory of this source tree, or the >> + * OpenIB.org BSD license below: >> + * >> + * Redistribution and use in source and binary forms, with or >> + * without modification, are permitted provided that the following >> + * conditions are met: >> + * >> + * - Redistributions of source code must retain the above >> + * copyright notice, this list of conditions and the following >> + * disclaimer. >> + * >> + * - Redistributions in binary form must reproduce the above >> + * copyright notice, this list of conditions and the following >> + * disclaimer in the documentation and/or other materials >> + * provided with the distribution. >> + * >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND >> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS >> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN >> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN >> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE >> + * SOFTWARE. >> + * >> + */ >> + >> +/* >> + * Abstract: >> + * Declaration of OSM QoS Policy data types and functions. >> + * >> + * Environment: >> + * Linux User Mode >> + * >> + * Author: >> + * Yevgeny Kliteynik, Mellanox >> + */ >> + >> +#ifndef OSM_QOS_POLICY_H >> +#define OSM_QOS_POLICY_H >> + >> +#include >> +#include >> +#include >> +#include >> + >> +#define YYSTYPE char * >> +#define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 >> +#define OSM_QOS_POLICY_DEFAULT_LEVEL_NAME "default" >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_port_group_t { >> + char *name; /* single string (this port group name) */ >> + char *use; /* single string (description) */ >> + cl_list_t port_name_list; /* list of port names (.../.../...) */ >> + uint64_t **guid_range_arr; /* array of guid ranges (pair of 64-bit guids) */ > > Instead of uint64_t ** use something like: > > struct range { > uint64_t min, max; > } > >> + unsigned guid_range_len; /* num of guid ranges in the array */ >> + cl_list_t partition_list; /* list of partition names */ > > Why not pkey range here? (by name partition search is extermely slow). > >> + boolean_t node_type_ca; >> + boolean_t node_type_switch; >> + boolean_t node_type_router; >> + boolean_t node_type_self; > > This probably could be optimized by using bitmask. Then instead of four > separate checks you will start with single 'if (node_type_mask) ...'. > >> +} osm_qos_port_group_t; >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_vlarb_scope_t { >> + cl_list_t group_list; /* list of group names (strings) */ >> + cl_list_t across_list; /* list of 'across' group names (strings) */ >> + cl_list_t vlarb_high_list; /* list of num pairs (n:m,...), 32-bit values */ >> + cl_list_t vlarb_low_list; /* list of num pairs (n:m,...), 32-bit values */ >> + uint32_t vl_high_limit; /* single integer */ >> + boolean_t vl_high_limit_set; >> +} osm_qos_vlarb_scope_t; >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_sl2vl_scope_t { >> + cl_list_t group_list; /* list of strings (port group names) */ >> + boolean_t from[OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH]; >> + boolean_t to[OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH]; >> + cl_list_t across_from_list; /* list of strings (port group names) */ >> + cl_list_t across_to_list; /* list of strings (port group names) */ >> + uint8_t sl2vl_table[16]; /* array of sl2vl values */ >> + boolean_t sl2vl_table_set; >> +} osm_qos_sl2vl_scope_t; >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_level_t { >> + char *use; >> + char *name; >> + uint8_t sl; >> + boolean_t sl_set; >> + uint8_t mtu_limit; >> + boolean_t mtu_limit_set; >> + uint8_t rate_limit; >> + boolean_t rate_limit_set; >> + uint8_t pkt_life; >> + boolean_t pkt_life_set; >> + uint64_t **path_bits_range_arr; /* array of bit ranges (real values are 32bits) */ >> + unsigned path_bits_range_len; /* num of bit ranges in the array */ >> + uint64_t **pkey_range_arr; /* array of PKey ranges (real values are 16bits) */ >> + unsigned pkey_range_len; >> +} osm_qos_level_t; >> + >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_match_rule_t { >> + char *use; >> + cl_list_t source_list; /* list of strings */ >> + cl_list_t source_group_list; /* list of pointers to relevant port-group */ >> + cl_list_t destination_list; /* list of strings */ >> + cl_list_t destination_group_list; /* list of pointers to relevant port-group */ >> + char *qos_level_name; >> + osm_qos_level_t *p_qos_level; >> + uint64_t **service_id_range_arr; /* array of SID ranges (64-bit values) */ >> + unsigned service_id_range_len; >> + uint64_t **qos_class_range_arr; /* array of QoS Class ranges (real values are 16bits) */ >> + unsigned qos_class_range_len; >> + uint64_t **pkey_range_arr; /* array of PKey ranges (real values are 16bits) */ >> + unsigned pkey_range_len; >> +} osm_qos_match_rule_t; >> + >> +/***************************************************/ >> + >> +typedef struct _osm_qos_policy_t { >> + cl_list_t port_groups; /* list of osm_qos_port_group_t */ >> + cl_list_t sl2vl_tables; /* list of osm_qos_sl2vl_scope_t */ >> + cl_list_t vlarb_tables; /* list of osm_qos_vlarb_scope_t */ >> + cl_list_t qos_levels; /* list of osm_qos_level_t */ >> + cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > > Here and above - where possible please use cl_qlist_t instead of > cl_list_t - it is _much_ faster (I did some benchmarking when worked > on up/down performance issues). > >> + osm_qos_level_t *p_default_qos_level; /* default QoS level */ >> +} osm_qos_policy_t; >> + >> +/***************************************************/ >> + >> +osm_qos_port_group_t * osm_qos_policy_port_group_create(); >> +void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p_port_group); >> + >> +osm_qos_vlarb_scope_t * osm_qos_policy_vlarb_scope_create(); >> +void osm_qos_policy_vlarb_scope_destroy(osm_qos_vlarb_scope_t * p_vlarb_scope); >> + >> +osm_qos_sl2vl_scope_t * osm_qos_policy_sl2vl_scope_create(); >> +void osm_qos_policy_sl2vl_scope_destroy(osm_qos_sl2vl_scope_t * p_sl2vl_scope); >> + >> +osm_qos_level_t * osm_qos_policy_qos_level_create(); >> +void osm_qos_policy_qos_level_destroy(osm_qos_level_t * p_qos_level); >> + >> +boolean_t osm_qos_level_has_pkey(IN const osm_qos_level_t * p_qos_level, >> + IN ib_net16_t pkey); >> + >> +ib_net16_t osm_qos_level_get_shared_pkey(IN const osm_qos_level_t * p_qos_level, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp); >> + >> +osm_qos_match_rule_t * osm_qos_policy_match_rule_create(); >> +void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p_match_rule); >> + >> +osm_qos_policy_t * osm_qos_policy_create(); >> +void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy); >> +int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, osm_log_t * p_log); >> + >> +void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, >> + IN const osm_pr_rcv_t * p_rcv, >> + IN const ib_path_rec_t * p_pr, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp, >> + IN ib_net64_t comp_mask, >> + OUT osm_qos_level_t ** pp_qos_level); >> + >> +/***************************************************/ >> + >> +int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn); >> + >> +/***************************************************/ >> + >> +#endif /* ifndef OSM_QOS_POLICY_H */ >> + >> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c >> new file mode 100644 >> index 0000000..a5a8856 >> --- /dev/null >> +++ b/opensm/opensm/osm_qos_policy.c >> @@ -0,0 +1,921 @@ >> +/* >> + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. >> + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> + * >> + * This software is available to you under a choice of one of two >> + * licenses. You may choose to be licensed under the terms of the GNU >> + * General Public License (GPL) Version 2, available from the file >> + * COPYING in the main directory of this source tree, or the >> + * OpenIB.org BSD license below: >> + * >> + * Redistribution and use in source and binary forms, with or >> + * without modification, are permitted provided that the following >> + * conditions are met: >> + * >> + * - Redistributions of source code must retain the above >> + * copyright notice, this list of conditions and the following >> + * disclaimer. >> + * >> + * - Redistributions in binary form must reproduce the above >> + * copyright notice, this list of conditions and the following >> + * disclaimer in the documentation and/or other materials >> + * provided with the distribution. >> + * >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND >> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS >> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN >> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN >> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE >> + * SOFTWARE. >> + * >> + */ >> + >> +/* >> + * Abstract: >> + * OSM QoS Policy functions. >> + * >> + * Environment: >> + * Linux User Mode >> + * >> + * Author: >> + * Yevgeny Kliteynik, Mellanox >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static boolean_t >> +__is_num_in_range_arr(uint64_t ** range_arr, >> + unsigned range_arr_len, uint64_t num) >> +{ >> + unsigned ind_1 = 0; >> + unsigned ind_2 = range_arr_len - 1; >> + unsigned ind_mid; >> + >> + if (!range_arr || !range_arr_len) >> + return FALSE; >> + >> + while (ind_1 <= ind_2) { >> + if (num < range_arr[ind_1][0] || num > range_arr[ind_2][1]) >> + return FALSE; >> + else if (num <= range_arr[ind_1][1] || num >= range_arr[ind_2][0]) >> + return TRUE; >> + >> + ind_mid = ind_1 + (ind_2 - ind_1 + 1)/2; >> + >> + if (num < range_arr[ind_mid][0]) >> + ind_2 = ind_mid; >> + else if (num > range_arr[ind_mid][1]) >> + ind_1 = ind_mid; >> + else >> + return TRUE; >> + >> + ind_1++; >> + ind_2--; >> + } >> + >> + return FALSE; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static void __free_single_element(void *p_element, void *context) >> +{ >> + if (p_element) >> + free(p_element); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_port_group_t *osm_qos_policy_port_group_create() >> +{ >> + osm_qos_port_group_t *p = >> + (osm_qos_port_group_t *) malloc(sizeof(osm_qos_port_group_t)); >> + if (!p) >> + return NULL; >> + >> + memset(p, 0, sizeof(osm_qos_port_group_t)); >> + >> + cl_list_init(&p->port_name_list, 10); >> + cl_list_init(&p->partition_list, 10); >> + >> + return p; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) >> +{ >> + unsigned i; >> + >> + if (!p) >> + return; >> + >> + if (p->name) >> + free(p->name); >> + if (p->use) >> + free(p->use); >> + >> + for (i = 0; i < p->guid_range_len; i++) >> + free(p->guid_range_arr[i]); >> + if (p->guid_range_arr) >> + free(p->guid_range_arr); >> + >> + cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); >> + cl_list_remove_all(&p->port_name_list); >> + cl_list_destroy(&p->port_name_list); >> + >> + cl_list_apply_func(&p->partition_list, __free_single_element, NULL); >> + cl_list_remove_all(&p->partition_list); >> + cl_list_destroy(&p->partition_list); >> + >> + free(p); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_vlarb_scope_t *osm_qos_policy_vlarb_scope_create() >> +{ >> + osm_qos_vlarb_scope_t *p = >> + (osm_qos_vlarb_scope_t *) malloc(sizeof(osm_qos_sl2vl_scope_t)); >> + if (!p) >> + return NULL; >> + >> + memset(p, 0, sizeof(osm_qos_vlarb_scope_t)); >> + >> + cl_list_init(&p->group_list, 10); >> + cl_list_init(&p->across_list, 10); >> + cl_list_init(&p->vlarb_high_list, 10); >> + cl_list_init(&p->vlarb_low_list, 10); >> + >> + return p; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_vlarb_scope_destroy(osm_qos_vlarb_scope_t * p) >> +{ >> + if (!p) >> + return; >> + >> + cl_list_apply_func(&p->group_list, __free_single_element, NULL); >> + cl_list_apply_func(&p->across_list, __free_single_element, NULL); >> + cl_list_apply_func(&p->vlarb_high_list, __free_single_element, NULL); >> + cl_list_apply_func(&p->vlarb_low_list, __free_single_element, NULL); >> + >> + cl_list_remove_all(&p->group_list); >> + cl_list_remove_all(&p->across_list); >> + cl_list_remove_all(&p->vlarb_high_list); >> + cl_list_remove_all(&p->vlarb_low_list); >> + >> + cl_list_destroy(&p->group_list); >> + cl_list_destroy(&p->across_list); >> + cl_list_destroy(&p->vlarb_high_list); >> + cl_list_destroy(&p->vlarb_low_list); >> + >> + free(p); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_sl2vl_scope_t *osm_qos_policy_sl2vl_scope_create() >> +{ >> + osm_qos_sl2vl_scope_t *p = >> + (osm_qos_sl2vl_scope_t *) malloc(sizeof(osm_qos_sl2vl_scope_t)); >> + if (!p) >> + return NULL; >> + >> + memset(p, 0, sizeof(osm_qos_vlarb_scope_t)); >> + >> + cl_list_init(&p->group_list, 10); >> + cl_list_init(&p->across_from_list, 10); >> + cl_list_init(&p->across_to_list, 10); >> + >> + return p; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_sl2vl_scope_destroy(osm_qos_sl2vl_scope_t * p) >> +{ >> + if (!p) >> + return; >> + >> + cl_list_apply_func(&p->group_list, __free_single_element, NULL); >> + cl_list_apply_func(&p->across_from_list, __free_single_element, NULL); >> + cl_list_apply_func(&p->across_to_list, __free_single_element, NULL); >> + >> + cl_list_remove_all(&p->group_list); >> + cl_list_remove_all(&p->across_from_list); >> + cl_list_remove_all(&p->across_to_list); >> + >> + cl_list_destroy(&p->group_list); >> + cl_list_destroy(&p->across_from_list); >> + cl_list_destroy(&p->across_to_list); >> + >> + free(p); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_level_t *osm_qos_policy_qos_level_create() >> +{ >> + osm_qos_level_t *p = >> + (osm_qos_level_t *) malloc(sizeof(osm_qos_level_t)); >> + if (!p) >> + return NULL; >> + memset(p, 0, sizeof(osm_qos_level_t)); >> + return p; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_qos_level_destroy(osm_qos_level_t * p) >> +{ >> + unsigned i; >> + >> + if (!p) >> + return; >> + >> + if (p->use) >> + free(p->use); >> + >> + for (i = 0; i < p->path_bits_range_len; i++) >> + free(p->path_bits_range_arr[i]); >> + if (p->path_bits_range_arr) >> + free(p->path_bits_range_arr); >> + >> + free(p); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +boolean_t osm_qos_level_has_pkey(IN const osm_qos_level_t * p_qos_level, >> + IN ib_net16_t pkey) >> +{ >> + if (!p_qos_level || !p_qos_level->pkey_range_len) >> + return FALSE; >> + return __is_num_in_range_arr(p_qos_level->pkey_range_arr, >> + p_qos_level->pkey_range_len, >> + cl_ntoh16(pkey)); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +ib_net16_t osm_qos_level_get_shared_pkey(IN const osm_qos_level_t * p_qos_level, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp) >> +{ >> + unsigned i; >> + uint16_t pkey_ho = 0; >> + >> + if (!p_qos_level || !p_qos_level->pkey_range_len) >> + return 0; >> + >> + /* >> + * ToDo: This approach is not optimal. >> + * Think how to find shared pkey that also exists >> + * in QoS level in less runtime. >> + */ > > When this "ToDo" will be addressed? > >> + >> + for (i = 0; i < p_qos_level->pkey_range_len; i++) { >> + for (pkey_ho = p_qos_level->pkey_range_arr[i][0]; >> + pkey_ho <= p_qos_level->pkey_range_arr[i][1]; pkey_ho++) { >> + if (osm_physp_share_this_pkey >> + (p_src_physp, p_dest_physp, cl_hton16(pkey_ho))) >> + return cl_hton16(pkey_ho); >> + } >> + } >> + >> + return 0; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_match_rule_t *osm_qos_policy_match_rule_create() >> +{ >> + osm_qos_match_rule_t *p = >> + (osm_qos_match_rule_t *) malloc(sizeof(osm_qos_match_rule_t)); >> + if (!p) >> + return NULL; >> + >> + memset(p, 0, sizeof(osm_qos_match_rule_t)); >> + >> + cl_list_init(&p->source_list, 10); >> + cl_list_init(&p->source_group_list, 10); >> + cl_list_init(&p->destination_list, 10); >> + cl_list_init(&p->destination_group_list, 10); >> + >> + return p; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_match_rule_destroy(osm_qos_match_rule_t * p) >> +{ >> + unsigned i; >> + >> + if (!p) >> + return; >> + >> + if (p->qos_level_name) >> + free(p->qos_level_name); >> + if (p->use) >> + free(p->use); >> + >> + for (i = 0; i < p->service_id_range_len; i++) >> + free(p->service_id_range_arr[i]); >> + if (p->service_id_range_arr) >> + free(p->service_id_range_arr); >> + >> + for (i = 0; i < p->qos_class_range_len; i++) >> + free(p->qos_class_range_arr[i]); >> + if (p->qos_class_range_arr) >> + free(p->qos_class_range_arr); >> + >> + cl_list_apply_func(&p->source_list, __free_single_element, NULL); >> + cl_list_remove_all(&p->source_list); >> + cl_list_destroy(&p->source_list); >> + >> + cl_list_remove_all(&p->source_group_list); >> + cl_list_destroy(&p->source_group_list); >> + >> + cl_list_apply_func(&p->destination_list, __free_single_element, NULL); >> + cl_list_remove_all(&p->destination_list); >> + cl_list_destroy(&p->destination_list); >> + >> + cl_list_remove_all(&p->destination_group_list); >> + cl_list_destroy(&p->destination_group_list); >> + >> + free(p); >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +osm_qos_policy_t * osm_qos_policy_create() >> +{ >> + osm_qos_policy_t * p_qos_policy = (osm_qos_policy_t *)malloc(sizeof(osm_qos_policy_t)); >> + if (!p_qos_policy) >> + return NULL; >> + >> + memset(p_qos_policy, 0, sizeof(osm_qos_policy_t)); >> + >> + cl_list_construct(&p_qos_policy->port_groups); >> + cl_list_init(&p_qos_policy->port_groups, 10); >> + >> + cl_list_construct(&p_qos_policy->vlarb_tables); >> + cl_list_init(&p_qos_policy->vlarb_tables, 10); >> + >> + cl_list_construct(&p_qos_policy->sl2vl_tables); >> + cl_list_init(&p_qos_policy->sl2vl_tables, 10); >> + >> + cl_list_construct(&p_qos_policy->qos_levels); >> + cl_list_init(&p_qos_policy->qos_levels, 10); >> + >> + cl_list_construct(&p_qos_policy->qos_match_rules); >> + cl_list_init(&p_qos_policy->qos_match_rules, 10); >> + >> + return p_qos_policy; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) >> +{ >> + cl_list_iterator_t list_iterator; >> + osm_qos_port_group_t *p_port_group = NULL; >> + osm_qos_vlarb_scope_t *p_vlarb_scope = NULL; >> + osm_qos_sl2vl_scope_t *p_sl2vl_scope = NULL; >> + osm_qos_level_t *p_qos_level = NULL; >> + osm_qos_match_rule_t *p_qos_match_rule = NULL; >> + >> + if (!p_qos_policy) >> + return; >> + >> + list_iterator = cl_list_head(&p_qos_policy->port_groups); >> + while (list_iterator != cl_list_end(&p_qos_policy->port_groups)) { >> + p_port_group = >> + (osm_qos_port_group_t *) cl_list_obj(list_iterator); >> + if (p_port_group) >> + osm_qos_policy_port_group_destroy(p_port_group); >> + list_iterator = cl_list_next(list_iterator); >> + } >> + cl_list_remove_all(&p_qos_policy->port_groups); >> + cl_list_destroy(&p_qos_policy->port_groups); >> + >> + list_iterator = cl_list_head(&p_qos_policy->vlarb_tables); >> + while (list_iterator != cl_list_end(&p_qos_policy->vlarb_tables)) { >> + p_vlarb_scope = >> + (osm_qos_vlarb_scope_t *) cl_list_obj(list_iterator); >> + if (p_vlarb_scope) >> + osm_qos_policy_vlarb_scope_destroy(p_vlarb_scope); >> + list_iterator = cl_list_next(list_iterator); >> + } >> + cl_list_remove_all(&p_qos_policy->vlarb_tables); >> + cl_list_destroy(&p_qos_policy->vlarb_tables); >> + >> + list_iterator = cl_list_head(&p_qos_policy->sl2vl_tables); >> + while (list_iterator != cl_list_end(&p_qos_policy->sl2vl_tables)) { >> + p_sl2vl_scope = >> + (osm_qos_sl2vl_scope_t *) cl_list_obj(list_iterator); >> + if (p_sl2vl_scope) >> + osm_qos_policy_sl2vl_scope_destroy(p_sl2vl_scope); >> + list_iterator = cl_list_next(list_iterator); >> + } >> + cl_list_remove_all(&p_qos_policy->sl2vl_tables); >> + cl_list_destroy(&p_qos_policy->sl2vl_tables); >> + >> + list_iterator = cl_list_head(&p_qos_policy->qos_levels); >> + while (list_iterator != cl_list_end(&p_qos_policy->qos_levels)) { >> + p_qos_level = (osm_qos_level_t *) cl_list_obj(list_iterator); >> + if (p_qos_level) >> + osm_qos_policy_qos_level_destroy(p_qos_level); >> + list_iterator = cl_list_next(list_iterator); >> + } >> + cl_list_remove_all(&p_qos_policy->qos_levels); >> + cl_list_destroy(&p_qos_policy->qos_levels); >> + >> + list_iterator = cl_list_head(&p_qos_policy->qos_match_rules); >> + while (list_iterator != cl_list_end(&p_qos_policy->qos_match_rules)) { >> + p_qos_match_rule = >> + (osm_qos_match_rule_t *) cl_list_obj(list_iterator); >> + if (p_qos_match_rule) >> + osm_qos_policy_match_rule_destroy(p_qos_match_rule); >> + list_iterator = cl_list_next(list_iterator); >> + } >> + cl_list_remove_all(&p_qos_policy->qos_match_rules); >> + cl_list_destroy(&p_qos_policy->qos_match_rules); >> + >> + free(p_qos_policy); >> + >> + p_qos_policy = NULL; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static boolean_t >> +__qos_policy_is_port_in_group(osm_subn_t * p_subn, >> + const osm_physp_t * p_physp, >> + osm_qos_port_group_t * p_port_group) >> +{ >> + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); >> + osm_prtn_t *p_prtn = NULL; >> + ib_net64_t port_guid = osm_physp_get_port_guid(p_physp); >> + uint64_t port_guid_ho = cl_ntoh64(port_guid); >> + uint8_t node_type = osm_node_get_type(p_node); >> + cl_list_iterator_t list_iterator; >> + char *partition_name; >> + >> + /* check whether this port's type matches any of group's types */ >> + >> + if ((node_type == IB_NODE_TYPE_CA && p_port_group->node_type_ca) || >> + (node_type == IB_NODE_TYPE_SWITCH && p_port_group->node_type_switch) >> + || (node_type == IB_NODE_TYPE_ROUTER >> + && p_port_group->node_type_router)) >> + return TRUE; >> + >> + /* check whether this port's guid is in range of this group's guids */ >> + >> + if (__is_num_in_range_arr(p_port_group->guid_range_arr, >> + p_port_group->guid_range_len, port_guid_ho)) >> + return TRUE; >> + >> + /* check whether this port is member of this group's partitions */ >> + >> + list_iterator = cl_list_head(&p_port_group->partition_list); >> + while (list_iterator != cl_list_end(&p_port_group->partition_list)) { >> + partition_name = (char *)cl_list_obj(list_iterator); >> + if (partition_name && strlen(partition_name)) { >> + p_prtn = osm_prtn_find_by_name(p_subn, partition_name); >> + if (p_prtn) { >> + if (osm_prtn_is_guid(p_prtn, port_guid)) >> + return TRUE; >> + } >> + } >> + list_iterator = cl_list_next(list_iterator); >> + } >> + >> + /* check whether this port's name matches any of group's names */ >> + >> + /* >> + * TODO: check port names >> + * >> + * char desc[IB_NODE_DESCRIPTION_SIZE + 1]; >> + * memcpy(desc, p_node->node_desc.description, IB_NODE_DESCRIPTION_SIZE); >> + * desc[IB_NODE_DESCRIPTION_SIZE] = '\0'; >> + */ >> + >> + return FALSE; >> +} /* __qos_policy_is_port_in_group() */ >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static boolean_t >> +__qos_policy_is_port_in_group_list(const osm_pr_rcv_t * p_rcv, >> + const osm_physp_t * p_physp, >> + cl_list_t * p_port_group_list) >> +{ >> + osm_qos_port_group_t *p_port_group; >> + cl_list_iterator_t list_iterator; >> + >> + list_iterator = cl_list_head(p_port_group_list); >> + while (list_iterator != cl_list_end(p_port_group_list)) { >> + p_port_group = >> + (osm_qos_port_group_t *) cl_list_obj(list_iterator); >> + if (p_port_group) { >> + if (__qos_policy_is_port_in_group >> + (p_rcv->p_subn, p_physp, p_port_group)) >> + return TRUE; >> + } >> + list_iterator = cl_list_next(list_iterator); >> + } >> + return FALSE; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_pr( >> + const osm_qos_policy_t * p_qos_policy, >> + const osm_pr_rcv_t * p_rcv, >> + const ib_path_rec_t * p_pr, >> + const osm_physp_t * p_src_physp, >> + const osm_physp_t * p_dest_physp, >> + ib_net64_t comp_mask) >> +{ >> + osm_qos_match_rule_t *p_qos_match_rule = NULL; >> + cl_list_iterator_t list_iterator; >> + >> + if (!cl_list_count(&p_qos_policy->qos_match_rules)) >> + return NULL; >> + >> + /* Go over all QoS match rules and find the one that matches the request */ >> + >> + list_iterator = cl_list_head(&p_qos_policy->qos_match_rules); >> + while (list_iterator != cl_list_end(&p_qos_policy->qos_match_rules)) { >> + p_qos_match_rule = >> + (osm_qos_match_rule_t *) cl_list_obj(list_iterator); >> + if (!p_qos_match_rule) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + /* If a match rule has Source groups, PR request source has to be in this list */ >> + >> + if (cl_list_count(&p_qos_match_rule->source_group_list)) { >> + if (!__qos_policy_is_port_in_group_list(p_rcv, >> + p_src_physp, >> + &p_qos_match_rule-> >> + source_group_list)) >> + { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + } >> + >> + /* If a match rule has Destination groups, PR request dest. has to be in this list */ >> + >> + if (cl_list_count(&p_qos_match_rule->destination_group_list)) { >> + if (!__qos_policy_is_port_in_group_list(p_rcv, >> + p_dest_physp, >> + &p_qos_match_rule-> >> + destination_group_list)) >> + { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + } >> + >> + /* If a match rule has QoS classes, PR request HAS >> + to have a matching QoS class to match the rule */ >> + >> + if (p_qos_match_rule->qos_class_range_len) { >> + if (!(comp_mask & IB_PR_COMPMASK_QOS_CLASS)) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + if (!__is_num_in_range_arr >> + (p_qos_match_rule->qos_class_range_arr, >> + p_qos_match_rule->qos_class_range_len, >> + ib_path_rec_qos_class(p_pr))) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + } >> + >> + /* If a match rule has Service IDs, PR request HAS >> + to have a matching Service ID to match the rule */ >> + >> + if (p_qos_match_rule->service_id_range_len) { >> + if (!(comp_mask & IB_PR_COMPMASK_SERVICEID)) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + if (!__is_num_in_range_arr >> + (p_qos_match_rule->service_id_range_arr, >> + p_qos_match_rule->service_id_range_len, >> + p_pr->service_id)) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + } >> + >> + /* If a match rule has PKeys, PR request HAS >> + to have a matching PKey to match the rule */ >> + >> + if (p_qos_match_rule->pkey_range_len) { >> + if (!(comp_mask & IB_PR_COMPMASK_PKEY)) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + if (!__is_num_in_range_arr >> + (p_qos_match_rule->pkey_range_arr, >> + p_qos_match_rule->pkey_range_len, >> + ib_path_rec_qos_class(p_pr))) { >> + list_iterator = cl_list_next(list_iterator); >> + continue; >> + } >> + >> + } >> + >> + /* if we got here, then this match-rule matched this PR request */ >> + break; >> + } >> + >> + if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules)) >> + return NULL; >> + >> + return p_qos_match_rule; >> +} /* __qos_policy_get_match_rule_by_pr() */ >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static osm_qos_level_t *__qos_policy_get_qos_level_by_name(osm_qos_policy_t * p_qos_policy, >> + char *name) >> +{ >> + osm_qos_level_t *p_qos_level = NULL; >> + cl_list_iterator_t list_iterator; >> + >> + list_iterator = cl_list_head(&p_qos_policy->qos_levels); >> + while (list_iterator != cl_list_end(&p_qos_policy->qos_levels)) { >> + p_qos_level = (osm_qos_level_t *) cl_list_obj(list_iterator); >> + if (!p_qos_level) >> + continue; >> + >> + /* names are case INsensitive */ >> + if (strcasecmp(name, p_qos_level->name) == 0) >> + return p_qos_level; >> + >> + list_iterator = cl_list_next(list_iterator); >> + } >> + >> + return NULL; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +static osm_qos_port_group_t *__qos_policy_get_port_group_by_name(osm_qos_policy_t * p_qos_policy, >> + const char *const name) >> +{ >> + osm_qos_port_group_t *p_port_group = NULL; >> + cl_list_iterator_t list_iterator; >> + >> + list_iterator = cl_list_head(&p_qos_policy->port_groups); >> + while (list_iterator != cl_list_end(&p_qos_policy->port_groups)) { >> + p_port_group = >> + (osm_qos_port_group_t *) cl_list_obj(list_iterator); >> + if (!p_port_group) >> + continue; >> + >> + /* names are case INsensitive */ >> + if (strcasecmp(name, p_port_group->name) == 0) >> + return p_port_group; >> + >> + list_iterator = cl_list_next(list_iterator); >> + } >> + >> + return NULL; >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +int osm_qos_policy_validate(osm_qos_policy_t * p_qos_policy, >> + osm_log_t *p_log) >> +{ >> + cl_list_iterator_t match_rules_list_iterator; >> + cl_list_iterator_t list_iterator; >> + osm_qos_port_group_t *p_port_group = NULL; >> + osm_qos_match_rule_t *p_qos_match_rule = NULL; >> + char *str; >> + unsigned i; >> + int res = 0; >> + >> + OSM_LOG_ENTER(p_log, osm_qos_policy_validate); >> + >> + /* set default qos level */ >> + >> + p_qos_policy->p_default_qos_level = >> + __qos_policy_get_qos_level_by_name(p_qos_policy, OSM_QOS_POLICY_DEFAULT_LEVEL_NAME); >> + if (!p_qos_policy->p_default_qos_level) { >> + osm_log(p_log, OSM_LOG_ERROR, >> + "osm_qos_policy_validate: ERR AC10: " >> + "Default qos-level (%s) not defined.\n", >> + OSM_QOS_POLICY_DEFAULT_LEVEL_NAME); >> + res = 1; >> + goto Exit; >> + } >> + >> + /* scan all the match rules, and fill the lists of pointers to >> + relevant qos levels and port groups to speed up PR matching */ >> + >> + i = 1; >> + match_rules_list_iterator = >> + cl_list_head(&p_qos_policy->qos_match_rules); >> + while (match_rules_list_iterator != >> + cl_list_end(&p_qos_policy->qos_match_rules)) { >> + p_qos_match_rule = >> + (osm_qos_match_rule_t *) >> + cl_list_obj(match_rules_list_iterator); >> + CL_ASSERT(p_qos_match_rule); >> + >> + /* find the matching qos-level for each match-rule */ >> + >> + p_qos_match_rule->p_qos_level = >> + __qos_policy_get_qos_level_by_name(p_qos_policy, >> + p_qos_match_rule->qos_level_name); >> + >> + if (!p_qos_match_rule->p_qos_level) { >> + osm_log(p_log, OSM_LOG_ERROR, >> + "osm_qos_policy_validate: ERR AC11: " >> + "qos-match-rule num %u: qos-level '%s' not found\n", >> + i, p_qos_match_rule->qos_level_name); >> + res = 1; >> + goto Exit; >> + } >> + >> + /* find the matching port-group for element of source_list */ >> + >> + if (cl_list_count(&p_qos_match_rule->source_list)) { >> + list_iterator = >> + cl_list_head(&p_qos_match_rule->source_list); >> + while (list_iterator != >> + cl_list_end(&p_qos_match_rule->source_list)) { >> + str = (char *)cl_list_obj(list_iterator); >> + CL_ASSERT(str); >> + >> + p_port_group = >> + __qos_policy_get_port_group_by_name(p_qos_policy, str); >> + if (!p_port_group) { >> + osm_log(p_log, >> + OSM_LOG_ERROR, >> + "osm_qos_policy_validate: ERR AC12: " >> + "qos-match-rule num %u: source port-group '%s' not found\n", >> + i, str); >> + res = 1; >> + goto Exit; >> + } >> + >> + cl_list_insert_tail(&p_qos_match_rule-> >> + source_group_list, >> + p_port_group); >> + >> + list_iterator = cl_list_next(list_iterator); >> + } >> + } >> + >> + /* find the matching port-group for element of destination_list */ >> + >> + if (cl_list_count(&p_qos_match_rule->destination_list)) { >> + list_iterator = >> + cl_list_head(&p_qos_match_rule->destination_list); >> + while (list_iterator != >> + cl_list_end(&p_qos_match_rule-> >> + destination_list)) { >> + str = (char *)cl_list_obj(list_iterator); >> + CL_ASSERT(str); >> + >> + p_port_group = >> + __qos_policy_get_port_group_by_name(p_qos_policy,str); >> + if (!p_port_group) { >> + osm_log(p_log, >> + OSM_LOG_ERROR, >> + "osm_qos_policy_validate: ERR AC13: " >> + "qos-match-rule num %u: destination port-group '%s' not found\n", >> + i, str); >> + res = 1; >> + goto Exit; >> + } >> + >> + cl_list_insert_tail(&p_qos_match_rule-> >> + destination_group_list, >> + p_port_group); >> + >> + list_iterator = cl_list_next(list_iterator); >> + } >> + } >> + >> + /* done with the current match-rule */ >> + >> + match_rules_list_iterator = >> + cl_list_next(match_rules_list_iterator); >> + i++; >> + } >> + >> + Exit: >> + OSM_LOG_EXIT(p_log); >> + return res; >> +} /* osm_qos_policy_validate() */ >> + >> +/*************************************************** >> + ***************************************************/ >> + >> +void osm_qos_policy_get_qos_level_by_pr(IN const osm_qos_policy_t * p_qos_policy, >> + IN const osm_pr_rcv_t * p_rcv, >> + IN const ib_path_rec_t * p_pr, >> + IN const osm_physp_t * p_src_physp, >> + IN const osm_physp_t * p_dest_physp, >> + IN ib_net64_t comp_mask, >> + OUT osm_qos_level_t ** pp_qos_level) >> +{ >> + osm_qos_match_rule_t *p_qos_match_rule = NULL; >> + osm_qos_level_t *p_qos_level = NULL; >> + >> + OSM_LOG_ENTER(p_rcv->p_log, osm_qos_policy_get_qos_level_by_pr); >> + >> + *pp_qos_level = NULL; >> + >> + if (!p_qos_policy) >> + goto Exit; >> + >> + p_qos_match_rule = __qos_policy_get_match_rule_by_pr(p_qos_policy, >> + p_rcv, >> + p_pr, >> + p_src_physp, >> + p_dest_physp, >> + comp_mask); >> + >> + if (p_qos_match_rule) >> + p_qos_level = p_qos_match_rule->p_qos_level; >> + else >> + p_qos_level = p_qos_policy->p_default_qos_level; >> + >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "osm_qos_policy_get_qos_level_by_pr: " >> + "PathRecord request:" >> + "Src port 0x%016" PRIx64 ", " >> + "Dst port 0x%016" PRIx64 "\n", >> + cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), >> + cl_ntoh64(osm_physp_get_port_guid(p_dest_physp))); >> + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, >> + "osm_qos_policy_get_qos_level_by_pr: " >> + "Applying QoS Level %s (%s)\n", >> + p_qos_level->name, >> + (p_qos_level->use) ? p_qos_level->use : "no description"); >> + >> + *pp_qos_level = p_qos_level; >> + >> + Exit: >> + OSM_LOG_EXIT(p_rcv->p_log); >> +} /* osm_qos_policy_get_qos_level_by_pr() */ >> + >> +/*************************************************** >> + ***************************************************/ >> -- >> 1.5.1.4 >> > From jackm at dev.mellanox.co.il Tue Sep 11 08:03:38 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 11 Sep 2007 18:03:38 +0300 Subject: [ofa-general] userspace "deadlock" bug in libmlx4? Message-ID: <200709111803.38431.jackm@dev.mellanox.co.il> Roland, I noticed the following in libmlx4, when destroying a qp: file verbs.c, procedure mlx4_destroy_qp: mlx4_lock_cqs(ibqp); mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num); mlx4_unlock_cqs(ibqp); (and mlx4_lock_cqs() takes pthread spinlocks). Now, in function mlx4_clear_qp() (file src/qp.c) , we see the following: pthread_mutex_lock(&ctx->qp_table_mutex); if (!--ctx->qp_table[tind].refcnt) free(ctx->qp_table[tind].table); else ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = NULL; pthread_mutex_unlock(&ctx->qp_table_mutex); We're (potentially) waiting on a pthread mutex inside a pthread spinlock. Is there a problem here? - Jack From sweitzen at cisco.com Tue Sep 11 08:48:10 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 11 Sep 2007 08:48:10 -0700 Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch In-Reply-To: <13995234.1189513707210.JavaMail.root@wombat.diezmil.com> References: <13995234.1189513707210.JavaMail.root@wombat.diezmil.com> Message-ID: You are hitting https://bugs.openfabrics.org/show_bug.cgi?id=48. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > snagai at jp.ibm.com > Sent: Tuesday, September 11, 2007 5:28 AM > To: general at lists.openfabrics.org > Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch > > I am trying to build OFED with enabling DAPL package, but > build proceess did not complete due to some errors. > > I just unzipped tar ball "OFED-1.2.tgz" and run build script > "build.sh". > Because I need to enable uDAPL on ppc64 linux machine, if > someone has already succeeded it, please show me the way. > > My build environment and error messages are below. It seems > the definition of "__PPC64__" is missing. > > [ build environment ] > > - machine arch: ppc64 > - OS : Fedora Core6 > - compiler: gcc4.1.1 > > [ error messages in build.log ] > > Make dapl started > make -C src/userspace/dapl \ > CPPFLAGS="-I../libibverbs/include/infiniband > -I../librdmacm/include \ > -I../libibverbs/include -I../../dat/include" \ > > AM_LDFLAGS="-L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspac e/libibverbs/src -libverbs -> L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/librdmacm/s > rc/ -lrdmacm" > make[1]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make all-recursive > make[2]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > Making all in . > make[3]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > if /bin/sh ./libtool --tag=CC --mode=compile gcc > -DHAVE_CONFIG_H -I. -I. -I. > -I../libibverbs/include/infiniband -I../librdmacm/include > -I../libibverbs/include -I../../dat/include -Wall -g > -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT > -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 > -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP > -MF ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" -c -o > dapl_udapl_libdaplcma_la-dapl_init.lo `test -f > 'dapl/udapl/dapl_init.c' || echo './'`dapl/udapl/dapl_init.c; \ > then mv -f > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Plo"; else rm -f > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo"; exit 1; fi > mkdir .libs > gcc -DHAVE_CONFIG_H -I. -I. -I. > -I../libibverbs/include/infiniband -I../librdmacm/include > -I../libibverbs/include -I../../dat/include -Wall -g > -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT > -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 > -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP > -MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c > dapl/udapl/dapl_init.c -fPIC -DPIC -o > .libs/dapl_udapl_libdaplcma_la-dapl_init.o > In file included from ./dapl/include/dapl.h:50, > from dapl/udapl/dapl_init.c:39: > ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH > make[3]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 > make[3]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make[1]: *** [all] Error 2 > make[1]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make: *** [dapl] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.33577 (%install) > > > RPM build errors: > user vlad does not exist - using root > group vlad does not exist - using root > user vlad does not exist - using root > group vlad does not exist - using root > Bad exit status from /var/tmp/rpm-tmp.33577 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define '_prefix /usr' --define > 'build_root /home/testuser/tmp/OFED' --define > 'configure_options --with-dapl --with-ipoibtools > --with-libcxgb3 --with-libehca --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs --with-libipathverbs --with-libmthca > --with-opensm --with-librdmacm --with-libsdp > --with-openib-diags --with-sdpnetstat --with-srptools > --with-perftest --sysconfdir=/etc --mandir=/usr/share/man' > --define 'configure_options32 --with-dapl --with-ipoibtools > --with-libcxgb3 --with-libehca --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs --with-libipathverbs --with-libmthca > --with-opensm --with-librdmacm --with-libsdp > --with-openib-diags --with-sdpnetstat --with-srptools > --with-mstflint --with-tvflash --sysconfdir=/etc > --mandir=/usr/share/man' --define 'build_32bit 1' --define > '_mandir /usr/share/man' /home/testuser/archives/OFED-1.2/SRPMS/ofa_user-1.2-0.src.rpm" > > > -- > This message was sent on behalf of snagai at jp.ibm.com at > openSubscriber.com > http://www.opensubscriber.com/messages/general at lists.openfabri > cs.org/topic.html > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Tue Sep 11 08:50:36 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 11 Sep 2007 08:50:36 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: Message-ID: On 9/10/07, Hal Rosenstock wrote: > On 9/7/07, Roland Dreier wrote: > > Here is a long overdue patch to enable userspace to control the P_Key > > index used for userspace MADs. I used the approach we discussed when > > this first came up, namely adding an ioctl to enable to the new > > interface so that existing binaries don't break. > > > > I haven't had a chance to make all the userspace library changes to > > test the new interface and I likely won't until I return home (I > > should be done traveling for a few months after this week). I have > > tested existing code against a kernel with this patch applied and it > > seems to be OK, and I wanted to at least get this out for review as > > soon as I had it. > > > > Please review/test. I would like to get this into 2.6.24 if possible > > since we've known so long that we needed it. > > Thanks for doing this :-) One nit below in the doc. > > I spent some time testing it today in old mode and although my > environment is limited, I did have trouble with an RMPP test as > follows: > > Can someone try the following with OpenSM running: > > First, osmtest -f c > and then > osmtest -f a > > All on same node with new user_mad module. > > That seems to hangup rather than complete for me. I didn't have time > to track this down any further. With clearer eyes this morning, I was able to see what my problem was. This test now is working. So although I am unable to review the packet contents on the wire, I am reasonably confident that hasn't changed although I would feel better knowing someone explictly did this. Bottom line is this seems to work in old mode for me. Sasha, Will you be testing this ? -- Hal > -- Hal > > > Thanks, > > Roland > > > > > > diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt > > index 8ec54b9..a3450aa 100644 > > --- a/Documentation/infiniband/user_mad.txt > > +++ b/Documentation/infiniband/user_mad.txt > > @@ -99,6 +99,20 @@ Transaction IDs > > request/response pairs. The upper 32 bits are reserved for use by > > the kernel and will be overwritten before a MAD is sent. > > > > +P_Key Index Handling > > + > > + The old ib_umad interface did not allow setting the P_Key index for > > + MADs that are sent and did not provide a way for obtaining the P_Key > > + index of received MADs. A new layout for struct ib_user_mad_hdr > > + with a pkey_index member has been defined; however, to preserve > > + binary compatibility with older applications, this new layout will > > + not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called > > + before a file description is used for anything else. > > Nit: Should this be "file descriptor" ? > > > + > > + In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented > > + to 6, the new layout of struct ib_user_mad_hdr will be used by > > + default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed. > > + > > Setting IsSM Capability Bit > > > > To set the IsSM capability bit for a port, simply open the > > diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c > > index d97ded2..3a0e579 100644 > > --- a/drivers/infiniband/core/user_mad.c > > +++ b/drivers/infiniband/core/user_mad.c > > @@ -118,6 +118,8 @@ struct ib_umad_file { > > wait_queue_head_t recv_wait; > > struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; > > int agents_dead; > > + u8 use_pkey_index; > > + u8 already_used; > > }; > > > > struct ib_umad_packet { > > @@ -147,6 +149,12 @@ static void ib_umad_release_dev(struct kref *ref) > > kfree(dev); > > } > > > > +static int hdr_size(struct ib_umad_file *file) > > +{ > > + return file->use_pkey_index ? sizeof (struct ib_user_mad_hdr) : > > + sizeof (struct ib_user_mad_hdr_old); > > +} > > + > > /* caller must hold port->mutex at least for reading */ > > static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id) > > { > > @@ -221,13 +229,13 @@ static void recv_handler(struct ib_mad_agent *agent, > > packet->length = mad_recv_wc->mad_len; > > packet->recv_wc = mad_recv_wc; > > > > - packet->mad.hdr.status = 0; > > - packet->mad.hdr.length = sizeof (struct ib_user_mad) + > > - mad_recv_wc->mad_len; > > - packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); > > - packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); > > - packet->mad.hdr.sl = mad_recv_wc->wc->sl; > > - packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; > > + packet->mad.hdr.status = 0; > > + packet->mad.hdr.length = hdr_size(file) + mad_recv_wc->mad_len; > > + packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); > > + packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); > > + packet->mad.hdr.sl = mad_recv_wc->wc->sl; > > + packet->mad.hdr.path_bits = mad_recv_wc->wc->dlid_path_bits; > > + packet->mad.hdr.pkey_index = mad_recv_wc->wc->pkey_index; > > packet->mad.hdr.grh_present = !!(mad_recv_wc->wc->wc_flags & IB_WC_GRH); > > if (packet->mad.hdr.grh_present) { > > struct ib_ah_attr ah_attr; > > @@ -253,8 +261,8 @@ err1: > > ib_free_recv_mad(mad_recv_wc); > > } > > > > -static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > > - size_t count) > > +static ssize_t copy_recv_mad(struct ib_umad_file *file, char __user *buf, > > + struct ib_umad_packet *packet, size_t count) > > { > > struct ib_mad_recv_buf *recv_buf; > > int left, seg_payload, offset, max_seg_payload; > > @@ -262,15 +270,15 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > > /* We need enough room to copy the first (or only) MAD segment. */ > > recv_buf = &packet->recv_wc->recv_buf; > > if ((packet->length <= sizeof (*recv_buf->mad) && > > - count < sizeof (packet->mad) + packet->length) || > > + count < hdr_size(file) + packet->length) || > > (packet->length > sizeof (*recv_buf->mad) && > > - count < sizeof (packet->mad) + sizeof (*recv_buf->mad))) > > + count < hdr_size(file) + sizeof (*recv_buf->mad))) > > return -EINVAL; > > > > - if (copy_to_user(buf, &packet->mad, sizeof (packet->mad))) > > + if (copy_to_user(buf, &packet->mad, hdr_size(file))) > > return -EFAULT; > > > > - buf += sizeof (packet->mad); > > + buf += hdr_size(file); > > seg_payload = min_t(int, packet->length, sizeof (*recv_buf->mad)); > > if (copy_to_user(buf, recv_buf->mad, seg_payload)) > > return -EFAULT; > > @@ -280,7 +288,7 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > > * Multipacket RMPP MAD message. Copy remainder of message. > > * Note that last segment may have a shorter payload. > > */ > > - if (count < sizeof (packet->mad) + packet->length) { > > + if (count < hdr_size(file) + packet->length) { > > /* > > * The buffer is too small, return the first RMPP segment, > > * which includes the RMPP message length. > > @@ -300,18 +308,23 @@ static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, > > return -EFAULT; > > } > > } > > - return sizeof (packet->mad) + packet->length; > > + return hdr_size(file) + packet->length; > > } > > > > -static ssize_t copy_send_mad(char __user *buf, struct ib_umad_packet *packet, > > - size_t count) > > +static ssize_t copy_send_mad(struct ib_umad_file *file, char __user *buf, > > + struct ib_umad_packet *packet, size_t count) > > { > > - ssize_t size = sizeof (packet->mad) + packet->length; > > + ssize_t size = hdr_size(file) + packet->length; > > > > if (count < size) > > return -EINVAL; > > > > - if (copy_to_user(buf, &packet->mad, size)) > > + if (copy_to_user(buf, &packet->mad, hdr_size(file))) > > + return -EFAULT; > > + > > + buf += hdr_size(file); > > + > > + if (copy_to_user(buf, packet->mad.data, packet->length)) > > return -EFAULT; > > > > return size; > > @@ -324,7 +337,7 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, > > struct ib_umad_packet *packet; > > ssize_t ret; > > > > - if (count < sizeof (struct ib_user_mad)) > > + if (count < hdr_size(file)) > > return -EINVAL; > > > > spin_lock_irq(&file->recv_lock); > > @@ -348,9 +361,9 @@ static ssize_t ib_umad_read(struct file *filp, char __user *buf, > > spin_unlock_irq(&file->recv_lock); > > > > if (packet->recv_wc) > > - ret = copy_recv_mad(buf, packet, count); > > + ret = copy_recv_mad(file, buf, packet, count); > > else > > - ret = copy_send_mad(buf, packet, count); > > + ret = copy_send_mad(file, buf, packet, count); > > > > if (ret < 0) { > > /* Requeue packet */ > > @@ -442,15 +455,14 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > > __be64 *tid; > > int ret, data_len, hdr_len, copy_offset, rmpp_active; > > > > - if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) > > + if (count < hdr_size(file) + IB_MGMT_RMPP_HDR) > > return -EINVAL; > > > > packet = kzalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL); > > if (!packet) > > return -ENOMEM; > > > > - if (copy_from_user(&packet->mad, buf, > > - sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { > > + if (copy_from_user(&packet->mad, buf, hdr_size(file))) { > > ret = -EFAULT; > > goto err; > > } > > @@ -461,6 +473,13 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > > goto err; > > } > > > > + buf += hdr_size(file); > > + > > + if (copy_from_user(packet->mad.data, buf, IB_MGMT_RMPP_HDR)) { > > + ret = -EFAULT; > > + goto err; > > + } > > + > > down_read(&file->port->mutex); > > > > agent = __get_agent(file, packet->mad.hdr.id); > > @@ -500,11 +519,11 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > > IB_MGMT_RMPP_FLAG_ACTIVE; > > } > > > > - data_len = count - sizeof (struct ib_user_mad) - hdr_len; > > + data_len = count - hdr_size(file) - hdr_len; > > packet->msg = ib_create_send_mad(agent, > > be32_to_cpu(packet->mad.hdr.qpn), > > - 0, rmpp_active, hdr_len, > > - data_len, GFP_KERNEL); > > + packet->mad.hdr.pkey_index, rmpp_active, > > + hdr_len, data_len, GFP_KERNEL); > > if (IS_ERR(packet->msg)) { > > ret = PTR_ERR(packet->msg); > > goto err_ah; > > @@ -517,7 +536,6 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf, > > > > /* Copy MAD header. Any RMPP header is already in place. */ > > memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); > > - buf += sizeof (struct ib_user_mad); > > > > if (!rmpp_active) { > > if (copy_from_user(packet->msg->mad + copy_offset, > > @@ -646,6 +664,7 @@ found: > > goto out; > > } > > > > + file->already_used = 1; > > file->agent[agent_id] = agent; > > ret = 0; > > > > @@ -682,6 +701,20 @@ out: > > return ret; > > } > > > > +static long ib_umad_enable_pkey(struct ib_umad_file *file) > > +{ > > + int ret = 0; > > + > > + down_write(&file->port->mutex); > > + if (file->already_used) > > + ret = -EINVAL; > > + else > > + file->use_pkey_index = 1; > > + up_write(&file->port->mutex); > > + > > + return ret; > > +} > > + > > static long ib_umad_ioctl(struct file *filp, unsigned int cmd, > > unsigned long arg) > > { > > @@ -690,6 +723,8 @@ static long ib_umad_ioctl(struct file *filp, unsigned int cmd, > > return ib_umad_reg_agent(filp->private_data, arg); > > case IB_USER_MAD_UNREGISTER_AGENT: > > return ib_umad_unreg_agent(filp->private_data, arg); > > + case IB_USER_MAD_ENABLE_PKEY: > > + return ib_umad_enable_pkey(filp->private_data); > > default: > > return -ENOIOCTLCMD; > > } > > diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h > > index d66b15e..2a32043 100644 > > --- a/include/rdma/ib_user_mad.h > > +++ b/include/rdma/ib_user_mad.h > > @@ -52,7 +52,50 @@ > > */ > > > > /** > > + * ib_user_mad_hdr_old - Old version of MAD packet header without pkey_index > > + * @id - ID of agent MAD received with/to be sent with > > + * @status - 0 on successful receive, ETIMEDOUT if no response > > + * received (transaction ID in data[] will be set to TID of original > > + * request) (ignored on send) > > + * @timeout_ms - Milliseconds to wait for response (unset on receive) > > + * @retries - Number of automatic retries to attempt > > + * @qpn - Remote QP number received from/to be sent to > > + * @qkey - Remote Q_Key to be sent with (unset on receive) > > + * @lid - Remote lid received from/to be sent to > > + * @sl - Service level received with/to be sent with > > + * @path_bits - Local path bits received with/to be sent with > > + * @grh_present - If set, GRH was received/should be sent > > + * @gid_index - Local GID index to send with (unset on receive) > > + * @hop_limit - Hop limit in GRH > > + * @traffic_class - Traffic class in GRH > > + * @gid - Remote GID in GRH > > + * @flow_label - Flow label in GRH > > + */ > > +struct ib_user_mad_hdr_old { > > + __u32 id; > > + __u32 status; > > + __u32 timeout_ms; > > + __u32 retries; > > + __u32 length; > > + __be32 qpn; > > + __be32 qkey; > > + __be16 lid; > > + __u8 sl; > > + __u8 path_bits; > > + __u8 grh_present; > > + __u8 gid_index; > > + __u8 hop_limit; > > + __u8 traffic_class; > > + __u8 gid[16]; > > + __be32 flow_label; > > +}; > > + > > +/** > > * ib_user_mad_hdr - MAD packet header > > + * This layout allows specifying/receiving the P_Key index. To use > > + * this capability, an application must call the > > + * IB_USER_MAD_ENABLE_PKEY ioctl on the user MAD file handle before > > + * any other actions with the file handle. > > * @id - ID of agent MAD received with/to be sent with > > * @status - 0 on successful receive, ETIMEDOUT if no response > > * received (transaction ID in data[] will be set to TID of original > > @@ -70,6 +113,7 @@ > > * @traffic_class - Traffic class in GRH > > * @gid - Remote GID in GRH > > * @flow_label - Flow label in GRH > > + * @pkey_index - P_Key index > > */ > > struct ib_user_mad_hdr { > > __u32 id; > > @@ -88,6 +132,8 @@ struct ib_user_mad_hdr { > > __u8 traffic_class; > > __u8 gid[16]; > > __be32 flow_label; > > + __u16 pkey_index; > > + __u8 reserved[6]; > > }; > > > > /** > > @@ -134,4 +180,6 @@ struct ib_user_mad_reg_req { > > > > #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, __u32) > > > > +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) > > + > > #endif /* IB_USER_MAD_H */ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eli at mellanox.co.il Tue Sep 11 08:53:26 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:53:26 +0300 Subject: [ofa-general] IPOIB enhancements Message-ID: <1189526006.13053.110.camel@mtls03> Following this email is a list of patches with all ipoib offload enhancements we have currently in the ofa git as patches in the fixes directory. Most of them are resent and some are new. With these patches we saw 1350 MB/s of throughput in datagram mode and even higher rates in connected mode. We would like them to be used in 2.6.24. From eli at mellanox.co.il Tue Sep 11 08:53:40 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:53:40 +0300 Subject: [ofa-general] [PATCH 1 of 17] ib_ipoib: add high dma support Message-ID: <1189526020.13053.111.camel@mtls03> Add high dma support to ipoib This patch assumes all IB devices support 64 bit DMA. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:17.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:24.000000000 +0300 @@ -1121,6 +1121,8 @@ static struct net_device *ipoib_add_port SET_NETDEV_DEV(priv->dev, hca->dma_device); + priv->dev->features |= NETIF_F_HIGHDMA; + result = ib_query_pkey(hca, port, 0, &priv->pkey); if (result) { printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", From eli at mellanox.co.il Tue Sep 11 08:53:51 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:53:51 +0300 Subject: [ofa-general] [PATCH 2 of 17] ipoib: hw csum and s/g support Message-ID: <1189526031.13053.112.camel@mtls03> From: Michael S. Tsirkin Add module option hw_csum: when set, IPoIB will report S/G support, and rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums. Since this will not inter-operate with older IPoIB modules, this option is off by default. Signed-off-by: Michael S. Tsirkin --- When applied on top of previously posted mlx4 patches, and with hw_csum enabled, this patch speeds up single-stream netperf bandwidth on connectx DDR from 1000 to 1250 MBytes/sec. I know some people find this approach controversial, but from my perspective, this is not worse than e.g. SDP which does not have SW checksums pretty much by design. Hopefully the option being off by default is enough to pacify the critics :). Add module option hw_csum: when set, IPoIB will report S/G support, and rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums. Since this will not inter-operate with older IPoIB modules, this option is off by default. Signed-off-by: Michael S. Tsirkin --- When applied on top of previously posted mlx4 patches, and with hw_csum enabled, this patch speeds up single-stream netperf bandwidth on connectx DDR from 1000 to 1250 MBytes/sec. I know some people find this approach controversial, but from my perspective, this is not worse than e.g. SDP which does not have SW checksums pretty much by design. Hopefully the option being off by default is enough to pacify the critics :). Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:17.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:25.000000000 +0300 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_HW_CSUM = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -104,9 +105,11 @@ enum { /* structs */ +#define IPOIB_HEADER_F_HWCSUM 0x1 + struct ipoib_header { __be16 proto; - u16 reserved; + __be16 flags; }; struct ipoib_pseudoheader { @@ -122,9 +125,54 @@ struct ipoib_rx_buf { struct ipoib_tx_buf { struct sk_buff *skb; - u64 mapping; + u64 mapping[MAX_SKB_FRAGS + 1]; }; +static inline int ipoib_dma_map_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int i, frags; + + mapping[0] = ib_dma_map_single(ca, skb->data, skb_headlen(skb), DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[0]))) + return -EIO; + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + mapping[i + 1] = ib_dma_map_page(ca, frag->page, frag->page_offset, + frag->size, DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[i + 1]))) + goto partial_error; + } + return 0; + +partial_error: + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + for (; i > 0; --i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1]; + ib_dma_unmap_page(ca, mapping[i], frag->size, DMA_TO_DEVICE); + } + return -EIO; +} + +static inline void ipoib_dma_unmap_tx(struct ib_device *ca, struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int i, frags; + + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + ib_dma_unmap_page(ca, mapping[i + 1], frag->size, DMA_TO_DEVICE); + } +} + struct ib_cm_id; struct ipoib_cm_data { @@ -269,7 +317,7 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; + struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; struct ib_send_wr tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:15:25.000000000 +0300 @@ -407,6 +407,7 @@ void ipoib_cm_handle_rx_wc(struct net_de unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; int frags; + struct ipoib_header *header; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -469,7 +470,10 @@ void ipoib_cm_handle_rx_wc(struct net_de skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); @@ -491,14 +495,21 @@ repost: static inline int post_send(struct ipoib_dev_priv *priv, struct ipoib_cm_tx *tx, unsigned int wr_id, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -507,7 +518,6 @@ void ipoib_cm_send(struct net_device *de { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > tx->mtu)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -530,20 +540,19 @@ void ipoib_cm_send(struct net_device *de */ tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; - if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), - addr, skb->len))) { + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, + skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -577,7 +586,7 @@ static void ipoib_cm_handle_tx_wc(struct tx_req = &tx->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); /* FIXME: is this right? Shouldn't we only increment on success? */ ++priv->stats.tx_packets; @@ -814,7 +823,7 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; - attr.cap.max_send_sge = 1; + attr.cap.max_send_sge = dev->features & NETIF_F_SG ? MAX_SKB_FRAGS + 1 : 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -981,8 +990,7 @@ static void ipoib_cm_tx_destroy(struct i if (p->tx_ring) { while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:25.000000000 +0300 @@ -170,6 +170,7 @@ static void ipoib_ib_handle_rx_wc(struct struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; struct sk_buff *skb; + struct ipoib_header *header; u64 addr; ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n", @@ -220,7 +221,10 @@ static void ipoib_ib_handle_rx_wc(struct skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); @@ -257,8 +261,7 @@ static void ipoib_ib_handle_tx_wc(struct tx_req = &priv->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); ++priv->stats.tx_packets; priv->stats.tx_bytes += tx_req->skb->len; @@ -343,16 +346,23 @@ void ipoib_ib_completion(struct ib_cq *c static inline int post_send(struct ipoib_dev_priv *priv, unsigned int wr_id, struct ib_ah *address, u32 qpn, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr.ud.remote_qpn = qpn; + priv->tx_wr.wr.ud.ah = address; return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } @@ -362,7 +372,6 @@ void ipoib_send(struct net_device *dev, { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -385,20 +394,19 @@ void ipoib_send(struct net_device *dev, */ tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { + address->ah, qpn, + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -604,10 +612,7 @@ int ipoib_ib_dev_stop(struct net_device while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, - tx_req->mapping, - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:24.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:25.000000000 +0300 @@ -55,11 +55,14 @@ MODULE_LICENSE("Dual BSD/GPL"); int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +static int ipoib_hw_csum __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +module_param_named(hw_csum, ipoib_hw_csum, int, 0444); +MODULE_PARM_DESC(hw_csum, "Rely on hardware end-to-end checksum (ICRC) if > 0"); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -804,11 +807,16 @@ static int ipoib_hard_header(struct sk_b void *daddr, void *saddr, unsigned len) { struct ipoib_header *header; + struct ipoib_dev_priv *priv = netdev_priv(dev); header = (struct ipoib_header *) skb_push(skb, sizeof *header); header->proto = htons(type); - header->reserved = 0; + if (test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags) && + skb->ip_summed == CHECKSUM_PARTIAL) + header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); + else + header->flags = 0; /* * If we don't have a neighbour structure, stuff the @@ -1006,6 +1014,10 @@ static void ipoib_setup(struct net_devic dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + if (ipoib_hw_csum) { + dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM; + set_bit(IPOIB_FLAG_HW_CSUM, &priv->flags); + } /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-11 21:15:25.000000000 +0300 @@ -149,14 +149,14 @@ int ipoib_transport_dev_init(struct net_ .cap = { .max_send_wr = ipoib_sendq_size, .max_recv_wr = ipoib_recvq_size, - .max_send_sge = 1, + .max_send_sge = MAX_SKB_FRAGS + 1, .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; - int ret, size; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,11 +197,11 @@ int ipoib_transport_dev_init(struct net_ priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; + for (i = 0; i < MAX_SKB_FRAGS + 1; ++i) + priv->tx_sge[i].lkey = priv->mr->lkey; priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; + priv->tx_wr.sg_list = priv->tx_sge; priv->tx_wr.send_flags = IB_SEND_SIGNALED; return 0; From eli at mellanox.co.il Tue Sep 11 08:54:04 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:04 +0300 Subject: [ofa-general] [PATCH 3 of 17] ib_core: add checksum offload support Message-ID: <1189526044.13053.113.camel@mtls03> Add checksum offload support to the core Signed-off-by: Eli Cohen --- A device that publishes IB_DEVICE_IP_CSUM actually supports calculating checksum on transmit and provides indication whether the checksum is OK on receive. Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-11 21:15:25.000000000 +0300 @@ -95,7 +95,8 @@ enum ib_device_cap_flags { IB_DEVICE_N_NOTIFY_CQ = (1<<14), IB_DEVICE_ZERO_STAG = (1<<15), IB_DEVICE_SEND_W_INV = (1<<16), - IB_DEVICE_MEM_WINDOW = (1<<17) + IB_DEVICE_MEM_WINDOW = (1<<17), + IB_DEVICE_IP_CSUM = (1<<18) }; enum ib_atomic_cap { @@ -431,6 +432,8 @@ struct ib_wc { u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ + u16 csum; + int csum_ok; }; enum ib_cq_notify_flags { @@ -615,7 +618,9 @@ enum ib_send_flags { IB_SEND_FENCE = 1, IB_SEND_SIGNALED = (1<<1), IB_SEND_SOLICITED = (1<<2), - IB_SEND_INLINE = (1<<3) + IB_SEND_INLINE = (1<<3), + IB_SEND_IP_CSUM = (1<<4), + IB_SEND_UDP_TCP_CSUM = (1<<5) }; struct ib_sge { From eli at mellanox.co.il Tue Sep 11 08:54:11 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:11 +0300 Subject: [ofa-general] [PATCH 4 of 17] mthca: add checksum offload support Message-ID: <1189526051.13053.114.camel@mtls03> Add checksum offload support in mthca Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-11 21:15:25.000000000 +0300 @@ -1377,6 +1377,9 @@ int mthca_INIT_HCA(struct mthca_dev *dev MTHCA_PUT(inbox, param->uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); } + if (dev->device_cap_flags & IB_DEVICE_IP_CSUM) + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(7 << 3); + err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, HZ, status); mthca_free_mailbox(dev, mailbox); Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-11 21:15:25.000000000 +0300 @@ -103,6 +103,7 @@ enum { DEV_LIM_FLAG_RAW_IPV6 = 1 << 4, DEV_LIM_FLAG_RAW_ETHER = 1 << 5, DEV_LIM_FLAG_SRQ = 1 << 6, + DEV_LIM_FLAG_IPOIB_CSUM = 1 << 7, DEV_LIM_FLAG_BAD_PKEY_CNTR = 1 << 8, DEV_LIM_FLAG_BAD_QKEY_CNTR = 1 << 9, DEV_LIM_FLAG_MW = 1 << 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-11 21:15:20.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-11 21:15:25.000000000 +0300 @@ -119,7 +119,8 @@ struct mthca_cqe { __be32 my_qpn; __be32 my_ee; __be32 rqpn; - __be16 sl_g_mlpath; + u8 sl_ipok; + u8 g_mlpath; __be16 rlid; __be32 imm_etype_pkey_eec; __be32 byte_cnt; @@ -498,6 +499,7 @@ static inline int mthca_poll_one(struct int is_send; int free_cqe = 1; int err = 0; + u16 checksum; cqe = next_cqe_sw(cq); if (!cqe) @@ -639,12 +641,15 @@ static inline int mthca_poll_one(struct break; } entry->slid = be16_to_cpu(cqe->rlid); - entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->sl = cqe->sl_ipok >> 4; entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; - entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->dlid_path_bits = cqe->g_mlpath & 0x7f; entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; - entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? - IB_WC_GRH : 0; + entry->wc_flags |= cqe->g_mlpath & 0x80 ? IB_WC_GRH : 0; + checksum = (be32_to_cpu(cqe->rqpn) >> 24) | + ((be32_to_cpu(cqe->my_ee) >> 16) & 0xff00); + entry->csum_ok = (cqe->sl_ipok & 1 && checksum == 0xffff); + entry->csum = checksum; } entry->status = IB_WC_SUCCESS; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-11 21:15:19.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-11 21:15:25.000000000 +0300 @@ -289,6 +289,10 @@ static int mthca_dev_lim(struct mthca_de if (dev_lim->flags & DEV_LIM_FLAG_SRQ) mdev->mthca_flags |= MTHCA_FLAG_SRQ; + if (mthca_is_memfree(mdev)) + if (dev_lim->flags & DEV_LIM_FLAG_IPOIB_CSUM) + mdev->device_cap_flags |= IB_DEVICE_IP_CSUM; + return 0; } @@ -1125,6 +1129,8 @@ static int __mthca_init_one(struct pci_d if (err) goto err_cmd; + mdev->ib_dev.flags = mdev->device_cap_flags; + if (mdev->fw_ver < mthca_hca_table[hca_type].latest_fw) { mthca_warn(mdev, "HCA FW version %d.%d.%03d is old (%d.%d.%03d is current).\n", (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-11 21:15:20.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-11 21:15:25.000000000 +0300 @@ -2024,6 +2024,10 @@ int mthca_arbel_post_send(struct ib_qp * cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | ((wr->send_flags & IB_SEND_SOLICITED) ? cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + ((wr->send_flags & IB_SEND_IP_CSUM) ? + cpu_to_be32(MTHCA_NEXT_IP_CSUM) : 0) | + ((wr->send_flags & IB_SEND_UDP_TCP_CSUM) ? + cpu_to_be32(MTHCA_NEXT_TCP_UDP_CSUM) : 0) | cpu_to_be32(1); if (wr->opcode == IB_WR_SEND_WITH_IMM || wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_wqe.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_wqe.h 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_wqe.h 2007-09-11 21:15:25.000000000 +0300 @@ -38,14 +38,15 @@ #include enum { - MTHCA_NEXT_DBD = 1 << 7, - MTHCA_NEXT_FENCE = 1 << 6, - MTHCA_NEXT_CQ_UPDATE = 1 << 3, - MTHCA_NEXT_EVENT_GEN = 1 << 2, - MTHCA_NEXT_SOLICIT = 1 << 1, - - MTHCA_MLX_VL15 = 1 << 17, - MTHCA_MLX_SLR = 1 << 16 + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + MTHCA_NEXT_IP_CSUM = 1 << 4, + MTHCA_NEXT_TCP_UDP_CSUM = 1 << 5, + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 }; enum { From eli at mellanox.co.il Tue Sep 11 08:54:18 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:18 +0300 Subject: [ofa-general] [PATCH 5 of 17] mlx4: add checksum offload support Message-ID: <1189526058.13053.115.camel@mtls03> Add checksum offload support to mlx4 Signed-off-by: Ali Ayub Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h 2007-09-11 21:15:26.000000000 +0300 @@ -45,11 +45,11 @@ struct mlx4_cqe { u8 sl; u8 reserved1; __be16 rlid; - u32 reserved2; + __be32 ipoib_status; __be32 byte_cnt; __be16 wqe_index; __be16 checksum; - u8 reserved3[3]; + u8 reserved2[3]; u8 owner_sr_opcode; }; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:15:26.000000000 +0300 @@ -431,6 +431,9 @@ static int mlx4_ib_poll_one(struct mlx4_ wc->wc_flags |= be32_to_cpu(cqe->g_mlpath_rqpn) & 0x80000000 ? IB_WC_GRH : 0; wc->pkey_index = be32_to_cpu(cqe->immed_rss_invalid) >> 16; + wc->csum = be16_to_cpu(cqe->checksum); + wc->csum_ok = be32_to_cpu(cqe->ipoib_status) & 0x10000000 && + wc->csum == 0xffff; } return 0; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:09.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:26.000000000 +0300 @@ -100,6 +100,8 @@ static int mlx4_ib_query_device(struct i props->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UD_AV_PORT) props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + props->device_cap_flags |= IB_DEVICE_IP_CSUM; props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & 0xffffff; @@ -626,6 +628,9 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + if (ibdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + ibdev->ib_dev.flags |= IB_DEVICE_IP_CSUM; + if (init_node_data(ibdev)) goto err_map; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:11.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:26.000000000 +0300 @@ -1278,6 +1278,10 @@ int mlx4_ib_post_send(struct ib_qp *ibqp cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr->send_flags & IB_SEND_SOLICITED ? cpu_to_be32(MLX4_WQE_CTRL_SOLICITED) : 0) | + ((wr->send_flags & IB_SEND_IP_CSUM) ? + cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM) : 0) | + ((wr->send_flags & IB_SEND_UDP_TCP_CSUM) ? + cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp->sq_signal_bits; if (wr->opcode == IB_WR_SEND_WITH_IMM || Index: ofa_1_3_dev_kernel/include/linux/mlx4/qp.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/qp.h 2007-09-11 21:14:35.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/qp.h 2007-09-11 21:15:26.000000000 +0300 @@ -155,9 +155,11 @@ struct mlx4_qp_context { }; enum { - MLX4_WQE_CTRL_FENCE = 1 << 6, - MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, - MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_FENCE = 1 << 6, + MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, + MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_IP_CSUM = 1 << 4, + MLX4_WQE_CTRL_TCP_UDP_CSUM = 1 << 5, }; struct mlx4_wqe_ctrl_seg { Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.c 2007-09-11 21:15:03.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c 2007-09-11 21:15:26.000000000 +0300 @@ -741,6 +741,9 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); MLX4_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 3); + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 1000); if (err) From eli at mellanox.co.il Tue Sep 11 08:54:28 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:28 +0300 Subject: [ofa-general] [PATCH 7 of 17] ipoib: fix typo Message-ID: <1189526068.13053.117.camel@mtls03> Fix type - comma instead of semicolon Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-11 21:15:27.000000000 +0300 @@ -185,7 +185,7 @@ int ipoib_transport_dev_init(struct net_ goto out_free_cq; init_attr.send_cq = priv->cq; - init_attr.recv_cq = priv->cq, + init_attr.recv_cq = priv->cq; priv->qp = ib_create_qp(priv->pd, &init_attr); if (IS_ERR(priv->qp)) { From eli at mellanox.co.il Tue Sep 11 08:54:23 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:23 +0300 Subject: [ofa-general] [PATCH 6 of 17] ipoib: add checksum offload support Message-ID: <1189526063.13053.116.camel@mtls03> Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:26.000000000 +0300 @@ -87,6 +87,7 @@ enum { IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, IPOIB_FLAG_HW_CSUM = 11, + IPOIB_FLAG_RX_CSUM = 12, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:15:26.000000000 +0300 @@ -1262,6 +1262,13 @@ static ssize_t set_mode(struct device *d set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); ipoib_warn(priv, "enabling connected mode " "will cause multicast packet drops\n"); + + /* clear ipv6 flag too */ + dev->features &= ~NETIF_F_IP_CSUM; + + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + ipoib_flush_paths(dev); return count; } @@ -1270,6 +1277,11 @@ static ssize_t set_mode(struct device *d clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); dev->mtu = min(priv->mcast_mtu, dev->mtu); ipoib_flush_paths(dev); + + if (priv->ca->flags & IB_DEVICE_IP_CSUM && + !test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags)) + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ + return count; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:26.000000000 +0300 @@ -37,6 +37,7 @@ #include #include +#include #include @@ -235,6 +236,11 @@ static void ipoib_ib_handle_rx_wc(struct skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + + /* check rx csum */ + if (test_bit(IPOIB_FLAG_RX_CSUM, &priv->flags) && likely(wc->csum_ok)) + skb->ip_summed = CHECKSUM_UNNECESSARY; + netif_receive_skb(skb); repost: @@ -400,6 +406,16 @@ void ipoib_send(struct net_device *dev, return; } + if (!test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags) && + priv->ca->flags & IB_DEVICE_IP_CSUM && + skb->ip_summed == CHECKSUM_PARTIAL) + priv->tx_wr.send_flags |= + IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM; + else + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, tx_req->mapping, skb_headlen(skb), Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:26.000000000 +0300 @@ -1121,6 +1121,29 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +static void set_tx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags) || ipoib_hw_csum) + return; + + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) + return; + + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ +} + +static void set_rx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) + return; + + set_bit(IPOIB_FLAG_RX_CSUM, &priv->flags); +} + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { @@ -1177,6 +1200,9 @@ static struct net_device *ipoib_add_port goto event_failed; } + set_tx_csum(priv->dev); + set_rx_csum(priv->dev); + result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", From eli at mellanox.co.il Tue Sep 11 08:54:32 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:32 +0300 Subject: [ofa-general] [PATCH 8 of 17] mlx4: configure QP max msg size Message-ID: <1189526072.13053.118.camel@mtls03> Configure QP's max message size according to the value queried by query dev cap. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:27.000000000 +0300 @@ -757,7 +757,8 @@ static int __mlx4_ib_modify_qp(struct ib attr->path_mtu); goto out; } - context->mtu_msgmax = (attr->path_mtu << 5) | 31; + context->mtu_msgmax = (attr->path_mtu << 5) | + ilog2(dev->dev->caps.max_msg_sz); } if (qp->rq.wqe_cnt) From eli at mellanox.co.il Tue Sep 11 08:54:37 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:37 +0300 Subject: [ofa-general] [PATCH 9 of 17] ib_core: add LSO support Message-ID: <1189526077.13053.119.camel@mtls03> Add LSO supprt at the core This patch provides support at the core level for devices that support TCP large send offload fragmentation. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-11 21:15:25.000000000 +0300 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-11 21:15:27.000000000 +0300 @@ -96,7 +96,8 @@ enum ib_device_cap_flags { IB_DEVICE_ZERO_STAG = (1<<15), IB_DEVICE_SEND_W_INV = (1<<16), IB_DEVICE_MEM_WINDOW = (1<<17), - IB_DEVICE_IP_CSUM = (1<<18) + IB_DEVICE_IP_CSUM = (1<<18), + IB_DEVICE_TCP_GSO = (1<<19) }; enum ib_atomic_cap { @@ -404,6 +405,7 @@ enum ib_wc_opcode { IB_WC_COMP_SWAP, IB_WC_FETCH_ADD, IB_WC_BIND_MW, + IB_WC_LSO, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -608,6 +610,7 @@ enum ib_wr_opcode { IB_WR_RDMA_WRITE, IB_WR_RDMA_WRITE_WITH_IMM, IB_WR_SEND, + IB_WR_LSO, IB_WR_SEND_WITH_IMM, IB_WR_RDMA_READ, IB_WR_ATOMIC_CMP_AND_SWP, @@ -620,7 +623,8 @@ enum ib_send_flags { IB_SEND_SOLICITED = (1<<2), IB_SEND_INLINE = (1<<3), IB_SEND_IP_CSUM = (1<<4), - IB_SEND_UDP_TCP_CSUM = (1<<5) + IB_SEND_UDP_TCP_CSUM = (1<<5), + IB_SEND_UDP_LSO = (1<<6) }; struct ib_sge { @@ -650,6 +654,9 @@ struct ib_send_wr { } atomic; struct { struct ib_ah *ah; + void *header; + int hlen; + int mss; u32 remote_qpn; u32 remote_qkey; u16 pkey_index; /* valid for GSI only */ From eli at mellanox.co.il Tue Sep 11 08:54:43 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:43 +0300 Subject: [ofa-general] [PATCH 10 of 17] mlx4: add LSO support Message-ID: <1189526083.13053.120.camel@mtls03> Add LSO support to mlx4 Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c 2007-09-11 21:15:28.000000000 +0300 @@ -133,6 +133,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev * #define QUERY_DEV_CAP_MAX_AV_OFFSET 0x27 #define QUERY_DEV_CAP_MAX_REQ_QP_OFFSET 0x29 #define QUERY_DEV_CAP_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_CAP_MAX_GSO_OFFSET 0x2d #define QUERY_DEV_CAP_MAX_RDMA_OFFSET 0x2f #define QUERY_DEV_CAP_RSZ_SRQ_OFFSET 0x33 #define QUERY_DEV_CAP_ACK_DELAY_OFFSET 0x35 @@ -215,6 +216,13 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev * dev_cap->max_requester_per_qp = 1 << (field & 0x3f); MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_RES_QP_OFFSET); dev_cap->max_responder_per_qp = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_GSO_OFFSET); + field &= 0x1f; + if (!field) + dev_cap->max_gso_sz = 0; + else + dev_cap->max_gso_sz = 1 << field; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_RDMA_OFFSET); dev_cap->max_rdma_global = 1 << (field & 0x3f); MLX4_GET(field, outbox, QUERY_DEV_CAP_ACK_DELAY_OFFSET); @@ -377,6 +385,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev * dev_cap->max_sq_desc_sz, dev_cap->max_sq_sg); mlx4_dbg(dev, "Max RQ desc size: %d, max RQ S/G: %d\n", dev_cap->max_rq_desc_sz, dev_cap->max_rq_sg); + mlx4_dbg(dev, "Max GSO size: %d\n", dev_cap->max_gso_sz); dump_dev_cap_flags(dev, dev_cap->flags); Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.h 2007-09-11 21:14:34.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.h 2007-09-11 21:15:28.000000000 +0300 @@ -96,6 +96,7 @@ struct mlx4_dev_cap { u8 bmme_flags; u32 reserved_lkey; u64 max_icm_sz; + int max_gso_sz; }; struct mlx4_adapter { Index: ofa_1_3_dev_kernel/drivers/net/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/main.c 2007-09-11 21:15:18.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/main.c 2007-09-11 21:15:28.000000000 +0300 @@ -159,6 +159,7 @@ static int __devinit mlx4_dev_cap(struct dev->caps.page_size_cap = ~(u32) (dev_cap->min_page_sz - 1); dev->caps.flags = dev_cap->flags; dev->caps.stat_rate_support = dev_cap->stat_rate_support; + dev->caps.max_gso_sz = dev_cap->max_gso_sz; return 0; } Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:28.000000000 +0300 @@ -102,6 +102,8 @@ static int mlx4_ib_query_device(struct i props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) props->device_cap_flags |= IB_DEVICE_IP_CSUM; + if (dev->dev->caps.max_gso_sz) + props->device_cap_flags |= IB_DEVICE_TCP_GSO; props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & 0xffffff; @@ -630,6 +632,8 @@ static void *mlx4_ib_add(struct mlx4_dev if (ibdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) ibdev->ib_dev.flags |= IB_DEVICE_IP_CSUM; + if (ibdev->dev->caps.max_gso_sz) + ibdev->ib_dev.flags |= IB_DEVICE_TCP_GSO; if (init_node_data(ibdev)) goto err_map; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:27.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-11 21:15:28.000000000 +0300 @@ -69,6 +69,7 @@ enum { static const __be32 mlx4_ib_opcode[] = { [IB_WR_SEND] = __constant_cpu_to_be32(MLX4_OPCODE_SEND), + [IB_WR_LSO] = __constant_cpu_to_be32(MLX4_OPCODE_LSO), [IB_WR_SEND_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_IMM), [IB_WR_RDMA_WRITE] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_WRITE), [IB_WR_RDMA_WRITE_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_WRITE_IMM), @@ -748,9 +749,11 @@ static int __mlx4_ib_modify_qp(struct ib } } - if (ibqp->qp_type == IB_QPT_GSI || ibqp->qp_type == IB_QPT_SMI || - ibqp->qp_type == IB_QPT_UD) + if (ibqp->qp_type == IB_QPT_GSI || ibqp->qp_type == IB_QPT_SMI) context->mtu_msgmax = (IB_MTU_4096 << 5) | 11; + else if (ibqp->qp_type == IB_QPT_UD) + context->mtu_msgmax = (IB_MTU_4096 << 5) | + ilog2(dev->dev->caps.max_gso_sz); else if (attr_mask & IB_QP_PATH_MTU) { if (attr->path_mtu < IB_MTU_256 || attr->path_mtu > IB_MTU_4096) { printk(KERN_ERR "path MTU (%u) is invalid\n", @@ -1240,6 +1243,29 @@ static void set_data_seg(struct mlx4_wqe dseg->byte_count = cpu_to_be32(sg->length); } +static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr, + struct mlx4_ib_qp *qp, int *lso_seg_len) +{ + int halign; + + memcpy(wqe->header, wr->wr.ud.header, wr->wr.ud.hlen); + + /* make sure LSO header is written before + overwriting stamping */ + wmb(); + + wqe->mss_hdr_size = cpu_to_be32(((wr->wr.ud.mss - wr->wr.ud.hlen) + << 16) | wr->wr.ud.hlen); + + halign = ALIGN(wr->wr.ud.hlen, 16); + + if (unlikely(wr->num_sge > qp->sq.max_gs - (halign >> 4))) + return -EINVAL; + + *lso_seg_len = halign; + return 0; +} + int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1331,6 +1357,19 @@ int mlx4_ib_post_send(struct ib_qp *ibqp set_datagram_seg(wqe, wr); wqe += sizeof (struct mlx4_wqe_datagram_seg); size += sizeof (struct mlx4_wqe_datagram_seg) / 16; + + if (wr->opcode == IB_WR_LSO) { + int hlen; + + err = build_lso_seg(wqe, wr, qp, &hlen); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += hlen; + size += hlen >> 4; + } + break; case IB_QPT_SMI: Index: ofa_1_3_dev_kernel/include/linux/mlx4/device.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/device.h 2007-09-11 21:15:11.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/device.h 2007-09-11 21:15:28.000000000 +0300 @@ -181,6 +181,7 @@ struct mlx4_caps { u32 flags; u16 stat_rate_support; u8 port_width_cap[MLX4_MAX_PORTS + 1]; + int max_gso_sz; }; struct mlx4_buf_list { Index: ofa_1_3_dev_kernel/include/linux/mlx4/qp.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/qp.h 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/qp.h 2007-09-11 21:15:28.000000000 +0300 @@ -215,6 +215,11 @@ struct mlx4_wqe_datagram_seg { __be32 reservd[2]; }; +struct mlx4_lso_seg { + __be32 mss_hdr_size; + __be32 header[0]; +}; + struct mlx4_wqe_bind_seg { __be32 flags1; __be32 flags2; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:15:28.000000000 +0300 @@ -403,6 +403,9 @@ static int mlx4_ib_poll_one(struct mlx4_ case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + case MLX4_OPCODE_LSO: + wc->opcode = IB_WC_LSO; + break; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); From eli at mellanox.co.il Tue Sep 11 08:54:47 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:47 +0300 Subject: [ofa-general] [PATCH 11 of 17] ipoib: add LSO support Message-ID: <1189526087.13053.121.camel@mtls03> Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:28.000000000 +0300 @@ -133,28 +133,38 @@ static inline int ipoib_dma_map_tx(struc { struct sk_buff *skb = tx_req->skb; u64 *mapping = tx_req->mapping; - int i, frags; + int i, frags, off; - mapping[0] = ib_dma_map_single(ca, skb->data, skb_headlen(skb), DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(ca, mapping[0]))) - return -EIO; + if (!skb_is_gso(skb)) { + mapping[0] = ib_dma_map_single(ca, skb->data, skb_headlen(skb), DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[0]))) + return -EIO; + off = 1; + } + else + off = 0; frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - mapping[i + 1] = ib_dma_map_page(ca, frag->page, frag->page_offset, - frag->size, DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(ca, mapping[i + 1]))) + mapping[i + off] = ib_dma_map_page(ca, frag->page, frag->page_offset, + frag->size, DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[i + off]))) goto partial_error; } return 0; partial_error: - ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + if (!skb_is_gso(skb)) { + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + off = 0; + } + else + off = 1; for (; i > 0; --i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1]; - ib_dma_unmap_page(ca, mapping[i], frag->size, DMA_TO_DEVICE); + ib_dma_unmap_page(ca, mapping[i - off], frag->size, DMA_TO_DEVICE); } return -EIO; } @@ -163,14 +173,19 @@ static inline void ipoib_dma_unmap_tx(st { struct sk_buff *skb = tx_req->skb; u64 *mapping = tx_req->mapping; - int i, frags; + int i, frags, off; - ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + if (!skb_is_gso(skb)) { + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + off = 1; + } + else + off = 0; frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - ib_dma_unmap_page(ca, mapping[i + 1], frag->size, DMA_TO_DEVICE); + ib_dma_unmap_page(ca, mapping[i + off], frag->size, DMA_TO_DEVICE); } } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:28.000000000 +0300 @@ -38,6 +38,7 @@ #include #include #include +#include #include @@ -354,22 +355,36 @@ static inline int post_send(struct ipoib struct ib_ah *address, u32 qpn, u64 *mapping, int headlen, skb_frag_t *frags, - int nr_frags) + int nr_frags, void *lso_header) { struct ib_send_wr *bad_wr; - int i; + int i, off; + + if (!lso_header) { + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + off = 1; + } + else + off = 0; - priv->tx_sge[0].addr = mapping[0]; - priv->tx_sge[0].length = headlen; for (i = 0; i < nr_frags; ++i) { - priv->tx_sge[i + 1].addr = mapping[i + 1]; - priv->tx_sge[i + 1].length = frags[i].size; + priv->tx_sge[i + off].addr = mapping[i + off]; + priv->tx_sge[i + off].length = frags[i].size; } - priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.num_sge = nr_frags + off; priv->tx_wr.wr_id = wr_id; priv->tx_wr.wr.ud.remote_qpn = qpn; priv->tx_wr.wr.ud.ah = address; + if (lso_header) { + priv->tx_wr.wr.ud.mss = priv->dev->mtu; + priv->tx_wr.wr.ud.header = lso_header; + priv->tx_wr.wr.ud.hlen = headlen; + priv->tx_wr.opcode = IB_WR_LSO; + } else + priv->tx_wr.opcode = IB_WR_SEND; + return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } @@ -379,13 +394,26 @@ void ipoib_send(struct net_device *dev, struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { - ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", - skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); - ++priv->stats.tx_dropped; - ++priv->stats.tx_errors; - ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); - return; + if (!skb_is_gso(skb)) { + if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); + return; + } + } else { + if (unlikely((skb_headlen(skb) - IPOIB_ENCAP_LEN) != + ((ip_hdr(skb)->ihl + tcp_hdr(skb)->doff) << 2))) { + ipoib_warn(priv, "headlen (%d) does not match ip (%d)and " + "tcp headers(%d), dropping skb\n", + skb_headlen(skb) - IPOIB_ENCAP_LEN, + ip_hdr(skb)->ihl << 2, tcp_hdr(skb)->doff << 2); + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } } ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", @@ -419,7 +447,9 @@ void ipoib_send(struct net_device *dev, if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, tx_req->mapping, skb_headlen(skb), - skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags))) { + skb_shinfo(skb)->frags, + skb_shinfo(skb)->nr_frags, + skb_is_gso(skb) ? skb->data : NULL))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; ipoib_dma_unmap_tx(priv->ca, tx_req); Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:28.000000000 +0300 @@ -733,7 +733,9 @@ static int ipoib_start_xmit(struct sk_bu goto out; } - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, neigh->ah, + IPOIB_QPN(skb->dst->neighbour->ha)); + goto out; } @@ -1203,6 +1205,11 @@ static struct net_device *ipoib_add_port set_tx_csum(priv->dev); set_rx_csum(priv->dev); + if (!ipoib_hw_csum && priv->dev->features & NETIF_F_SG && + priv->ca->flags & IB_DEVICE_TCP_GSO) + priv->dev->features |= NETIF_F_TSO; + + result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-11 21:15:28.000000000 +0300 @@ -1264,7 +1264,7 @@ static ssize_t set_mode(struct device *d "will cause multicast packet drops\n"); /* clear ipv6 flag too */ - dev->features &= ~NETIF_F_IP_CSUM; + dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_TSO); priv->tx_wr.send_flags &= ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); @@ -1282,6 +1282,12 @@ static ssize_t set_mode(struct device *d !test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags)) dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ + + if (!test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags) && + priv->dev->features & NETIF_F_SG && + priv->ca->flags & IB_DEVICE_TCP_GSO) + priv->dev->features |= NETIF_F_TSO; + return count; } From eli at mellanox.co.il Tue Sep 11 08:54:51 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:51 +0300 Subject: [ofa-general] [PATCH 12 of 17] ipoib: ethtool support Message-ID: <1189526091.13053.122.camel@mtls03> Add ethtool support to ipoib Signed-off-by: Eli Cohen --- This one is actually the foundation with no real contecxt. I think we can add here all the logic of wheather to allow using a certain feature, e.g. checksum offload, scatter/gather etc. and decide on all the dependencies. Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/Makefile 2007-09-11 21:14:34.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile 2007-09-11 21:15:29.000000000 +0300 @@ -4,7 +4,8 @@ ib_ipoib-y := ipoib_main.o \ ipoib_ib.o \ ipoib_multicast.o \ ipoib_verbs.o \ - ipoib_vlan.o + ipoib_vlan.o \ + ipoib_etool.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM) += ipoib_cm.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-11 21:15:29.000000000 +0300 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2007 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_etool.c $ + */ + +#include +#include +#include + +#include "ipoib.h" + +static void ipoib_get_drvinfo(struct net_device *netdev, + struct ethtool_drvinfo *drvinfo) +{ + strncpy(drvinfo->driver, "ipoib", sizeof(drvinfo->driver) - 1); +} + +static const struct ethtool_ops ipoib_ethtool_ops = { + .get_drvinfo = ipoib_get_drvinfo, + .get_tso = ethtool_op_get_tso, +}; + +void ipoib_set_ethtool_ops(struct net_device *dev) +{ + SET_ETHTOOL_OPS(dev, &ipoib_ethtool_ops); +} Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:28.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:29.000000000 +0300 @@ -496,6 +496,8 @@ void ipoib_pkey_poll(struct work_struct int ipoib_pkey_dev_delay_open(struct net_device *dev); void ipoib_drain_cq(struct net_device *dev); +void ipoib_set_ethtool_ops(struct net_device *dev); + #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:28.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:29.000000000 +0300 @@ -1002,6 +1002,7 @@ static void ipoib_setup(struct net_devic dev->neigh_setup = ipoib_neigh_setup_dev; dev->poll = ipoib_poll; dev->weight = 100; + ipoib_set_ethtool_ops(dev); dev->watchdog_timeo = HZ; From eli at mellanox.co.il Tue Sep 11 08:54:55 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:55 +0300 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support Message-ID: <1189526095.13053.123.camel@mtls03> Add Large Receive Offload support to IPOIB Reduce overhead incurred by handling many small packets by aggregating SKBs related to the same stream and passing them up. This patch is based on the work done for MTNIC by Liran Liss Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/Makefile 2007-09-11 21:15:29.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile 2007-09-11 21:15:29.000000000 +0300 @@ -5,7 +5,8 @@ ib_ipoib-y := ipoib_main.o \ ipoib_multicast.o \ ipoib_verbs.o \ ipoib_vlan.o \ - ipoib_etool.o + ipoib_etool.o \ + ipoib_lro.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM) += ipoib_cm.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:29.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:29.000000000 +0300 @@ -95,6 +95,8 @@ enum { IPOIB_MCAST_FLAG_SENDONLY = 1, IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ IPOIB_MCAST_FLAG_ATTACHED = 3, + + MAX_LRO_SESSIONS = 1 << 5, /* must be power of 2 */ }; #define IPOIB_OP_RECV (1ul << 31) @@ -281,6 +283,30 @@ struct ipoib_cm_dev_priv { struct ib_recv_wr rx_wr; }; +struct ipoib_lro { + struct hlist_node node; + struct hlist_node flush_node; + + /* Id fields come first: */ + u32 saddr; + u32 daddr; + u32 sport_dport; + u32 next_seq; + u16 tot_len; + u8 psh; + + u32 tsval; + __be32 tsecr; + __be32 ack_seq; + __be16 window; + u16 has_vlan; + u16 has_timestamp; + + unsigned long expires; + struct sk_buff *head; + struct sk_buff *tail; +}; + /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -357,6 +383,11 @@ struct ipoib_dev_priv { struct dentry *mcg_dentry; struct dentry *path_dentry; #endif + + struct hlist_head *lro_hash; + struct hlist_head lro_free; + struct hlist_head lro_flush; + int lro_sz; /* must be 2^x */ }; struct ipoib_ah { @@ -498,6 +529,11 @@ void ipoib_drain_cq(struct net_device *d void ipoib_set_ethtool_ops(struct net_device *dev); +int ipoib_lro_init(struct ipoib_dev_priv *priv, int num_lro); +void ipoib_lro_destroy(struct ipoib_dev_priv *priv); +int ipoib_lro_rx(struct ipoib_dev_priv *priv, struct sk_buff *skb); +void ipoib_lro_flush(struct ipoib_dev_priv *priv, int all); + #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:28.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-11 21:15:29.000000000 +0300 @@ -239,8 +239,11 @@ static void ipoib_ib_handle_rx_wc(struct skb->pkt_type = PACKET_HOST; /* check rx csum */ - if (test_bit(IPOIB_FLAG_RX_CSUM, &priv->flags) && likely(wc->csum_ok)) + if (test_bit(IPOIB_FLAG_RX_CSUM, &priv->flags) && likely(wc->csum_ok)) { skb->ip_summed = CHECKSUM_UNNECESSARY; + if (!ipoib_lro_rx(priv, skb)) + goto repost; + } netif_receive_skb(skb); @@ -332,13 +335,13 @@ int ipoib_poll(struct net_device *dev, i *budget -= done; if (empty) { + ipoib_lro_flush(priv, 1); netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && netif_rx_reschedule(dev, 0)) return 1; - return 0; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:29.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-11 21:15:29.000000000 +0300 @@ -1165,7 +1165,7 @@ static struct net_device *ipoib_add_port if (result) { printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", hca->name, port, result); - goto alloc_mem_failed; + goto device_init_failed; } /* @@ -1181,7 +1181,7 @@ static struct net_device *ipoib_add_port if (result) { printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", hca->name, port, result); - goto alloc_mem_failed; + goto device_init_failed; } else memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); @@ -1211,6 +1211,9 @@ static struct net_device *ipoib_add_port priv->dev->features |= NETIF_F_TSO; + if (ipoib_lro_init(priv, MAX_LRO_SESSIONS)) + goto lro_init_failed; + result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", @@ -1236,6 +1239,9 @@ sysfs_failed: unregister_netdev(priv->dev); register_failed: + ipoib_lro_destroy(priv); + +lro_init_failed: ib_unregister_event_handler(&priv->event_handler); flush_scheduled_work(); @@ -1295,6 +1301,7 @@ static void ipoib_remove_one(struct ib_d dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { + ipoib_lro_destroy(priv); ib_unregister_event_handler(&priv->event_handler); flush_scheduled_work(); Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_lro.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_lro.c 2007-09-11 21:15:29.000000000 +0300 @@ -0,0 +1,392 @@ +/* + * Copyright (c) 2007 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +static int data_debug_level; + +module_param_named(lro_data_debug_level, data_debug_level, int, 0644); +MODULE_PARM_DESC(lro_data_debug_level, + "Enable data path debug tracing for lro code if > 0"); +#endif + +#include "ipoib.h" + +#include +#include +#include +#include + + +/* LRO hash function - using sum of source and destination port LSBs is + * good enough */ +#define LRO_INDEX(th, size) \ + ((*((u8 *)&th->source + 1) + *((u8 *)&th->dest + 1)) & (size - 1)) + + +int ipoib_lro_init(struct ipoib_dev_priv *priv, int num_lro) +{ + struct ipoib_lro *lro; + int i; + + INIT_HLIST_HEAD(&priv->lro_free); + INIT_HLIST_HEAD(&priv->lro_flush); + priv->lro_hash = kmalloc(sizeof(struct hlist_head) * num_lro, + GFP_KERNEL); + if (!priv->lro_hash) + return -ENOMEM; + + for (i = 0; i < num_lro; ++i) { + INIT_HLIST_HEAD(&priv->lro_hash[i]); + lro = kzalloc(sizeof(struct ipoib_lro), GFP_KERNEL); + if (!lro) { + ipoib_lro_destroy(priv); + return -ENOMEM; + } + INIT_HLIST_NODE(&lro->node); + INIT_HLIST_NODE(&lro->flush_node); + hlist_add_head(&lro->node, &priv->lro_free); + } + priv->lro_sz = num_lro; + + return 0; +} + +void ipoib_lro_destroy(struct ipoib_dev_priv *priv) +{ + struct ipoib_lro *lro; + struct hlist_node *node, *tmp; + + hlist_for_each_entry_safe(lro, node, tmp, &priv->lro_free, node) { + hlist_del(&lro->node); + kfree(lro); + } + kfree(priv->lro_hash); +} + +static inline int skb_valid_for_lro(const struct sk_buff *skb) +{ + const struct iphdr *hdr = (struct iphdr *)(skb->data); + + /* FIXME: mlx4 hw can supply all these test in the ipoib status + field - need to change implenetation that this value is passed + up to the caller */ + /* This packet is eligible for LRO if it is: + * - TCP/IP (v4) + * - without IP options + * - not an IP fragment */ + return hdr->protocol == IPPROTO_TCP && hdr->ihl == 5 && + !(hdr->frag_off & htons(0x2000)); +} + +static struct ipoib_lro *lro_find_session(struct ipoib_dev_priv *priv, + const struct iphdr *iph, + const struct tcphdr *th) +{ + struct ipoib_lro *lro; + struct hlist_node *pos; + int index = LRO_INDEX(th, priv->lro_sz); + struct hlist_head *head = &priv->lro_hash[index]; + + ipoib_dbg_data(priv, "Searching session at index:%d\n", index); + + hlist_for_each_entry(lro, pos, head, node) { + if (lro->sport_dport == *((__be32 *)&th->source) && + lro->saddr == iph->saddr && + lro->daddr == iph->daddr) + return lro; + } + return NULL; +} + +static void lro_flush_single(struct ipoib_dev_priv *priv, + struct ipoib_lro *lro) +{ + struct sk_buff *skb = lro->head; + struct iphdr *iph = (struct iphdr *)skb->data; + struct tcphdr *th = (struct tcphdr *)(iph + 1); + struct net_device *dev = priv->dev; + u32 *ts; + + /* Update IP length and checksum */ + iph->tot_len = htons(lro->tot_len); + iph->check = 0; + iph->check = ip_fast_csum(iph, sizeof(*iph) >> 2); + + /* Update latest TCP ack, window, psh, and timestamp */ + th->ack_seq = lro->ack_seq; + th->window = lro->window; + th->psh = !!lro->psh; + if (lro->has_timestamp) { + ts = (u32 *) (th + 1); + ts[1] = htonl(lro->tsval); + ts[2] = lro->tsecr; + } + + ipoib_dbg_data(priv, "Flushing LRO session (%p) - tot_len:%d\n", + lro, lro->tot_len); + + netif_receive_skb(skb); + dev->last_rx = jiffies; + + /* TBD Increment stats ?? */ + + /* Move session back to the free list */ + ipoib_dbg_data(priv, "Returning LRO session to free list\n"); + hlist_del(&lro->node); + hlist_del(&lro->flush_node); + hlist_add_head(&lro->node, &priv->lro_free); +} + +static void lro_append(struct ipoib_dev_priv *priv, struct ipoib_lro *lro, + struct sk_buff *skb, int tcp_len, int tcp_hlen) +{ + struct sk_buff *head = lro->head; + + ipoib_dbg_data(priv, "append %d bytes\n", tcp_len); + head->len += tcp_len; + head->data_len += tcp_len; + skb_pull(skb, tcp_hlen + sizeof(struct iphdr)); + if (skb_shinfo(head)->frag_list) + lro->tail->next = skb; + else + skb_shinfo(head)->frag_list = skb; + + head->truesize += skb->truesize; + lro->tail = skb; + return; +} + +static struct ipoib_lro *lro_alloc_session(struct ipoib_dev_priv *priv) +{ + struct ipoib_lro *lro; + + if (hlist_empty(&priv->lro_free)) + return NULL; + + lro = hlist_entry(priv->lro_free.first, struct ipoib_lro, node); + hlist_del(&lro->node); + + return lro; +} + +int ipoib_lro_rx(struct ipoib_dev_priv *priv, struct sk_buff *skb) +{ + struct ipoib_lro *lro; + const struct iphdr *iph; + const struct tcphdr *th; + int tcp_hlen; + int tcp_data_len; + u16 ip_len; + u32 *ts; + u32 seq; + u32 tsval = 0xffffffff; + __be32 tsecr = 0; + + if (unlikely(!skb_valid_for_lro(skb))) + return -1; + + /* Get pointer to TCP header */ + iph = (struct iphdr *)(skb->data); + th = (struct tcphdr *)(iph + 1); + + /* We only handle aligned timestamp options */ + tcp_hlen = th->doff << 2; + if (tcp_hlen == sizeof *th + TCPOLEN_TSTAMP_ALIGNED) { + ts = (u32 *)(th + 1); + if (unlikely(*ts != htonl((TCPOPT_NOP << 24) | + (TCPOPT_NOP << 16) | + (TCPOPT_TIMESTAMP << 8) | + TCPOLEN_TIMESTAMP))) + return -1; + + tsval = ntohl(ts[1]); + tsecr = ts[2]; + ipoib_dbg_data(priv, "Found ts:0x%x tsecr:0x%x\n", tsval, + ntohl(tsecr)); + } else if (tcp_hlen != sizeof(*th)) { + ipoib_dbg_data(priv, "Cannot LRO - tcp options\n"); + return -1; + } + + /* At this point we know we have a TCP packet that is likely to be + * eligible for LRO. Therefore, see now if we have an oustanding + * session that corresponds to this packet so we could flush it if + * something still prevents LRO */ + lro = lro_find_session(priv, iph, th); + ipoib_dbg_data(priv, "%s LRO session\n", lro ? "Found" : "Unrecognized"); + + /* ensure no bits set besides ack or psh */ + if (th->fin || th->syn || th->rst || th->urg || th->ece || + th->cwr || !th->ack) { + ipoib_dbg_data(priv, "Cannot LRO - tcp flags\n"); + if (lro) + lro_flush_single(priv, lro); + + return -1; + } + + ip_len = ntohs(iph->tot_len); + /* Get TCP payload length */ + tcp_data_len = ip_len - tcp_hlen - sizeof(struct iphdr); + seq = ntohl(th->seq); + ipoib_dbg_data(priv, "ip_len:%d ip_hlen:%d tcp_hlen:%d tcp_data_len:%d\n", + ip_len, iph->ihl * 4, tcp_hlen, tcp_data_len); + + if (lro) { + ipoib_dbg_data(priv, "Extending LRO (%p) session with current " + "current tot_len:%d\n", lro, lro->tot_len); + + /* Check sequence number */ + if (unlikely(seq != lro->next_seq)) { + ipoib_dbg_data(priv, "Sequence mismatch (got: 0x%08x, " + "expected:0x%08x)\n", seq, lro->next_seq); + lro_flush_single(priv, lro); + return -1; + } + + /* If the cummulative IP length is over 64K, flush and start + * a new session */ + if (lro->tot_len + tcp_data_len > 0xffff) { + ipoib_dbg_data(priv, "LRO 64K exceeded - " + "starting new session\n"); + lro_flush_single(priv, lro); + goto new_session; + } + + /* Check timestamps */ + if (tcp_hlen != sizeof(*th)) { + if (unlikely(lro->tsval > tsval || !tsecr)) { + ipoib_dbg_data(priv, "LRO - bad timestamp\n"); + return -1; + } + } + + /* Update session */ + lro->psh |= th->psh; + lro->next_seq += tcp_data_len; + lro->tot_len += tcp_data_len; + lro->tsval = tsval; + lro->tsecr = tsecr; + lro->ack_seq = th->ack_seq; + lro->window = th->window; + + if (likely(tcp_data_len)) + lro_append(priv, lro, skb, tcp_data_len, tcp_hlen); + else + dev_kfree_skb_any(skb); + +#ifdef IPOIB_LRO_FLUSH_PSH + if (th->psh) + lro_flush_single(priv, lro); +#endif + + return 0; + } + +new_session: + ipoib_dbg_data(priv, "LRO session not found - allocating new\n"); +#ifdef IPOIB_LRO_FLUSH_PSH + if (th->psh) { + ipoib_dbg_data(priv, "Aborting new session due to set psh bit\n"); + return -1; + } +#endif + + lro = lro_alloc_session(priv); + if (likely(lro)) { + int index; + + /* Add in the skb */ + lro->head = skb; + lro->tail = skb; + + /* Initialize session */ + lro->saddr = iph->saddr; + lro->daddr = iph->daddr; + lro->sport_dport = *((u32 *)&th->source); + + lro->next_seq = seq + tcp_data_len; + lro->tot_len = ip_len; + lro->psh = th->psh; + lro->ack_seq = th->ack_seq; + lro->window = th->window; + + /* Handle timestamps */ + if (tcp_hlen != sizeof(*th)) { + lro->tsval = tsval; + lro->tsecr = tsecr; + lro->has_timestamp = 1; + } else { + lro->tsval = 0xffffffff; + lro->has_timestamp = 0; + } + + /* Activate this session */ + lro->expires = jiffies + HZ / 25; + index = LRO_INDEX(th, priv->lro_sz); + + ipoib_dbg_data(priv, "Inserting session (%p) to list at index:%d\n", + lro, index); + hlist_add_head(&lro->node, &priv->lro_hash[index]); + hlist_add_head(&lro->flush_node, &priv->lro_flush); + return 0; + } else + ipoib_dbg_data(priv, "No more LRO sessions\n"); + + return -1; +} + + +void ipoib_lro_flush(struct ipoib_dev_priv *priv, int all) +{ + struct ipoib_lro *lro; + struct hlist_node *node, *tmp; + + ipoib_dbg_data(priv, "LRO flush called with all:%d at jiffies:%lu\n", + all, jiffies); + + hlist_for_each_entry_safe(lro, node, tmp, &priv->lro_flush, + flush_node) { + if (all || time_after(jiffies, lro->expires)) { + ipoib_dbg_data(priv, "Flushing session - saddr:0x%x " + "daddr:0x%x sport:%d dport:%d expires:%lu\n", + be32_to_cpu(lro->saddr), + be32_to_cpu(lro->daddr), + be16_to_cpu(*((__be16 *)&lro->sport_dport)), + be16_to_cpu(*((__be16 *)&lro->sport_dport + 1)), + lro->expires); + lro_flush_single(priv, lro); + } + } +} From eli at mellanox.co.il Tue Sep 11 08:54:59 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:54:59 +0300 Subject: [ofa-general] [PATCH 14 of 17] ib_core: modify CQ moderation params Message-ID: <1189526099.13053.124.camel@mtls03> Add support for modifying CQ parameters for controlling event generation moderation. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-11 21:15:27.000000000 +0300 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-11 21:15:29.000000000 +0300 @@ -975,6 +975,8 @@ struct ib_device { int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); + int (*modify_cq)(struct ib_cq *cq, u16 cq_count, + u16 cq_period); int (*destroy_cq)(struct ib_cq *cq); int (*resize_cq)(struct ib_cq *cq, int cqe, struct ib_udata *udata); @@ -1380,6 +1382,16 @@ struct ib_cq *ib_create_cq(struct ib_dev int ib_resize_cq(struct ib_cq *cq, int cqe); /** + * ib_modify_cq - Modifies moderation params of the CQ + * @cq: The CQ to modify. + * @cq_count: number of CQEs that will tirgger an event + * @cq_period: max period of time beofre triggering an event + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); + +/** * ib_destroy_cq - Destroys the specified CQ. * @cq: The CQ to destroy. */ Index: ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/core/verbs.c 2007-09-11 21:14:34.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c 2007-09-11 21:15:29.000000000 +0300 @@ -628,6 +628,13 @@ struct ib_cq *ib_create_cq(struct ib_dev } EXPORT_SYMBOL(ib_create_cq); +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + return cq->device->modify_cq ? + cq->device->modify_cq(cq, cq_count, cq_period) : -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_cq); + int ib_destroy_cq(struct ib_cq *cq) { if (atomic_read(&cq->usecnt)) From eli at mellanox.co.il Tue Sep 11 08:55:03 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:55:03 +0300 Subject: [ofa-general] [PATCH 15 of 17] mlx4: support modify CQ Message-ID: <1189526103.13053.125.camel@mtls03> Add support for modifying CQ parameters. Signed-off-by: Eli Cohen --- Add support for modifying CQ parameters. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:28.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-11 21:15:30.000000000 +0300 @@ -615,6 +615,7 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev->ib_dev.post_send = mlx4_ib_post_send; ibdev->ib_dev.post_recv = mlx4_ib_post_recv; ibdev->ib_dev.create_cq = mlx4_ib_create_cq; + ibdev->ib_dev.modify_cq = mlx4_ib_modify_cq; ibdev->ib_dev.destroy_cq = mlx4_ib_destroy_cq; ibdev->ib_dev.poll_cq = mlx4_ib_poll_cq; ibdev->ib_dev.req_notify_cq = mlx4_ib_arm_cq; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:15:28.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-11 21:15:30.000000000 +0300 @@ -91,6 +91,25 @@ static struct mlx4_cqe *next_cqe_sw(stru return get_sw_cqe(cq, cq->mcq.cons_index); } +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + struct mlx4_ib_cq *mcq = to_mcq(cq); + struct mlx4_ib_dev *dev = to_mdev(cq->device); + struct mlx4_cq_context *context; + int err; + + context = kzalloc(sizeof *context, GFP_KERNEL); + if (!context) + return -ENOMEM; + + context->cq_period = cpu_to_be16(cq_period); + context->cq_max_count = cpu_to_be16(cq_count); + err = mlx4_cq_modify(dev->dev, &mcq->mcq, context, 1); + + kfree(context); + return err; +} + struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-11 21:15:08.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-11 21:15:30.000000000 +0300 @@ -247,6 +247,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct struct ib_udata *udata); int mlx4_ib_dereg_mr(struct ib_mr *mr); +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata); Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-11 21:14:34.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c 2007-09-11 21:15:30.000000000 +0300 @@ -38,33 +38,11 @@ #include #include +#include #include "mlx4.h" #include "icm.h" -struct mlx4_cq_context { - __be32 flags; - u16 reserved1[3]; - __be16 page_offset; - __be32 logsize_usrpage; - u8 reserved2; - u8 cq_period; - u8 reserved3; - u8 cq_max_count; - u8 reserved4[3]; - u8 comp_eqn; - u8 log_page_size; - u8 reserved5[2]; - u8 mtt_base_addr_h; - __be32 mtt_base_addr_l; - __be32 last_notified_index; - __be32 solicit_producer_index; - __be32 consumer_index; - __be32 producer_index; - u32 reserved6[2]; - __be64 db_rec_addr; -}; - #define MLX4_CQ_STATUS_OK ( 0 << 28) #define MLX4_CQ_STATUS_OVERFLOW ( 9 << 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 << 28) @@ -121,6 +99,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev MLX4_CMD_TIME_CLASS_A); } +static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int cq_num, u32 opmod) +{ + return mlx4_cmd(dev, mailbox->dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ, + MLX4_CMD_TIME_CLASS_A); +} + static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, int cq_num) { @@ -206,6 +191,24 @@ err_out: } EXPORT_SYMBOL_GPL(mlx4_cq_alloc); +int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq, + struct mlx4_cq_context *context, int modify) +{ + struct mlx4_cmd_mailbox *mailbox; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + memcpy(mailbox->buf, context, sizeof *context); + err = mlx4_MODIFY_CQ(dev, mailbox, cq->cqn, modify); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_cq_modify); + void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) { struct mlx4_priv *priv = mlx4_priv(dev); Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-11 21:15:26.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h 2007-09-11 21:15:30.000000000 +0300 @@ -38,6 +38,27 @@ #include #include +struct mlx4_cq_context { + __be32 flags; + u16 reserved1[3]; + __be16 page_offset; + __be32 logsize_usrpage; + u16 cq_period; + u16 cq_max_count; + u8 reserved4[3]; + u8 comp_eqn; + u8 log_page_size; + u8 reserved5[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + __be32 last_notified_index; + __be32 solicit_producer_index; + __be32 consumer_index; + __be32 producer_index; + u32 reserved6[2]; + __be64 db_rec_addr; +}; + struct mlx4_cqe { __be32 my_qpn; __be32 immed_rss_invalid; @@ -120,4 +141,8 @@ enum { MLX4_CQ_DB_REQ_NOT = 2 << 24 }; + +int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq, + struct mlx4_cq_context *context, int resize); + #endif /* MLX4_CQ_H */ Index: ofa_1_3_dev_kernel/include/linux/mlx4/cmd.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cmd.h 2007-09-11 21:14:34.000000000 +0300 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cmd.h 2007-09-11 21:15:30.000000000 +0300 @@ -81,7 +81,7 @@ enum { MLX4_CMD_SW2HW_CQ = 0x16, MLX4_CMD_HW2SW_CQ = 0x17, MLX4_CMD_QUERY_CQ = 0x18, - MLX4_CMD_RESIZE_CQ = 0x2c, + MLX4_CMD_MODIFY_CQ = 0x2c, /* SRQ commands */ MLX4_CMD_SW2HW_SRQ = 0x35, From eli at mellanox.co.il Tue Sep 11 08:55:07 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:55:07 +0300 Subject: [ofa-general] [PATCH 16 of 17] ipoib: modify CQ through ethtool Message-ID: <1189526107.13053.126.camel@mtls03> Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:29.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-11 21:15:30.000000000 +0300 @@ -283,6 +283,13 @@ struct ipoib_cm_dev_priv { struct ib_recv_wr rx_wr; }; +struct ipoib_ethtool_st { + u16 rx_coalesce_usecs; + u16 tx_coalesce_usecs; + u16 rx_max_coalesced_frames; + u16 tx_max_coalesced_frames; +}; + struct ipoib_lro { struct hlist_node node; struct hlist_node flush_node; @@ -388,6 +395,8 @@ struct ipoib_dev_priv { struct hlist_head lro_free; struct hlist_head lro_flush; int lro_sz; /* must be 2^x */ + + struct ipoib_ethtool_st etool; }; struct ipoib_ah { Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-11 21:15:29.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-11 21:15:30.000000000 +0300 @@ -44,9 +44,49 @@ static void ipoib_get_drvinfo(struct net strncpy(drvinfo->driver, "ipoib", sizeof(drvinfo->driver) - 1); } +static int ipoib_get_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + coal->rx_coalesce_usecs = priv->etool.rx_coalesce_usecs; + coal->tx_coalesce_usecs = priv->etool.tx_coalesce_usecs; + coal->rx_max_coalesced_frames = priv->etool.rx_max_coalesced_frames; + coal->rx_max_coalesced_frames = priv->etool.tx_max_coalesced_frames; + + return 0; +} + +static int ipoib_set_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (coal->rx_coalesce_usecs > 0xffff || + coal->tx_coalesce_usecs > 0xffff || + coal->rx_max_coalesced_frames > 0xffff || + coal->tx_max_coalesced_frames > 0xffff) + return -EINVAL; + + ret = ib_modify_cq(priv->cq, coal->rx_max_coalesced_frames, + coal->rx_coalesce_usecs); + if (ret) + return ret; + + priv->etool.rx_coalesce_usecs = coal->rx_coalesce_usecs; + priv->etool.tx_coalesce_usecs = coal->tx_coalesce_usecs; + priv->etool.rx_max_coalesced_frames = coal->rx_max_coalesced_frames; + priv->etool.tx_max_coalesced_frames = coal->rx_max_coalesced_frames; + + return 0; +} + static const struct ethtool_ops ipoib_ethtool_ops = { .get_drvinfo = ipoib_get_drvinfo, .get_tso = ethtool_op_get_tso, + .get_coalesce = ipoib_get_coalesce, + .set_coalesce = ipoib_set_coalesce, }; void ipoib_set_ethtool_ops(struct net_device *dev) From eli at mellanox.co.il Tue Sep 11 08:55:11 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 11 Sep 2007 18:55:11 +0300 Subject: [ofa-general] [PATCH 17 of 17] mlx4: config coalscing params as default Message-ID: <1189526111.13053.127.camel@mtls03> From: Michael S. Tsirkin Enable interrupt coalescing for CQs in mlx4. Signed-off-by: Michael S. Tsirkin --- Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-11 21:15:30.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c 2007-09-11 21:15:31.000000000 +0300 @@ -43,6 +43,14 @@ #include "mlx4.h" #include "icm.h" +static int cq_max_count = 16; +static int cq_period = 10; + +module_param(cq_max_count, int, 0444); +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); +module_param(cq_period, int, 0444); +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); + #define MLX4_CQ_STATUS_OK ( 0 << 28) #define MLX4_CQ_STATUS_OVERFLOW ( 9 << 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 << 28) From rick.jones2 at hp.com Tue Sep 11 10:17:05 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 11 Sep 2007 10:17:05 -0700 Subject: [ofa-general] performance and Kernel support In-Reply-To: References: Message-ID: <46E6CD91.2030209@hp.com> H. N. HARAKE wrote: > The second question is regarding performance parameters using netperf > I reach 4GBit/s between two nodes using OFED version 1.2.51 and > 3GBit/s using OFED version 1.1 (10 Gig Mellanox cards) is their any > parameters to apply for improving the performance or is their any > document around. What is the CPU util being reported by netperf (-c and -C options for local and remote respectively) and how many cores are there in the system? Here are some numbers I get with a pair of rx2660's connected via an HP 4x IB switch: RedHat Enterprise Linux 5 2.6.18-8.el5 Peak Single-Stream Performance Bulk Transfer "Latency" Unidir Bidir Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr --------------------------------------------------------------------------- AD313A IPoIB 1.1 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a AD313A SDP 1.1 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 AD313A SDP p0 7810 0.346 0.527 12670 0.42 0.43 19380 n/a n/a AD313A IPoIP 1.2 5510 0.426 1.593 5730 n/a n/a 18990 n/a n/a AD313A SDP 1.2 7820 0.409 1.047 12890 0.64 0.68 41988 25.89 26.32 AD313A SDP p0 1.2 7820 0.309 0.517 12760 0.36 0.36 19800 15.47 15.72 The big change between 1.1 and 1.2 was, IIRC the increase in the default IP MTU from 2044 to 65520 (?) bytes. The limitation in the 1.1 case at least was CPU saturation (although I don't show the CPU utils in the table above, just the service demands. Notice the very significant change in service deman (microseconds of CPU consumed per KB transferred) between 1.1 and 1.2. I suspect the receive side would go down even further with CKO support but alas I've none of those sorts of cards at my disposal... For those test I was likely using -s 1M -S 1M -m 64K on the Unidir, and -s 1M -S 1M -r 64K -b 12 on the Bidir (TCP_RR ./configured with --enable-burst). The latency figures are the "standard" :) single-byte TCP_RR test. p0 means the SDP stuff was configured to sleep rather than sit and spin. happy benchmarking rick jones From jgunthorpe at obsidianresearch.com Tue Sep 11 10:46:04 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 11 Sep 2007 11:46:04 -0600 Subject: [ofa-general] RFC: modify upstream code to make backporting easier In-Reply-To: <20070911062851.GC15363@mellanox.co.il> References: <20070911062851.GC15363@mellanox.co.il> Message-ID: <20070911174604.GG4472@obsidianresearch.com> On Tue, Sep 11, 2007 at 09:28:51AM +0300, Michael S. Tsirkin wrote: > Upstream maintainers, can you pls comment ASAP on whether such > approach would be acceptable e.g. for 2.6.24? If I could get rid of > backport patches, it might make sense to start thinking about > converting fixes patches to git commits, post 1.3, as well. FWIW, I've seen arguments about this for other drivers over the years and the Kernel folks have pretty much always said that wrappers like this in the mainline to support backporting are not desired.. Jason From hrosenstock at xsigo.com Tue Sep 11 11:03:58 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 11 Sep 2007 11:03:58 -0700 Subject: [ofa-general] [PATCH] OpenSM/console: Support loopback in -console option Message-ID: <1189533839.11745.9.camel@hrosenstock-ws.xsigo.com> OpenSM/(osm_console main).c: Support loopback option to -console for local only telnet support Note: Patch is based on OFED 1.2 Signed-off-by: Hal Rosenstock diff --git a/osm/man/opensm.8 b/osm/man/opensm.8 index 38b49c1..5da53b5 100644 --- a/osm/man/opensm.8 +++ b/osm/man/opensm.8 @@ -1,11 +1,11 @@ -.TH OPENSM 8 "May 15, 2007" "OpenIB" "OpenIB Management" +.TH OPENSM 8 "August 8, 2007" "OpenIB" "OpenIB Management" .SH NAME opensm \- InfiniBand subnet manager and administration (SM/SA) .SH SYNOPSIS .B opensm -[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-routing_engine ] [\-M | \-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a(dd_guid_file) ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] +[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-routing_engine ] [\-M | \-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a(dd_guid_file) ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket | loopback]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] .SH DESCRIPTION .PP @@ -132,10 +132,10 @@ SMPs. Without -maxsmps, OpenSM defaults to a maximum of 4 outstanding SMPs. .TP -\fB\-console [off | local | socket]\fR +\fB\-console [off | local | socket | loopback]\fR This option brings up the OpenSM console (default off). -Note that the socket option will only be available if OpenSM ---enable-console-socket. +Note that the socket and loopback options will only be available +if OpenSM was built with --enable-console-socket. .TP \fB\-console-port\fR Specify an alternate telnet port for the socket console (default 10000). diff --git a/osm/opensm/main.c b/osm/opensm/main.c index e38ea7f..7cbca74 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -214,7 +214,7 @@ show_usage(void) " 4 outstanding SMPs.\n\n" ); printf( "-console [off|local" #ifdef ENABLE_OSM_CONSOLE_SOCKET - "|socket" + "|socket|loopback" #endif "]\n This option activates the OpenSM console (default off).\n\n"); #ifdef ENABLE_OSM_CONSOLE_SOCKET @@ -676,17 +676,17 @@ main( /* * OpenSM interactive console */ - if (strcmp(optarg, "off") == 0) { - opt.console = "off"; - } else if (strcmp(optarg, "local") == 0) { - opt.console = "local"; + if (strcmp(optarg, "off") == 0 || + strcmp(optarg, "local") == 0 #ifdef ENABLE_OSM_CONSOLE_SOCKET - } else if (strcmp(optarg, "socket") == 0) { - opt.console = "socket"; + || + strcmp(optarg, "socket") == 0 || + strcmp(optarg, "loopback") == 0 #endif - } else { + ) + opt.console = optarg; + else printf("-console %s option not understood\n", optarg); - } break; #ifdef ENABLE_OSM_CONSOLE_SOCKET @@ -957,7 +957,8 @@ main( osm_console(&osm); #ifdef ENABLE_OSM_CONSOLE_SOCKET } - else if (strcmp(opt.console, "socket") == 0) + else if (strcmp(opt.console, "socket") == 0 || + strcmp(opt.console, "loopback") == 0) { osm_console(&osm); #endif diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c index 38b978a..5575425 100644 --- a/osm/opensm/osm_console.c +++ b/osm/opensm/osm_console.c @@ -520,7 +520,8 @@ void osm_console_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm) osm_console_prompt(p_osm->console.out); #ifdef ENABLE_OSM_CONSOLE_SOCKET - } else if (strcmp(opt->console, "socket") == 0) { + } else if (strcmp(opt->console, "socket") == 0 || + strcmp(opt->console, "loopback") == 0) { struct sockaddr_in sin; int optval = 1; @@ -534,7 +535,10 @@ void osm_console_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm) setsockopt(p_osm->console.socket, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval)); sin.sin_family = AF_INET; sin.sin_port = htons(opt->console_port); - sin.sin_addr.s_addr = htonl(INADDR_ANY); + if (strcmp(opt->console, "socket") == 0) + sin.sin_addr.s_addr = htonl(INADDR_ANY); + else + sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); if (bind(p_osm->console.socket, &sin, sizeof(sin)) < 0) { osm_log(&(p_osm->log), OSM_LOG_ERROR, From hrosenstock at xsigo.com Tue Sep 11 11:04:16 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 11 Sep 2007 11:04:16 -0700 Subject: [ofa-general] [PATCH] OpenSM: Improve QP0 and QP1 counter accounting Message-ID: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> OpenSM: Improve QP0 and QP1 counter accounting Note: Patch is based on OFED 1.2 Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h index ea60341..eced96b 100644 --- a/osm/include/opensm/osm_sa.h +++ b/osm/include/opensm/osm_sa.h @@ -209,6 +209,7 @@ typedef struct _osm_sa * FIELDS * state * State of this SA object +* * p_subn * Pointer to the Subnet object for this subnet. * @@ -448,6 +449,22 @@ osm_sa_bind( * SEE ALSO *********/ +/****f* OpenSM: SA/osm_sa_vendor_send +* NAME +* osm_sa_vendor_send +* +* DESCRIPTION +* Sends SA MAD via osm_vendor_call and maintains the QP1 sent statistic +* +* SYNOPSIS +*/ +ib_api_status_t +osm_sa_vendor_send( + IN osm_bind_handle_t h_bind, + IN osm_madw_t* const p_madw, + IN boolean_t const resp_expected, + IN osm_subn_t* const p_subn ); + struct _osm_opensm_t; /****f* OpenSM: SA/osm_sa_db_file_dump * NAME diff --git a/osm/include/opensm/osm_sa_guidinfo_record.h b/osm/include/opensm/osm_sa_guidinfo_record.h index 5c23cf9..d3cb23d 100644 --- a/osm/include/opensm/osm_sa_guidinfo_record.h +++ b/osm/include/opensm/osm_sa_guidinfo_record.h @@ -98,7 +98,7 @@ BEGIN_C_DECLS */ typedef struct _osm_gir_rcv { - const osm_subn_t *p_subn; + osm_subn_t *p_subn; osm_sa_resp_t *p_resp; osm_mad_pool_t *p_mad_pool; osm_log_t *p_log; @@ -209,7 +209,7 @@ osm_gir_rcv_init( IN osm_gir_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* diff --git a/osm/include/opensm/osm_sa_node_record.h b/osm/include/opensm/osm_sa_node_record.h index c0e8988..0ee8ae1 100644 --- a/osm/include/opensm/osm_sa_node_record.h +++ b/osm/include/opensm/osm_sa_node_record.h @@ -99,7 +99,7 @@ BEGIN_C_DECLS */ typedef struct _osm_nr_recv { - const osm_subn_t *p_subn; + osm_subn_t *p_subn; osm_sa_resp_t *p_resp; osm_mad_pool_t *p_mad_pool; osm_log_t *p_log; @@ -206,7 +206,7 @@ ib_api_status_t osm_nr_rcv_init( IN osm_nr_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* diff --git a/osm/include/opensm/osm_sa_pkey_record.h b/osm/include/opensm/osm_sa_pkey_record.h index aceab9a..08b7fee 100644 --- a/osm/include/opensm/osm_sa_pkey_record.h +++ b/osm/include/opensm/osm_sa_pkey_record.h @@ -87,7 +87,7 @@ BEGIN_C_DECLS */ typedef struct _osm_pkey_rec_rcv { - const osm_subn_t* p_subn; + osm_subn_t* p_subn; osm_sa_resp_t* p_resp; osm_mad_pool_t* p_mad_pool; osm_log_t* p_log; @@ -198,7 +198,7 @@ osm_pkey_rec_rcv_init( IN osm_pkey_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* diff --git a/osm/include/opensm/osm_sa_response.h b/osm/include/opensm/osm_sa_response.h index b9e84d1..d883c3b 100644 --- a/osm/include/opensm/osm_sa_response.h +++ b/osm/include/opensm/osm_sa_response.h @@ -52,6 +52,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -97,6 +98,7 @@ BEGIN_C_DECLS typedef struct _osm_sa_resp { osm_mad_pool_t *p_pool; + osm_subn_t *p_subn; osm_log_t *p_log; } osm_sa_resp_t; /* @@ -186,6 +188,7 @@ ib_api_status_t osm_sa_resp_init( IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_pool, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log ); /* * PARAMETERS @@ -195,8 +198,8 @@ osm_sa_resp_init( * p_mad_pool * [in] Pointer to the MAD pool. * -* p_vl15 -* [in] Pointer to the VL15 interface. +* p_subn +* [in] Pointer to Subnet object for this subnet. * * p_log * [in] Pointer to the log object. diff --git a/osm/include/opensm/osm_sa_slvl_record.h b/osm/include/opensm/osm_sa_slvl_record.h index a5ce9b4..fabd133 100644 --- a/osm/include/opensm/osm_sa_slvl_record.h +++ b/osm/include/opensm/osm_sa_slvl_record.h @@ -100,7 +100,7 @@ BEGIN_C_DECLS */ typedef struct _osm_slvl_rec_rcv { - const osm_subn_t *p_subn; + osm_subn_t *p_subn; osm_sa_resp_t *p_resp; osm_mad_pool_t *p_mad_pool; osm_log_t *p_log; @@ -211,7 +211,7 @@ osm_slvl_rec_rcv_init( IN osm_slvl_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* diff --git a/osm/include/opensm/osm_sa_vlarb_record.h b/osm/include/opensm/osm_sa_vlarb_record.h index 4aad76f..9796483 100644 --- a/osm/include/opensm/osm_sa_vlarb_record.h +++ b/osm/include/opensm/osm_sa_vlarb_record.h @@ -100,7 +100,7 @@ BEGIN_C_DECLS */ typedef struct _osm_vlarb_rec_rcv { - const osm_subn_t *p_subn; + osm_subn_t *p_subn; osm_sa_resp_t *p_resp; osm_mad_pool_t *p_mad_pool; osm_log_t *p_log; @@ -211,7 +211,7 @@ osm_vlarb_rec_rcv_init( IN osm_vlarb_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* diff --git a/osm/include/opensm/osm_stats.h b/osm/include/opensm/osm_stats.h index 5cffc00..15bc8e0 100644 --- a/osm/include/opensm/osm_stats.h +++ b/osm/include/opensm/osm_stats.h @@ -90,9 +90,12 @@ typedef struct _osm_stats atomic32_t qp0_mads_rcvd; atomic32_t qp0_mads_sent; atomic32_t qp0_unicasts_sent; + atomic32_t qp0_mads_rcvd_unknown; atomic32_t qp1_mads_outstanding; atomic32_t qp1_mads_rcvd; atomic32_t qp1_mads_sent; + atomic32_t qp1_mads_rcvd_unknown; + atomic32_t qp1_mads_ignored; } osm_stats_t; /* @@ -117,6 +120,27 @@ typedef struct _osm_stats * Total number of response-less MADs sent on the wire. This count * includes getresp(), send() and trap() methods. * +* qp0_mads_rcvd_unknown +* Total number of unknown QP0 MADs received. This includes +* unrecognized attribute IDs and methods. +* +* qp1_mads_outstanding +* Contains the number of MADs outstanding on QP1. +* +* qp1_mads_rcvd +* Total number of QP1 MADs received. +* +* qp1_mads_sent +* Total number of QP1 MADs sent. +* +* qp1_mads_rcvd_unknown +* Total number of unknown QP1 MADs received. This includes +* unrecognized attribute IDs and methods. +* +* qp1_mads_ignored +* Total number of QP1 MADs received because SM is not +* master or SM is in first time sweep. +* * SEE ALSO ***************/ diff --git a/osm/include/opensm/osm_version.h b/osm/include/opensm/osm_version.h index ef91e16..6d2c8ee 100644 --- a/osm/include/opensm/osm_version.h +++ b/osm/include/opensm/osm_version.h @@ -55,7 +55,7 @@ BEGIN_C_DECLS * * SYNOPSIS */ -#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo2" +#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo3" /********/ END_C_DECLS diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c index 5575425..7acfdf1 100644 --- a/osm/opensm/osm_console.c +++ b/osm/opensm/osm_console.c @@ -336,23 +336,29 @@ static void print_status(osm_opensm_t *p_osm, FILE *out) p_osm->routing_engine.name ? p_osm->routing_engine.name : "null (min-hop)"); fprintf(out, "\n MAD stats\n" " ---------\n" - " QP0 MADS outstanding : %d\n" - " QP0 MADS outstanding (on wire) : %d\n" - " QP0 MADS rcvd : %d\n" - " QP0 MADS sent : %d\n" + " QP0 MADs outstanding : %d\n" + " QP0 MADs outstanding (on wire) : %d\n" + " QP0 MADs rcvd : %d\n" + " QP0 MADs sent : %d\n" " QP0 unicasts sent : %d\n" - " QP1 MADS outstanding : %d\n" - " QP1 MADS rcvd : %d\n" - " QP1 MADS sent : %d\n" + " QP0 unknown MADs rcvd : %d\n" + " QP1 MADs outstanding : %d\n" + " QP1 MADs rcvd : %d\n" + " QP1 MADs sent : %d\n" + " QP1 unknown MADs rcvd : %d\n" + " QP1 MADs ignored : %d\n" , p_osm->stats.qp0_mads_outstanding, p_osm->stats.qp0_mads_outstanding_on_wire, p_osm->stats.qp0_mads_rcvd, p_osm->stats.qp0_mads_sent, p_osm->stats.qp0_unicasts_sent, + p_osm->stats.qp0_mads_rcvd_unknown, p_osm->stats.qp1_mads_outstanding, p_osm->stats.qp1_mads_rcvd, - p_osm->stats.qp1_mads_sent + p_osm->stats.qp1_mads_sent, + p_osm->stats.qp1_mads_rcvd_unknown, + p_osm->stats.qp1_mads_ignored ); fprintf(out, "\n Subnet flags\n" " ------------\n" diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index f91fa49..e1e1dec 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -57,6 +57,7 @@ #include #include #include +#include typedef struct _osm_infr_match_ctxt { @@ -442,7 +443,8 @@ __osm_send_report( *p_report_ntc = *p_ntc; /* The TRUE is for: response is expected */ - status = osm_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE ); + status = osm_sa_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE, + p_infr_rec->p_infr_rcv->p_subn ); if ( status != IB_SUCCESS ) { osm_log( p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c index d856fb0..f10ed60 100644 --- a/osm/opensm/osm_lid_mgr.c +++ b/osm/opensm/osm_lid_mgr.c @@ -1163,15 +1163,19 @@ __osm_lid_mgr_set_physp_pi( if ( (mtu != ib_port_info_get_neighbor_mtu(p_old_pi)) || (op_vls != ib_port_info_get_op_vls(p_old_pi))) { - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) +#if 0 + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ERROR ) ) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, +#endif + osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_lid_mgr_set_physp_pi: " - "Sending Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", + "Setting Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", mtu, ib_port_info_get_neighbor_mtu(p_old_pi), op_vls, ib_port_info_get_op_vls(p_old_pi) ); +#if 0 } +#endif /* we need to make sure the internal DB will follow the fact the remote diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c index 6d68ed2..360ad70 100644 --- a/osm/opensm/osm_sa.c +++ b/osm/opensm/osm_sa.c @@ -69,6 +69,7 @@ #include #include #include +#include #define OSM_SA_INITIAL_TID_VALUE 0xabc @@ -202,6 +203,7 @@ osm_sa_init( status = osm_sa_resp_init(&p_sa->resp, p_sa->p_mad_pool, + p_subn, p_log); if( status != IB_SUCCESS ) goto Exit; @@ -519,6 +521,22 @@ osm_sa_bind( return( status ); } +ib_api_status_t +osm_sa_vendor_send( + IN osm_bind_handle_t h_bind, + IN osm_madw_t* const p_madw, + IN boolean_t const resp_expected, + IN osm_subn_t* const p_subn ) +{ + ib_api_status_t status; + + cl_atomic_inc( &p_subn->p_osm->stats.qp1_mads_sent ); + status = osm_vendor_send( h_bind, p_madw, resp_expected ); + if ( status != IB_SUCCESS ) + cl_atomic_dec( &p_subn->p_osm->stats.qp1_mads_sent ); + return status; +} + /********************************************************************** **********************************************************************/ /* diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c index da107ee..9ee434a 100644 --- a/osm/opensm/osm_sa_class_port_info.c +++ b/osm/opensm/osm_sa_class_port_info.c @@ -60,6 +60,7 @@ #include #include #include +#include #define MAX_MSECS_TO_RTV 24 /* Precalculated table in msec (index is related to encoded value) */ @@ -223,7 +224,8 @@ __osm_cpi_rcv_respond( if( osm_log_is_active( p_rcv->p_log, OSM_LOG_FRAMES ) ) osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if( status != IB_SUCCESS ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_guidinfo_record.c b/osm/opensm/osm_sa_guidinfo_record.c index 10fac3c..fe85eff 100644 --- a/osm/opensm/osm_sa_guidinfo_record.c +++ b/osm/opensm/osm_sa_guidinfo_record.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of osm_gir_rcv_t. @@ -61,6 +60,7 @@ #include #include #include +#include #define OSM_GIR_RCV_POOL_MIN_SIZE 32 #define OSM_GIR_RCV_POOL_GROW_SIZE 32 @@ -108,7 +108,7 @@ osm_gir_rcv_init( IN osm_gir_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -595,7 +595,8 @@ osm_gir_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 340a7f1..dc999b3 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -339,7 +339,8 @@ __osm_infr_rcv_respond( p_resp_infr = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if ( status != IB_SUCCESS ) { @@ -647,7 +648,8 @@ osm_infr_rcv_process_get_method( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c index b6333e7..ed989a0 100644 --- a/osm/opensm/osm_sa_lft_record.c +++ b/osm/opensm/osm_sa_lft_record.c @@ -58,6 +58,7 @@ #include #include #include +#include #define OSM_LFTR_RCV_POOL_MIN_SIZE 32 #define OSM_LFTR_RCV_POOL_GROW_SIZE 32 @@ -502,7 +503,8 @@ osm_lftr_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 169e75e..058b6b2 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -60,6 +60,7 @@ #include #include #include +#include #define OSM_LR_RCV_POOL_MIN_SIZE 64 #define OSM_LR_RCV_POOL_GROW_SIZE 64 @@ -679,7 +680,8 @@ __osm_lr_rcv_respond( } } - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index d6518e4..579e8f1 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -269,6 +269,7 @@ __osm_sa_mad_ctrl_process( There is an unknown MAD attribute type for which there is no recipient. Simply retire the MAD here. */ + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); osm_mad_pool_put( p_ctrl->p_mad_pool, p_madw ); } @@ -330,6 +331,7 @@ __osm_sa_mad_ctrl_rcv_callback( */ if ( p_ctrl->p_subn->sm_state != IB_SMINFO_STATE_MASTER ) { + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, "__osm_sa_mad_ctrl_rcv_callback: " "Received SA MAD while SM not MASTER. MAD ignored\n"); @@ -338,6 +340,7 @@ __osm_sa_mad_ctrl_rcv_callback( } if ( p_ctrl->p_subn->first_time_master_sweep == TRUE ) { + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, "__osm_sa_mad_ctrl_rcv_callback: " "Received SA MAD while SM in first sweep. MAD ignored\n"); @@ -394,6 +397,7 @@ __osm_sa_mad_ctrl_rcv_callback( break; default: + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sa_mad_ctrl_rcv_callback: ERR 1A05: " "Unsupported method = 0x%X\n", diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 50c4f22..260360f 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -68,6 +68,7 @@ #include #include #include +#include #define OSM_MCMR_RCV_POOL_MIN_SIZE 32 #define OSM_MCMR_RCV_POOL_GROW_SIZE 32 @@ -571,7 +572,8 @@ __osm_mcmr_rcv_respond( p_resp_mcmember_rec->pkt_life &= 0x3f; p_resp_mcmember_rec->pkt_life |= 2<<6; /* exactly */ - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if(status != IB_SUCCESS) { @@ -2266,7 +2268,8 @@ __osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* const p_rcv, CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if(status != IB_SUCCESS) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c index 005c9bd..d7c7544 100644 --- a/osm/opensm/osm_sa_mft_record.c +++ b/osm/opensm/osm_sa_mft_record.c @@ -57,6 +57,7 @@ #include #include #include +#include #define OSM_MFTR_RCV_POOL_MIN_SIZE 32 #define OSM_MFTR_RCV_POOL_GROW_SIZE 32 @@ -534,7 +535,8 @@ osm_mftr_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c index 0c5643e..2df3699 100644 --- a/osm/opensm/osm_sa_multipath_record.c +++ b/osm/opensm/osm_sa_multipath_record.c @@ -64,6 +64,7 @@ #include #include #include +#include #define OSM_MPR_RCV_POOL_MIN_SIZE 64 #define OSM_MPR_RCV_POOL_GROW_SIZE 64 @@ -1536,7 +1537,8 @@ __osm_mpr_rcv_respond( osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if ( status != IB_SUCCESS ) { diff --git a/osm/opensm/osm_sa_node_record.c b/osm/opensm/osm_sa_node_record.c index 892582e..0d08a4c 100644 --- a/osm/opensm/osm_sa_node_record.c +++ b/osm/opensm/osm_sa_node_record.c @@ -58,6 +58,7 @@ #include #include #include +#include #define OSM_NR_RCV_POOL_MIN_SIZE 32 #define OSM_NR_RCV_POOL_GROW_SIZE 32 @@ -105,7 +106,7 @@ osm_nr_rcv_init( IN osm_nr_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -587,7 +588,8 @@ osm_nr_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 1b0f89f..b993fdd 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -67,6 +67,7 @@ #include #include #include +#include #ifdef ROUTER_EXP #include #include @@ -1892,7 +1893,8 @@ __osm_pr_rcv_respond( CL_ASSERT( cl_is_qlist_empty( p_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if( status != IB_SUCCESS ) { diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c index 5eb15df..2692d0c 100644 --- a/osm/opensm/osm_sa_pkey_record.c +++ b/osm/opensm/osm_sa_pkey_record.c @@ -49,6 +49,7 @@ #include #include #include +#include #define OSM_PKEY_REC_RCV_POOL_MIN_SIZE 32 #define OSM_PKEY_REC_RCV_POOL_GROW_SIZE 32 @@ -94,10 +95,10 @@ osm_pkey_rec_rcv_destroy( **********************************************************************/ ib_api_status_t osm_pkey_rec_rcv_init( - IN osm_pkey_rec_rcv_t* const p_rcv, + IN osm_pkey_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, - IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_mad_pool_t* const p_mad_pool, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -573,7 +574,8 @@ osm_pkey_rec_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c index 5d9b1b2..4aa1723 100644 --- a/osm/opensm/osm_sa_portinfo_record.c +++ b/osm/opensm/osm_sa_portinfo_record.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of osm_pir_rcv_t. @@ -63,6 +62,7 @@ #include #include #include +#include #define OSM_PIR_RCV_POOL_MIN_SIZE 32 #define OSM_PIR_RCV_POOL_GROW_SIZE 32 @@ -865,7 +865,8 @@ osm_pir_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c index 4f158e9..fac2159 100644 --- a/osm/opensm/osm_sa_response.c +++ b/osm/opensm/osm_sa_response.c @@ -56,6 +56,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -81,6 +82,7 @@ ib_api_status_t osm_sa_resp_init( IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_pool, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log ) { ib_api_status_t status = IB_SUCCESS; @@ -89,6 +91,7 @@ osm_sa_resp_init( osm_sa_resp_construct( p_resp ); + p_resp->p_subn = p_subn; p_resp->p_log = p_log; p_resp->p_pool = p_pool; @@ -158,8 +161,8 @@ osm_sa_send_error( if( osm_log_is_active( p_resp->p_log, OSM_LOG_FRAMES ) ) osm_dump_sa_mad( p_resp->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); - status = osm_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), - p_resp_madw, FALSE ); + status = osm_sa_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), + p_resp_madw, FALSE, p_resp->p_subn ); if( status != IB_SUCCESS ) { diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c index b23a12d..4479f00 100644 --- a/osm/opensm/osm_sa_service_record.c +++ b/osm/opensm/osm_sa_service_record.c @@ -465,7 +465,8 @@ __osm_sr_rcv_respond( } } - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if( status != IB_SUCCESS ) { diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c index d831ffd..885bdc5 100644 --- a/osm/opensm/osm_sa_slvl_record.c +++ b/osm/opensm/osm_sa_slvl_record.c @@ -61,6 +61,7 @@ #include #include #include +#include #define OSM_SLVL_REC_RCV_POOL_MIN_SIZE 32 #define OSM_SLVL_REC_RCV_POOL_GROW_SIZE 32 @@ -109,7 +110,7 @@ osm_slvl_rec_rcv_init( IN osm_slvl_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -540,7 +541,8 @@ osm_slvl_rec_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if(status != IB_SUCCESS) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c index 5e15f52..99e31c6 100644 --- a/osm/opensm/osm_sa_sminfo_record.c +++ b/osm/opensm/osm_sa_sminfo_record.c @@ -68,6 +68,7 @@ #include #include #include +#include #define OSM_SMIR_RCV_POOL_MIN_SIZE 32 #define OSM_SMIR_RCV_POOL_GROW_SIZE 32 @@ -570,7 +571,8 @@ osm_smir_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if( status != IB_SUCCESS ) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c index da65864..1c2b6c7 100644 --- a/osm/opensm/osm_sa_sw_info_record.c +++ b/osm/opensm/osm_sa_sw_info_record.c @@ -57,6 +57,7 @@ #include #include #include +#include #define OSM_SIR_RCV_POOL_MIN_SIZE 32 #define OSM_SIR_RCV_POOL_GROW_SIZE 32 @@ -522,7 +523,8 @@ osm_sir_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c index f0ff957..fdb3d99 100644 --- a/osm/opensm/osm_sa_vlarb_record.c +++ b/osm/opensm/osm_sa_vlarb_record.c @@ -61,6 +61,7 @@ #include #include #include +#include #define OSM_VLARB_REC_RCV_POOL_MIN_SIZE 32 #define OSM_VLARB_REC_RCV_POOL_GROW_SIZE 32 @@ -109,7 +110,7 @@ osm_vlarb_rec_rcv_init( IN osm_vlarb_rec_rcv_t* const p_rcv, IN osm_sa_resp_t* const p_resp, IN osm_mad_pool_t* const p_mad_pool, - IN const osm_subn_t* const p_subn, + IN osm_subn_t* const p_subn, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -560,7 +561,8 @@ osm_vlarb_rec_rcv_process( CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, + p_rcv->p_subn ); if(status != IB_SUCCESS) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_sm_mad_ctrl.c b/osm/opensm/osm_sm_mad_ctrl.c index acd68d7..85729af 100644 --- a/osm/opensm/osm_sm_mad_ctrl.c +++ b/osm/opensm/osm_sm_mad_ctrl.c @@ -318,6 +318,7 @@ __osm_sm_mad_ctrl_process_get_resp( case IB_MAD_ATTR_NOTICE: case IB_MAD_ATTR_INFORM_INFO: default: + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_process_get_resp: ERR 3103: " "Unsupported attribute = 0x%X\n", @@ -395,6 +396,7 @@ __osm_sm_mad_ctrl_process_get( break; default: + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, "__osm_sm_mad_ctrl_process_get: " "Ignoring SubnGet MAD - unsupported attribute = 0x%X\n", @@ -487,6 +489,7 @@ __osm_sm_mad_ctrl_process_set( break; default: + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_process_set: ERR 3107: " "Unsupported attribute = 0x%X\n", @@ -591,6 +594,7 @@ __osm_sm_mad_ctrl_process_trap( break; default: + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_process_trap: ERR 3109: " "Unsupported attribute = 0x%X\n", @@ -763,6 +767,7 @@ __osm_sm_mad_ctrl_rcv_callback( case IB_MAD_METHOD_REPORT_RESP: case IB_MAD_METHOD_TRAP_REPRESS: default: + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_rcv_callback: ERR 3112: " "Unsupported method = 0x%X\n", p_smp->method ); From ralph.campbell at qlogic.com Tue Sep 11 13:19:19 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 11 Sep 2007 13:19:19 -0700 Subject: [ofa-general] Re: RFC: modify upstream code to make backporting easier In-Reply-To: <20070911062851.GC15363@mellanox.co.il> References: <20070911062851.GC15363@mellanox.co.il> Message-ID: <1189541959.3650.6.camel@brick.pathscale.com> Looks OK to me from the InfiniPath side. Keeping the backported code as close as possible to the upstream code is a good thing in my view. On Tue, 2007-09-11 at 09:28 +0300, Michael S. Tsirkin wrote: > Roland, Ralph, all, > I'd like to get your opinion on the following matter: > OFED is backporting upstream rdma code to older kernels. > While doing so, I really take pains to keep the ported > code as close as possible to upstream original, > mostly by using preprocessor to implement, as closely > as possible, the APIs from recent kernels on top of > older ones. > > As an example where this works well, see my backport of the > new workqueue API to 2.6.19: > http://www.openfabrics.org/git/?p=ofed_1_3/linux-2.6.git;a=blob;f=kernel_addons/backport/2.6.19/include/linux/workqueue.h;hb=HEAD > > However, sometimes I am forced to patch the upstream code. Here's an > example of the patch needed to make ipath build on > 2.6.22: > > > diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c > index 09c5fd8..94edb5d 100644 > --- a/drivers/infiniband/hw/ipath/ipath_driver.c > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c > @@ -287,6 +287,7 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, > struct ipath_devdata *dd; > unsigned long long addr; > u32 bar0 = 0, bar1 = 0; > + u8 rev; > > dd = ipath_alloc_devdata(pdev); > if (IS_ERR(dd)) { > @@ -448,7 +449,13 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, > dd->ipath_deviceid = ent->device; /* save for later use */ > dd->ipath_vendorid = ent->vendor; > > - dd->ipath_pcirev = pdev->revision; > + ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev); > + if (ret) { > + ipath_dev_err(dd, "Failed to read PCI revision ID unit " > + "%u: err %d\n", dd->ipath_unit, -ret); > + goto bail_regions; /* shouldn't ever happen */ > + } > + dd->ipath_pcirev = rev; > > #if defined(__powerpc__) > /* There isn't a generic way to specify writethrough mappings */ > > > As you can see, there's nothing I can do with macros outside the code > to make it work without code changes. > However, the patching mechanism is pretty fragile with respect > to code reorgs etc. > I wonder whether it's acceptable in cases such as this to add > a wrapper in upstream code. For example, upstream could have: > > #ifndef pci_get_revision > #define pci_get_revision(dev) ((dev)->revision) > #endif > > and then all a 2.6.22 backport needs to do is define it's own > pci_get_revision macro. > > Upstream maintainers, can you pls comment ASAP on whether such approach would be > acceptable e.g. for 2.6.24? If I could get rid of backport > patches, it might make sense to start thinking about converting fixes > patches to git commits, post 1.3, as well. > > Thanks, > From rdreier at cisco.com Tue Sep 11 13:48:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 13:48:30 -0700 Subject: [ofa-general] Re: [PATCH] Export ehca module parameters in sysfs In-Reply-To: (Hoang-Nam Nguyen's message of "Thu, 30 Aug 2007 18:03:00 +0200") References: Message-ID: Thanks, applied both patches acked by Nam. From rdreier at cisco.com Tue Sep 11 14:11:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 14:11:50 -0700 Subject: [ofa-general] Re: userspace "deadlock" bug in libmlx4? In-Reply-To: <200709111803.38431.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 11 Sep 2007 18:03:38 +0300") References: <200709111803.38431.jackm@dev.mellanox.co.il> Message-ID: > We're (potentially) waiting on a pthread mutex inside a pthread spinlock. > Is there a problem here? I don't think so, since the CQ locks are never taken inside the qp_table_lock. So there's no chance of AB-BA deadlock. Or were you thinking of some other problem that I'm missing? - R. From rdreier at cisco.com Tue Sep 11 14:19:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 14:19:49 -0700 Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: <20070911090313.GE15363@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 11 Sep 2007 12:03:13 +0300") References: <20070911090313.GE15363@mellanox.co.il> Message-ID: > I guess CQ spinlock implies rcu_read_lock - is that right? > But I do not see any synchronize_rcu calls anywhere in mlx4. > Should destroy QP and friends call synchronize_rcu after > removing the QP from radix tree but before freeing the QP structure? Well, I don't think we're really trying to use RCU to synchronize the radix tree. It's the same locking scheme as in mthca, except without the home-grown sparse array stuff: we have a qp table lock that protects inserting and removing QPs, and then we use the CQ locks to avoid looking up a QP that is being removed. However, I think you're right: we do violate the radix tree locking rules. So maybe we need to fall back to our own homegrown array stuff as in mthca. From rdreier at cisco.com Tue Sep 11 14:21:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 14:21:37 -0700 Subject: [ofa-general] Re: RFC: modify upstream code to make backporting easier In-Reply-To: <20070911062851.GC15363@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 11 Sep 2007 09:28:51 +0300") References: <20070911062851.GC15363@mellanox.co.il> Message-ID: > I wonder whether it's acceptable in cases such as this to add > a wrapper in upstream code. For example, upstream could have: > > #ifndef pci_get_revision > #define pci_get_revision(dev) ((dev)->revision) > #endif My feeling is that this type of wrapper is just obfuscation that makes the driver harder to read and maintain. If there's a way to make backporting easier that also makes the upstream driver better, then I'm in favor of it, but this sounds like a bad example to me. - R. From rdreier at cisco.com Tue Sep 11 14:23:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 14:23:30 -0700 Subject: [ofa-general] Re: [PATCH] IB/sa: error handling thinko fix In-Reply-To: <20070909115511.GC25910@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 9 Sep 2007 14:55:11 +0300") References: <20070909115511.GC25910@mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Tue Sep 11 14:25:11 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 14:25:11 -0700 Subject: [ofa-general] Re: question about ack of completion/async events in libibverbs In-Reply-To: <46DEC367.2050203@dev.mellanox.co.il> (Dotan Barak's message of "Wed, 05 Sep 2007 17:55:35 +0300") References: <46DEC367.2050203@dev.mellanox.co.il> Message-ID: > This code will cause for a careless programmer to loop forever if he > acked the events (completion or async) > too many times .... > > will you accept a patch that will fix this issue? Yes, as long as you're careful to handle integer wraparound. From envios2001 at yahoo.es Tue Sep 11 14:59:07 2007 From: envios2001 at yahoo.es (Universidad Academia de Humanismo Cristiano) Date: Tue, 11 Sep 2007 17:59:07 -0400 Subject: [ofa-general] Seminario Terapia Floral para la Escuela de hoy .. Message-ID: <31265102-22007921121597996@Mauricio> Viernes 05 de Octubre de 2007 Recursos Terapéuticos para la Escuela de hoy Código Sence Nº 12-37-7886-13 - Valor: $ 25.000 Bajar Programa completo del Seminario Condell 343 - Providencia, Santiago Fono: (02) 209 66 44 Fono: (02) 787 8227 Lorena Ponce - lponce at academia.cl Miriam Pavez - mpavez at academia.cl Este mensaje se envía en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los artículos 2 y 4 de la ley 19.628 sobre protección de la vida privada o datos de carácter personal, todo esto en conformidad a los numerales 4 y 12 de la constitución política. Su dirección ha sido extraída manualmente por personal de nuestra compañía desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envío de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos -------------- next part -------------- An HTML attachment was scrubbed... URL: From kanojsarcar at yahoo.com Tue Sep 11 19:20:02 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Tue, 11 Sep 2007 19:20:02 -0700 (PDT) Subject: [ofa-general] RDMA/iwarp CM question Message-ID: <585200.56399.qm@web32507.mail.mud.yahoo.com> Hello iwarp/rdmacm folks, If an iwarp driver sends a IW_CM_EVENT_CONNECT_REQUEST type event to the OFA stack, what synchronization (if any) is provided by OFA against a service destruct downcall to the driver that will attempt to destroy the listener for which this upcall was made? Will some layer in OFA ensure that accept/reject(s) on children of a listener will not go down to the iwarp provider if the service_destroy on the listener has already been invoked? Thanks. Kanoj ____________________________________________________________________________________ Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 From rdreier at cisco.com Tue Sep 11 19:58:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 11 Sep 2007 19:58:14 -0700 Subject: [ofa-general] Re: [PATCH 7 of 17] ipoib: fix typo In-Reply-To: <1189526068.13053.117.camel@mtls03> (Eli Cohen's message of "Tue, 11 Sep 2007 18:54:28 +0300") References: <1189526068.13053.117.camel@mtls03> Message-ID: this is a good catch but I've had it in my tree for a while already... From arlin.r.davis at intel.com Tue Sep 11 20:50:40 2007 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 11 Sep 2007 20:50:40 -0700 Subject: [ofa-general] scp performance over IPoIB Message-ID: Can someone explain why scp performance over IPoIB would be 10x slower then on GBE? The netperf numbers look normal. Running OFED 1.2, IPoIB-cm [ardavis at C-27-61 ~]$ /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:36.102.27.61 Bcast:36.102.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:399017 errors:0 dropped:0 overruns:0 frame:0 TX packets:509466 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:735453689 (701.3 MiB) TX bytes:11065632473 (10.3 GiB) scp performance: ethernet: [ardavis at C-27-61 ~]$ scp 36.101.27.60:/tmp/testfile3 /tmp/testfile3 testfile3 100% 215MB 53.7MB/s 00:04 infiniband: [ardavis at C-27-61 ~]$ scp 36.102.27.60:/tmp/testfile3 /tmp/testfile3 testfile3 100% 215MB 5.8MB/s 00:37 netperf performance: ethernet: [ardavis at C-27-61 ~]$ netperf -f -M -c -C -H 36.101.27.60 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 36.101.27.60 (36.101.27.60) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % S % S us/KB us/KB 87380 16384 16384 10.00 114501.08 6.59 11.74 2.302 4.100 infiniband: [ardavis at C-27-61 ~]$ netperf -f -M -c -C -H 36.102.27.60 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 36.102.27.60 (36.102.27.60) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % S % S us/KB us/KB 87380 16384 16384 10.00 340012.36 15.12 16.59 1.778 1.952 Thanks, -arlin From dotanb at dev.mellanox.co.il Tue Sep 11 22:39:12 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 12 Sep 2007 08:39:12 +0300 Subject: [ofa-general] Re: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <46E03176.3010209@ichips.intel.com> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> <46DFE93B.60702@dev.mellanox.co.il> <46E03176.3010209@ichips.intel.com> Message-ID: <46E77B80.2030305@dev.mellanox.co.il> Hi Sean. Sean Hefty wrote: > I checked a couple of older valgrind releases, and you are correct. > There are versions where it is undefined. I've reverted this change > back to match your original patch. Thanks. Did you commit this change to the librdmacm git? I would like to add support to libibcm based on your final patch. thanks Dotan From SNAGAI at jp.ibm.com Tue Sep 11 23:55:19 2007 From: SNAGAI at jp.ibm.com (Shingo Nagai) Date: Wed, 12 Sep 2007 15:55:19 +0900 Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch In-Reply-To: Message-ID: Scott, Thanks for your information. I tried to build Topspin uDAPL under my environment, but the makefile located in top directory need to include "vars.mk" and "rules.mk" under the directory "../../../make". I am wondering this dapl package cannot be built by itself and parent source code tree is needed to have these make configuration files. Could you kindly show me the way to build Topspin uDAPL ? "Scott Weitzenkamp \(sweitzen\)" 2007/09/12 00:48 To Shingo Nagai/Japan/IBM at IBMJP, cc Subject RE: [ofa-general] DAPL Package Build Error on PPC64 Arch You are hitting https://bugs.openfabrics.org/show_bug.cgi?id=48. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > snagai at jp.ibm.com > Sent: Tuesday, September 11, 2007 5:28 AM > To: general at lists.openfabrics.org > Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch > > I am trying to build OFED with enabling DAPL package, but > build proceess did not complete due to some errors. > > I just unzipped tar ball "OFED-1.2.tgz" and run build script > "build.sh". > Because I need to enable uDAPL on ppc64 linux machine, if > someone has already succeeded it, please show me the way. > > My build environment and error messages are below. It seems > the definition of "__PPC64__" is missing. > > [ build environment ] > > - machine arch: ppc64 > - OS : Fedora Core6 > - compiler: gcc4.1.1 > > [ error messages in build.log ] > > Make dapl started > make -C src/userspace/dapl \ > CPPFLAGS="-I../libibverbs/include/infiniband > -I../librdmacm/include \ > -I../libibverbs/include -I../../dat/include" \ > > AM_LDFLAGS="-L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspac e/libibverbs/src -libverbs -> L/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/librdmacm/s > rc/ -lrdmacm" > make[1]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make all-recursive > make[2]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > Making all in . > make[3]: Entering directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > if /bin/sh ./libtool --tag=CC --mode=compile gcc > -DHAVE_CONFIG_H -I. -I. -I. > -I../libibverbs/include/infiniband -I../librdmacm/include > -I../libibverbs/include -I../../dat/include -Wall -g > -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT > -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 > -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP > -MF ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" -c -o > dapl_udapl_libdaplcma_la-dapl_init.lo `test -f > 'dapl/udapl/dapl_init.c' || echo './'`dapl/udapl/dapl_init.c; \ > then mv -f > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo" > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Plo"; else rm -f > ".deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo"; exit 1; fi > mkdir .libs > gcc -DHAVE_CONFIG_H -I. -I. -I. > -I../libibverbs/include/infiniband -I../librdmacm/include > -I../libibverbs/include -I../../dat/include -Wall -g > -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT > -I./dat/include/ -I./dapl/include/ -I./dapl/common > -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 > -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP > -MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c > dapl/udapl/dapl_init.c -fPIC -DPIC -o > .libs/dapl_udapl_libdaplcma_la-dapl_init.o > In file included from ./dapl/include/dapl.h:50, > from dapl/udapl/dapl_init.c:39: > ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH > make[3]: *** [dapl_udapl_libdaplcma_la-dapl_init.lo] Error 1 > make[3]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make[1]: *** [all] Error 2 > make[1]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/ofa_user-1.2/src/userspace/dapl' > make: *** [dapl] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.33577 (%install) > > > RPM build errors: > user vlad does not exist - using root > group vlad does not exist - using root > user vlad does not exist - using root > group vlad does not exist - using root > Bad exit status from /var/tmp/rpm-tmp.33577 (%install) > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define '_prefix /usr' --define > 'build_root /home/testuser/tmp/OFED' --define > 'configure_options --with-dapl --with-ipoibtools > --with-libcxgb3 --with-libehca --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs --with-libipathverbs --with-libmthca > --with-opensm --with-librdmacm --with-libsdp > --with-openib-diags --with-sdpnetstat --with-srptools > --with-perftest --sysconfdir=/etc --mandir=/usr/share/man' > --define 'configure_options32 --with-dapl --with-ipoibtools > --with-libcxgb3 --with-libehca --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs --with-libipathverbs --with-libmthca > --with-opensm --with-librdmacm --with-libsdp > --with-openib-diags --with-sdpnetstat --with-srptools > --with-mstflint --with-tvflash --sysconfdir=/etc > --mandir=/usr/share/man' --define 'build_32bit 1' --define > '_mandir /usr/share/man' /home/testuser/archives/OFED-1.2/SRPMS/ofa_user-1.2-0.src.rpm" > > > -- > This message was sent on behalf of snagai at jp.ibm.com at > openSubscriber.com > http://www.opensubscriber.com/messages/general at lists.openfabri > cs.org/topic.html > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harake at cscs.ch Wed Sep 12 02:13:07 2007 From: harake at cscs.ch (H. N. HARAKE) Date: Wed, 12 Sep 2007 11:13:07 +0200 Subject: [ofa-general] performance and Kernel support In-Reply-To: <46E6CD91.2030209@hp.com> References: <46E6CD91.2030209@hp.com> Message-ID: <5396BA25-8311-43CE-A41B-81DA384222E5@cscs.ch> Rick, I have dual core AMD opteron, the IB card are connected to 4X mellanox infiniscale 2400, I am running Sles 10. when i used to run the test without any message size (-m ) The SDP test didn't work for me I am try to figure out why also I had noproblem in loading sdp library. I also changed the txqueuelen to 4096 the default was 128, t1:~ # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 80-00-04-04- FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.2.101 Bcast:192.168.2.255 Mask: 255.255.255.0 inet6 addr: fe80::202:c902:24:2909/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:11673701 errors:0 dropped:0 overruns:0 frame:0 TX packets:14069560 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:4096 RX bytes:59793476658 (57023.5 Mb) TX bytes:192547529035 (183627.6 Mb) t1:~ # netperf -H 192.168.2.100 -c -C -- TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.100 (192.168.2.100) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/ KB us/KB 87380 16384 16384 10.00 3989.07 12.81 23.66 0.526 0.972 adding the -m with 128K show a better results netperf -H 192.168.2.100 -c -C -- -m 128K bytes bytes bytes secs. 10^6bits/s % S % S us/ KB us/KB 87380 16384 131072 10.00 4646.27 12.91 24.76 0.455 0.873 in case of -m 1m the cpu util showed a real different in this case. bytes bytes bytes secs. 10^6bits/s % S % S us/ KB us/KB 87380 16384 1048576 10.00 5796.64 21.91 31.91 0.619 0.902 Thanks H. N. Harake On 11-Sep-2007, at 19:17, Rick Jones wrote: > SDP From vlad at lists.openfabrics.org Wed Sep 12 02:51:57 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 12 Sep 2007 02:51:57 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070912-0200 daily build status Message-ID: <20070912095158.1B026E60877@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070912-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From swise at opengridcomputing.com Wed Sep 12 03:00:25 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 12 Sep 2007 05:00:25 -0500 Subject: [ofa-general] [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. Message-ID: <20070912100025.3190.89259.stgit@dell3.ogc.int> RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. Calling arp_send() to initiate neighbour discovery (ND) doesn't do the full ND protocol. Namely, it doesn't handle retransmitting the arp request if it is dropped. The function neigh_event_send() does all this. Without doing full ND, rdma address resolution fails in the presence of dropped arp bcast packets. Signed-off-by: Steve Wise --- drivers/infiniband/core/addr.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index c5c33d3..5381c80 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -161,8 +161,7 @@ static void addr_send_arp(struct sockadd if (ip_route_output_key(&rt, &fl)) return; - arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, - rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); + neigh_event_send(rt->u.dst.neighbour, NULL); ip_rt_put(rt); } From swise at opengridcomputing.com Wed Sep 12 03:51:16 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 12 Sep 2007 05:51:16 -0500 Subject: [ofa-general] RDMA/iwarp CM question In-Reply-To: <585200.56399.qm@web32507.mail.mud.yahoo.com> References: <585200.56399.qm@web32507.mail.mud.yahoo.com> Message-ID: <46E7C4A4.2060502@opengridcomputing.com> Kanoj Sarcar wrote: > Hello iwarp/rdmacm folks, > > If an iwarp driver sends a IW_CM_EVENT_CONNECT_REQUEST > type event to the OFA stack, what synchronization (if > any) is provided by OFA against a service destruct > downcall to the driver that will attempt to destroy > the listener for which this upcall was made? > > No synchronization is provided. The only thing I see is that a connect request will be dropped if the listening cm_id is being destroyed. So the iwarp cm protects its own data structures for this case. See iwcm.c cm_conn_req_handler() and destoy_cm_id(). But from the driver's perspective, one thread/cpu could be running a connect request event and be in the iwcm's event handler, while another thread/cpu is running a destroy on the listen cm_id and is in the drivers destroy_listen handler. > Will some layer in OFA ensure that accept/reject(s) on > children of a listener will not go down to the iwarp > provider if the service_destroy on the listener has > already been invoked? > > I don't think so. Once the connect request is passed up to the user, any association with the listening cm_id is gone. And I believe it should be valid that an application can get a connect request, then destroy the listen cm_id, then accept or reject the connect request. Steve. From nwba at bobcoletti.com Wed Sep 12 04:41:34 2007 From: nwba at bobcoletti.com (Hollie Kendrick) Date: Wed, 12 Sep 2007 12:41:34 +0100 Subject: [ofa-general] Hollie has sent you a message Message-ID: <3302ffa4$3302ffa4$1ce07b93@nwba> Lightning Could Strike Twice! Yesterday we told you this stock would sky rocket, and today it did! Moving over 4.1 million shares, and up 100%. The stock doubled, and is going to double again tomorrow! There is an odour in the air a huge press release is coming out tomorrow! ww Energy Inc. Symbol : wwng Current: $.02 Expected: $.54 Wednesday will provide huge news on WWNG. Huge returns resulted from last months Big news release. Get in while the price is low. DonÂ’t pass on WWNG, get on it Wednesday. From fenkes at de.ibm.com Wed Sep 12 05:39:31 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 12 Sep 2007 14:39:31 +0200 Subject: [ofa-general] [PATCH] IB/ehca: Make sure user pages are from hugetlb before using MR large pages Message-ID: <200709121439.32641.fenkes@de.ibm.com> From: Hoang-Nam Nguyen ...because, on virtualized hardware like System p, we can't be sure that the physical pages behind them are contiguous. Signed-off-by: Joachim Fenkes --- Another patch for 2.6.24 that will apply cleanly on top of my previous patchset. Please review and apply. Thanks! drivers/infiniband/hw/ehca/ehca_classes.h | 8 ++-- drivers/infiniband/hw/ehca/ehca_mrmw.c | 82 +++++++++++++++++++++++++---- 2 files changed, 75 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 206d4eb..c2edd4c 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -99,10 +99,10 @@ struct ehca_sport { struct ehca_sma_attr saved_attr; }; -#define HCA_CAP_MR_PGSIZE_4K 1 -#define HCA_CAP_MR_PGSIZE_64K 2 -#define HCA_CAP_MR_PGSIZE_1M 4 -#define HCA_CAP_MR_PGSIZE_16M 8 +#define HCA_CAP_MR_PGSIZE_4K 0x80000000 +#define HCA_CAP_MR_PGSIZE_64K 0x40000000 +#define HCA_CAP_MR_PGSIZE_1M 0x20000000 +#define HCA_CAP_MR_PGSIZE_16M 0x10000000 struct ehca_shca { struct ib_device ib_device; diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 4c8f3b3..1bb9d23 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -41,6 +41,8 @@ */ #include +#include +#include #include @@ -51,6 +53,7 @@ #define NUM_CHUNKS(length, chunk_size) \ (((length) + (chunk_size - 1)) / (chunk_size)) + /* max number of rpages (per hcall register_rpages) */ #define MAX_RPAGES 512 @@ -279,6 +282,52 @@ reg_phys_mr_exit0: } /* end ehca_reg_phys_mr() */ /*----------------------------------------------------------------------*/ +static int ehca_is_mem_hugetlb(unsigned long addr, unsigned long size) +{ + struct vm_area_struct **vma_list; + unsigned long cur_base; + unsigned long npages; + int ret, i; + + vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL); + if (!vma_list) { + ehca_gen_err("Can not alloc vma_list"); + return -ENOMEM; + } + + down_write(¤t->mm->mmap_sem); + npages = PAGE_ALIGN(size + (addr & ~PAGE_MASK)) >> PAGE_SHIFT; + cur_base = addr & PAGE_MASK; + + while (npages) { + ret = get_user_pages(current, current->mm, cur_base, + min_t(int, npages, + PAGE_SIZE / sizeof (*vma_list)), + 1, 0, NULL, vma_list); + + if (ret < 0) { + ehca_gen_err("get_user_pages() failed " + "ret=%x cur_base=%lx", ret, cur_base); + goto is_hugetlb_out; + } + + for (i = 0; i < ret; ++i) + if (!is_vm_hugetlb_page(vma_list[i])) { + ret = 0; + goto is_hugetlb_out; + } + + cur_base += ret * PAGE_SIZE; + npages -= ret; + } + ret = 1; + +is_hugetlb_out: + up_write(¤t->mm->mmap_sem); + free_page((unsigned long) vma_list); + + return ret; +} struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, int mr_access_flags, @@ -346,18 +395,29 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); /* select proper hw_pgsize */ if (ehca_mr_largepage && - (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { - if (length <= EHCA_MR_PGSIZE4K - && PAGE_SIZE == EHCA_MR_PGSIZE4K) - hwpage_size = EHCA_MR_PGSIZE4K; - else if (length <= EHCA_MR_PGSIZE64K) - hwpage_size = EHCA_MR_PGSIZE64K; - else if (length <= EHCA_MR_PGSIZE1M) - hwpage_size = EHCA_MR_PGSIZE1M; - else - hwpage_size = EHCA_MR_PGSIZE16M; + shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M) { + ret = ehca_is_mem_hugetlb(virt, length); + switch (ret) { + case 0: /* mem is not from hugetlb */ + hwpage_size = PAGE_SIZE; + break; + case 1: + if (length <= EHCA_MR_PGSIZE4K + && PAGE_SIZE == EHCA_MR_PGSIZE4K) + hwpage_size = EHCA_MR_PGSIZE4K; + else if (length <= EHCA_MR_PGSIZE64K) + hwpage_size = EHCA_MR_PGSIZE64K; + else if (length <= EHCA_MR_PGSIZE1M) + hwpage_size = EHCA_MR_PGSIZE1M; + else + hwpage_size = EHCA_MR_PGSIZE16M; + break; + default: /* out of mem */ + ib_mr = ERR_PTR(-ENOMEM); + goto reg_user_mr_exit1; + } } else - hwpage_size = EHCA_MR_PGSIZE4K; + hwpage_size = EHCA_MR_PGSIZE4K; /* ehca1 can only 4k */ ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); reg_user_mr_fallback: -- 1.5.2 From fenkes at de.ibm.com Wed Sep 12 07:42:50 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 12 Sep 2007 16:42:50 +0200 Subject: [ofa-general] Re: [PATCH 08/12] IB/ehca: Replace get_paca()->paca_index by the more portable smp_processor_id() In-Reply-To: <20070911145131.GN32388@localdomain> References: <200709111518.26276.fenkes@de.ibm.com> <200709111533.14333.fenkes@de.ibm.com> <20070911145131.GN32388@localdomain> Message-ID: <200709121642.51198.fenkes@de.ibm.com> On Tuesday 11 September 2007 16:51, Nathan Lynch wrote: > > - get_paca()->paca_index, __FUNCTION__, \ > > + smp_processor_id(), __FUNCTION__, \ > > I think I see these macros used in preemptible code (e.g. ehca_probe), > where smp_processor_id() will print a warning when > CONFIG_DEBUG_PREEMPT=y. Probably better to use raw_smp_processor_id. You're right, man. The processor id doesn't need to be preemption-safe in this context, so that would be a bogus warning. Thanks for pointing this out. I'll post a new version of this patch. Joachim From fenkes at de.ibm.com Wed Sep 12 07:44:11 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Wed, 12 Sep 2007 16:44:11 +0200 Subject: [ofa-general] [PATCH 08/12] IB/ehca: Replace get_paca()->paca_index by the more portable raw_smp_processor_id() In-Reply-To: <200709111533.14333.fenkes@de.ibm.com> References: <200709111518.26276.fenkes@de.ibm.com> <200709111533.14333.fenkes@de.ibm.com> Message-ID: <200709121644.12717.fenkes@de.ibm.com> We can use raw_smp_processor_id() here because the processor ID is only used for debug output and may therefore be preemption-unsafe. Signed-off-by: Joachim Fenkes --- This is the same patch, but with smp_processor_id() replaced by raw_smp_processor_id(), as kindly pointed out to me by Nathan. Thanks! drivers/infiniband/hw/ehca/ehca_tools.h | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index f9b264b..4a8346a 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -73,37 +73,37 @@ extern int ehca_debug_level; if (unlikely(ehca_debug_level)) \ dev_printk(KERN_DEBUG, (ib_dev)->dma_device, \ "PU%04x EHCA_DBG:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, \ + raw_smp_processor_id(), __FUNCTION__, \ ## arg); \ } while (0) #define ehca_info(ib_dev, format, arg...) \ dev_info((ib_dev)->dma_device, "PU%04x EHCA_INFO:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + raw_smp_processor_id(), __FUNCTION__, ## arg) #define ehca_warn(ib_dev, format, arg...) \ dev_warn((ib_dev)->dma_device, "PU%04x EHCA_WARN:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + raw_smp_processor_id(), __FUNCTION__, ## arg) #define ehca_err(ib_dev, format, arg...) \ dev_err((ib_dev)->dma_device, "PU%04x EHCA_ERR:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + raw_smp_processor_id(), __FUNCTION__, ## arg) /* use this one only if no ib_dev available */ #define ehca_gen_dbg(format, arg...) \ do { \ if (unlikely(ehca_debug_level)) \ printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg); \ + raw_smp_processor_id(), __FUNCTION__, ## arg); \ } while (0) #define ehca_gen_warn(format, arg...) \ printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + raw_smp_processor_id(), __FUNCTION__, ## arg) #define ehca_gen_err(format, arg...) \ printk(KERN_ERR "PU%04x EHCA_ERR:%s " format "\n", \ - get_paca()->paca_index, __FUNCTION__, ## arg) + raw_smp_processor_id(), __FUNCTION__, ## arg) /** * ehca_dmp - printk a memory block, whose length is n*8 bytes. -- 1.5.2 From rick.jones2 at hp.com Wed Sep 12 09:36:16 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 12 Sep 2007 09:36:16 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: References: Message-ID: <46E81580.4040900@hp.com> Davis, Arlin R wrote: > Can someone explain why scp performance over IPoIB would be 10x slower > then on GBE? The netperf numbers look normal. Might you be running into limitations of the app-level windowing in scp (ssl?). ISTR there is need of patching to get scp to work "well" with high bandwidth delay product links. If your IPoIB happens to have higher latency than your GbE setup - which you can check with a netperf TCP_RR test then perhaps the Tput < W/RTT thing is happening - with the app-level windowing. Might also check the send sizes in scp relative to the MTU - the OFED 1.2 IPoIB MTU is (IIRC) 65520 bytes and perhaps the scp sends aren't playing well with that. Netperf TCP_STREAM will just shove bytes into the socket until blocked, which scp may not do. So, you could try tweaking the MTU on the IPoIB interfaces. rick jones From sean.hefty at intel.com Wed Sep 12 11:06:45 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Sep 2007 11:06:45 -0700 Subject: [ofa-general] Re: [PATCH] librdmacm 1/2: add valgrind support to auto-tools configuration file In-Reply-To: <46E77B80.2030305@dev.mellanox.co.il> References: <200708151352.42026.dotanb@dev.mellanox.co.il> <000201c7f00b$5826e900$3c98070a@amr.corp.intel.com> <46DFE93B.60702@dev.mellanox.co.il> <46E03176.3010209@ichips.intel.com> <46E77B80.2030305@dev.mellanox.co.il> Message-ID: <000001c7f567$ae596800$ff0da8c0@amr.corp.intel.com> >Did you commit this change to the librdmacm git? I just pushed this upstream. Also, here's what I've created so far for the libibcm. I haven't tested or completed it, but at least it's a starting point. I stopped working when I reached the event handling code in the libibcm. If you don't get to this, I'll try to finish it up early next week. - Sean diff --git a/configure.in b/configure.in index e33a188..ae47451 100644 --- a/configure.in +++ b/configure.in @@ -9,6 +9,18 @@ AM_INIT_AUTOMAKE(libibcm, 1.0-1) AM_PROG_LIBTOOL +AC_ARG_WITH([valgrind], + AC_HELP_STRING([--with-valgrind], + [Enable valgrind annotations - default NO])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then + AC_DEFINE([INCLUDE_VALGRIND], 1, + [Define to 1 to enable valgrind annotations]) + if test -d $with_valgrind; then + CPPFLAGS="$CPPFLAGS -I$with_valgrind/include" + fi +fi + AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], [ if test "$enableval" = "no"; then disable_libcheck=yes @@ -38,6 +50,12 @@ AC_CHECK_HEADER(infiniband/verbs.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) AC_CHECK_HEADER(infiniband/marshall.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then +AC_CHECK_HEADER(valgrind/memcheck.h, [], + AC_MSG_ERROR([valgrind requested but not found.])) +fi + fi AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, diff --git a/src/cm.c b/src/cm.c index 2572972..1dba773 100644 --- a/src/cm.c +++ b/src/cm.c @@ -51,6 +51,17 @@ #include #include +#ifdef INCLUDE_VALGRIND +# include +# ifndef VALGRIND_MAKE_MEM_DEFINED +# warning "Valgrind requested, but VALGRIND_MAKE_MEM_DEFINED undefined" +# endif +#endif + +#ifndef VALGRIND_MAKE_MEM_DEFINED +# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +#endif + #define PFX "libibcm: " static int abi_ver; @@ -226,6 +237,8 @@ int ib_cm_create_id(struct ib_cm_device *device, if (result != size) goto err; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + cm_id_priv->id.handle = resp->id; *cm_id = &cm_id_priv->id; return 0; @@ -250,6 +263,8 @@ int ib_cm_destroy_id(struct ib_cm_id *cm_id) if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + cm_id_priv = container_of(cm_id, struct cm_id_private, id); pthread_mutex_lock(&cm_id_priv->mut); @@ -279,6 +294,8 @@ int ib_cm_attr_id(struct ib_cm_id *cm_id, struct ib_cm_attr_param *param) if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + param->service_id = resp->service_id; param->service_mask = resp->service_mask; param->local_id = resp->local_id; @@ -307,6 +324,8 @@ int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + *qp_attr_mask = resp->qp_attr_mask; ibv_copy_qp_attr_from_kern(qp_attr, resp); @@ -818,6 +837,9 @@ int ib_cm_get_event(struct ib_cm_device *device, struct ib_cm_event **event) result = (result > 0) ? -ENODATA : result; goto done; } + + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + /* * decode event. */ From sean.hefty at intel.com Wed Sep 12 11:13:08 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Sep 2007 11:13:08 -0700 Subject: [ofa-general] RE: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <20070912100025.3190.89259.stgit@dell3.ogc.int> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> Message-ID: <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> >RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. > >Calling arp_send() to initiate neighbour discovery (ND) doesn't do the >full ND protocol. Namely, it doesn't handle retransmitting the arp >request if it is dropped. The function neigh_event_send() does all this. >Without doing full ND, rdma address resolution fails in the presence of >dropped arp bcast packets. > >Signed-off-by: Steve Wise Acked-by: Sean Hefty Roland - can you please queue this up for 2.6.24? From rick.jones2 at hp.com Wed Sep 12 11:28:21 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 12 Sep 2007 11:28:21 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: <46E82C9F.9010602@ichips.intel.com> References: <46E81580.4040900@hp.com> <46E82C9F.9010602@ichips.intel.com> Message-ID: <46E82FC5.9030906@hp.com> Arlin Davis wrote: > Rick Jones wrote: > >> Davis, Arlin R wrote: >> >>> Can someone explain why scp performance over IPoIB would be 10x slower >>> then on GBE? The netperf numbers look normal. >> >> >> >> So, you could try tweaking the MTU on the IPoIB interfaces. >> >> > > Rick, > > Thanks for the suggestion. Looks like we may need to change the default > MTU for IPoIB. It would be interesting to see results from other > distributions. > > (Woodcrest, Xeon 5160, DDR, RHEL4U4) > > MTU SCP NetPerf > > 1024 41 MB/s 151 MB/s > 2048 50 MB/s 313 MB/s > 4096 50 MB/s 485 MB/s > 8192 50 MB/s 641 MB/s > 16384 25 MB/s 761 MB/s > 32768 50 MB/s 700 MB/s > 65520 8 MB/s 440 MB/s I'm actually a triffle surprised that netperf was affected by the 65520 MTU - I'm guessing you were using all defaults, which on "linux" IIRC means netperf was making 16KB (K == 1024) sends. I suspect that if you were to make 64K sends from netperf (test specific -m 64K) that the numbers for 64420 might be better. I'm really shakey on scp behaviour knowledge, but suspect that perhaps with the "HPN" (High Performance Network) patches in place (check the archives pointed-to by: https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev) it might be possible to get good SCP performance out of a 65520 byte MTU. I'm _guessing" that by default scp isn't trying to put-out > 65520 bytes worth of data in the sum of its sends with its own windowing and so gets hit by issues with Nagle. Ie it is doing write, write, read and the second write at least is sub-MSS. Some strace tracing of the scp transfer could confirm/deny that hypothesis. So, it may not be necessary to shrink the MTU. rick jones From kanojsarcar at yahoo.com Wed Sep 12 12:07:43 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Wed, 12 Sep 2007 12:07:43 -0700 (PDT) Subject: [ofa-general] Re: RDMA/iwarp CM question In-Reply-To: <585200.56399.qm@web32507.mail.mud.yahoo.com> Message-ID: <193319.53854.qm@web32501.mail.mud.yahoo.com> Response to original mail did not come to me, but I see it in the archives, responding back to the archived response. Please reply all on your responses. If the driver detaches the incoming (child) connection request from the listener at the point of sending the IW_CM_EVENT_CONNECT_REQUEST upcall, then for on-card connection clean up and child state cleanup in driver, OFA must guarantee that a accept/reject downcall will be made in the future. I don't believe that gurantee currently exists. There is exactly one failure point in the call chain cm_work_handler():process_event():cm_conn_req_handler() that driver reject interface is invoked, but at multiple other failure points, this is not done. Also, looking at ucma.c, on destruction of a listener, I believe ucma_cleanup_events() will go around killing all pending IW_CM_EVENT_CONNECT_REQUEST requests, so the app will never get a chance to do the accept/reject. Doesn't this sound like a problem (namely provider/card resource leak due to races with listener destruct)? Kanoj --- Kanoj Sarcar wrote: > Hello iwarp/rdmacm folks, > > If an iwarp driver sends a > IW_CM_EVENT_CONNECT_REQUEST > type event to the OFA stack, what synchronization > (if > any) is provided by OFA against a service destruct > downcall to the driver that will attempt to destroy > the listener for which this upcall was made? > > > > Will some layer in OFA ensure that accept/reject(s) > on > children of a listener will not go down to the iwarp > provider if the service_destroy on the listener has > already been invoked? > > > > Thanks. > > > > Kanoj > > > > > ____________________________________________________________________________________ > Shape Yahoo! in your own image. Join our Network > Research Panel today! > http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 > > > > ____________________________________________________________________________________ Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase. http://farechase.yahoo.com/ From swise at opengridcomputing.com Wed Sep 12 12:33:04 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 12 Sep 2007 14:33:04 -0500 Subject: [ofa-general] Re: RDMA/iwarp CM question In-Reply-To: <193319.53854.qm@web32501.mail.mud.yahoo.com> References: <193319.53854.qm@web32501.mail.mud.yahoo.com> Message-ID: <46E83EF0.1010908@opengridcomputing.com> Kanoj Sarcar wrote: > Response to original mail did not come to me, but I > see it in the archives, responding back to the > archived response. Please reply all on your responses. > I did reply to all. My outgoing folder shows that it went to both of your addresses... > If the driver detaches the incoming (child) connection > request from the listener at the point of sending the > IW_CM_EVENT_CONNECT_REQUEST upcall, then for on-card > connection clean up and child state cleanup in driver, > OFA must guarantee that a accept/reject downcall will > be made in the future. Or you can time it out in your driver. > > > I don't believe that gurantee currently exists. There > is exactly one failure point in the call chain > cm_work_handler():process_event():cm_conn_req_handler() > that driver reject interface is invoked, but at > multiple other failure points, this is not done. > > > Also, looking at ucma.c, on destruction of a listener, > I believe ucma_cleanup_events() will go around killing > all pending IW_CM_EVENT_CONNECT_REQUEST requests, so > the app will never get a chance to do the > accept/reject. > > It looks to me like ucma_clean_events() calls rdma_destroy_id() / iw_destroy_cm_id() / destroy_cm_id() which calls the provider reject function. Or NOT! :) There's a comment in the IW_CM_STATE_CONN_RECV case inside destroy_cm_id(): > /* > * App called destroy before/without calling accept after > * receiving connection request event notification or > * returned non zero from the event callback function. > * In either case, must tell the provider to reject. > */ But I don't see the call to reject the connection... Maybe you could add it and see if it clears up your issue? > Doesn't this sound like a problem (namely > provider/card resource leak due to races with listener > destruct)? > It does. But MPA mandates a timeout so the connections will get aborted eventually by the provider or peer... But I think you've found a bug... Steve. From kanojsarcar at yahoo.com Wed Sep 12 12:49:42 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Wed, 12 Sep 2007 12:49:42 -0700 (PDT) Subject: [ofa-general] Re: RDMA/iwarp CM question In-Reply-To: <46E83EF0.1010908@opengridcomputing.com> Message-ID: <626974.64916.qm@web32513.mail.mud.yahoo.com> --- Steve Wise wrote: > > > Kanoj Sarcar wrote: > > Response to original mail did not come to me, but > I > > see it in the archives, responding back to the > > archived response. Please reply all on your > responses. > > > > I did reply to all. My outgoing folder shows that > it went to both of > your addresses... > Hmmm, this mail arrived in my yahoo bulk folder, might have happened with thel last one too, I probably overlooked, sorry. > > > If the driver detaches the incoming (child) > connection > > request from the listener at the point of sending > the > > IW_CM_EVENT_CONNECT_REQUEST upcall, then for > on-card > > connection clean up and child state cleanup in > driver, > > OFA must guarantee that a accept/reject downcall > will > > be made in the future. > > Or you can time it out in your driver. > See below. > > > > > > > I don't believe that gurantee currently exists. > There > > is exactly one failure point in the call chain > > > cm_work_handler():process_event():cm_conn_req_handler() > > that driver reject interface is invoked, but at > > multiple other failure points, this is not done. > > > > > > > Also, looking at ucma.c, on destruction of a > listener, > > I believe ucma_cleanup_events() will go around > killing > > all pending IW_CM_EVENT_CONNECT_REQUEST requests, > so > > the app will never get a chance to do the > > accept/reject. > > > > > > > It looks to me like ucma_clean_events() calls > rdma_destroy_id() / > iw_destroy_cm_id() / destroy_cm_id() which calls the > provider reject > function. Or NOT! :) There's a comment in the > IW_CM_STATE_CONN_RECV > case inside destroy_cm_id(): > > > /* > > * App called destroy > before/without calling accept after > > * receiving connection request > event notification or > > * returned non zero from the > event callback function. > > * In either case, must tell the > provider to reject. > > */ > > But I don't see the call to reject the connection... > > Maybe you could add it and see if it clears up your > issue? I haven't hit a problem yet, I am looking at what my driver should/should not do ... > > > > Doesn't this sound like a problem (namely > > provider/card resource leak due to races with > listener > > destruct)? > > > > It does. > > But MPA mandates a timeout so the connections will > get aborted > eventually by the provider or peer... > I believe the timeout you are talking about applies to limiting how long it takes (on responder side) from an incoming SYN to receipt of complete MPA request. I don't believe there is much logic in having a timeout between the incoming-connect upcall send by the driver and an eventual accept/reject done by the app, but thats a seperate discussion. The core problem is this though. On a listener destruct, the driver can either do: a. destroy all children on which an accept/reject has not yet been invoked, and OFA stack then must stop app from sending an accept/reject down in such case. There is currently an attempt to do this at the ucma layer (eg cleanup unpolled events), but it is not race free. b. OFA guarantees than an eventual accept/reject downcall will be made, and driver can rely on that to prevent resource leakage. Any other solution will have some problem somewhere. EG, in your timeout suggestion, if the driver goes ahead and cleans up the state on on-card resource for the child, due to the race mentioned in a) above, the app might succeed in making an eventual accept/reject, leading to a kernel crash. > But I think you've found a bug... > > Steve. > Are folks filing bugs in bugzilla or similar? Thanks. Kanoj ____________________________________________________________________________________ Check out the hottest 2008 models today at Yahoo! Autos. http://autos.yahoo.com/new_cars.html From swise at opengridcomputing.com Wed Sep 12 13:08:43 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 12 Sep 2007 15:08:43 -0500 Subject: [ofa-general] Re: RDMA/iwarp CM question In-Reply-To: <626974.64916.qm@web32513.mail.mud.yahoo.com> References: <626974.64916.qm@web32513.mail.mud.yahoo.com> Message-ID: <46E8474B.7030606@opengridcomputing.com> >>> >> It looks to me like ucma_clean_events() calls >> rdma_destroy_id() / >> iw_destroy_cm_id() / destroy_cm_id() which calls the >> provider reject >> function. Or NOT! :) There's a comment in the >> IW_CM_STATE_CONN_RECV >> case inside destroy_cm_id(): >> >>> /* >>> * App called destroy >> before/without calling accept after >>> * receiving connection request >> event notification or >>> * returned non zero from the >> event callback function. >>> * In either case, must tell the >> provider to reject. >>> */ >> But I don't see the call to reject the connection... >> >> Maybe you could add it and see if it clears up your >> issue? > > I haven't hit a problem yet, I am looking at what my > driver should/should not do ... > >> >>> Doesn't this sound like a problem (namely >>> provider/card resource leak due to races with >> listener >>> destruct)? >>> >> It does. >> >> But MPA mandates a timeout so the connections will >> get aborted >> eventually by the provider or peer... >> > > I believe the timeout you are talking about applies to > limiting how long it takes (on responder side) from an > incoming SYN to receipt of complete MPA request. I > don't believe there is much logic in having a timeout > between the incoming-connect upcall send by the driver > and an eventual accept/reject done by the app, but > thats a seperate discussion. > My point is the peer will abort the TCP connection if the passive side never accepts or rejects. > The core problem is this though. On a listener > destruct, the driver can either do: > > a. destroy all children on which an accept/reject has > not yet been invoked, and OFA stack then must stop app > from sending an accept/reject down in such case. There > is currently an attempt to do this at the ucma layer > (eg cleanup unpolled events), but it is not race free. > This code is only cleaning up cm_id's that have _not_ been reaped by the application via get_rdma_cm_event(). Any connection requests that have been reaped will stay around until the application disposes of them via rdma_accept(), rdma_reject(), rdma_destroy_id(), or when the process exists. > b. OFA guarantees than an eventual accept/reject > downcall will be made, and driver can rely on that to > prevent resource leakage. > Yes I think the rdma core must guarantee an eventual accept/reject downcall. > Any other solution will have some problem somewhere. > EG, in your timeout suggestion, if the driver goes > ahead and cleans up the state on on-card resource for > the child, due to the race mentioned in a) above, the > app might succeed in making an eventual accept/reject, > leading to a kernel crash. > > >> But I think you've found a bug... >> >> Steve. >> > > Are folks filing bugs in bugzilla or similar? > You can if you want. There is a bugzilla db on the ofa site... Or provide the fix and test it. That would be ideal... Steve. From rdreier at cisco.com Wed Sep 12 13:21:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Sep 2007 13:21:54 -0700 Subject: [ofa-general] Re: [PATCH 02/12] IB/ehca: Add 1 is not longer needed because of firmware interface change In-Reply-To: <200709111529.07935.fenkes@de.ibm.com> (Joachim Fenkes's message of "Tue, 11 Sep 2007 15:29:07 +0200") References: <200709111518.26276.fenkes@de.ibm.com> <200709111529.07935.fenkes@de.ibm.com> Message-ID: What happens if someone runs the new driver with older firmware? Or what if someone upgrades the firmware without updating the driver? - R. From kanojsarcar at yahoo.com Wed Sep 12 14:07:19 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Wed, 12 Sep 2007 14:07:19 -0700 (PDT) Subject: [ofa-general] Re: RDMA/iwarp CM question In-Reply-To: <46E8474B.7030606@opengridcomputing.com> Message-ID: <925161.64591.qm@web32509.mail.mud.yahoo.com> --- Steve Wise wrote: > > >>> > >> It looks to me like ucma_clean_events() calls > >> rdma_destroy_id() / > >> iw_destroy_cm_id() / destroy_cm_id() which calls > the > >> provider reject > >> function. Or NOT! :) There's a comment in the > >> IW_CM_STATE_CONN_RECV > >> case inside destroy_cm_id(): > >> > >>> /* > >>> * App called destroy > >> before/without calling accept after > >>> * receiving connection request > >> event notification or > >>> * returned non zero from the > >> event callback function. > >>> * In either case, must tell the > >> provider to reject. > >>> */ > >> But I don't see the call to reject the > connection... > >> > >> Maybe you could add it and see if it clears up > your > >> issue? > > > > I haven't hit a problem yet, I am looking at what > my > > driver should/should not do ... > > > >> > >>> Doesn't this sound like a problem (namely > >>> provider/card resource leak due to races with > >> listener > >>> destruct)? > >>> > >> It does. > >> > >> But MPA mandates a timeout so the connections > will > >> get aborted > >> eventually by the provider or peer... > >> > > > > I believe the timeout you are talking about > applies to > > limiting how long it takes (on responder side) > from an > > incoming SYN to receipt of complete MPA request. I > > don't believe there is much logic in having a > timeout > > between the incoming-connect upcall send by the > driver > > and an eventual accept/reject done by the app, but > > thats a seperate discussion. > > > > My point is the peer will abort the TCP connection > if the passive side > never accepts or rejects. Agreed, but even though the connection is aborted, the handle for it can not be deallocted unless we can guarantee the local system will not make accept/reject calls. > > > > The core problem is this though. On a listener > > destruct, the driver can either do: > > > > a. destroy all children on which an accept/reject > has > > not yet been invoked, and OFA stack then must stop > app > > from sending an accept/reject down in such case. > There > > is currently an attempt to do this at the ucma > layer > > (eg cleanup unpolled events), but it is not race > free. > > > > This code is only cleaning up cm_id's that have > _not_ been reaped by the > application via get_rdma_cm_event(). Any connection > requests that have > been reaped will stay around until the application > disposes of them via > rdma_accept(), rdma_reject(), rdma_destroy_id(), or > when the process exists. Exactly, thats why I mentioned there is a non race-free attempt; it is quite possible that the app has polled some of the events and will make eventual accept/rejects. > > > > b. OFA guarantees than an eventual accept/reject > > downcall will be made, and driver can rely on that > to > > prevent resource leakage. > > > > Yes I think the rdma core must guarantee an eventual > accept/reject downcall. > > > > Any other solution will have some problem > somewhere. > > EG, in your timeout suggestion, if the driver goes > > ahead and cleans up the state on on-card resource > for > > the child, due to the race mentioned in a) above, > the > > app might succeed in making an eventual > accept/reject, > > leading to a kernel crash. > > > > > >> But I think you've found a bug... > >> > >> Steve. > >> > > > > Are folks filing bugs in bugzilla or similar? > > > > You can if you want. There is a bugzilla db on the > ofa site... > > Or provide the fix and test it. That would be > ideal... > > > Steve. > I will likely file a bug just to ensure this is not lost, since its unlikely I will be able to work on a solution "soon". Kanoj ____________________________________________________________________________________ Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC From ssufficool at rov.sbcounty.gov Wed Sep 12 14:14:47 2007 From: ssufficool at rov.sbcounty.gov (Sufficool, Stanley) Date: Wed, 12 Sep 2007 14:14:47 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: <46E82FC5.9030906@hp.com> References: <46E81580.4040900@hp.com><46E82C9F.9010602@ichips.intel.com> <46E82FC5.9030906@hp.com> Message-ID: How exactly do you set the MTU for ipoib? I am running the latest unpatched git branch of vofed kernel 1.2.5 and I get "SIOCSIFMTU: Invalid argument" when I try ifconfig ib0 mtu 65520. Anything above the preset 2044 returns this issue. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones Sent: Wednesday, September 12, 2007 11:28 AM To: Arlin Davis Cc: general; Davis,Arlin R Subject: Re: [ofa-general] scp performance over IPoIB Arlin Davis wrote: > Rick Jones wrote: > >> Davis, Arlin R wrote: >> >>> Can someone explain why scp performance over IPoIB would be 10x >>> slower then on GBE? The netperf numbers look normal. >> >> >> >> So, you could try tweaking the MTU on the IPoIB interfaces. >> >> > > Rick, > > Thanks for the suggestion. Looks like we may need to change the > default MTU for IPoIB. It would be interesting to see results from > other distributions. > > (Woodcrest, Xeon 5160, DDR, RHEL4U4) > > MTU SCP NetPerf > > 1024 41 MB/s 151 MB/s > 2048 50 MB/s 313 MB/s > 4096 50 MB/s 485 MB/s > 8192 50 MB/s 641 MB/s > 16384 25 MB/s 761 MB/s > 32768 50 MB/s 700 MB/s > 65520 8 MB/s 440 MB/s I'm actually a triffle surprised that netperf was affected by the 65520 MTU - I'm guessing you were using all defaults, which on "linux" IIRC means netperf was making 16KB (K == 1024) sends. I suspect that if you were to make 64K sends from netperf (test specific -m 64K) that the numbers for 64420 might be better. I'm really shakey on scp behaviour knowledge, but suspect that perhaps with the "HPN" (High Performance Network) patches in place (check the archives pointed-to by: https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev) it might be possible to get good SCP performance out of a 65520 byte MTU. I'm _guessing" that by default scp isn't trying to put-out > 65520 bytes worth of data in the sum of its sends with its own windowing and so gets hit by issues with Nagle. Ie it is doing write, write, read and the second write at least is sub-MSS. Some strace tracing of the scp transfer could confirm/deny that hypothesis. So, it may not be necessary to shrink the MTU. rick jones _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sweitzen at cisco.com Wed Sep 12 14:21:04 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 12 Sep 2007 14:21:04 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: References: <46E81580.4040900@hp.com><46E82C9F.9010602@ichips.intel.com><46E82FC5.9030906@hp.com> Message-ID: What does "cat /sys/class/net/ib0/mode" report? If "datagram", you need to run "echo connected > /sys/class/net/ib0/mode", then you can raise the MTU. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Sufficool, Stanley > Sent: Wednesday, September 12, 2007 2:15 PM > To: Arlin Davis; general > Subject: RE: [ofa-general] scp performance over IPoIB > > How exactly do you set the MTU for ipoib? > > I am running the latest unpatched git branch of vofed kernel > 1.2.5 and I > get "SIOCSIFMTU: Invalid argument" when I try ifconfig ib0 mtu 65520. > Anything above the preset 2044 returns this issue. > > > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Rick Jones > Sent: Wednesday, September 12, 2007 11:28 AM > To: Arlin Davis > Cc: general; Davis,Arlin R > Subject: Re: [ofa-general] scp performance over IPoIB > > Arlin Davis wrote: > > Rick Jones wrote: > > > >> Davis, Arlin R wrote: > >> > >>> Can someone explain why scp performance over IPoIB would be 10x > >>> slower then on GBE? The netperf numbers look normal. > >> > >> > >> > >> So, you could try tweaking the MTU on the IPoIB interfaces. > >> > >> > > > > Rick, > > > > Thanks for the suggestion. Looks like we may need to change the > > default MTU for IPoIB. It would be interesting to see results from > > other distributions. > > > > (Woodcrest, Xeon 5160, DDR, RHEL4U4) > > > > MTU SCP NetPerf > > > > 1024 41 MB/s 151 MB/s > > 2048 50 MB/s 313 MB/s > > 4096 50 MB/s 485 MB/s > > 8192 50 MB/s 641 MB/s > > 16384 25 MB/s 761 MB/s > > 32768 50 MB/s 700 MB/s > > 65520 8 MB/s 440 MB/s > > I'm actually a triffle surprised that netperf was affected by > the 65520 > MTU - I'm guessing you were using all defaults, which on "linux" IIRC > means netperf was making 16KB (K == 1024) sends. I suspect > that if you > were to make 64K sends from netperf (test specific -m 64K) that the > numbers for 64420 might be better. > > I'm really shakey on scp behaviour knowledge, but suspect that perhaps > with the "HPN" (High Performance Network) patches in place (check the > archives pointed-to by: > https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev) > it might be > possible to get good SCP performance out of a 65520 byte MTU. I'm > _guessing" that by default scp isn't trying to put-out > 65520 bytes > worth of data in the sum of its sends with its own windowing > and so gets > hit by issues with Nagle. Ie it is doing write, write, read and the > second write at least is sub-MSS. Some strace tracing of the scp > transfer could confirm/deny that hypothesis. > > So, it may not be necessary to shrink the MTU. > > rick jones > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Nathan.Dauchy at noaa.gov Wed Sep 12 15:08:37 2007 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Wed, 12 Sep 2007 16:08:37 -0600 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: References: <46E81580.4040900@hp.com> <46E82C9F.9010602@ichips.intel.com> <46E82FC5.9030906@hp.com> Message-ID: <46E86365.7030009@noaa.gov> Scott Weitzenkamp (sweitzen) wrote: > What does "cat /sys/class/net/ib0/mode" report? If "datagram", you need > to run "echo connected > /sys/class/net/ib0/mode", then you can raise > the MTU. > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems Do ALL the nodes on the IB network have to be run in connected mode? Or is a mixed configuration supported? On ethernet, I have generally run into problems where mis-matched MTU settings caused problems. Is this the case on Infiniband? Thanks, Nathan From rick.jones2 at hp.com Wed Sep 12 15:18:16 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 12 Sep 2007 15:18:16 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: <46E86365.7030009@noaa.gov> References: <46E81580.4040900@hp.com> <46E82C9F.9010602@ichips.intel.com> <46E82FC5.9030906@hp.com> <46E86365.7030009@noaa.gov> Message-ID: <46E865A8.7040200@hp.com> > On ethernet, I have generally run into problems where mis-matched MTU > settings caused problems. Is this the case on Infiniband? I would think the issues would be very similar if not exactly the same. For TCP you are probably OK with different MTUs in the same subnet, but with UDP sorrow and woe can be the result. Just like with JumboFrames on a subset of the subnet. rick jones From sweitzen at cisco.com Wed Sep 12 15:20:20 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 12 Sep 2007 15:20:20 -0700 Subject: [ofa-general] scp performance over IPoIB In-Reply-To: <46E86365.7030009@noaa.gov> References: <46E81580.4040900@hp.com> <46E82C9F.9010602@ichips.intel.com> <46E82FC5.9030906@hp.com> <46E86365.7030009@noaa.gov> Message-ID: Mixed config should be OK. Scott > -----Original Message----- > From: Nathan Dauchy [mailto:Nathan.Dauchy at noaa.gov] > Sent: Wednesday, September 12, 2007 3:09 PM > To: Scott Weitzenkamp (sweitzen); general > Cc: Sufficool, Stanley; Arlin Davis > Subject: Re: [ofa-general] scp performance over IPoIB > > Scott Weitzenkamp (sweitzen) wrote: > > What does "cat /sys/class/net/ib0/mode" report? If > "datagram", you need > > to run "echo connected > /sys/class/net/ib0/mode", then > you can raise > > the MTU. > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > Do ALL the nodes on the IB network have to be run in > connected mode? Or > is a mixed configuration supported? > > On ethernet, I have generally run into problems where mis-matched MTU > settings caused problems. Is this the case on Infiniband? > > Thanks, > Nathan > From metzlerelkibatyawgfzr at ch.thalesgroup.com Wed Sep 12 18:16:31 2007 From: metzlerelkibatyawgfzr at ch.thalesgroup.com (kneecap) Date: Thu, 13 Sep 2007 07:16:31 +0600 Subject: [ofa-general] medical data - package deals Message-ID: <632139e1hzj0$h7267sy0$3993i2e0@Delldim5150 For the week ending Sep 14, you will receive a Contact List for Nursing Homes, Hospitals and Dentists without charge when you order the Physician Contact List Licensed Physicians in the USA 788,387 in total ďż˝ 17,400 emails Coverage in many different areas of medicine such as Endocrinology, Pathology, Urology, Neurology, Plastic Surgery, Psychiatry, Cardiology and much more 16 different sortable fields Lowered Price - $356 *** FREE OFFER: Get the 3 directories below for FREE with the purchase of the Doctor data *** List of US Hospitals more than 23k hospital administrators in over 7k hospitals [worth over $300 alone) List of US Dentists A complete Directory or dentists and related services (valued at $299) Nursing Homes in the USA includes over 31,589 Senior administrators, 11,288 Nursing Directors in over 14,706 Nursing Homes in the United States. (value: $249) or by phone: 1-206-202-1564 if you send us an email with "block" in the subject we will not include you in future mail From rdreier at cisco.com Wed Sep 12 17:36:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Sep 2007 17:36:28 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <46D78104.mailJY81GRONO@systemfabricworks.com> (swelch@systemfabricworks.com's message of "Thu, 30 Aug 2007 21:46:28 -0500") References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: Hal and Sean, what was the final feeling about this? I seem to recall some changes were requested? Are there two independent changes mixed up in one patch here? > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..9ec910b 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > if (port_priv) { > mad_priv->mad.mad.mad_hdr.tid = > ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..d96fc8e 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return 1 if the SMP response should be handled by the local management stack > + */ > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ From kliteyn at mellanox.co.il Wed Sep 12 21:18:17 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 13 Sep 2007 07:18:17 +0300 Subject: [ofa-general] nightly osm_sim report 2007-09-13:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-12 OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From rdreier at cisco.com Wed Sep 12 21:33:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 12 Sep 2007 21:33:45 -0700 Subject: [ofa-general] Re: [PATCH] IB/ehca: Make sure user pages are from hugetlb before using MR large pages In-Reply-To: <200709121439.32641.fenkes@de.ibm.com> (Joachim Fenkes's message of "Wed, 12 Sep 2007 14:39:31 +0200") References: <200709121439.32641.fenkes@de.ibm.com> Message-ID: > -#define HCA_CAP_MR_PGSIZE_4K 1 > -#define HCA_CAP_MR_PGSIZE_64K 2 > -#define HCA_CAP_MR_PGSIZE_1M 4 > -#define HCA_CAP_MR_PGSIZE_16M 8 > +#define HCA_CAP_MR_PGSIZE_4K 0x80000000 > +#define HCA_CAP_MR_PGSIZE_64K 0x40000000 > +#define HCA_CAP_MR_PGSIZE_1M 0x20000000 > +#define HCA_CAP_MR_PGSIZE_16M 0x10000000 Not sure I understand what this has to do with things... is this an unrelated fix? > +static int ehca_is_mem_hugetlb(unsigned long addr, unsigned long size) This is rather awful -- another call to get_user_pages() to iterate over all the vmas... I would suggest extending ib_umem_get() to check the vmas and adding a member to struct ib_umem to say whether the memory is entirely covered by hugetlb pages or not. > + ret = ehca_is_mem_hugetlb(virt, length); > + switch (ret) { > + case 0: /* mem is not from hugetlb */ > + hwpage_size = PAGE_SIZE; > + break; > + case 1: > + if (length <= EHCA_MR_PGSIZE4K > + && PAGE_SIZE == EHCA_MR_PGSIZE4K) > + hwpage_size = EHCA_MR_PGSIZE4K; > + else if (length <= EHCA_MR_PGSIZE64K) > + hwpage_size = EHCA_MR_PGSIZE64K; > + else if (length <= EHCA_MR_PGSIZE1M) > + hwpage_size = EHCA_MR_PGSIZE1M; > + else > + hwpage_size = EHCA_MR_PGSIZE16M; > + break; > + default: /* out of mem */ > + ib_mr = ERR_PTR(-ENOMEM); > + goto reg_user_mr_exit1; It seems like it would be better to just assume the memory is not from a hugetlb is ehca_is_mem_hugetlb() fails its memory allocation and fall back to the PAGE_SIZE case rather than failing entirely. Also if someone runs a kernel with 64K pages on a machine where they end up being simulated from 4K pages, do you have the same issue with the hypervisor ganging together non-contiguous pages? - R. From gtelzur at bgu.ac.il Wed Sep 12 23:59:33 2007 From: gtelzur at bgu.ac.il (Guy Tel-Zur) Date: Thu, 13 Sep 2007 06:59:33 GMT Subject: [ofa-general] Can not unsubscribe from the mailing list Message-ID: Can someone please sign me off the mailing list. The link at the bottom is broken Regards, gtelzur at bgu.ac.il http://openib.org/mailman/listinfo/openib-general Guy Tel-Zur, Ph.D. http://tel-zur.com‎ -------------- next part -------------- An HTML attachment was scrubbed... URL: From cap at nsc.liu.se Thu Sep 13 01:32:28 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Thu, 13 Sep 2007 10:32:28 +0200 Subject: [ofa-general] Can not unsubscribe from the mailing list In-Reply-To: References: Message-ID: <200709131032.28095.cap@nsc.liu.se> On Thursday 13 September 2007, Guy Tel-Zur wrote: > Can someone please sign me off the mailing list. > The link at the bottom is broken Someone forgot to update that.. The list currently lives at: http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general This needs to be fixed by someone with mailman admin rights. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From RAISCH at de.ibm.com Thu Sep 13 02:49:19 2007 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Thu, 13 Sep 2007 11:49:19 +0200 Subject: [ofa-general] Re: [PATCH] IB/ehca: Make sure user pages are from hugetlb before using MR large pages In-Reply-To: References: <200709121439.32641.fenkes@de.ibm.com> Message-ID: Roland Dreier wrote on 13.09.2007 06:33:45: > > Also if someone runs a kernel with 64K pages on a machine where they > end up being simulated from 4K pages, do you have the same issue with > the hypervisor ganging together non-contiguous pages? With todays hypervisor and todays pagesizes and todays MMUs we don't have this problem if eHCA is enabled. It is difficult to make predictions about the future, but that's not specific to driver development. ;-) > > - R. - Christoph R. From vlad at lists.openfabrics.org Thu Sep 13 02:52:23 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 13 Sep 2007 02:52:23 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070913-0200 daily build status Message-ID: <20070913095223.E343AE6084A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070913-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hrosenstock at xsigo.com Thu Sep 13 07:24:19 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 13 Sep 2007 07:24:19 -0700 Subject: [ofa-general] [PATCH] ibnetdiscover: Support Xsigo chassis grouping Message-ID: <1189693459.13110.132.camel@hrosenstock-ws.xsigo.com> ibnetdiscover: Support Xsigo chassis grouping I think this also fixes a bug with grouping of multiple non Voltaire chassis as well. Note: this patch is against OFED 1.2 Signed-off-by: Hal Rosenstock diff --git a/diags/include/grouping.h b/diags/include/grouping.h index 4666935..3ba872c 100644 --- a/diags/include/grouping.h +++ b/diags/include/grouping.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); char *get_chassis_slot(unsigned char chassisslot); uint64_t get_chassis_guid(unsigned char chassisnum); +int is_xsigo_guid(uint64_t guid); +int is_xsigo_tca(uint64_t guid); +int is_xsigo_hca(uint64_t guid); + #endif /* _GROUPING_H_ */ diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h index d13a666..bfbe7f5 100644 --- a/diags/include/ibnetdiscover.h +++ b/diags/include/ibnetdiscover.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -44,6 +45,7 @@ #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ #define TS_VENDOR_ID 0x5ad /* Cisco */ #define SS_VENDOR_ID 0x66a /* InfiniCon */ +#define XS_VENDOR_ID 0x1397 /* Xsigo */ typedef struct Port Port; diff --git a/diags/src/grouping.c b/diags/src/grouping.c index 0e5bd78..6602f26 100644 --- a/diags/src/grouping.c +++ b/diags/src/grouping.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) return guid & 0xffffffff00ffffffULL; } -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) +int is_xsigo_guid(uint64_t guid) { - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) - return topspin_chassisguid(guid); + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) + return 1; else - return guid; + return 0; +} + +static int is_xsigo_leafone(uint64_t guid) +{ + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) + return 1; + else + return 0; +} + +int is_xsigo_hca(uint64_t guid) +{ + /* NodeType 2 is HCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) + return 1; + else + return 0; +} + +int is_xsigo_tca(uint64_t guid) +{ + /* NodeType 3 is TCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_ca(uint64_t guid) +{ + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) + return 1; + else + return 0; +} + +static int is_xsigo_switch(uint64_t guid) +{ + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) + return 1; + else + return 0; +} + +static uint64_t xsigo_chassisguid(Node *node) +{ + if (!is_xsigo_ca(node->sysimgguid)) { + /* Byte 3 is NodeType and byte 4 is PortType */ + /* If NodeType is 1 (switch), PortType is masked */ + if (is_xsigo_switch(node->sysimgguid)) + return node->sysimgguid & 0xffffffff00ffffffULL; + else + return node->sysimgguid; + } else { + /* If peer port is Leaf 1, use its chassis GUID */ + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) + return node->ports->remoteport->node->sysimgguid & + 0xffffffff00ffffffULL; + else + return node->sysimgguid; + } } -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) +static uint64_t get_chassisguid(Node *node) +{ + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) + return topspin_chassisguid(node->sysimgguid); + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) + return xsigo_chassisguid(node); + else + return node->sysimgguid; +} + +static struct ChassisList *find_chassisguid(Node *node) { ChassisList *current; uint64_t chguid; - chguid = get_chassisguid(guid, vendid); + chguid = get_chassisguid(node); for (current = mylist.first; current; current = current->next) { if (current->chassisguid == chguid) return current; @@ -668,14 +740,13 @@ ChassisList *group_nodes() if (node->vendid == VTR_VENDOR_ID) continue; if (node->sysimgguid) { - chassis = find_chassisguid(node->sysimgguid, - node->vendid); + chassis = find_chassisguid(node); if (chassis) chassis->nodecount++; else { /* Possible new chassis */ add_chassislist(); - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); + mylist.current->chassisguid = get_chassisguid(node); mylist.current->nodecount = 1; } } @@ -684,13 +755,12 @@ ChassisList *group_nodes() /* now, make another pass to see which nodes are part of chassis */ /* (defined as chassis->nodecount > 1) */ - for (dist = 0; dist <= maxhops_discovered; dist++) { + for (dist = 0; dist <= MAXHOPS; ) { for (node = nodesdist[dist]; node; node = node->dnext) { if (node->vendid == VTR_VENDOR_ID) continue; if (node->sysimgguid) { - chassis = find_chassisguid(node->sysimgguid, - node->vendid); + chassis = find_chassisguid(node); if (chassis && chassis->nodecount > 1) { if (!chassis->chassisnum) chassis->chassisnum = ++chassisnum; @@ -702,6 +772,10 @@ ChassisList *group_nodes() } } } + if (dist == maxhops_discovered) + dist = MAXHOPS; /* skip to CAs */ + else + dist++; } return (mylist.first); diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index cb62c44..ac4cecd 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -450,14 +451,25 @@ list_node(Node *node) } void -out_ids(Node *node) +out_ids(Node *node, int group, char *chname) { fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); if (node->sysimgguid) - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); + if (group) + if (node->chrecord) + if (node->chrecord->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); + if (chname) + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (is_xsigo_tca(node->nodeguid)) + if (node->ports->remoteport) + fprintf(f, " slot %d", node->ports->remoteport->portnum); + } + fprintf(f, "\n"); } -void +uint64_t out_chassis(int chassisnum) { uint64_t guid; @@ -467,20 +479,20 @@ out_chassis(int chassisnum) if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); + return guid; } void -out_switch(Node *node, int group) +out_switch(Node *node, int group, char *chname) { char *str; char *nodename = NULL; - out_ids(node); + out_ids(node, group, chname); fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); if (group) { if (node->chrecord) { if (node->chrecord->chassisnum) { - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); /* Currently, only if Voltaire chassis */ if (node->vendid == VTR_VENDOR_ID) { str = get_chassis_type(node->chrecord->chassistype); @@ -510,12 +522,12 @@ out_switch(Node *node, int group) } void -out_ca(Node *node) +out_ca(Node *node, int group, char *chname) { char *node_type; char *node_type2; - out_ids(node); + out_ids(node, group, chname); switch(node->type) { case CA_NODE: node_type = "ca"; @@ -572,12 +584,15 @@ out_switch_port(Port *port, int group) rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : "", rem_nodename, port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); + if (is_xsigo_tca(port->remoteport->portguid)) + fprintf(f, " slot %d", port->portnum); + fprintf(f, "\n"); if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) free(rem_nodename); @@ -616,6 +631,8 @@ dump_topology(int listtype, int group) Port *port; int i = 0, dist = 0; time_t t = time(0); + uint64_t chguid; + char *chname = NULL; if (!listtype) { fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); @@ -633,11 +650,31 @@ dump_topology(int listtype, int group) if (!ch->chassisnum) continue; - out_chassis(ch->chassisnum); + chguid = out_chassis(ch->chassisnum); + chname = NULL; + if (is_xsigo_guid(chguid)) { + /* !!! */ + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { + if (node->chrecord) { + if (!node->chrecord->chassisnum) + continue; + } else + continue; + + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + if (is_xsigo_hca(node->nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); + } + } + } + fprintf(f, "\n# Spine Nodes"); for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { if (ch->spinenode[n]) { - out_switch(ch->spinenode[n], group); + out_switch(ch->spinenode[n], group, chname); for (port = ch->spinenode[n]->ports; port; port = port->next, i++) if (port->remoteport) out_switch_port(port, group); @@ -646,34 +683,57 @@ dump_topology(int listtype, int group) fprintf(f, "\n# Line Nodes"); for (n = 1; n <= (LINES_MAX_NUM+1); n++) { if (ch->linenode[n]) { - out_switch(ch->linenode[n], group); + out_switch(ch->linenode[n], group, chname); for (port = ch->linenode[n]->ports; port; port = port->next, i++) if (port->remoteport) out_switch_port(port, group); } } - } + fprintf(f, "\n# Chassis Switches"); + for (dist = 0; dist <= maxhops_discovered; dist++) { - for (dist = 0; dist <= maxhops_discovered; dist++) { + for (node = nodesdist[dist]; node; node = node->dnext) { - for (node = nodesdist[dist]; node; node = node->dnext) { + /* Non Voltaire chassis */ + if (node->vendid == VTR_VENDOR_ID) + continue; + if (node->chrecord) { + if (!node->chrecord->chassisnum) + continue; + } else + continue; - /* Non Voltaire chassis */ - if (node->vendid == VTR_VENDOR_ID) - continue; + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + out_switch(node, group, chname); + for (port = node->ports; port; port = port->next, i++) + if (port->remoteport) + out_switch_port(port, group); + + } + + } + + fprintf(f, "\n# Chassis CAs"); + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { if (node->chrecord) { if (!node->chrecord->chassisnum) continue; } else continue; - out_switch(node, group); + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + out_ca(node, group, chname); for (port = node->ports; port; port = port->next, i++) if (port->remoteport) - out_switch_port(port, group); + out_ca_port(port, group); } + } } else { @@ -683,7 +743,7 @@ dump_topology(int listtype, int group) DEBUG("SWITCH: dist %d node %p", dist, node); if (!listtype) { - out_switch(node, group); + out_switch(node, group, chname); } else { if (listtype & SWITCH_NODE) list_node(node); @@ -697,6 +757,7 @@ dump_topology(int listtype, int group) } } + chname = NULL; if (group && !listtype) { fprintf(f, "\nNon-Chassis Nodes\n"); @@ -710,7 +771,7 @@ dump_topology(int listtype, int group) if (node->chrecord) if (node->chrecord->chassisnum) continue; - out_switch(node, group); + out_switch(node, group, chname); for (port = node->ports; port; port = port->next, i++) if (port->remoteport) @@ -725,9 +786,14 @@ dump_topology(int listtype, int group) for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { DEBUG("CA: dist %d node %p", dist, node); - if (!listtype) - out_ca(node); - else { + if (!listtype) { + if (group) + /* Now, skip chassis based CAs */ + if (node->chrecord) + if (node->chrecord->chassisnum) + continue; + out_ca(node, group, chname); + } else { if (listtype & CA_NODE) list_node(node); continue; From FENKES at de.ibm.com Thu Sep 13 07:27:03 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Thu, 13 Sep 2007 16:27:03 +0200 Subject: [ofa-general] Re: [PATCH] IB/ehca: Make sure user pages are from hugetlb before using MR large pages In-Reply-To: Message-ID: Roland Dreier wrote on 13.09.2007 06:33:45: > > -#define HCA_CAP_MR_PGSIZE_4K 1 > > -#define HCA_CAP_MR_PGSIZE_64K 2 > > -#define HCA_CAP_MR_PGSIZE_1M 4 > > -#define HCA_CAP_MR_PGSIZE_16M 8 > > +#define HCA_CAP_MR_PGSIZE_4K 0x80000000 > > +#define HCA_CAP_MR_PGSIZE_64K 0x40000000 > > +#define HCA_CAP_MR_PGSIZE_1M 0x20000000 > > +#define HCA_CAP_MR_PGSIZE_16M 0x10000000 > > Not sure I understand what this has to do with things... is this an > unrelated fix? Kinda. I can put it into its own patch if you want. > I would suggest extending ib_umem_get() to check the vmas and adding a > member to struct ib_umem to say whether the memory is entirely covered > by hugetlb pages or not. I like that approach - one patch coming right up! =) > > + default: /* out of mem */ > > + ib_mr = ERR_PTR(-ENOMEM); > > + goto reg_user_mr_exit1; > > It seems like it would be better to just assume the memory is not from > a hugetlb is ehca_is_mem_hugetlb() fails its memory allocation and > fall back to the PAGE_SIZE case rather than failing entirely. If ehca_is_mem_hugetlb() runs out of memory, ehca_reg_mr() is rather unlikely to get the memory, but it's worth a try, I'll give you that. I'll make the umem patch work that way. Joachim From swise at opengridcomputing.com Thu Sep 13 07:37:42 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 13 Sep 2007 09:37:42 -0500 Subject: [ofa-general] [GIT PULL ofed_1_2_c] cxgb3 bug fixes Message-ID: <46E94B36.70406@opengridcomputing.com> Vlad (Michael/Tziporet in Vlad's absence), Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of these patches are either in 2.6.23 or merged into Jeff Garzik's upstream branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we update ofed-1.2.5 and ofed-1.3 will all of these fixes. I'll send another email with the ofed-1.3 changes as they will be slightly different. Please pull the ofed_1_2_c changes from: git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c The patch files added to kernel_patches/fixes include: > swise at dell3:~/git/ofed-1.2.5> stg series > + 0029-cxgb3-engine-microcode-load > + 0030-cxgb3-MAC-workaround-update > + 0031-cxgb3-Update-rx-coalescing-length > + 0032-cxgb3-SGE-doorbell-overflow-warning > + 0033-cxgb3-use-immediate-data-for-offload-Tx > + 0034-cxgb3-Expose-HW-memory-page-info > + 0035-cxgb3-tighten-checks-on-TID-values > + 0036-cxgb3-Fatal-error-update > + 0037-cxgb3-log-adapter-serial-number > + 0038-cxgb3-Update-internal-memory-management > + 0039-cxgb3-update-firmware-version > + 0040-cxgb3-log-and-clear-PEX-errors > + 0041-cxgb3-remove-false-positive-in-xgmac-workaround > + 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts > + 0043-cxgb3-CQ-context-operations-time-out-too-soon > + 0044-cxgb3-Add-T3C-rev > + 0045-cxgb3-Update-engine-microcode-version > > 0046-cxgb3-driver-version Steve. From hal.rosenstock at gmail.com Thu Sep 13 08:24:14 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 13 Sep 2007 11:24:14 -0400 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: On 9/12/07, Roland Dreier wrote: > Hal and Sean, what was the final feeling about this? I seem to recall some changes were requested? Yes, some mainly cosmetic changes were requested for more clarity. This was all based on just code review. I have not had a chance to test this out yet in what environments I can. I hope to get to this next week. > Are there two independent changes mixed > up in one patch here? Are you referring to the memcpy as one change and the rest as another ? -- Hal > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 6f42877..9ec910b 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > } > > > > /* Check to post send on QP or process locally */ > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > if (port_priv) { > > mad_priv->mad.mad.mad_hdr.tid = > > ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > > index 1cfc298..d96fc8e 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return 1 if the SMP response should be handled by the local management stack > > + */ > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > + > > #endif /* __SMI_H_ */ > From fenkes at de.ibm.com Thu Sep 13 09:14:13 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 13 Sep 2007 18:14:13 +0200 Subject: [ofa-general] [PATCH 0/3] IB/ehca: MR/MW fixes Message-ID: <200709131814.13937.fenkes@de.ibm.com> This patchset replaces Nam's previous MR/MW patch (posted by me). I split the #define fixes into a separate patch and moved the "is the memory from hugetlbfs?" code into ib_umem_get(). [1/3] fixes the page size HW cap defines [2/3] adds the hugetlb test to ib_umem_get() [3/3] finally uses the hugetlb flag in ehca_reg_user_mr() The patches should apply cleanly, in order, on top of my previous 12-patch set. Please review the changes and apply the patches for 2.6.24 if they are okay. Regards, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com From mshefty at ichips.intel.com Thu Sep 13 09:14:32 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 09:14:32 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: <46E961E8.4070004@ichips.intel.com> Roland Dreier wrote: > Hal and Sean, what was the final feeling about this? I seem to recall > some changes were requested? Are there two independent changes mixed > up in one patch here? I don't think there are necessarily two independent patches here, but there were two minor changes that I requested: > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > if (port_priv) { > > mad_priv->mad.mad.mad_hdr.tid = > > ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); Remove setting the TID, since the memcpy will overwrite it anyway. It would be nice to test that this change doesn't break ehca or qlogic adapters, but it doesn't look like the existing code in this area would work. You could separate this change out as a bug fix, I guess. > > +/* > > + * Return 1 if the SMP response should be handled by the local management stack > > + */ > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} Update the comment above the function to replace '1' with IB_SMI_HANDLE. - Sean From fenkes at de.ibm.com Thu Sep 13 09:14:58 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 13 Sep 2007 18:14:58 +0200 Subject: [ofa-general] [PATCH 1/3] IB/ehca: Fix large page HW cap defines In-Reply-To: <200709131814.13937.fenkes@de.ibm.com> References: <200709131814.13937.fenkes@de.ibm.com> Message-ID: <200709131814.59307.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 206d4eb..c2edd4c 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -99,10 +99,10 @@ struct ehca_sport { struct ehca_sma_attr saved_attr; }; -#define HCA_CAP_MR_PGSIZE_4K 1 -#define HCA_CAP_MR_PGSIZE_64K 2 -#define HCA_CAP_MR_PGSIZE_1M 4 -#define HCA_CAP_MR_PGSIZE_16M 8 +#define HCA_CAP_MR_PGSIZE_4K 0x80000000 +#define HCA_CAP_MR_PGSIZE_64K 0x40000000 +#define HCA_CAP_MR_PGSIZE_1M 0x20000000 +#define HCA_CAP_MR_PGSIZE_16M 0x10000000 struct ehca_shca { struct ib_device ib_device; -- 1.5.2 From fenkes at de.ibm.com Thu Sep 13 09:15:28 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 13 Sep 2007 18:15:28 +0200 Subject: [ofa-general] [PATCH 2/3] IB/umem: Add hugetlb flag to struct ib_umem In-Reply-To: <200709131814.13937.fenkes@de.ibm.com> References: <200709131814.13937.fenkes@de.ibm.com> Message-ID: <200709131815.29040.fenkes@de.ibm.com> During ib_umem_get(), determine whether all pages from the memory region are hugetlb pages and report this in the "hugetlb" field. Low-level driver can use this information if they need it. Signed-off-by: Joachim Fenkes --- drivers/infiniband/core/umem.c | 20 +++++++++++++++++++- include/rdma/ib_umem.h | 1 + 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 664d2fa..2f54e29 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -37,6 +37,7 @@ #include #include #include +#include #include "uverbs.h" @@ -75,6 +76,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, { struct ib_umem *umem; struct page **page_list; + struct vm_area_struct **vma_list; struct ib_umem_chunk *chunk; unsigned long locked; unsigned long lock_limit; @@ -104,6 +106,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, */ umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ); + /* We assume the memory is from hugetlb until proved otherwise */ + umem->hugetlb = 1; + INIT_LIST_HEAD(&umem->chunk_list); page_list = (struct page **) __get_free_page(GFP_KERNEL); @@ -112,6 +117,14 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, return ERR_PTR(-ENOMEM); } + /* + * if we can't alloc the vma_list, it's not so bad; + * just assume the memory is not hugetlb memory + */ + vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL); + if (!vma_list) + umem->hugetlb = 0; + npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT; down_write(¤t->mm->mmap_sem); @@ -131,7 +144,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current->mm, cur_base, min_t(int, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !umem->writable, page_list, NULL); + 1, !umem->writable, page_list, vma_list); if (ret < 0) goto out; @@ -152,6 +165,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); for (i = 0; i < chunk->nents; ++i) { + if (vma_list && + !is_vm_hugetlb_page(vma_list[i + off])) + umem->hugetlb = 0; chunk->page_list[i].page = page_list[i + off]; chunk->page_list[i].offset = 0; chunk->page_list[i].length = PAGE_SIZE; @@ -186,6 +202,8 @@ out: current->mm->locked_vm = locked; up_write(¤t->mm->mmap_sem); + if (vma_list) + free_page((unsigned long) vma_list); free_page((unsigned long) page_list); return ret < 0 ? ERR_PTR(ret) : umem; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index c533d6c..2229842 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -45,6 +45,7 @@ struct ib_umem { int offset; int page_size; int writable; + int hugetlb; struct list_head chunk_list; struct work_struct work; struct mm_struct *mm; -- 1.5.2 From fenkes at de.ibm.com Thu Sep 13 09:16:20 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 13 Sep 2007 18:16:20 +0200 Subject: [ofa-general] [PATCH 3/3] IB/ehca: Make sure user pages are from hugetlb before using MR large pages In-Reply-To: <200709131814.13937.fenkes@de.ibm.com> References: <200709131814.13937.fenkes@de.ibm.com> Message-ID: <200709131816.21162.fenkes@de.ibm.com> ...because, on virtualized hardware like System p, we can't be sure that the physical pages behind them are contiguous. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 25 +++++++++++++++---------- 1 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 4c8f3b3..4ba8b7c 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -51,6 +51,7 @@ #define NUM_CHUNKS(length, chunk_size) \ (((length) + (chunk_size - 1)) / (chunk_size)) + /* max number of rpages (per hcall register_rpages) */ #define MAX_RPAGES 512 @@ -64,6 +65,11 @@ enum ehca_mr_pgsize { EHCA_MR_PGSIZE16M = 0x1000000L }; +#define EHCA_MR_PGSHIFT4K 12 +#define EHCA_MR_PGSHIFT64K 16 +#define EHCA_MR_PGSHIFT1M 20 +#define EHCA_MR_PGSHIFT16M 24 + static u32 ehca_encode_hwpage_size(u32 pgsize) { u32 idx = 0; @@ -347,17 +353,16 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, /* select proper hw_pgsize */ if (ehca_mr_largepage && (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { - if (length <= EHCA_MR_PGSIZE4K - && PAGE_SIZE == EHCA_MR_PGSIZE4K) - hwpage_size = EHCA_MR_PGSIZE4K; - else if (length <= EHCA_MR_PGSIZE64K) - hwpage_size = EHCA_MR_PGSIZE64K; - else if (length <= EHCA_MR_PGSIZE1M) - hwpage_size = EHCA_MR_PGSIZE1M; - else - hwpage_size = EHCA_MR_PGSIZE16M; + int page_shift = PAGE_SHIFT; + if (e_mr->umem->hugetlb) { + /* determine page_shift, clamp between 4K and 16M */ + page_shift = (fls64(length - 1) + 3) & ~3; + page_shift = min(max(page_shift, EHCA_MR_PGSHIFT4K), + EHCA_MR_PGSHIFT16M); + } + hwpage_size = 1UL << page_shift; } else - hwpage_size = EHCA_MR_PGSIZE4K; + hwpage_size = EHCA_MR_PGSIZE4K; /* ehca1 only supports 4k */ ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); reg_user_mr_fallback: -- 1.5.2 From sean.hefty at intel.com Thu Sep 13 10:32:33 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 10:32:33 -0700 Subject: [ofa-general] [RFC 0/2] ib/cm: add message counters Message-ID: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> The following patches provide an initial attempt at providing performance counters to the ib_cm. I still need to export the counters, but wanted to get feedback about the counters that were selected, along with how they are being gathered. The cm tracks the number of sends, receives, sent retries, and received duplicates. It does this per message (REQ, REP, etc.), per port. In total, there are 11 cm message types, giving 44 counters per port, though a small number of these are not possible (e.g. REJ retries). I did not want to add or change the cm state tracking just for message counters, so detecting a received duplicate is not always possible. A message is counted as a duplicate when it is clearly a duplicate, or when it has a high likelihood of being a duplicate. Signed-off-by: Sean Hefty From sean.hefty at intel.com Thu Sep 13 10:36:05 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 10:36:05 -0700 Subject: [ofa-general] [RFC 1/2] ib/mad: report number of times a mad was retried In-Reply-To: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> Message-ID: <000101c7f62c$90534a40$65cc180a@amr.corp.intel.com> To allow ULPs to tune timeout values and capture retry statistics, report the number of times that a mad send operation was retried. For RMPP mads, report the total number of times that any portion (send window) of the send operation was retried. Signed-off-by: Sean Hefty --- drivers/infiniband/core/mad.c | 9 +++++++-- drivers/infiniband/core/mad_priv.h | 3 ++- drivers/infiniband/core/mad_rmpp.c | 2 +- include/rdma/ib_mad.h | 4 +++- 4 files changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..91e62c3 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1100,7 +1100,9 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf, mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid; /* Timeout will be updated after send completes */ mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms); - mad_send_wr->retries = send_buf->retries; + mad_send_wr->max_retries = send_buf->retries; + mad_send_wr->retries_left = send_buf->retries; + send_buf->retries = 0; /* Reference for work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -2445,9 +2447,12 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr) { int ret; - if (!mad_send_wr->retries--) + if (!mad_send_wr->retries_left) return -ETIMEDOUT; + mad_send_wr->retries_left--; + mad_send_wr->send_buf.retries++; + mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms); if (mad_send_wr->mad_agent_priv->agent.rmpp_version) { diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 9be5cc0..8b75010 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -131,7 +131,8 @@ struct ib_mad_send_wr_private { struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; __be64 tid; unsigned long timeout; - int retries; + int max_retries; + int retries_left; int retry; int refcount; enum ib_wc_status status; diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index d43bc62..a5e2a31 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -684,7 +684,7 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent, if (seg_num > mad_send_wr->last_ack) { adjust_last_ack(mad_send_wr, seg_num); - mad_send_wr->retries = mad_send_wr->send_buf.retries; + mad_send_wr->retries_left = mad_send_wr->max_retries; } mad_send_wr->newwin = newwin; if (mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) { diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 8ec3799..7228c05 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -230,7 +230,9 @@ struct ib_class_port_info * @seg_count: The number of RMPP segments allocated for this send. * @seg_size: Size of each RMPP segment. * @timeout_ms: Time to wait for a response. - * @retries: Number of times to retry a request for a response. + * @retries: Number of times to retry a request for a response. For MADs + * using RMPP, this applies per window. On completion, returns the number + * of retries needed to complete the transfer. * * Users are responsible for initializing the MAD buffer itself, with the * exception of any RMPP header. Additional segment buffer space allocated From sean.hefty at intel.com Thu Sep 13 10:40:00 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 10:40:00 -0700 Subject: [ofa-general] [RFC 2/2] ib/cm: add basic performance counters In-Reply-To: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> Message-ID: <000201c7f62d$1c004750$65cc180a@amr.corp.intel.com> Add performance/debug counters to track sent/received messages, retries, and duplicates. Counters are tracked per CM message type, per port. The counters are always enabled, so intrusive state tracking is not done. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cm.c | 87 ++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 83 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 2e39236..0cebcb3 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2007 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. @@ -78,10 +78,35 @@ static struct ib_cm { struct workqueue_struct *wq; } cm; +/* Counter indexes ordered by attribute ID */ +enum { + CM_REQ_COUNTER, + CM_MRA_COUNTER, + CM_REJ_COUNTER, + CM_REP_COUNTER, + CM_RTU_COUNTER, + CM_DREQ_COUNTER, + CM_DREP_COUNTER, + CM_SIDR_REQ_COUNTER, + CM_SIDR_REP_COUNTER, + CM_LAP_COUNTER, + CM_APR_COUNTER, + CM_COUNTERS, + CM_ATTR_ID_OFFSET = 0x0010 +}; + +struct cm_counter { + atomic_long_t xmit; + atomic_long_t xmit_retries; + atomic_long_t rcv; + atomic_long_t rcv_duplicates; +}; + struct cm_port { struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; u8 port_num; + struct cm_counter counters[CM_COUNTERS]; }; struct cm_device { @@ -1270,6 +1295,8 @@ static void cm_dup_req_handler(struct cm_work *work, struct ib_mad_send_buf *msg = NULL; int ret; + atomic_long_inc(&work->port->counters[CM_REQ_COUNTER].rcv_duplicates); + /* Quick state check to discard duplicate REQs. */ if (cm_id_priv->id.state == IB_CM_REQ_RCVD) return; @@ -1616,6 +1643,7 @@ static void cm_dup_rep_handler(struct cm_work *work) if (!cm_id_priv) return; + atomic_long_inc(&work->port->counters[CM_REP_COUNTER].rcv_duplicates); ret = cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg); if (ret) goto deref; @@ -1781,6 +1809,8 @@ static int cm_rtu_handler(struct cm_work *work) if (cm_id_priv->id.state != IB_CM_REP_SENT && cm_id_priv->id.state != IB_CM_MRA_REP_RCVD) { spin_unlock_irq(&cm_id_priv->lock); + atomic_long_inc(&work->port->counters[CM_RTU_COUNTER]. + rcv_duplicates); goto out; } cm_id_priv->id.state = IB_CM_ESTABLISHED; @@ -1958,6 +1988,8 @@ static int cm_dreq_handler(struct cm_work *work) cm_id_priv = cm_acquire_id(dreq_msg->remote_comm_id, dreq_msg->local_comm_id); if (!cm_id_priv) { + atomic_long_inc(&work->port->counters[CM_DREQ_COUNTER]. + rcv_duplicates); cm_issue_drep(work->port, work->mad_recv_wc); return -EINVAL; } @@ -1977,6 +2009,8 @@ static int cm_dreq_handler(struct cm_work *work) case IB_CM_MRA_REP_RCVD: break; case IB_CM_TIMEWAIT: + atomic_long_inc(&work->port->counters[CM_DREQ_COUNTER]. + rcv_duplicates); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -1988,6 +2022,10 @@ static int cm_dreq_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_DREQ_RCVD: + atomic_long_inc(&work->port->counters[CM_DREQ_COUNTER]. + rcv_duplicates); + goto unlock; default: goto unlock; } @@ -2339,10 +2377,19 @@ static int cm_mra_handler(struct cm_work *work) if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_OTHER || cm_id_priv->id.lap_state != IB_CM_LAP_SENT || ib_modify_mad(cm_id_priv->av.port->mad_agent, - cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) { + if (cm_id_priv->id.lap_state == IB_CM_MRA_LAP_RCVD) + atomic_long_inc(&work->port->counters + [CM_MRA_COUNTER].rcv_duplicates); goto out; + } cm_id_priv->id.lap_state = IB_CM_MRA_LAP_RCVD; break; + case IB_CM_MRA_REQ_RCVD: + case IB_CM_MRA_REP_RCVD: + atomic_long_inc(&work->port->counters[CM_MRA_COUNTER]. + rcv_duplicates); + /* fall through */ default: goto out; } @@ -2502,6 +2549,8 @@ static int cm_lap_handler(struct cm_work *work) case IB_CM_LAP_IDLE: break; case IB_CM_MRA_LAP_SENT: + atomic_long_inc(&work->port->counters[CM_LAP_COUNTER]. + rcv_duplicates); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -2515,6 +2564,10 @@ static int cm_lap_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_LAP_RCVD: + atomic_long_inc(&work->port->counters[CM_LAP_COUNTER]. + rcv_duplicates); + goto unlock; default: goto unlock; } @@ -2796,6 +2849,8 @@ static int cm_sidr_req_handler(struct cm_work *work) cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv); if (cur_cm_id_priv) { spin_unlock_irq(&cm.lock); + atomic_long_inc(&work->port->counters[CM_SIDR_REQ_COUNTER]. + rcv_duplicates); goto out; /* Duplicate message. */ } cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; @@ -2990,6 +3045,25 @@ static void cm_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { struct ib_mad_send_buf *msg = mad_send_wc->send_buf; + struct cm_port *port; + u16 attr_index; + + port = mad_agent->context; + attr_index = be16_to_cpu(((struct ib_mad_hdr *) + msg->mad)->attr_id) - CM_ATTR_ID_OFFSET; + + /* + * If the send was in response to a received message (context[0] is not + * set to a cm_id), and is not a REJ, then it is a send that was + * manually retried. + */ + if (!msg->context[0] && (attr_index != CM_REJ_COUNTER)) + msg->retries = 1; + + atomic_long_add(1 + msg->retries, &port->counters[attr_index].xmit); + if (msg->retries) + atomic_long_add(msg->retries, + &port->counters[attr_index].xmit_retries); switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -3148,8 +3222,10 @@ EXPORT_SYMBOL(ib_cm_notify); static void cm_recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { + struct cm_port *port = mad_agent->context; struct cm_work *work; enum ib_cm_event_type event; + u16 attr_id; int paths = 0; switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) { @@ -3194,6 +3270,9 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, return; } + attr_id = be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id); + atomic_long_inc(&port->counters[attr_id - CM_ATTR_ID_OFFSET].rcv); + work = kmalloc(sizeof *work + sizeof(struct ib_sa_path_rec) * paths, GFP_KERNEL); if (!work) { @@ -3204,7 +3283,7 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, INIT_DELAYED_WORK(&work->work, cm_work_handler); work->cm_event.event = event; work->mad_recv_wc = mad_recv_wc; - work->port = (struct cm_port *)mad_agent->context; + work->port = port; queue_delayed_work(cm.wq, &work->work, 0); } @@ -3397,7 +3476,7 @@ static void cm_add_one(struct ib_device *device) if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; - cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * + cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) return; From rdreier at cisco.com Thu Sep 13 10:57:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 10:57:27 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 Message-ID: With 2.6.24 probably opening in the not-too-distant future, it's probably a good time to review what my plans are for when the merge window opens. At the kernel summit, we discussed patch review (doing a web search for "kernel summit" "reviewed-by:" should turn up lots of info on this). Due to an unfortunate combination of vacation and conference travel, summer colds, and other inconveniences, I am very backed up on reviewing. And in any case, I've allowed too much code review to be dumped on me -- when there are dozens of people working on IB and RDMA stuff, it obviously doesn't work to expect me to do all the reviewing. Unfortunately, due to the length of the backlog and the fact that 2.6.23 seems fairly close, some of the things listed below are going to miss the 2.6.24 merge window. So, although the plan is to phase in requiring "Reviewed-by:" gently, for this merge, if you can get someone other than me to review your work, then the chances of it being merged increase dramatically. I'm talking about a real review-- ideally, someone independent (from another company would be good) who is willing to provide a "Reviewed-by:" line that means the reviewer has really looked at and thought about the patch. There should be a mailing list thread you can point me at where the reviewer comments on the patch and a new version of that patch addressing all comments is posted (or in exceptional cases, where the patch is perfect to start with, where the reviewer says the patch is great). For example, given the number of IPoIB changes pending, it might be a good idea for the people submitting them to get together and trade reviews (ie "If you review my patch, I'll review your patch"). There are a few cases where getting a review may not be necessary. First of all, trivial and obvious patches don't need a review. It's a judgement call what is trivial or obvious, and it's always a good idea to provide a changelog that makes it clear why a patch is trivial and obviously correct. Second, hardware driver patches may not make sense to anyone outside of the company whose hardware the driver is for. Still, in this case, an internal Reviewed-by: would be nice, and also a changelog that explains the reason for the change always helps (don't just tell me what your patch does, but also explain what the patch fixes and what the impact of the current situation is). Anyway, here are all the pending things that I'm aware of. As usual, if something isn't already in my tree and isn't listed below, I probably missed it or dropped it by mistake. Please remind me again in that case. Core: - My user_mad P_Key index support patch. I'll test the ioctl to change to the new mode and merge this I guess, since Hal and Sean have tested this out. - A fix to the user_mad 32-bit big-endian userspace 64/32 problem with the method_mask when registering agents. I'll write a patch to handle this in a way that doesn't change the ABI for anything other than the broken case and hope to get someone to review this so it can be merged. - Sean's QoS changes. These look fine at first glance, and I just plan to understand the backwards compatibility story (ie how this works with an old SM) and merge. Anyone who objects let me know. - Sean's IB CM MRA interface changes. Don't know at this point. It seems OK but I'm not clear on what if any real-world improvement this gives us. ULPs: - Pradeep's IPoIB CM support for devices that don't have SRQs. I think the basic approach makes sense (I don't think faking SRQs at some other layer is really feasible) and I need to find time to look at the details to see if the current patch looks workable. I'm likely to merge this; getting an independent Reviewed-by: would certainly be appreciated too. - Moni's IPoIB bonding support. This seems mostly an issue of getting the core bonding maintainer's attention. However getting a Reviewed-by: for the IPoIB changes wouldn't hurt too. - Rolf's IPoIB MGID scope changes. Certainly we want to fix this issue but the specific changes need review. - Eli and Michael's IPoIB stateless offload (checksum offload, LSO, LRO, etc). It's a big series that makes quite a few core changes. I think it needs some careful review and is probably at risk of missing this merge window. Sorting in order of invasiveness so we can merge at least some of it (if splitting it makes sense) might be a good idea. HW specific: - I already merged patches to enable MSI-X by default for mthca and mlx4. I hope there aren't too many systems that get hosed if a MSI-X interrupt is generated. - Jack and Michael's mlx4 FMR support. Will merge I guess, although I do hope to have time to address the DMA API abuse that is being copied from mthca, so that mlx4 and mthca work in Xen domU. - ehca patch queue. Will merge, pending fixes for the few minor issues I commented on. - Steve's mthca router mode support. Would be nice to see a review from someone at Mellanox. - Arthur's mthca doorbell alignment fixes. I will experiment with a few different approaches and post what I like (and fix mlx4 as well). I hope Arthur can review. - Michael's mlx4 WQE shrinking patch. Not sure yet; I'll reply to the latest patch directly. Here are a few topics that I believe will not be ready in time for the 2.6.24 window and will need to wait for 2.6.25: - Multiple CQ event vector support. I haven't seen any discussions about how ULPs or userspace apps should decide which vector to use, and hence no progress has been made since we deferred this during the 2.6.23 merge window. - XRC. Given the length of the backlog above and the fact that a first draft of this code has not been posted yet, I don't see any way that we could have something this major ready in time. Here is the complete list of patches I have in my for-2.6.24 branch waiting for the merge window so far. Mostly I haven't merged anything big out of my backlog, so this is essentially all Ali Ayoub (1): IB/sa: Error handling thinko fix Anton Blanchard (3): IB/fmr_pool: Clean up some error messages in fmr_pool.c IB/ehca: Make output clearer by removing some debug messages IB/ehca: Export module parameters in sysfs Dotan Barak (1): mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS Eli Cohen (2): IPoIB: Fix typo to end statement with ';' instead of ',' IPoIB: Fix error path memory leak Michael S. Tsirkin (2): mlx4_core: Enable MSI-X by default IB/mthca: Enable MSI-X by default Peter Oruba (1): IB/mthca: Use PCI-X/PCI-Express read control interfaces Roland Dreier (6): IPoIB: Make sure no receives are handled when stopping device IB: find_first_zero_bit() takes unsigned pointer mlx4_core: Don't free special QPs in QP number bitmap IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv() IB/ehca: Include from ehca_classes.h IB/mlx4: Fix up SRQ limit_watermark endianness Steve Wise (1): RDMA/cxgb3: Make the iw_cxgb3 module parameters writable From rdreier at cisco.com Thu Sep 13 11:00:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 11:00:02 -0700 Subject: [ofa-general] [RFC 2/2] ib/cm: add basic performance counters In-Reply-To: <000201c7f62d$1c004750$65cc180a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 13 Sep 2007 10:40:00 -0700") References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> <000201c7f62d$1c004750$65cc180a@amr.corp.intel.com> Message-ID: Am I missing something, or is there no way to actually read the counters? From swise at opengridcomputing.com Thu Sep 13 11:04:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 13 Sep 2007 13:04:32 -0500 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <46E97BB0.9030106@opengridcomputing.com> Hey Roland, I was about to post v2 of my patch to avoid port space collisions with the native stack. Can we get that 2.6.24? It is high priority IMO. I've tried to solicit review on it, but I think folks are reluctant... ;-) Steve. Roland Dreier wrote: > With 2.6.24 probably opening in the not-too-distant future, it's > probably a good time to review what my plans are for when the merge > window opens. > > At the kernel summit, we discussed patch review (doing a web search > for "kernel summit" "reviewed-by:" should turn up lots of info on > this). Due to an unfortunate combination of vacation and conference > travel, summer colds, and other inconveniences, I am very backed up on > reviewing. And in any case, I've allowed too much code review to be > dumped on me -- when there are dozens of people working on IB and RDMA > stuff, it obviously doesn't work to expect me to do all the reviewing. > > Unfortunately, due to the length of the backlog and the fact that > 2.6.23 seems fairly close, some of the things listed below are going > to miss the 2.6.24 merge window. So, although the plan is to phase in > requiring "Reviewed-by:" gently, for this merge, if you can get > someone other than me to review your work, then the chances of it > being merged increase dramatically. I'm talking about a real review-- > ideally, someone independent (from another company would be good) who > is willing to provide a "Reviewed-by:" line that means the reviewer > has really looked at and thought about the patch. There should be a > mailing list thread you can point me at where the reviewer comments on > the patch and a new version of that patch addressing all comments is > posted (or in exceptional cases, where the patch is perfect to start > with, where the reviewer says the patch is great). > > For example, given the number of IPoIB changes pending, it might be a > good idea for the people submitting them to get together and trade > reviews (ie "If you review my patch, I'll review your patch"). There > are a few cases where getting a review may not be necessary. First of > all, trivial and obvious patches don't need a review. It's a > judgement call what is trivial or obvious, and it's always a good idea > to provide a changelog that makes it clear why a patch is trivial and > obviously correct. Second, hardware driver patches may not make sense > to anyone outside of the company whose hardware the driver is for. > Still, in this case, an internal Reviewed-by: would be nice, and also > a changelog that explains the reason for the change always helps > (don't just tell me what your patch does, but also explain what the > patch fixes and what the impact of the current situation is). > > Anyway, here are all the pending things that I'm aware of. As usual, > if something isn't already in my tree and isn't listed below, I > probably missed it or dropped it by mistake. Please remind me again > in that case. > > Core: > > - My user_mad P_Key index support patch. I'll test the ioctl to > change to the new mode and merge this I guess, since Hal and Sean > have tested this out. > > - A fix to the user_mad 32-bit big-endian userspace 64/32 problem > with the method_mask when registering agents. I'll write a patch > to handle this in a way that doesn't change the ABI for anything > other than the broken case and hope to get someone to review this > so it can be merged. > > - Sean's QoS changes. These look fine at first glance, and I just > plan to understand the backwards compatibility story (ie how this > works with an old SM) and merge. Anyone who objects let me know. > > - Sean's IB CM MRA interface changes. Don't know at this point. It > seems OK but I'm not clear on what if any real-world improvement > this gives us. > > ULPs: > > - Pradeep's IPoIB CM support for devices that don't have SRQs. I > think the basic approach makes sense (I don't think faking SRQs at > some other layer is really feasible) and I need to find time to > look at the details to see if the current patch looks workable. I'm > likely to merge this; getting an independent Reviewed-by: would > certainly be appreciated too. > > - Moni's IPoIB bonding support. This seems mostly an issue of > getting the core bonding maintainer's attention. However getting a > Reviewed-by: for the IPoIB changes wouldn't hurt too. > > - Rolf's IPoIB MGID scope changes. Certainly we want to fix this > issue but the specific changes need review. > > - Eli and Michael's IPoIB stateless offload (checksum offload, LSO, > LRO, etc). It's a big series that makes quite a few core changes. > I think it needs some careful review and is probably at risk of > missing this merge window. Sorting in order of invasiveness so we > can merge at least some of it (if splitting it makes sense) might > be a good idea. > > HW specific: > > - I already merged patches to enable MSI-X by default for mthca and > mlx4. I hope there aren't too many systems that get hosed if a > MSI-X interrupt is generated. > > - Jack and Michael's mlx4 FMR support. Will merge I guess, although > I do hope to have time to address the DMA API abuse that is being > copied from mthca, so that mlx4 and mthca work in Xen domU. > > - ehca patch queue. Will merge, pending fixes for the few minor > issues I commented on. > > - Steve's mthca router mode support. Would be nice to see a review > from someone at Mellanox. > > - Arthur's mthca doorbell alignment fixes. I will experiment with a > few different approaches and post what I like (and fix mlx4 as > well). I hope Arthur can review. > > - Michael's mlx4 WQE shrinking patch. Not sure yet; I'll reply to > the latest patch directly. > > Here are a few topics that I believe will not be ready in time for the > 2.6.24 window and will need to wait for 2.6.25: > > - Multiple CQ event vector support. I haven't seen any discussions > about how ULPs or userspace apps should decide which vector to use, > and hence no progress has been made since we deferred this during > the 2.6.23 merge window. > > - XRC. Given the length of the backlog above and the fact that a > first draft of this code has not been posted yet, I don't see any > way that we could have something this major ready in time. > > Here is the complete list of patches I have in my for-2.6.24 branch > waiting for the merge window so far. Mostly I haven't merged anything > big out of my backlog, so this is essentially all > > Ali Ayoub (1): > IB/sa: Error handling thinko fix > > Anton Blanchard (3): > IB/fmr_pool: Clean up some error messages in fmr_pool.c > IB/ehca: Make output clearer by removing some debug messages > IB/ehca: Export module parameters in sysfs > > Dotan Barak (1): > mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS > > Eli Cohen (2): > IPoIB: Fix typo to end statement with ';' instead of ',' > IPoIB: Fix error path memory leak > > Michael S. Tsirkin (2): > mlx4_core: Enable MSI-X by default > IB/mthca: Enable MSI-X by default > > Peter Oruba (1): > IB/mthca: Use PCI-X/PCI-Express read control interfaces > > Roland Dreier (6): > IPoIB: Make sure no receives are handled when stopping device > IB: find_first_zero_bit() takes unsigned pointer > mlx4_core: Don't free special QPs in QP number bitmap > IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv() > IB/ehca: Include from ehca_classes.h > IB/mlx4: Fix up SRQ limit_watermark endianness > > Steve Wise (1): > RDMA/cxgb3: Make the iw_cxgb3 module parameters writable > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Sep 13 11:06:45 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 11:06:45 -0700 Subject: [ofa-general] [RFC 2/2] ib/cm: add basic performance counters In-Reply-To: References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com><000201c7f62d$1c004750$65cc180a@amr.corp.intel.com> Message-ID: <000301c7f630$d8ac1d90$65cc180a@amr.corp.intel.com> >Am I missing something, or is there no way to actually read the counters? There's no way to read the counters yet. See the comment from RFC 0/2: I still need to export the counters, but wanted to get feedback about the counters that were selected, along with how they are being gathered. Any ideas on the best approach there would be appreciated as well. - Sean From sean.hefty at intel.com Thu Sep 13 11:20:38 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 11:20:38 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> > - My user_mad P_Key index support patch. I'll test the ioctl to > change to the new mode and merge this I guess, since Hal and Sean > have tested this out. I can give this patch a reviewed-by: too, and I will also try to review a couple of the pending ipoib patches. > - Sean's QoS changes. These look fine at first glance, and I just > plan to understand the backwards compatibility story (ie how this > works with an old SM) and merge. Anyone who objects let me know. The new QoS fields fall into fields that are currently reserved, which should be ignored by an older SM. I've only tested this against openSM however. > - Sean's IB CM MRA interface changes. Don't know at this point. It > seems OK but I'm not clear on what if any real-world improvement > this gives us. This patch was generated in response to an Intel MPI issue. We've seen MPI take several minutes to respond to a connection request during the middle of large application runs. When this happens, the active side times out the connection. In OFED, we added module parameters to adjust the rdma_cm connection timeout on the active side, but I believe that sending an MRA from the passive side is a better solution. - Sean From xma at us.ibm.com Thu Sep 13 11:22:16 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 13 Sep 2007 11:22:16 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: Message-ID: Hello Roland, Since ehca can support 4K MTU, we would like to see a patch in IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 kernel. The idea is IPoIB link MTU will pick up a return value from SM's default broadcast MTU. This patch should be a small patch, I hope you are OK with this. Thanks Shirley From jeff at garzik.org Thu Sep 13 11:56:32 2007 From: jeff at garzik.org (Jeff Garzik) Date: Thu, 13 Sep 2007 14:56:32 -0400 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E97BB0.9030106@opengridcomputing.com> References: <46E97BB0.9030106@opengridcomputing.com> Message-ID: <46E987E0.2010605@garzik.org> Steve Wise wrote: > I was about to post v2 of my patch to avoid port space collisions with > the native stack. Can we get that 2.6.24? It is high priority IMO. > I've tried to solicit review on it, but I think folks are reluctant... ;-) Well, if it involves /sharing/ port space with the native stack, i.e. where port 1234 is IB but 1235 is Linux, pretty much all the networking devs have NAK'd that approach AFAICS. Jeff From swise at opengridcomputing.com Thu Sep 13 11:59:21 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 13 Sep 2007 13:59:21 -0500 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E987E0.2010605@garzik.org> References: <46E97BB0.9030106@opengridcomputing.com> <46E987E0.2010605@garzik.org> Message-ID: <46E98889.1080706@opengridcomputing.com> Jeff Garzik wrote: > Steve Wise wrote: >> I was about to post v2 of my patch to avoid port space collisions with >> the native stack. Can we get that 2.6.24? It is high priority IMO. >> I've tried to solicit review on it, but I think folks are reluctant... >> ;-) > > Well, if it involves /sharing/ port space with the native stack, i.e. > where port 1234 is IB but 1235 is Linux, pretty much all the networking > devs have NAK'd that approach AFAICS. > Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp device will use its own ip address and subnet to avoid collisions. You should review the patch when I post v2. Thanks, Steve. From swise at opengridcomputing.com Thu Sep 13 12:07:06 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 13 Sep 2007 14:07:06 -0500 Subject: [ofa-general] [GIT PULL ofed-1.3] cxgb3 bug fixes In-Reply-To: <46E94B36.70406@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> Message-ID: <46E98A5A.1000507@opengridcomputing.com> For ofed-1.3, please pull from: git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel The 1.3 patch series is identical to the ofed_1_2_c series except that the first patch, 0029-*, isn't needed since its already in ofed-1.3 from 2.6.23. Thanks, Steve. Steve Wise wrote: > Vlad (Michael/Tziporet in Vlad's absence), > > Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of > these patches are either in 2.6.23 or merged into Jeff Garzik's upstream > branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we > update ofed-1.2.5 and ofed-1.3 will all of these fixes. > > I'll send another email with the ofed-1.3 changes as they will be > slightly different. > > Please pull the ofed_1_2_c changes from: > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > > The patch files added to kernel_patches/fixes include: > >> swise at dell3:~/git/ofed-1.2.5> stg series >> + 0029-cxgb3-engine-microcode-load >> + 0030-cxgb3-MAC-workaround-update >> + 0031-cxgb3-Update-rx-coalescing-length >> + 0032-cxgb3-SGE-doorbell-overflow-warning >> + 0033-cxgb3-use-immediate-data-for-offload-Tx >> + 0034-cxgb3-Expose-HW-memory-page-info >> + 0035-cxgb3-tighten-checks-on-TID-values >> + 0036-cxgb3-Fatal-error-update >> + 0037-cxgb3-log-adapter-serial-number >> + 0038-cxgb3-Update-internal-memory-management >> + 0039-cxgb3-update-firmware-version >> + 0040-cxgb3-log-and-clear-PEX-errors >> + 0041-cxgb3-remove-false-positive-in-xgmac-workaround >> + 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >> + 0043-cxgb3-CQ-context-operations-time-out-too-soon >> + 0044-cxgb3-Add-T3C-rev >> + 0045-cxgb3-Update-engine-microcode-version >> > 0046-cxgb3-driver-version > > Steve. > From swise at opengridcomputing.com Thu Sep 13 12:16:17 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 13 Sep 2007 14:16:17 -0500 Subject: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Message-ID: <20070913191617.30937.95960.stgit@dell3.ogc.int> iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Version 2: - added a per-device mutex for the address and listening endpoints lists. - wait for all replies if sending multiple passive_open requests to rnic. - log warning if no addresses are available when a listen is issued. - tested --- Design: The sysadmin creates "for iwarp use only" alias interfaces of the form "devname:iw*" where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with "iw". The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. EG: ifconfig eth0 192.168.70.123 up ifconfig eth0:iw1 192.168.71.123 up ifconfig eth0:iw2 192.168.72.123 up In the above example, 192.168.70/24 is for TCP traffic, while 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. The rdma-only interface must be on its own IP subnet. This allows routing all rdma traffic onto this interface. The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. Implementation Details: - The iw_cxgb3 driver registers for inetaddr events via register_inetaddr_notifier(). This allows tracking the iwarp-only addresses/subnets as they get added and deleted. The iwarp driver maintains a list of the current iwarp-only addresses. - The iw_cxgb3 driver builds the list of iwarp-only addresses for its devices at module insert time. This is needed because the inetaddr notifier callbacks don't "replay" address-add events when someone registers. So the driver must build the initial list at module load time. - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver must translate that into a set of listens on the iwarp-only addresses. This is implemented by maintaining a list of stid/addr entries per listening endpoint. - When a new iwarp-only address is added or removed, the iw_cxgb3 driver must traverse the set of listening endpoints and update them accordingly. This allows an application to bind to 0.0.0.0 prior to the iwarp-only interfaces being configured. It also allows changing the iwarp-only set of addresses and getting the expected behavior for apps already bound to 0.0.0.0. This is done by maintaining a list of listening endpoints off the device struct. - The address list, the listening endpoint list, and each list of stid/addrs in use per listening endpoint are all protected via a mutex per iw_cxgb3 device. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 125 ++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 11 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 259 +++++++++++++++++++++++++++------ drivers/infiniband/hw/cxgb3/iwch_cm.h | 15 ++ 4 files changed, 360 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 0315c9d..296fb66 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = { static LIST_HEAD(dev_list); static DEFINE_MUTEX(dev_mutex); +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return; + } + addr->ifa = ifa; + mutex_lock(&rnicp->mutex); + list_add_tail(&addr->entry, &rnicp->addrlist); + mutex_unlock(&rnicp->mutex); +} + +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + if (addr->ifa == ifa) { + list_del_init(&addr->entry); + kfree(addr); + goto out; + } + } +out: + mutex_unlock(&rnicp->mutex); +} + +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) +{ + int i; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) + if (netdev == rnicp->rdev.port_info.lldevs[i]) + return 1; + return 0; +} + +static inline int is_iwarp_label(char *label) +{ + char *colon; + + colon = strchr(label, ':'); + if (colon && !strncmp(colon+1, "iw", 2)) + return 1; + return 0; +} + +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa->ifa_address); + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x deleted\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + iwch_listeners_del_addr(rnicp, ifa->ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + list_del_init(&addr->entry); + kfree(addr); + } + mutex_unlock(&rnicp->mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ + int i; + struct in_device *indev; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, + ifa->ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(&rnicp->qpidr); idr_init(&rnicp->mmidr); spin_lock_init(&rnicp->lock); + INIT_LIST_HEAD(&rnicp->addrlist); + INIT_LIST_HEAD(&rnicp->listen_eps); + mutex_init(&rnicp->mutex); + rnicp->nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(&rnicp->nb); rnicp->attr.vendor_id = 0x168; rnicp->attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(&dev_mutex); list_for_each_entry_safe(dev, tmp, &dev_list, entry) { if (dev->rdev.t3cdev_p == tdev) { + unregister_inetaddr_notifier(&dev->nb); + delete_addrlist(dev); list_del(&dev->entry); iwch_unregister_device(dev); cxio_rdev_close(&dev->rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,6 +36,8 @@ #include #include #include #include +#include +#include #include @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { u32 cq_overflow_detection; }; +struct iwch_addrlist { + struct list_head entry; + struct in_ifaddr *ifa; +}; + struct iwch_dev { struct ib_device ibdev; struct cxio_rdev rdev; @@ -111,6 +118,10 @@ struct iwch_dev { struct idr mmidr; spinlock_t lock; struct list_head entry; + struct notifier_block nb; + struct list_head addrlist; + struct list_head listen_eps; + struct mutex mutex; }; static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 1cdfcd4..954069f 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t return CPL_RET_BUF_DONE; } -static int listen_start(struct iwch_listen_ep *ep) +static int wait_for_reply(struct iwch_ep_common *epc) +{ + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); + wait_event(epc->waitq, epc->rpl_done); + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); + return epc->rpl_err; +} + +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, + __be32 addr) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_listen_entry *le; + + le = kmalloc(sizeof *le, GFP_KERNEL); + if (!le) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return NULL; + } + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, + &t3c_client, ep); + if (le->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", + __FUNCTION__); + kfree(le); + return NULL; + } + le->addr = addr; + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + return le; +} + +static void dealloc_listener(struct iwch_listen_ep *ep, + struct iwch_listen_entry *le) +{ + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + cxgb3_free_stid(ep->com.tdev, le->stid); + kfree(le); +} + +static void dealloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le, *tmp; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + + mutex_lock(&h->mutex); + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { + list_del_init(&le->entry); + dealloc_listener(ep, le); + } + mutex_unlock(&h->mutex); +} + +static int alloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_addrlist *addr; + struct iwch_listen_entry *le; + int err = 0; + int added=0; + mutex_lock(&h->mutex); + list_for_each_entry(addr, &h->addrlist, entry) { + if (ep->com.local_addr.sin_addr.s_addr == 0 || + ep->com.local_addr.sin_addr.s_addr == + addr->ifa->ifa_address) { + le = alloc_listener(ep, addr->ifa->ifa_address); + if (!le) + break; + list_add_tail(&le->entry, &ep->listeners); + added++; + } + } + mutex_unlock(&h->mutex); + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) + err = -EADDRNOTAVAIL; + if (!err && !added) + printk(KERN_WARNING MOD + "No RDMA interface found for device %s\n", + pci_name(h->rdev.rnic_info.pdev)); + return err; +} + +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int stid) { struct sk_buff *skb; - struct cpl_pass_open_req *req; + struct cpl_close_listserv_req *req; + + PDBG("%s stid %u\n", __FUNCTION__, stid); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->cpu_idx = 0; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); + skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; + cxgb3_ofld_send(ep->com.tdev, skb); + return wait_for_reply(&ep->com); +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_stop_one(ep, le->stid); + if (err) + break; + } + mutex_unlock(&h->mutex); + return err; +} + +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int stid, + __be32 addr, __be16 port) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, ntohl(addr), + ntohs(port)); skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); if (!skb) { - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); return -ENOMEM; } req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); - req->local_port = ep->com.local_addr.sin_port; - req->local_ip = ep->com.local_addr.sin_addr.s_addr; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); + req->local_port = port; + req->local_ip = addr; req->peer_port = 0; req->peer_ip = 0; req->peer_netmask = 0; @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return wait_for_reply(&ep->com); +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_start_one(ep, le->stid, le->addr, + ep->com.local_addr.sin_port); + if (err) + goto fail; + } + mutex_unlock(&h->mutex); + return err; +fail: + mutex_unlock(&h->mutex); + listen_stop(ep); + return err; } static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * return CPL_RET_BUF_DONE; } -static int listen_stop(struct iwch_listen_ep *ep) -{ - struct sk_buff *skb; - struct cpl_close_listserv_req *req; - - PDBG("%s ep %p\n", __FUNCTION__, ep); - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); - if (!skb) { - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); - return -ENOMEM; - } - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - req->cpu_idx = 0; - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); - skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; -} - static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) { struct iwch_listen_ep *ep = ctx; struct cpl_close_listserv_rpl *rpl = cplhdr(skb); - PDBG("%s ep %p\n", __FUNCTION__, ep); + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); + ep->com.rpl_err = status2errno(rpl->status); ep->com.rpl_done = 1; wake_up(&ep->com.waitq); return CPL_RET_BUF_DONE; } +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + le = alloc_listener(listen_ep, addr); + if (le) { + list_add_tail(&le->entry, &listen_ep->listeners); + listen_start_one(listen_ep, le->stid, addr, + listen_ep->com.local_addr.sin_port); + } + } + mutex_unlock(&rnicp->mutex); +} + +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, + entry) + if (le->addr == addr) { + listen_stop_one(listen_ep, le->stid); + list_del_init(&le->entry); + dealloc_listener(listen_ep, le); + } + } + mutex_unlock(&rnicp->mutex); +} + static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) { struct cpl_pass_accept_rpl *rpl; @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i goto err; /* wait for wr_ack */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; + err = wait_for_reply(&ep->com); if (err) goto err; @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * ep->com.cm_id = cm_id; ep->backlog = backlog; ep->com.local_addr = cm_id->local_addr; + INIT_LIST_HEAD(&ep->listeners); - /* - * Allocate a server TID. - */ - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); - if (ep->stid == -1) { - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); - err = -ENOMEM; + err = alloc_listener_list(ep); + if (err) goto fail2; - } state_set(&ep->com, LISTEN); err = listen_start(ep); - if (err) - goto fail3; - /* wait for pass_open_rpl */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; if (!err) { cm_id->provider_data = ep; + mutex_lock(&h->mutex); + list_add_tail(&ep->entry, &h->listen_eps); + mutex_unlock(&h->mutex); goto out; } -fail3: - cxgb3_free_stid(ep->com.tdev, ep->stid); + dealloc_listener_list(ep); fail2: cm_id->rem_ref(cm_id); put_ep(&ep->com); @@ -1923,18 +2084,20 @@ out: int iwch_destroy_listen(struct iw_cm_id *cm_id) { int err; + struct iwch_dev *h = to_iwch_dev(cm_id->device); struct iwch_listen_ep *ep = to_listen_ep(cm_id); PDBG("%s ep %p\n", __FUNCTION__, ep); might_sleep(); + mutex_lock(&h->mutex); + list_del_init(&ep->entry); + mutex_unlock(&h->mutex); state_set(&ep->com, DEAD); ep->com.rpl_done = 0; ep->com.rpl_err = 0; err = listen_stop(ep); - wait_event(ep->com.waitq, ep->com.rpl_done); - cxgb3_free_stid(ep->com.tdev, ep->stid); - err = ep->com.rpl_err; + dealloc_listener_list(ep); cm_id->rem_ref(cm_id); put_ep(&ep->com); return err; diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 6107e7c..23e5a22 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -162,10 +162,19 @@ struct iwch_ep_common { int rpl_err; }; -struct iwch_listen_ep { - struct iwch_ep_common com; +struct iwch_listen_entry { + struct list_head entry; unsigned int stid; + __be32 addr; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; /* Must be first entry! */ + struct list_head entry; + struct list_head listeners; int backlog; + int listen_count; + int listen_rpls; }; struct iwch_ep { @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); void __free_ep(struct kref *kref); void iwch_rearp(struct iwch_ep *ep); int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t); +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); int __init iwch_cm_init(void); void __exit iwch_cm_term(void); From mshefty at ichips.intel.com Thu Sep 13 12:54:46 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 12:54:46 -0700 Subject: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070913191617.30937.95960.stgit@dell3.ogc.int> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> Message-ID: <46E99586.90905@ichips.intel.com> > The iWARP driver must translate all listens on address 0.0.0.0 to the > set of rdma-only ip addresses for the device in question. This prevents > incoming connect requests to the TCP ipaddresses from going up the > rdma stack. I've only given this a high level review at this point, and while the patch looks okay on first pass, is there a way to move some of this functionality to either the rdma_cm or iw_cm? I don't like the idea of every iwarp driver having to implement address/listen list maintenance. I may have some ideas after re-examining it. > Implementation Details: There are a couple of areas that I made a note to look at in more detail (because I didn't understand everything that was happening), but I did have one minor nit - most uses of list_del_init can just be list_del. - Sean From jeff at garzik.org Thu Sep 13 12:55:02 2007 From: jeff at garzik.org (Jeff Garzik) Date: Thu, 13 Sep 2007 15:55:02 -0400 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E98889.1080706@opengridcomputing.com> References: <46E97BB0.9030106@opengridcomputing.com> <46E987E0.2010605@garzik.org> <46E98889.1080706@opengridcomputing.com> Message-ID: <46E99596.8000904@garzik.org> Steve Wise wrote: > Jeff Garzik wrote: >> Steve Wise wrote: >>> I was about to post v2 of my patch to avoid port space collisions >>> with the native stack. Can we get that 2.6.24? It is high priority >>> IMO. I've tried to solicit review on it, but I think folks are >>> reluctant... ;-) >> Well, if it involves /sharing/ port space with the native stack, i.e. >> where port 1234 is IB but 1235 is Linux, pretty much all the >> networking devs have NAK'd that approach AFAICS. > Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp > device will use its own ip address and subnet to avoid collisions. You > should review the patch when I post v2. Sounds promising, then! :) Jeff From sean.hefty at intel.com Thu Sep 13 13:09:25 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Sep 2007 13:09:25 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <000001c7f641$fbbcccb0$55cc180a@amr.corp.intel.com> > - Pradeep's IPoIB CM support for devices that don't have SRQs. I > think the basic approach makes sense (I don't think faking SRQs at > some other layer is really feasible) and I need to find time to > look at the details to see if the current patch looks workable. I'm > likely to merge this; getting an independent Reviewed-by: would > certainly be appreciated too. Are the latest patches available anywhere (git tree or other)? If not, Pradeep, can you confirm if these link to the latest? patch 0 - description: http://lists.openfabrics.org/pipermail/general/2007-August/039707.html patch 1 - reworked: http://lists.openfabrics.org/pipermail/general/2007-August/039884.html patch 2 http://lists.openfabrics.org/pipermail/general/2007-August/039710.html - Sean From pradeeps at linux.vnet.ibm.com Thu Sep 13 13:58:41 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 13 Sep 2007 13:58:41 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <000001c7f641$fbbcccb0$55cc180a@amr.corp.intel.com> References: <000001c7f641$fbbcccb0$55cc180a@amr.corp.intel.com> Message-ID: <46E9A481.2090301@linux.vnet.ibm.com> Sean Hefty wrote: >> - Pradeep's IPoIB CM support for devices that don't have SRQs. I >> think the basic approach makes sense (I don't think faking SRQs at >> some other layer is really feasible) and I need to find time to >> look at the details to see if the current patch looks workable. I'm >> likely to merge this; getting an independent Reviewed-by: would >> certainly be appreciated too. > > Are the latest patches available anywhere (git tree or other)? If not, Pradeep, > can you confirm if these link to the latest? > > patch 0 - description: > http://lists.openfabrics.org/pipermail/general/2007-August/039707.html > patch 1 - reworked: > http://lists.openfabrics.org/pipermail/general/2007-August/039884.html > patch 2 > http://lists.openfabrics.org/pipermail/general/2007-August/039710.html > > - Sean > Sean, Thanks for volunteering to review these! Patch 2 is the one that you mentioned. However, I incorporated a few minor changes (Roland's comments about additional spaces and module parameter being readable only. Plus some minor event handler changes) and resubmitted on Aug 20th. http://lists.openfabrics.org/pipermail/general/2007-August/039884.html Pradeep From rdreier at cisco.com Thu Sep 13 14:00:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 14:00:43 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: (Shirley Ma's message of "Thu, 13 Sep 2007 11:22:16 -0700") References: Message-ID: > Since ehca can support 4K MTU, we would like to see a patch in > IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 > kernel. The idea is IPoIB link MTU will pick up a return value from SM's > default broadcast MTU. This patch should be a small patch, I hope you are > OK with this. It's actually not small, since it turns the skb allocation into a 4100-byte buffer, which ends up being more than 1 page usually, which means it fails if memory is fragmented. Anyway given the backlog anything substantial that hasn't been posted already is almost surely going to have to wait until 2.6.25. From rdreier at cisco.com Thu Sep 13 14:02:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 14:02:57 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 13 Sep 2007 11:20:38 -0700") References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> Message-ID: > > - My user_mad P_Key index support patch. I'll test the ioctl to > > change to the new mode and merge this I guess, since Hal and Sean > > have tested this out. > > I can give this patch a reviewed-by: too, and I will also try to review a couple > of the pending ipoib patches. Thanks! > > - Sean's QoS changes. These look fine at first glance, and I just > > plan to understand the backwards compatibility story (ie how this > > works with an old SM) and merge. Anyone who objects let me know. > > The new QoS fields fall into fields that are currently reserved, which should be > ignored by an older SM. I've only tested this against openSM however. That seems OK -- I'm OK with breaking things if an SM is clearly buggy (and not ignoring fields that are defined to be ignored in the spec would certainly be a clear bug to me). > This patch was generated in response to an Intel MPI issue. We've seen MPI take > several minutes to respond to a connection request during the middle of large > application runs. When this happens, the active side times out the connection. > In OFED, we added module parameters to adjust the rdma_cm connection timeout on > the active side, but I believe that sending an MRA from the passive side is a > better solution. OK -- just to make sure I'm understanding what you're saying: have you confirmed that your proposed patches actually fix the issue? - R. From rdreier at cisco.com Thu Sep 13 14:11:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 14:11:29 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E97BB0.9030106@opengridcomputing.com> (Steve Wise's message of "Thu, 13 Sep 2007 13:04:32 -0500") References: <46E97BB0.9030106@opengridcomputing.com> Message-ID: > I was about to post v2 of my patch to avoid port space collisions with > the native stack. Can we get that 2.6.24? It is high priority > IMO. I've tried to solicit review on it, but I think folks are > reluctant... ;-) I would like to get this in, but I'm still at least a little reluctant, since we would be committing to a user interface that seems a little awkward at best, so I'd like to try and find something better. Just to summarize my understanding: - your patch requires the administration to configure an ethX:iwY alias address to use iwarp. (By the way is there anything other than "don't do that" that avoids assigning the same address to the iwarp alias and a non-iwarp interface?) - it would be nicer to create the alias automatically, but an alias without an address doesn't make sense. Creating a whole separate net device causes problems because the iwarp stuff still needs to use the main net device to do ARP etc. - so I'm out of better ideas but I still want to push back a little before we commit to something ugly. I've been meaning to track down the bnx2 iscsi offload patch to look and see if this issue is addressed, since the same problem seems to exist: it seems an iscsi connection and a main stack tcp connection might share the same 4-tuple unless something is done to avoid that happening. Also, I think it behooves us to get some agreement on this approach with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers seem to be imminent. - R. From rdreier at cisco.com Thu Sep 13 14:12:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 14:12:35 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E987E0.2010605@garzik.org> (Jeff Garzik's message of "Thu, 13 Sep 2007 14:56:32 -0400") References: <46E97BB0.9030106@opengridcomputing.com> <46E987E0.2010605@garzik.org> Message-ID: > Well, if it involves /sharing/ port space with the native stack, > i.e. where port 1234 is IB but 1235 is Linux, pretty much all the > networking devs have NAK'd that approach AFAICS. Just to be clear, InfiniBand has no problem; the issue is port collisions involving iWARP connections. - R. From rdreier at cisco.com Thu Sep 13 14:14:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 13 Sep 2007 14:14:17 -0700 Subject: [ofa-general] [RFC 2/2] ib/cm: add basic performance counters In-Reply-To: <000301c7f630$d8ac1d90$65cc180a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 13 Sep 2007 11:06:45 -0700") References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com> <000201c7f62d$1c004750$65cc180a@amr.corp.intel.com> <000301c7f630$d8ac1d90$65cc180a@amr.corp.intel.com> Message-ID: > I still need to export the counters, but wanted to get feedback > about the counters that were selected, along with how they are > being gathered. Sorry, I'll learn to read one of these days. > Any ideas on the best approach there would be appreciated as well. My first reaction would be to stick them somewhere in debugfs. (I'm assuming this feature is for diagnostics etc) - R. From mchan at broadcom.com Thu Sep 13 15:59:18 2007 From: mchan at broadcom.com (Michael Chan) Date: Thu, 13 Sep 2007 15:59:18 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <46E97BB0.9030106@opengridcomputing.com> Message-ID: <1189724358.9540.113.camel@dell> On Thu, 2007-09-13 at 14:11 -0700, Roland Dreier wrote: > > I've been meaning to track down the bnx2 iscsi offload patch to look > and see if this issue is addressed, since the same problem seems to > exist: it seems an iscsi connection and a main stack tcp connection > might share the same 4-tuple unless something is done to avoid that > happening. > iSCSI does not do passive listens, only active connections to the target. But you're right, the port space is still shared between iSCSI and the main stack. We currently rely on user apps binding to the main stack to reserve certain ephemeral ports, and telling the iSCSI driver which ports to use. From xma at us.ibm.com Thu Sep 13 15:16:40 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 13 Sep 2007 15:16:40 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: Message-ID: netdev-owner at vger.kernel.org wrote on 09/13/2007 02:00:43 PM: > > Since ehca can support 4K MTU, we would like to see a patch in > > IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 > > kernel. The idea is IPoIB link MTU will pick up a return value from SM's > > default broadcast MTU. This patch should be a small patch, I hope you are > > OK with this. > > It's actually not small, since it turns the skb allocation into a > 4100-byte buffer, which ends up being more than 1 page usually, which > means it fails if memory is fragmented. > > Anyway given the backlog anything substantial that hasn't been posted > already is almost surely going to have to wait until 2.6.25. The patch is just needed to pick up broadcast MTU size instead of hard coding 2K right now. SKB allocation shouldn't be different with Ethernet Jambo Frame and IPoIB-CM which 64K MTU. I don't understand why it's different. Could you please explain this? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From maxwell at research.nokia.com Sun Sep 9 16:59:38 2007 From: maxwell at research.nokia.com (charleton morgan) Date: Sun, 09 Sep 2007 23:59:38 +0000 Subject: [ofa-general] Worldwide Message-ID: <000401c7f34c$04cf0281$4897aeab@cllttbl> Electronics: Building Chips in 3-D Dr. Krishna Saraswat, Electronic Engineering; Dr. Chris Chidsey, Chemistry Our organization offers a very good salary to the successful candidate, along with an unrivalled career progression chance. If you think you have what it takes to take on this challenge and would like to join please send the following information to: MarlonLangleyBH at gmail.com 1) Full name 2) Contact phone numbers 3) Part time job/Full time The ideal applicant will be an smart person, someone who can work autonomously with a high level of enthusiasm. We are looking for a highly motivated specialist, with skill of working with people. The position is home-based. We offer a part-time position with flexible working hours. And we would be happy to consider a full-time job share candidate. A strong background in pr field is essential for this role, as is the ability to inspire at every level. You do not need to spend any sum of money and we do not ask you to provide us with your bank account number! We are occupied in completely officially authorized activity. If you are interested in our vacancy please feel free to contact us for further information. The preference is given to employees with understanding of foreign languages. Thank you and we are looking forward to work together in long term base with you all. "Whether nanotechnology had ever showed up or not, electronics would have gotten there anyway," says Professor Saraswat. For the past four decades, the number of transistors that can be put on a chip, or equivalently, the number of information processing events that can be done per chip, has doubled every twenty-two months; concomitantly, the cost per processing event has dropped. Following this trend called Moore's Law, microelectronics has steadily settled into nanoelectronics in the past decade. Materials: Carbon Nanotubes Dr. Hongjie Dai, Chemistry Slice a layer of pencil lead, roll it up, and you have a carbon nanotube: a graphene sheet (a layer of graphite) rolled up into a cylinder. "A carbon nanotube is a clever way of making a fully saturated nanowire structure-a 1-D structure with all its atoms fully bonded," explains Professor Dai, who has developed catalysts that control where carbon nanotubes grow. "The big challenge is controlling the synthesis. More control leads to definite physical properties," says Dai. In contrast to conventional semi-conductors, where "the surface atoms are not happily bonded," as Dai puts it, the high degree of structural perfection in nanotubes leads to ballistic transport of electrons, which translates into high speed electronics. Dai predicts that while it is doubtful that carbon nanotubes will overtake the electronics industry, it is quite possible that they will replace some electronics components. From hrosenstock at xsigo.com Thu Sep 13 17:36:34 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 13 Sep 2007 17:36:34 -0700 Subject: [ofa-general] [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping Message-ID: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> ibnetdiscover: Support Xsigo chassis grouping I think this also fixes a bug with grouping of multiple non Voltaire chassis as well. Note: this patch is against OFED 1.2 Signed-off-by: Hal Rosenstock diff --git a/diags/include/grouping.h b/diags/include/grouping.h index 4666935..3ba872c 100644 --- a/diags/include/grouping.h +++ b/diags/include/grouping.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); char *get_chassis_slot(unsigned char chassisslot); uint64_t get_chassis_guid(unsigned char chassisnum); +int is_xsigo_guid(uint64_t guid); +int is_xsigo_tca(uint64_t guid); +int is_xsigo_hca(uint64_t guid); + #endif /* _GROUPING_H_ */ diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h index d13a666..bfbe7f5 100644 --- a/diags/include/ibnetdiscover.h +++ b/diags/include/ibnetdiscover.h @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -44,6 +45,7 @@ #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ #define TS_VENDOR_ID 0x5ad /* Cisco */ #define SS_VENDOR_ID 0x66a /* InfiniCon */ +#define XS_VENDOR_ID 0x1397 /* Xsigo */ typedef struct Port Port; diff --git a/diags/src/grouping.c b/diags/src/grouping.c index 0e5bd78..6602f26 100644 --- a/diags/src/grouping.c +++ b/diags/src/grouping.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) return guid & 0xffffffff00ffffffULL; } -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) +int is_xsigo_guid(uint64_t guid) { - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) - return topspin_chassisguid(guid); + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) + return 1; else - return guid; + return 0; +} + +static int is_xsigo_leafone(uint64_t guid) +{ + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) + return 1; + else + return 0; +} + +int is_xsigo_hca(uint64_t guid) +{ + /* NodeType 2 is HCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) + return 1; + else + return 0; +} + +int is_xsigo_tca(uint64_t guid) +{ + /* NodeType 3 is TCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_ca(uint64_t guid) +{ + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) + return 1; + else + return 0; +} + +static int is_xsigo_switch(uint64_t guid) +{ + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) + return 1; + else + return 0; +} + +static uint64_t xsigo_chassisguid(Node *node) +{ + if (!is_xsigo_ca(node->sysimgguid)) { + /* Byte 3 is NodeType and byte 4 is PortType */ + /* If NodeType is 1 (switch), PortType is masked */ + if (is_xsigo_switch(node->sysimgguid)) + return node->sysimgguid & 0xffffffff00ffffffULL; + else + return node->sysimgguid; + } else { + /* If peer port is Leaf 1, use its chassis GUID */ + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) + return node->ports->remoteport->node->sysimgguid & + 0xffffffff00ffffffULL; + else + return node->sysimgguid; + } } -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) +static uint64_t get_chassisguid(Node *node) +{ + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) + return topspin_chassisguid(node->sysimgguid); + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) + return xsigo_chassisguid(node); + else + return node->sysimgguid; +} + +static struct ChassisList *find_chassisguid(Node *node) { ChassisList *current; uint64_t chguid; - chguid = get_chassisguid(guid, vendid); + chguid = get_chassisguid(node); for (current = mylist.first; current; current = current->next) { if (current->chassisguid == chguid) return current; @@ -668,14 +740,13 @@ ChassisList *group_nodes() if (node->vendid == VTR_VENDOR_ID) continue; if (node->sysimgguid) { - chassis = find_chassisguid(node->sysimgguid, - node->vendid); + chassis = find_chassisguid(node); if (chassis) chassis->nodecount++; else { /* Possible new chassis */ add_chassislist(); - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); + mylist.current->chassisguid = get_chassisguid(node); mylist.current->nodecount = 1; } } @@ -684,13 +755,12 @@ ChassisList *group_nodes() /* now, make another pass to see which nodes are part of chassis */ /* (defined as chassis->nodecount > 1) */ - for (dist = 0; dist <= maxhops_discovered; dist++) { + for (dist = 0; dist <= MAXHOPS; ) { for (node = nodesdist[dist]; node; node = node->dnext) { if (node->vendid == VTR_VENDOR_ID) continue; if (node->sysimgguid) { - chassis = find_chassisguid(node->sysimgguid, - node->vendid); + chassis = find_chassisguid(node); if (chassis && chassis->nodecount > 1) { if (!chassis->chassisnum) chassis->chassisnum = ++chassisnum; @@ -702,6 +772,10 @@ ChassisList *group_nodes() } } } + if (dist == maxhops_discovered) + dist = MAXHOPS; /* skip to CAs */ + else + dist++; } return (mylist.first); diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index cb62c44..2cff87e 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -450,14 +451,26 @@ list_node(Node *node) } void -out_ids(Node *node) +out_ids(Node *node, int group, char *chname) { fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); if (node->sysimgguid) - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); + if (group) + if (node->chrecord) + if (node->chrecord->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); + if (chname) + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (is_xsigo_tca(node->nodeguid)) { + if (node->ports->remoteport) + fprintf(f, " slot %d", node->ports->remoteport->portnum); + } + } + fprintf(f, "\n"); } -void +uint64_t out_chassis(int chassisnum) { uint64_t guid; @@ -467,20 +480,20 @@ out_chassis(int chassisnum) if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); + return guid; } void -out_switch(Node *node, int group) +out_switch(Node *node, int group, char *chname) { char *str; char *nodename = NULL; - out_ids(node); + out_ids(node, group, chname); fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); if (group) { if (node->chrecord) { if (node->chrecord->chassisnum) { - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); /* Currently, only if Voltaire chassis */ if (node->vendid == VTR_VENDOR_ID) { str = get_chassis_type(node->chrecord->chassistype); @@ -510,12 +523,12 @@ out_switch(Node *node, int group) } void -out_ca(Node *node) +out_ca(Node *node, int group, char *chname) { char *node_type; char *node_type2; - out_ids(node); + out_ids(node, group, chname); switch(node->type) { case CA_NODE: node_type = "ca"; @@ -532,9 +545,12 @@ out_ca(Node *node) } fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); - fprintf(f, "%s\t%d %s\t\t# \"%s\"\n", + fprintf(f, "%s\t%d %s\t\t# \"%s\"", node_type2, node->numports, node_name(node), clean_nodedesc(node->nodedesc)); + if (group && is_xsigo_hca(node->nodeguid)) + fprintf(f, " (scp)"); + fprintf(f, "\n"); } static char * @@ -572,12 +588,17 @@ out_switch_port(Port *port, int group) rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : "", rem_nodename, port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); + if (is_xsigo_tca(port->remoteport->portguid)) + fprintf(f, " slot %d", port->portnum); + else if (is_xsigo_hca(port->remoteport->portguid)) + fprintf(f, " (scp)"); + fprintf(f, "\n"); if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) free(rem_nodename); @@ -616,6 +637,8 @@ dump_topology(int listtype, int group) Port *port; int i = 0, dist = 0; time_t t = time(0); + uint64_t chguid; + char *chname = NULL; if (!listtype) { fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); @@ -633,11 +656,31 @@ dump_topology(int listtype, int group) if (!ch->chassisnum) continue; - out_chassis(ch->chassisnum); + chguid = out_chassis(ch->chassisnum); + chname = NULL; + if (is_xsigo_guid(chguid)) { + /* !!! */ + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { + if (node->chrecord) { + if (!node->chrecord->chassisnum) + continue; + } else + continue; + + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + if (is_xsigo_hca(node->nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); + } + } + } + fprintf(f, "\n# Spine Nodes"); for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { if (ch->spinenode[n]) { - out_switch(ch->spinenode[n], group); + out_switch(ch->spinenode[n], group, chname); for (port = ch->spinenode[n]->ports; port; port = port->next, i++) if (port->remoteport) out_switch_port(port, group); @@ -646,34 +689,57 @@ dump_topology(int listtype, int group) fprintf(f, "\n# Line Nodes"); for (n = 1; n <= (LINES_MAX_NUM+1); n++) { if (ch->linenode[n]) { - out_switch(ch->linenode[n], group); + out_switch(ch->linenode[n], group, chname); for (port = ch->linenode[n]->ports; port; port = port->next, i++) if (port->remoteport) out_switch_port(port, group); } } - } + fprintf(f, "\n# Chassis Switches"); + for (dist = 0; dist <= maxhops_discovered; dist++) { - for (dist = 0; dist <= maxhops_discovered; dist++) { + for (node = nodesdist[dist]; node; node = node->dnext) { - for (node = nodesdist[dist]; node; node = node->dnext) { + /* Non Voltaire chassis */ + if (node->vendid == VTR_VENDOR_ID) + continue; + if (node->chrecord) { + if (!node->chrecord->chassisnum) + continue; + } else + continue; - /* Non Voltaire chassis */ - if (node->vendid == VTR_VENDOR_ID) - continue; + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + out_switch(node, group, chname); + for (port = node->ports; port; port = port->next, i++) + if (port->remoteport) + out_switch_port(port, group); + + } + + } + + fprintf(f, "\n# Chassis CAs"); + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { if (node->chrecord) { if (!node->chrecord->chassisnum) continue; } else continue; - out_switch(node, group); + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + out_ca(node, group, chname); for (port = node->ports; port; port = port->next, i++) if (port->remoteport) - out_switch_port(port, group); + out_ca_port(port, group); } + } } else { @@ -683,7 +749,7 @@ dump_topology(int listtype, int group) DEBUG("SWITCH: dist %d node %p", dist, node); if (!listtype) { - out_switch(node, group); + out_switch(node, group, chname); } else { if (listtype & SWITCH_NODE) list_node(node); @@ -697,6 +763,7 @@ dump_topology(int listtype, int group) } } + chname = NULL; if (group && !listtype) { fprintf(f, "\nNon-Chassis Nodes\n"); @@ -710,7 +777,7 @@ dump_topology(int listtype, int group) if (node->chrecord) if (node->chrecord->chassisnum) continue; - out_switch(node, group); + out_switch(node, group, chname); for (port = node->ports; port; port = port->next, i++) if (port->remoteport) @@ -725,9 +792,14 @@ dump_topology(int listtype, int group) for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { DEBUG("CA: dist %d node %p", dist, node); - if (!listtype) - out_ca(node); - else { + if (!listtype) { + if (group) + /* Now, skip chassis based CAs */ + if (node->chrecord) + if (node->chrecord->chassisnum) + continue; + out_ca(node, group, chname); + } else { if (listtype & CA_NODE) list_node(node); continue; From kliteyn at mellanox.co.il Thu Sep 13 21:08:19 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 14 Sep 2007 07:08:19 +0300 Subject: [ofa-general] nightly osm_sim report 2007-09-14:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-13 OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From rpcyc123 at 21cn.com Thu Sep 13 22:32:17 2007 From: rpcyc123 at 21cn.com (=?GB2312?B?zNW0ybOn?=) Date: Fri, 14 Sep 2007 13:32:17 +0800 Subject: [ofa-general] =?gb2312?b?x7+7r7TJ?= Message-ID: <20070914053225.DB035E6086A@openfabrics.org> č€ćťżä˝ ĺĄ˝ďĽ ć‘ćŻć˝®ĺ·žĺ¸‚一家专门ç§ĺ¶é™¶ç“·ć±¤ĺŚ™ă€ĺ‹şĺ­çš„厂家ďĽĺŚ…括é•č´¨ç“·ďĽŚĺĽşĺŚ–瓷,白瓷,新骨瓷čżćś‰ćťŻç˘źç­‰. 如有需č¦ä¸Žć‘č”çł»13631001230 QQ:1120960238 邮箱:rpcyc9688 at 163.com č”系人:ĺ生. From stanleysufficool at roadrunner.com Thu Sep 13 22:37:46 2007 From: stanleysufficool at roadrunner.com (Stanley Sufficool) Date: Thu, 13 Sep 2007 22:37:46 -0700 Subject: [ofa-general] Out of tree SRP Target Module Message-ID: <1189748266.23403.16.camel@gentoo-linux.localdomain> I changed IB_SRPT to compile without patching the kernel tree. This is like the other SCST target drivers (SCST-iSCSI, qlogic, etc...) compile. It requires that it is copied into the SCST trunk. Is there a reason to keep this as a kernel patch, or would it be acceptable to have this as a standalone module? From billfink at mindspring.com Fri Sep 14 00:20:55 2007 From: billfink at mindspring.com (Bill Fink) Date: Fri, 14 Sep 2007 03:20:55 -0400 Subject: [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB In-Reply-To: <1188257019.4250.55.camel@localhost> References: <46CF7B13.3020701@psc.edu> <20070826044134.eabd18cf.billfink@mindspring.com> <46D229AA.6020900@psc.edu> <20070826.190420.41652839.davem@davemloft.net> <1188257019.4250.55.camel@localhost> Message-ID: <20070914032055.8f96449b.billfink@mindspring.com> On Mon, 27 Aug 2007, jamal wrote: > On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: > > > The transfer is much better behaved if we ACK every two full sized > > frames we copy into the receiver, and therefore don't stretch ACK, but > > at the cost of cpu utilization. > > The rx coalescing in theory should help by accumulating more ACKs on the > rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, > you are better off to turn off the coalescing if you want higher > numbers. Also some of the TOE vendors (chelsio?) claim to have fixed > this by reducing bursts on outgoing packets. > > Bill: > who suggested (as per your email) the 75usec value and what was it based > on measurement-wise? Belatedly getting back to this thread. There was a recent myri10ge patch that changed the default value for tx/rx interrupt coalescing to 75 usec claiming it was an optimum value for maximum throughput (and is also mentioned in their external README documentation). I also did some empirical testing to determine the effect of different values of TX/RX interrupt coalescing on 10-GigE network performance, both with TSO enabled and with TSO disabled. The actual test runs are attached at the end of this message, but the results are summarized in the following table (network performance in Mbps). TX/RX interrupt coalescing in usec (both sides) 0 15 30 45 60 75 90 105 TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648 TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910 TSO disabled performance is always better than equivalent TSO enabled performance. With TSO enabled, the optimum performance is indeed at a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, performance is the full 10-GigE line rate of 9910 Mbps for any value of TX/RX interrupt coalescing from 15 usec to 105 usec. > BTW, thanks for the finding the energy to run those tests and a very > refreshing perspective. I dont mean to add more work, but i had some > queries; > On your earlier tests, i think that Reno showed some significant > differences on the lower MTU case over BIC. I wonder if this is > consistent? Here's a retest (5 tests each): TSO enabled: TCP Cubic (initial_ssthresh set to 0): [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5007.6295 MB / 10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4950.9279 MB / 10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4917.1742 MB / 10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4948.7920 MB / 10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4937.5765 MB / 10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX TCP Bic (initial_ssthresh set to 0): [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5005.5335 MB / 10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5001.0625 MB / 10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.7500 MB / 10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.3777 MB / 10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5059.1815 MB / 10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX TCP Reno: [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4973.3532 MB / 10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4984.4375 MB / 10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4995.6841 MB / 10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4982.2500 MB / 10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4989.9796 MB / 10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX TSO disabled: TCP Cubic (initial_ssthresh set to 0): [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5075.8125 MB / 10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5056.0000 MB / 10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5047.4375 MB / 10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5066.1875 MB / 10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4986.3750 MB / 10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX TCP Bic (initial_ssthresh set to 0): [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5040.5625 MB / 10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5049.7500 MB / 10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5076.5000 MB / 10.03 sec = 4247.6632 Mbps 100 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5017.2500 MB / 10.03 sec = 4197.4990 Mbps 100 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5013.3125 MB / 10.03 sec = 4194.8851 Mbps 100 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5036.0625 MB / 10.03 sec = 4213.9195 Mbps 100 %TX 100 %RX TCP Reno: [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5006.8750 MB / 10.02 sec = 4189.6051 Mbps 99 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5028.1250 MB / 10.02 sec = 4207.4553 Mbps 100 %TX 99 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5021.9375 MB / 10.02 sec = 4202.2668 Mbps 99 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5000.5625 MB / 10.03 sec = 4184.3109 Mbps 99 %TX 100 %RX [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5025.1250 MB / 10.03 sec = 4204.7378 Mbps 99 %TX 100 %RX Not too much variation here, and not quite as high results as previously. Some further testing reveals that while this time I mainly get results like (here for TCP Bic with TSO disabled): [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX I also sometimes get results like: [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX The higher performing results seem to correspond to when there's a somewhat lower receiver CPU utilization. I'm not sure but there could also have been an effect from running the "-M1460" test after the 9000 byte jumbo frame test (no jumbo tests were done at all prior to running the above sets of 5 tests, although I did always discard an initial "warmup" test, and now that I think about it some of those initial discarded "warmup" tests did have somewhat anomalously high results). > A side note: Although the experimentation reduces the variables (eg > tying all to CPU0), it would be more exciting to see multi-cpu and > multi-flow sender effect (which IMO is more real world). These systems are intended as test systems for 10-GigE networks, and as such it's important to get as consistently close to full 10-GigE line rate as possible, and that's why the interrupts and nuttcp application are tied to CPU0, with almost all other system applications tied to CPU1. Now on another system that's intended as a 10-GigE firewall system, it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding tests of this system, I have basically achieved full bidirectional 10-GigE line rate IP forwarding with 9000 byte jumbo frames. chance4 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream chance5 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream chance7 <- chance6 <- chance8 10.0 Gbps non-rate limited TCP stream [root at chance7 ~]# nuttcp -Ic4tc9 -Ri4.85g -w10m 192.168.88.8 192.168.89.16 & \ nuttcp -Ic5tc9 -Ri4.85g -w10m -P5100 -p5101 192.168.88.9 192.168.89.16 & \ nuttcp -Ic7rc8 -r -w10m 192.168.89.15 c4tc9: 5778.6875 MB / 10.01 sec = 4842.7158 Mbps 100 %TX 42 %RX c5tc9: 5778.9375 MB / 10.01 sec = 4843.1595 Mbps 100 %TX 40 %RX c7rc8: 11509.1875 MB / 10.00 sec = 9650.8009 Mbps 99 %TX 74 %RX If there's some other specific test you'd like to see, and it's not too difficult to set up and I have some spare time, I'll see what I can do. -Bill Testing of effect of RX/TX interrupt coalescing on 10-GigE network performance (both with TSO enabled and with TSO disabled): -------------------------------------------------------------------------------- No RX/TX interrupt coalescing (either side): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 10649.8750 MB / 10.03 sec = 8908.9806 Mbps 97 %TX 100 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 10879.5000 MB / 10.02 sec = 9112.5141 Mbps 99 %TX 99 %RX RX/TX interrupt coalescing set to 15 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11546.7500 MB / 10.00 sec = 9682.0785 Mbps 99 %TX 90 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.9375 MB / 10.00 sec = 9910.3702 Mbps 100 %TX 92 %RX RX/TX interrupt coalescing set to 30 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11587.1250 MB / 10.00 sec = 9715.9489 Mbps 99 %TX 81 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.8125 MB / 10.00 sec = 9910.3040 Mbps 100 %TX 81 %RX RX/TX interrupt coalescing set to 45 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11597.8750 MB / 10.00 sec = 9724.9902 Mbps 99 %TX 76 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.6250 MB / 10.00 sec = 9910.0933 Mbps 100 %TX 77 %RX RX/TX interrupt coalescing set to 60 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11614.7500 MB / 10.00 sec = 9739.1323 Mbps 100 %TX 74 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9909.9995 Mbps 100 %TX 76 %RX RX/TX interrupt coalescing set to 75 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11621.7500 MB / 10.00 sec = 9745.0993 Mbps 100 %TX 72 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.0625 MB / 10.00 sec = 9909.7881 Mbps 100 %TX 75 %RX RX/TX interrupt coalescing set to 90 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11553.1250 MB / 10.00 sec = 9687.6458 Mbps 100 %TX 71 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9910.0837 Mbps 100 %TX 73 %RX RX/TX interrupt coalescing set to 105 usec (both sides): TSO enabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11505.7500 MB / 10.00 sec = 9647.8558 Mbps 99 %TX 69 %RX TSO disabled: [root at lang2 ~]# nuttcp -w10m 192.168.88.16 11818.4375 MB / 10.00 sec = 9910.0530 Mbps 100 %TX 74 %RX From krkumar2 at in.ibm.com Fri Sep 14 01:52:34 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 14 Sep 2007 14:22:34 +0530 Subject: [ofa-general] [RESEND] Implement batching skb API and support in IPoIB Message-ID: Hi Dave, I am re-sending in case you didn't get this earlier. Also sending REV5 of the patch. I will send patch for e1000e on monday or tuesday after making the changes and testing over the weekend. thanks, - KK __________________ Hi Dave, David Miller wrote on 08/29/2007 10:21:50 AM: > From: Krishna Kumar2 > Date: Wed, 29 Aug 2007 08:53:30 +0530 > > > I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send > > side. > > Bad choice of test, this is cpu limited since the scp > has to encrypt and MAC hash all the data it sends. > > Use something like straight ftp or "bw_tcp" from lmbench. I used bw_tcp from lmbench-3. I transfered 500MB and captured the tcpdump, and analysis at various points gave pipeline sizes: 26064, 27792, 22888, 23168, 23448, 20272, 23168, 4344, 10136, 164792, 35920, 26344, 24336, 24336, 23168, 25784, 23168, There was one huge 164K, otherwise most were in smaller ranges like 20-30K. I ran the following test script: SERVER=192.168.1.2 BYTES=100m BUFFERSIZES="4096 16384 65536 131072 262144" PROCS="1 8" ITERATIONS=5 for m in $BUFFERSIZES do for procs in $PROCS do echo TEST: Size:$m Procs:$procs bw_tcp -N $ITERATIONS -m $m -M $BYTES -P $procs $SERVER done done Result is: Test without batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.49 3 65536 1 117.55 4 131072 1 117.55 5 262144 1 117.58 6 4096 8 117.18 7 16384 8 117.47 8 65536 8 117.54 9 131072 8 117.59 10 262144 8 117.55 Test with batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.48 3 65536 1 117.55 4 131072 1 117.58 5 262144 1 117.58 6 4096 8 117.19 7 16384 8 117.46 8 65536 8 117.53 9 131072 8 117.55 10 262144 8 117.60 So it doesn't seem to harm e1000. Can someone give a link to the E1000E driver? I couldn't find it after downloading Jeff's netdev-2.6 tree. Thanks, - KK From krkumar2 at in.ibm.com Fri Sep 14 02:00:58 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:30:58 +0530 Subject: [ofa-general] [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 Message-ID: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> This set of patches implements the batching xmit capability, and adds support for batching in IPoIB and E1000 (E1000 driver changes is ported, thanks to changes taken from Jamal's code from an old kernel). List of changes from previous revision: ---------------------------------------- 1. [Dave] Enable batching as default (change in register_netdev). 2. [Randy] Update documentation (however ethtool cmd to get/set batching is not implemented, hence I am guessing the usage). 3. [KK] When changing tx_batch_skb, qdisc xmits need to be blocked since qdisc_restart() drops queue_lock before calling driver xmit, and driver could find blist change under it. 4. [KK] sched: requeue could wrongly requeue skb already put in the batching list (in case a single skb was sent to the device but not sent as the device was full, resulting in the skb getting added to blist). This also results in slight optimization of batching behavior where for getting skbs #2 onwards don't require to check for gso_skb as that is the first skb that is processed. 4. [KK] Change documentation to explain this behavior. 5. [KK] sched: Fix panic when GSO is enabled in driver. 6. [KK] IPoIB: Small optimization in ipoib_ib_handle_tx_wc 7. [KK] netdevice: Needed to change NETIF_F_GSO_SHIFT/NETIF_F_GSO_MASK as BATCH_SKBS is now defined as 65536 (earlier it was using 8192 which was taken up by NETIF_F_NETNS_LOCAL). Will submit in the next 1-2 days: --------------------------------- 1. [Auke] Enable batching in e1000e. Extras that I can do later: --------------------------- 1. [Patrick] Use skb_blist statically in netdevice. This could also be used to integrate GSO and batching. 2. [Evgeniy] Useful to splice lists dev_add_skb_to_blist (and this can be done for regular xmit's of GSO skbs too for #1 above). Patches are described as: Mail 0/10: This mail Mail 1/10: HOWTO documentation Mail 2/10: Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for batching/no-batching, etc. Mail 3/10: Modify qdisc_run() to support batching Mail 4/10: Add ethtool support to enable/disable batching Mail 5/10: IPoIB: Header file changes to use batching Mail 6/10: IPoIB: CM & Multicast changes Mail 7/10: IPoIB: Verbs changes to use batching Mail 8/10: IPoIB: Internal post and work completion handler Mail 9/10: IPoIB: Implement the new batching capability Mail 10/10: E1000: Implement the new batching capability Issues: -------- The retransmission problem reported earlier seems to happen when mthca is used as the underlying device, but when I tested ehca the retransmissions dropped to normal levels (around 2 times the regular code). The performance improvement is around 55% for TCP. Please review and provide feedback; and consider for inclusion. Thanks, - KK ---------------------------------------------------- TCP ---- Size:32 Procs:1 2728 3544 29.91 Size:128 Procs:1 11803 13679 15.89 Size:512 Procs:1 43279 49665 14.75 Size:4096 Procs:1 147952 101246 -31.56 Size:16384 Procs:1 149852 141897 -5.30 Size:32 Procs:4 10562 11349 7.45 Size:128 Procs:4 41010 40832 -.43 Size:512 Procs:4 75374 130943 73.72 Size:4096 Procs:4 167996 368218 119.18 Size:16384 Procs:4 123176 379524 208.11 Size:32 Procs:8 21125 21990 4.09 Size:128 Procs:8 77419 78605 1.53 Size:512 Procs:8 234678 265047 12.94 Size:4096 Procs:8 218063 367604 68.57 Size:16384 Procs:8 184283 370972 101.30 Average: 1509300 -> 2345115 = 55.38% ---------------------------------------------------- From krkumar2 at in.ibm.com Fri Sep 14 02:01:18 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:31:18 +0530 Subject: [ofa-general] [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090118.17589.43799.sendpatchset@K50wks273871wss.in.ibm.com> Add Documentation describing batching skb xmit capability. Signed-off-by: Krishna Kumar --- batching_skb_xmit.txt | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 107 insertions(+) diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30:00.000000000 +0530 +++ new/Documentation/networking/batching_skb_xmit.txt 2007-09-14 10:25:36.000000000 +0530 @@ -0,0 +1,107 @@ + HOWTO for batching skb xmit support + ----------------------------------- + +Section 1: What is batching skb xmit +Section 2: How batching xmit works vs the regular xmit +Section 3: How drivers can support batching +Section 4: Nitty gritty details for drivers +Section 5: How users can work with batching + + +Introduction: Kernel support for batching skb +---------------------------------------------- + +A new capability to support xmit of multiple skbs is provided in the netdevice +layer. Drivers which enable this capability should be able to process multiple +skbs in a single call to their xmit handler. + + +Section 1: What is batching skb xmit +------------------------------------- + + This capability is optionally enabled by a driver by setting the + NETIF_F_BATCH_SKBS bit in dev->features. The prerequisite for a + driver to use this capability is that it should have a reasonably- + sized hardware queue that can process multiple skbs. + + +Section 2: How batching xmit works vs the regular xmit +------------------------------------------------------- + + The network stack gets called from upper layer protocols with a single + skb to transmit. This skb is first enqueued and an attempt is made to + transmit it immediately (via qdisc_run). However, events like tx lock + contention, tx queue stopped, etc., can result in the skb not getting + sent out and it remains in the queue. When the next xmit is called or + when the queue is re-enabled, qdisc_run could potentially find + multiple packets in the queue, and iteratively send them all out + one-by-one. + + Batching skb xmit is a mechanism to exploit this situation where all + skbs can be passed in one shot to the device. This reduces driver + processing, locking at the driver (or in stack for ~LLTX drivers) + gets amortized over multiple skbs, and in case of specific drivers + where every xmit results in a completion processing (like IPoIB) - + optimizations can be made in the driver to request a completion for + only the last skb that was sent which results in saving interrupts + for every (but the last) skb that was sent in the same batch. + + Batching can result in significant performance gains for systems that + have multiple data stream paths over the same network interface card. + + +Section 3: How drivers can support batching +--------------------------------------------- + + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in + dev->features. + + The driver's xmit handler should be modified to process multiple skbs + instead of one skb. The driver's xmit handler is called either with + an skb to transmit or NULL skb, where the latter case should be + handled as a call to xmit multiple skbs. This is done by sending out + all skbs in the dev->skb_blist list (where it was added by the core + stack). + + +Section 4: Nitty gritty details for driver writers +-------------------------------------------------- + + Batching is enabled from core networking stack only from softirq + context (NET_TX_SOFTIRQ), and dev_queue_xmit() doesn't use batching. + + This leads to the following situation: + A skb was not sent out as either driver lock was contested or + the device was blocked. When the softirq handler runs, it + moves all skbs from the device queue to the batch list, but + then it too could fail to send due to lock contention. The + next xmit (of a single skb) called from dev_queue_xmit() will + not use batching and try to xmit skb, while previous skbs are + still present in the batch list. This results in the receiver + getting out-of-order packets, and in case of TCP the sender + would have unnecessary retransmissions. + + To fix this problem, error cases where driver xmit gets called with a + skb must code as follows: + 1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED + as usual. This allows qdisc to requeue the skb. + 2. If driver xmit got the lock but failed to send the skb, it + should return NETDEV_TX_BUSY but before that it should have + queue'd the skb to the batch list. In this case, the qdisc + does not requeue the skb. + + +Section 5: How users can work with batching +-------------------------------------------- + + Batching can be disabled for a particular device, e.g. on desktop + systems if only one stream of network activity for that device is + taking place, since performance could be slightly affected due to + extra processing that batching adds (unless packets are getting + sent fast resulting in queue getting stopped). Batching can be enabled + if more than one stream of network activity per device is being done, + e.g. on servers; or even desktop usage with multiple browser, chat, + file transfer sessions, etc. + + Per device batching can be enabled/disabled by: + ethtool batching on/off From krkumar2 at in.ibm.com Fri Sep 14 02:01:37 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:31:37 +0530 Subject: [ofa-general] [PATCH 2/10 REV5] [core] Add skb_blist & support for batching In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090137.17589.60322.sendpatchset@K50wks273871wss.in.ibm.com> Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for batching/no-batching, etc. Signed-off-by: Krishna Kumar --- include/linux/netdevice.h | 8 ++++++-- net/core/dev.c | 29 ++++++++++++++++++++++++++--- 2 files changed, 32 insertions(+), 5 deletions(-) diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530 +++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530 @@ -439,10 +439,11 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BATCH_SKBS 65536 /* Driver supports multiple skbs/xmit */ /* Segmentation offload features */ -#define NETIF_F_GSO_SHIFT 16 -#define NETIF_F_GSO_MASK 0xffff0000 +#define NETIF_F_GSO_SHIFT 17 +#define NETIF_F_GSO_MASK 0xfffe0000 #define NETIF_F_TSO (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT) #define NETIF_F_UFO (SKB_GSO_UDP << NETIF_F_GSO_SHIFT) #define NETIF_F_GSO_ROBUST (SKB_GSO_DODGY << NETIF_F_GSO_SHIFT) @@ -548,6 +549,9 @@ struct net_device /* Partially transmitted GSO packet. */ struct sk_buff *gso_skb; + /* List of batch skbs (optional, used if driver supports skb batching */ + struct sk_buff_head *skb_blist; + /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff -ruNp org/net/core/dev.c new/net/core/dev.c --- org/net/core/dev.c 2007-09-14 10:24:27.000000000 +0530 +++ new/net/core/dev.c 2007-09-14 10:25:36.000000000 +0530 @@ -953,6 +953,16 @@ void netdev_state_change(struct net_devi } } +static void free_batching(struct net_device *dev) +{ + if (dev->skb_blist) { + if (!skb_queue_empty(dev->skb_blist)) + skb_queue_purge(dev->skb_blist); + kfree(dev->skb_blist); + dev->skb_blist = NULL; + } +} + /** * dev_load - load a network module * @name: name of interface @@ -1534,7 +1544,10 @@ static int dev_gso_segment(struct sk_buf int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { - if (likely(!skb->next)) { + if (likely(skb)) { + if (unlikely(skb->next)) + goto gso; + if (!list_empty(&ptype_all)) dev_queue_xmit_nit(skb, dev); @@ -1544,10 +1557,10 @@ int dev_hard_start_xmit(struct sk_buff * if (skb->next) goto gso; } - - return dev->hard_start_xmit(skb, dev); } + return dev->hard_start_xmit(skb, dev); + gso: do { struct sk_buff *nskb = skb->next; @@ -3566,6 +3579,13 @@ int register_netdevice(struct net_device } } + if (dev->features & NETIF_F_BATCH_SKBS) { + /* Driver supports batching skb */ + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL); + if (dev->skb_blist) + skb_queue_head_init(dev->skb_blist); + } + /* * nil rebuild_header routine, * that should be never called and used as just bug trap. @@ -3901,6 +3921,9 @@ void unregister_netdevice(struct net_dev synchronize_net(); + /* Deallocate batching structure */ + free_batching(dev); + /* Shutdown queueing discipline. */ dev_shutdown(dev); From krkumar2 at in.ibm.com Fri Sep 14 02:01:56 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:31:56 +0530 Subject: [ofa-general] [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090156.17589.61701.sendpatchset@K50wks273871wss.in.ibm.com> Modify qdisc_run() to support batching. Modify callers of qdisc_run to use batching, modify qdisc_restart to implement batching. Signed-off-by: Krishna Kumar --- include/linux/netdevice.h | 2 include/net/pkt_sched.h | 17 +++++-- net/core/dev.c | 45 ++++++++++++++++++ net/sched/sch_generic.c | 109 ++++++++++++++++++++++++++++++++++++---------- 4 files changed, 145 insertions(+), 28 deletions(-) diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-09-13 09:11:09.000000000 +0530 +++ new/include/net/pkt_sched.h 2007-09-14 10:25:36.000000000 +0530 @@ -80,13 +80,24 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +static inline void qdisc_block(struct net_device *dev) +{ + while (test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) + yield(); +} + +static inline void qdisc_unblock(struct net_device *dev) +{ + clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state); +} + +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); -static inline void qdisc_run(struct net_device *dev) +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { if (!netif_queue_stopped(dev) && !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) - __qdisc_run(dev); + __qdisc_run(dev, blist); } extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp, diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530 +++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530 @@ -1013,6 +1013,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, + struct net_device *dev); extern int netdev_budget; diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c --- org/net/sched/sch_generic.c 2007-09-13 09:11:10.000000000 +0530 +++ new/net/sched/sch_generic.c 2007-09-14 10:25:36.000000000 +0530 @@ -59,26 +59,30 @@ static inline int qdisc_qlen(struct Qdis static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb->next)) - dev->gso_skb = skb; - else - q->ops->requeue(skb, q); + if (skb) { + if (unlikely(skb->next)) + dev->gso_skb = skb; + else + q->ops->requeue(skb, q); + } netif_schedule(dev); return 0; } -static inline struct sk_buff *dev_dequeue_skb(struct net_device *dev, - struct Qdisc *q) +static inline int dev_requeue_skb_wrapper(struct sk_buff *skb, + struct net_device *dev, + struct Qdisc *q) { - struct sk_buff *skb; - - if ((skb = dev->gso_skb)) - dev->gso_skb = NULL; - else - skb = q->dequeue(q); + if (dev->skb_blist) { + /* + * In case of tx full, batching drivers would have put all + * skbs into skb_blist so there is no skb to requeue. + */ + skb = NULL; + } - return skb; + return dev_requeue_skb(skb, dev, q); } static inline int handle_dev_cpu_collision(struct sk_buff *skb, @@ -91,10 +95,15 @@ static inline int handle_dev_cpu_collisi /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We - * detect it by checking xmit owner and drop the packet when - * deadloop is detected. Return OK to try the next skb. + * detect it by checking xmit owner and drop the packet (or + * all packets in batching case) when deadloop is detected. + * Return OK to try the next skb. */ - kfree_skb(skb); + if (likely(skb)) + kfree_skb(skb); + else if (!skb_queue_empty(dev->skb_blist)) + skb_queue_purge(dev->skb_blist); + if (net_ratelimit()) printk(KERN_WARNING "Dead loop on netdevice %s, " "fix it urgently!\n", dev->name); @@ -111,6 +120,53 @@ static inline int handle_dev_cpu_collisi return ret; } +#define DEQUEUE_SKB(q) (q->dequeue(q)) + +static inline struct sk_buff *get_gso_skb(struct net_device *dev) +{ + struct sk_buff *skb; + + if ((skb = dev->gso_skb)) + dev->gso_skb = NULL; + + return skb; +} + +/* + * Algorithm to get skb(s) is: + * - If gso skb present, return it. + * - Non batching drivers, or if the batch list is empty and there is + * 1 skb in the queue - dequeue skb and put it in *skbp to tell the + * caller to use the single xmit API. + * - Batching drivers where the batch list already contains atleast one + * skb, or if there are multiple skbs in the queue: keep dequeue'ing + * skb's upto a limit and set *skbp to NULL to tell the caller to use + * the multiple xmit API. + * + * Returns: + * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL + * (in case >1 skbs present in blist for batching) + * 0 - no skbs to be sent. + */ +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, struct sk_buff **skbp) +{ + if ((*skbp = get_gso_skb(dev)) != NULL) + return 1; + + if (!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) { + return likely((*skbp = DEQUEUE_SKB(q)) != NULL); + } else { + struct sk_buff *skb; + int max = dev->tx_queue_len - skb_queue_len(blist); + + while (max > 0 && (skb = DEQUEUE_SKB(q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + return 1; /* there is atleast one skb in skb_blist */ + } +} + /* * NOTE: Called under dev->queue_lock with locally disabled BH. * @@ -130,7 +186,8 @@ static inline int handle_dev_cpu_collisi * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *blist) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; @@ -138,7 +195,7 @@ static inline int qdisc_restart(struct n int ret; /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) + if (unlikely(get_skb(dev, q, blist, &skb) == 0)) return 0; /* @@ -168,7 +225,7 @@ static inline int qdisc_restart(struct n switch (ret) { case NETDEV_TX_OK: - /* Driver sent out skb successfully */ + /* Driver sent out skb (or entire skb_blist) successfully */ ret = qdisc_qlen(q); break; @@ -183,21 +240,21 @@ static inline int qdisc_restart(struct n printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skb_wrapper(skb, dev, q); break; } return ret; } -void __qdisc_run(struct net_device *dev) +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, blist)) break; } while (!netif_queue_stopped(dev)); - clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state); + qdisc_unblock(dev); } static void dev_watchdog(unsigned long arg) @@ -575,6 +632,12 @@ void dev_deactivate(struct net_device *d qdisc = dev->qdisc; dev->qdisc = &noop_qdisc; + if (dev->skb_blist) { + /* Release skbs on batch list */ + if (!skb_queue_empty(dev->skb_blist)) + skb_queue_purge(dev->skb_blist); + } + qdisc_reset(qdisc); skb = dev->gso_skb; diff -ruNp org/net/core/dev.c new/net/core/dev.c --- org/net/core/dev.c 2007-09-14 10:24:27.000000000 +0530 +++ new/net/core/dev.c 2007-09-14 10:25:36.000000000 +0530 @@ -1542,6 +1542,46 @@ static int dev_gso_segment(struct sk_buf return 0; } +/* + * Add skb (skbs in case segmentation is required) to dev->skb_blist. No one + * can add to this list simultaneously since we are holding QDISC RUNNING + * bit. Also list is safe from simultaneous deletes too since skbs are + * dequeued only when the driver is invoked. + * + * Returns count of successful skb(s) added to skb_blist. + */ +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree_skb(skb); + return 0; + } + + if (skb->next) { + int count = 0; + + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + __skb_queue_tail(dev->skb_blist, nskb); + count++; + } while (skb->next); + + /* Reset destructor for kfree_skb to work */ + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + return count; + } + } + __skb_queue_tail(dev->skb_blist, skb); + return 1; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(skb)) { @@ -1697,7 +1737,7 @@ gso: /* reset queue_mapping to zero */ skb->queue_mapping = 0; rc = q->enqueue(skb, q); - qdisc_run(dev); + qdisc_run(dev, NULL); spin_unlock(&dev->queue_lock); rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; @@ -1895,7 +1935,8 @@ static void net_tx_action(struct softirq clear_bit(__LINK_STATE_SCHED, &dev->state); if (spin_trylock(&dev->queue_lock)) { - qdisc_run(dev); + /* Send all skbs if driver supports batching */ + qdisc_run(dev, dev->skb_blist); spin_unlock(&dev->queue_lock); } else { netif_schedule(dev); From krkumar2 at in.ibm.com Fri Sep 14 02:02:25 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:32:25 +0530 Subject: [ofa-general] [PATCH 4/10 REV5] [ethtool] Add ethtool support In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090215.17589.53243.sendpatchset@K50wks273871wss.in.ibm.com> Add ethtool support to enable/disable batching. Signed-off-by: Krishna Kumar --- include/linux/ethtool.h | 2 ++ include/linux/netdevice.h | 2 ++ net/core/dev.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ net/core/ethtool.c | 27 +++++++++++++++++++++++++++ 4 files changed, 75 insertions(+) diff -ruNp org/include/linux/ethtool.h new/include/linux/ethtool.h --- org/include/linux/ethtool.h 2007-09-13 09:11:09.000000000 +0530 +++ new/include/linux/ethtool.h 2007-09-14 10:25:36.000000000 +0530 @@ -440,6 +440,8 @@ struct ethtool_ops { #define ETHTOOL_SFLAGS 0x00000026 /* Set flags bitmap(ethtool_value) */ #define ETHTOOL_GPFLAGS 0x00000027 /* Get driver-private flags bitmap */ #define ETHTOOL_SPFLAGS 0x00000028 /* Set driver-private flags bitmap */ +#define ETHTOOL_GBATCH 0x00000029 /* Get Batching (ethtool_value) */ +#define ETHTOOL_SBATCH 0x00000030 /* Set Batching (ethtool_value) */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530 +++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530 @@ -1331,6 +1331,8 @@ extern void dev_set_promiscuity(struct extern void dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); extern void netdev_features_change(struct net_device *dev); +extern int dev_change_tx_batch_skb(struct net_device *dev, + unsigned long new_batch_skb); /* Load a device via the kmod */ extern void dev_load(struct net *net, const char *name); extern void dev_mcast_init(void); diff -ruNp org/net/core/dev.c new/net/core/dev.c --- org/net/core/dev.c 2007-09-14 10:24:27.000000000 +0530 +++ new/net/core/dev.c 2007-09-14 10:25:36.000000000 +0530 @@ -963,6 +963,50 @@ void free_batching(struct net_dev } } +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb) +{ + int ret = 0; + struct sk_buff_head *blist = NULL; + + if (!(dev->features & NETIF_F_BATCH_SKBS)) { + /* Driver doesn't support batching skb API */ + ret = -EINVAL; + goto out; + } + + /* + * Check if new value is same as the current (paranoia to use !! for + * new_batch_skb as that will be boolean via ethtool). + */ + if (!!dev->skb_blist == !!new_batch_skb) + goto out; + + if (new_batch_skb && + (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) { + ret = -ENOMEM; + goto out; + } + + /* + * Block xmit as qdisc_restart() drops queue_lock before calling + * driver xmit, and driver could find blist change under it. + */ + qdisc_block(dev); + + spin_lock_bh(&dev->queue_lock); + if (new_batch_skb) { + skb_queue_head_init(blist); + dev->skb_blist = blist; + } else + free_batching(dev); + spin_unlock_bh(&dev->queue_lock); + + qdisc_unblock(dev); + +out: + return ret; +} + /** * dev_load - load a network module * @name: name of interface diff -ruNp org/net/core/ethtool.c new/net/core/ethtool.c --- org/net/core/ethtool.c 2007-09-13 09:11:10.000000000 +0530 +++ new/net/core/ethtool.c 2007-09-14 10:25:36.000000000 +0530 @@ -556,6 +556,26 @@ static int ethtool_set_gso(struct net_de return 0; } +static int ethtool_get_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata = { ETHTOOL_GBATCH }; + + edata.data = dev->skb_blist != NULL; + if (copy_to_user(useraddr, &edata, sizeof(edata))) + return -EFAULT; + return 0; +} + +static int ethtool_set_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata; + + if (copy_from_user(&edata, useraddr, sizeof(edata))) + return -EFAULT; + + return dev_change_tx_batch_skb(dev, edata.data); +} + static int ethtool_self_test(struct net_device *dev, char __user *useraddr) { struct ethtool_test test; @@ -813,6 +833,7 @@ int dev_ethtool(struct net *net, struct case ETHTOOL_GGSO: case ETHTOOL_GFLAGS: case ETHTOOL_GPFLAGS: + case ETHTOOL_GBATCH: break; default: if (!capable(CAP_NET_ADMIN)) @@ -956,6 +977,12 @@ int dev_ethtool(struct net *net, struct rc = ethtool_set_value(dev, useraddr, dev->ethtool_ops->set_priv_flags); break; + case ETHTOOL_GBATCH: + rc = ethtool_get_batch(dev, useraddr); + break; + case ETHTOOL_SBATCH: + rc = ethtool_set_batch(dev, useraddr); + break; default: rc = -EOPNOTSUPP; } From krkumar2 at in.ibm.com Fri Sep 14 02:02:46 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:32:46 +0530 Subject: [ofa-general] [PATCH 5/10 REV5] [IPoIB] Header file changes In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090246.17589.74932.sendpatchset@K50wks273871wss.in.ibm.com> IPoIB header file changes to use batching. Signed-off-by: Krishna Kumar --- ipoib.h | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h --- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-14 10:25:36.000000000 +0530 @@ -271,8 +271,8 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; - struct ib_send_wr tx_wr; + struct ib_sge *tx_sge; + struct ib_send_wr *tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -367,8 +367,11 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, struct ipoib_ah *address, + u32 qpn, int wr_num); void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn); + struct ipoib_ah *address, u32 qpn, int num_skbs); void ipoib_reap_ah(struct work_struct *work); void ipoib_flush_paths(struct net_device *dev); From krkumar2 at in.ibm.com Fri Sep 14 02:03:15 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:33:15 +0530 Subject: [ofa-general] [PATCH 6/10 REV5] [IPoIB] CM & Multicast changes In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090310.17589.31185.sendpatchset@K50wks273871wss.in.ibm.com> IPoIB CM & Multicast changes based on header file changes. Signed-off-by: Krishna Kumar --- ipoib_cm.c | 13 +++++++++---- ipoib_multicast.c | 4 ++-- 2 files changed, 11 insertions(+), 6 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c --- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-14 10:25:36.000000000 +0530 @@ -493,14 +493,19 @@ static inline int post_send(struct ipoib unsigned int wr_id, u64 addr, int len) { + int ret; struct ib_send_wr *bad_wr; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + priv->tx_sge[0].addr = addr; + priv->tx_sge[0].length = len; + + priv->tx_wr[0].wr_id = wr_id; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr[0].next = NULL; + ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr); + priv->tx_wr[0].next = &priv->tx_wr[1]; - return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); + return ret; } void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c --- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-14 10:25:36.000000000 +0530 @@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); - priv->tx_wr.wr.ud.remote_qkey = priv->qkey; + priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey; } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -736,7 +736,7 @@ out: } } - ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1); } unlock: From krkumar2 at in.ibm.com Fri Sep 14 02:03:34 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:33:34 +0530 Subject: [ofa-general] [PATCH 7/10 REV5] [IPoIB] Verbs changes In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090334.17589.95279.sendpatchset@K50wks273871wss.in.ibm.com> IPoIB verb changes to use batching. Signed-off-by: Krishna Kumar --- ipoib_verbs.c | 23 ++++++++++++++--------- 1 files changed, 14 insertions(+), 9 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c --- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-14 10:25:36.000000000 +0530 @@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_ .max_send_sge = 1, .max_recv_sge = 1 }, - .sq_sig_type = IB_SIGNAL_ALL_WR, + .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */ .qp_type = IB_QPT_UD }; - - int ret, size; + struct ib_send_wr *next_wr = NULL; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_ priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; - - priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; - priv->tx_wr.send_flags = IB_SEND_SIGNALED; + for (i = ipoib_sendq_size - 1; i >= 0; i--) { + priv->tx_sge[i].lkey = priv->mr->lkey; + priv->tx_wr[i].opcode = IB_WR_SEND; + priv->tx_wr[i].sg_list = &priv->tx_sge[i]; + priv->tx_wr[i].num_sge = 1; + priv->tx_wr[i].send_flags = 0; + + /* Link the list properly for provider to use */ + priv->tx_wr[i].next = next_wr; + next_wr = &priv->tx_wr[i]; + } return 0; From krkumar2 at in.ibm.com Fri Sep 14 02:03:58 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:33:58 +0530 Subject: [ofa-general] [PATCH 8/10 REV5] [IPoIB] Post and work completion handler changes In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090353.17589.26052.sendpatchset@K50wks273871wss.in.ibm.com> IPoIB internal post and work completion handler changes. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 168 insertions(+), 44 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-14 10:25:36.000000000 +0530 @@ -242,6 +242,8 @@ repost: static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); + int num_completions, to_process; + unsigned int tx_ring_index; unsigned int wr_id = wc->wr_id; struct ipoib_tx_buf *tx_req; unsigned long flags; @@ -255,18 +257,51 @@ static void ipoib_ib_handle_tx_wc(struct return; } - tx_req = &priv->tx_ring[wr_id]; + /* Get first WC to process (no one can update tx_tail at this time) */ + tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1); - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + /* Find number of WC's to process */ + num_completions = wr_id - tx_ring_index + 1; + if (unlikely(num_completions <= 0)) + num_completions += ipoib_sendq_size; + to_process = num_completions; - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + /* + * Handle WC's from earlier (possibly multiple) post_sends in this + * iteration as we move from tx_tail to wr_id, since if the last WR + * (which is the one which requested completion notification) failed + * to be sent for any of those earlier request(s), no completion + * notification is generated for successful WR's of those earlier + * request(s). Use a infinite loop to handle the regular case of + * one skb processing faster. + */ + tx_req = &priv->tx_ring[tx_ring_index]; + while (1) { + if (likely(tx_req->skb)) { + ib_dma_unmap_single(priv->ca, tx_req->mapping, + tx_req->skb->len, DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + } + /* + * else this skb failed synchronously when posted and was + * freed immediately. + */ + + if (--to_process == 0) + break; - dev_kfree_skb_any(tx_req->skb); + if (likely(++tx_ring_index != ipoib_sendq_size)) + tx_req++; + else + tx_req = &priv->tx_ring[0]; + } spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; + priv->tx_tail += num_completions; if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); @@ -335,29 +370,57 @@ void ipoib_ib_completion(struct ib_cq *c netif_rx_schedule(dev, &priv->napi); } -static inline int post_send(struct ipoib_dev_priv *priv, - unsigned int wr_id, - struct ib_ah *address, u32 qpn, - u64 addr, int len) +/* + * post_send : Post WR(s) to the device. + * + * num_skbs is the number of WR's, first_wr is the first slot in tx_wr[] (or + * tx_sge[]). first_wr is normally zero unless a previous post_send returned + * error and we are trying to post the untried WR's, in which case first_wr + * is the index to the first untried WR. + * + * Break the WR link before posting so that provider knows how many WR's to + * process, and this is set back after the post. + */ +static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn, + int first_wr, int num_skbs, + struct ib_send_wr **bad_wr) { - struct ib_send_wr *bad_wr; + int ret; + struct ib_send_wr *last_wr, *next_wr; + + last_wr = &priv->tx_wr[first_wr + num_skbs - 1]; + + /* Set Completion Notification for last WR */ + last_wr->send_flags = IB_SEND_SIGNALED; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + /* Terminate the last WR */ + next_wr = last_wr->next; + last_wr->next = NULL; - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + /* Send all the WR's in one doorbell */ + ret = ib_post_send(priv->qp, &priv->tx_wr[first_wr], bad_wr); - return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); + /* Restore send_flags & WR chain */ + last_wr->send_flags = 0; + last_wr->next = next_wr; + + return ret; } -void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn) +/* + * Map skb & store skb/mapping in tx_ring; and details of the WR in tx_wr + * to pass to the provider. + * + * Returns: + * 1: Error and the skb is freed. + * 0 skb processed successfully. + */ +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, struct ipoib_ah *address, + u32 qpn, int wr_num) { - struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_tx_buf *tx_req; u64 addr; + unsigned int tx_ring_index; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -365,7 +428,7 @@ void ipoib_send(struct net_device *dev, ++priv->stats.tx_dropped; ++priv->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); - return; + return 1; } ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", @@ -378,35 +441,96 @@ void ipoib_send(struct net_device *dev, * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; - tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); - return; + return 1; } - tx_req->mapping = addr; - if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { - ipoib_warn(priv, "post_send failed\n"); - ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); - dev_kfree_skb_any(skb); - } else { - dev->trans_start = jiffies; + tx_ring_index = priv->tx_head & (ipoib_sendq_size - 1); + + /* Save till completion handler executes */ + priv->tx_ring[tx_ring_index].skb = skb; + priv->tx_ring[tx_ring_index].mapping = addr; + + /* Set WR values for the provider to use */ + priv->tx_sge[wr_num].addr = addr; + priv->tx_sge[wr_num].length = skb->len; + + priv->tx_wr[wr_num].wr_id = tx_ring_index; + priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn; + priv->tx_wr[wr_num].wr.ud.ah = address->ah; + + priv->tx_head++; + + if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + } - address->last_send = priv->tx_head; - ++priv->tx_head; + return 0; +} - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { - ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); - netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); +/* + * Send num_skbs to the device. If an skb is passed to this function, it is + * single, unprocessed skb send case; otherwise it means that all skbs are + * already processed and put on priv->tx_wr,tx_sge,tx_ring, etc. + */ +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn, int num_skbs) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int first_wr = 0; + + if (skb && ipoib_process_skb(dev, skb, priv, address, qpn, 0)) + return; + + /* Send all skb's in one post */ + do { + struct ib_send_wr *bad_wr; + + if (unlikely((post_send(priv, qpn, first_wr, num_skbs, + &bad_wr)))) { + int done; + + ipoib_warn(priv, "post_send failed\n"); + + /* Get number of WR's that finished successfully */ + done = bad_wr - &priv->tx_wr[first_wr]; + + /* Handle 1 error */ + priv->stats.tx_errors++; + ib_dma_unmap_single(priv->ca, + priv->tx_sge[first_wr + done].addr, + priv->tx_sge[first_wr + done].length, + DMA_TO_DEVICE); + + /* Free failed WR & reset for WC handler to recognize */ + dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb); + priv->tx_ring[bad_wr->wr_id].skb = NULL; + + /* Handle 'n' successes */ + if (done) { + dev->trans_start = jiffies; + address->last_send = priv->tx_head - (num_skbs - + done) - 1; + } + + /* Get count of skbs that were not tried */ + num_skbs -= (done + 1); + /* + 1 for WR that was tried & failed */ + + /* Get start index for next iteration */ + first_wr += (done + 1); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head - 1; + num_skbs = 0; } - } + } while (num_skbs); } static void __ipoib_reap_ah(struct net_device *dev) From krkumar2 at in.ibm.com Fri Sep 14 02:04:23 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:34:23 +0530 Subject: [ofa-general] [PATCH 9/10 REV5] [IPoIB] Implement batching In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090423.17589.77448.sendpatchset@K50wks273871wss.in.ibm.com> IPoIB: implement the new batching API. Signed-off-by: Krishna Kumar --- ipoib_main.c | 248 +++++++++++++++++++++++++++++++++++++++-------------------- 1 files changed, 168 insertions(+), 80 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c --- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-13 09:10:58.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-14 10:25:36.000000000 +0530 @@ -563,7 +563,8 @@ static void neigh_add_path(struct sk_buf goto err_drop; } } else - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, path->ah, + IPOIB_QPN(skb->dst->neighbour->ha), 1); } else { neigh->ah = NULL; @@ -643,7 +644,7 @@ static void unicast_arp_send(struct sk_b ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1); } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -657,105 +658,163 @@ static void unicast_arp_send(struct sk_b spin_unlock(&priv->lock); } +#define XMIT_PROCESSED_SKBS() \ + do { \ + if (wr_num) { \ + ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \ + wr_num); \ + wr_num = 0; \ + } \ + } while (0) + static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_neigh *neigh; + struct sk_buff_head *blist; + int max_skbs, wr_num = 0; + u32 qpn, old_qpn = 0; + struct ipoib_neigh *neigh, *old_neigh = NULL; unsigned long flags; if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) return NETDEV_TX_LOCKED; - /* - * Check if our queue is stopped. Since we have the LLTX bit - * set, we can't rely on netif_stop_queue() preventing our - * xmit function from being called with a full queue. - */ - if (unlikely(netif_queue_stopped(dev))) { - spin_unlock_irqrestore(&priv->tx_lock, flags); - return NETDEV_TX_BUSY; + blist = dev->skb_blist; + if (!skb || (blist && skb_queue_len(blist))) { + /* + * Either batching xmit call, or single skb case but there are + * skbs already in the batch list from previous failure to + * xmit - send the earlier skbs first to avoid out of order. + */ + + if (skb) + __skb_queue_tail(blist, skb); + + /* + * Figure out how many skbs can be sent. This prevents the + * device getting full and avoids checking for stopped queue + * after each iteration. Now the queue can get stopped atmost + * after xmit of the last skb. + */ + max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail); + skb = __skb_dequeue(blist); + } else { + blist = NULL; + max_skbs = 1; } - if (likely(skb->dst && skb->dst->neighbour)) { - if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { - ipoib_path_lookup(skb, dev); - goto out; - } - - neigh = *to_ipoib_neigh(skb->dst->neighbour); - - if (ipoib_cm_get(neigh)) { - if (ipoib_cm_up(neigh)) { - ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); - goto out; - } - } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { - spin_lock(&priv->lock); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ - ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock(&priv->lock); + do { + if (likely(skb->dst && skb->dst->neighbour)) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + XMIT_PROCESSED_SKBS(); ipoib_path_lookup(skb, dev); - goto out; + continue; } - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); - goto out; - } - - if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { - spin_lock(&priv->lock); - __skb_queue_tail(&neigh->queue, skb); - spin_unlock(&priv->lock); - } else { - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); - } - } else { - struct ipoib_pseudoheader *phdr = - (struct ipoib_pseudoheader *) skb->data; - skb_pull(skb, sizeof *phdr); + neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (phdr->hwaddr[4] == 0xff) { - /* Add in the P_Key for multicast*/ - phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; - phdr->hwaddr[9] = priv->pkey & 0xff; + if (ipoib_cm_get(neigh)) { + if (ipoib_cm_up(neigh)) { + XMIT_PROCESSED_SKBS(); + ipoib_cm_send(dev, skb, + ipoib_cm_get(neigh)); + continue; + } + } else if (neigh->ah) { + if (unlikely(memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid)))) { + spin_lock(&priv->lock); + /* + * It's safe to call ipoib_put_ah() + * inside priv->lock here, because we + * know that path->ah will always hold + * one more reference, so ipoib_put_ah() + * will never do more than decrement + * the ref count. + */ + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock(&priv->lock); + XMIT_PROCESSED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + qpn = IPOIB_QPN(skb->dst->neighbour->ha); + if (neigh != old_neigh || qpn != old_qpn) { + /* + * Sending to a different destination + * from earlier skb's (or this is the + * first skb) - send all existing skbs. + */ + XMIT_PROCESSED_SKBS(); + old_neigh = neigh; + old_qpn = qpn; + } + + if (likely(!ipoib_process_skb(dev, skb, priv, + neigh->ah, qpn, + wr_num))) + wr_num++; - ipoib_mcast_send(dev, phdr->hwaddr + 4, skb); - } else { - /* unicast GID -- should be ARP or RARP reply */ + continue; + } - if ((be16_to_cpup((__be16 *) skb->data) != ETH_P_ARP) && - (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP)) { - ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " - IPOIB_GID_FMT "\n", - skb->dst ? "neigh" : "dst", - be16_to_cpup((__be16 *) skb->data), - IPOIB_QPN(phdr->hwaddr), - IPOIB_GID_RAW_ARG(phdr->hwaddr + 4)); + if (skb_queue_len(&neigh->queue) < + IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; - goto out; } - - unicast_arp_send(skb, dev, phdr); + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + XMIT_PROCESSED_SKBS(); + ipoib_mcast_send(dev, phdr->hwaddr + 4, skb); + } else { + /* unicast GID -- should be ARP or RARP reply */ + + if ((be16_to_cpup((__be16 *) skb->data) != + ETH_P_ARP) && + (be16_to_cpup((__be16 *) skb->data) != + ETH_P_RARP)) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((__be16 *) + skb->data), + IPOIB_QPN(phdr->hwaddr), + IPOIB_GID_RAW_ARG(phdr->hwaddr + + 4)); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + continue; + } + XMIT_PROCESSED_SKBS(); + unicast_arp_send(skb, dev, phdr); + } } - } + } while (--max_skbs > 0 && (skb = __skb_dequeue(blist)) != NULL); + + /* Send out last packets (if any) */ + XMIT_PROCESSED_SKBS(); -out: spin_unlock_irqrestore(&priv->tx_lock, flags); - return NETDEV_TX_OK; + return (!blist || !skb_queue_len(blist)) ? NETDEV_TX_OK : + NETDEV_TX_BUSY; } static struct net_device_stats *ipoib_get_stats(struct net_device *dev) @@ -903,11 +962,35 @@ int ipoib_dev_init(struct net_device *de /* priv->tx_head & tx_tail are already 0 */ - if (ipoib_ib_dev_init(dev, ca, port)) + /* Allocate tx_sge */ + priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge, + GFP_KERNEL); + if (!priv->tx_sge) { + printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n", + ca->name, ipoib_sendq_size); goto out_tx_ring_cleanup; + } + + /* Allocate tx_wr */ + priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr, + GFP_KERNEL); + if (!priv->tx_wr) { + printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n", + ca->name, ipoib_sendq_size); + goto out_tx_sge_cleanup; + } + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_wr_cleanup; return 0; +out_tx_wr_cleanup: + kfree(priv->tx_wr); + +out_tx_sge_cleanup: + kfree(priv->tx_sge); + out_tx_ring_cleanup: kfree(priv->tx_ring); @@ -935,9 +1018,13 @@ void ipoib_dev_cleanup(struct net_device kfree(priv->rx_ring); kfree(priv->tx_ring); + kfree(priv->tx_sge); + kfree(priv->tx_wr); priv->rx_ring = NULL; priv->tx_ring = NULL; + priv->tx_sge = NULL; + priv->tx_wr = NULL; } static void ipoib_setup(struct net_device *dev) @@ -968,7 +1055,8 @@ static void ipoib_setup(struct net_devic dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; - dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX | + NETIF_F_BATCH_SKBS; /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; From krkumar2 at in.ibm.com Fri Sep 14 02:04:42 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 14 Sep 2007 14:34:42 +0530 Subject: [ofa-general] [PATCH 10/10 REV5] [E1000] Implement batching In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914090442.17589.23005.sendpatchset@K50wks273871wss.in.ibm.com> E1000: Implement batching capability (ported thanks to changes taken from Jamal). Signed-off-by: Krishna Kumar --- e1000_main.c | 104 ++++++++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 75 insertions(+), 29 deletions(-) diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c --- org/drivers/net/e1000/e1000_main.c 2007-09-14 10:30:57.000000000 +0530 +++ new/drivers/net/e1000/e1000_main.c 2007-09-14 10:31:02.000000000 +0530 @@ -990,7 +990,7 @@ e1000_probe(struct pci_dev *pdev, if (pci_using_dac) netdev->features |= NETIF_F_HIGHDMA; - netdev->features |= NETIF_F_LLTX; + netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS; adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw); @@ -3092,6 +3092,17 @@ e1000_tx_map(struct e1000_adapter *adapt return count; } +static void e1000_kick_DMA(struct e1000_adapter *adapter, + struct e1000_tx_ring *tx_ring, int i) +{ + wmb(); + + writel(i, adapter->hw.hw_addr + tx_ring->tdt); + /* we need this if more than one processor can write to our tail + * at a time, it syncronizes IO on IA64/Altix systems */ + mmiowb(); +} + static void e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring, int tx_flags, int count) @@ -3138,13 +3149,7 @@ e1000_tx_queue(struct e1000_adapter *ada * know there are new descriptors to fetch. (Only * applicable for weak-ordered memory model archs, * such as IA-64). */ - wmb(); - tx_ring->next_to_use = i; - writel(i, adapter->hw.hw_addr + tx_ring->tdt); - /* we need this if more than one processor can write to our tail - * at a time, it syncronizes IO on IA64/Altix systems */ - mmiowb(); } /** @@ -3251,22 +3256,23 @@ static int e1000_maybe_stop_tx(struct ne } #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 ) + +#define NETDEV_TX_DROPPED -5 + static int -e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_tx_ring *tx_ring; unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD; unsigned int max_txd_pwr = E1000_MAX_TXD_PWR; unsigned int tx_flags = 0; - unsigned int len = skb->len; - unsigned long flags; - unsigned int nr_frags = 0; - unsigned int mss = 0; + unsigned int len = skb->len - skb->data_len; + unsigned int nr_frags; + unsigned int mss; int count = 0; int tso; unsigned int f; - len -= skb->data_len; /* This goes back to the question of how to logically map a tx queue * to a flow. Right now, performance is impacted slightly negatively @@ -3276,7 +3282,7 @@ e1000_xmit_frame(struct sk_buff *skb, st if (unlikely(skb->len <= 0)) { dev_kfree_skb_any(skb); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } /* 82571 and newer doesn't need the workaround that limited descriptor @@ -3322,7 +3328,7 @@ e1000_xmit_frame(struct sk_buff *skb, st DPRINTK(DRV, ERR, "__pskb_pull_tail failed.\n"); dev_kfree_skb_any(skb); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } len = skb->len - skb->data_len; break; @@ -3366,22 +3372,15 @@ e1000_xmit_frame(struct sk_buff *skb, st (adapter->hw.mac_type == e1000_82573)) e1000_transfer_dhcp_info(adapter, skb); - if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) - /* Collision - tell upper layer to requeue */ - return NETDEV_TX_LOCKED; - /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ - if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) { - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) return NETDEV_TX_BUSY; - } if (unlikely(adapter->hw.mac_type == e1000_82547)) { if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) { netif_stop_queue(netdev); mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1); - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); return NETDEV_TX_BUSY; } } @@ -3396,8 +3395,7 @@ e1000_xmit_frame(struct sk_buff *skb, st tso = e1000_tso(adapter, tx_ring, skb); if (tso < 0) { dev_kfree_skb_any(skb); - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } if (likely(tso)) { @@ -3416,13 +3414,61 @@ e1000_xmit_frame(struct sk_buff *skb, st e1000_tx_map(adapter, tx_ring, skb, first, max_per_txd, nr_frags, mss)); - netdev->trans_start = jiffies; + return NETDEV_TX_OK; +} + +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct e1000_adapter *adapter = netdev_priv(netdev); + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + struct sk_buff_head *blist; + int ret, skbs_done = 0; + unsigned long flags; + + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) { + /* Collision - tell upper layer to requeue */ + return NETDEV_TX_LOCKED; + } - /* Make sure there is space in the ring for the next send. */ - e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2); + blist = netdev->skb_blist; + + if (!skb || (blist && skb_queue_len(blist))) { + /* + * Either batching xmit call, or single skb case but there are + * skbs already in the batch list from previous failure to + * xmit - send the earlier skbs first to avoid out of order. + */ + if (skb) + __skb_queue_tail(blist, skb); + skb = __skb_dequeue(blist); + } else { + blist = NULL; + } + + do { + ret = e1000_prep_queue_frame(skb, netdev); + if (likely(ret == NETDEV_TX_OK)) + skbs_done++; + else { + if (ret == NETDEV_TX_BUSY) { + if (blist) + __skb_queue_head(blist, skb); + break; + } + /* skb dropped, not a TX error */ + ret = NETDEV_TX_OK; + } + } while (blist && (skb = __skb_dequeue(blist)) != NULL); + + if (skbs_done) { + e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use); + netdev->trans_start = jiffies; + /* Make sure there is space in the ring for the next send. */ + e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2); + } spin_unlock_irqrestore(&tx_ring->tx_lock, flags); - return NETDEV_TX_OK; + return ret; } /** From vlad at lists.openfabrics.org Fri Sep 14 02:51:37 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 14 Sep 2007 02:51:37 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070914-0200 daily build status Message-ID: <20070914095137.8AC23E6086A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070914-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From johnpol at 2ka.mipt.ru Fri Sep 14 05:15:19 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 16:15:19 +0400 Subject: [ofa-general] Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching In-Reply-To: <20070914090156.17589.61701.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070914090156.17589.61701.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914121518.GB18517@2ka.mipt.ru> Hi Krishna. On Fri, Sep 14, 2007 at 02:31:56PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) > +{ > + if (!list_empty(&ptype_all)) > + dev_queue_xmit_nit(skb, dev); > + > + if (netif_needs_gso(dev, skb)) { > + if (unlikely(dev_gso_segment(skb))) { > + kfree_skb(skb); > + return 0; > + } > + > + if (skb->next) { > + int count = 0; > + > + do { > + struct sk_buff *nskb = skb->next; > + > + skb->next = nskb->next; > + __skb_queue_tail(dev->skb_blist, nskb); > + count++; > + } while (skb->next); Could it be list_move()-like function for skb lists? I'm pretty sure if you change first and the last skbs and ke of the queue in one shot, result will be the same. Actually how many skbs are usually batched in your load? > + /* Reset destructor for kfree_skb to work */ > + skb->destructor = DEV_GSO_CB(skb)->destructor; > + kfree_skb(skb); Why do you free first skb in the chain? > + return count; > + } > + } > + __skb_queue_tail(dev->skb_blist, skb); > + return 1; > +} -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Fri Sep 14 05:46:38 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 16:46:38 +0400 Subject: [ofa-general] Re: [PATCH 2/10 REV5] [core] Add skb_blist & support for batching In-Reply-To: <20070914090137.17589.60322.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070914090137.17589.60322.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914124637.GC18517@2ka.mipt.ru> On Fri, Sep 14, 2007 at 02:31:37PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > @@ -3566,6 +3579,13 @@ int register_netdevice(struct net_device > } > } > > + if (dev->features & NETIF_F_BATCH_SKBS) { > + /* Driver supports batching skb */ > + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL); > + if (dev->skb_blist) > + skb_queue_head_init(dev->skb_blist); > + } > + A nitpick is that you should use sizeof(struct ...) and I think it requires flag clearing in cae of failed initialization? > /* > * nil rebuild_header routine, > * that should be never called and used as just bug trap. -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Fri Sep 14 05:47:14 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 16:47:14 +0400 Subject: [ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching In-Reply-To: <20070914090442.17589.23005.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070914090442.17589.23005.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914124714.GD18517@2ka.mipt.ru> On Fri, Sep 14, 2007 at 02:34:42PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > @@ -3276,7 +3282,7 @@ e1000_xmit_frame(struct sk_buff *skb, st > > if (unlikely(skb->len <= 0)) { > dev_kfree_skb_any(skb); > - return NETDEV_TX_OK; > + return NETDEV_TX_DROPPED; > } This changes could actually go as own patch, although not sure it is ever used. just a though, not a stopper. > /* 82571 and newer doesn't need the workaround that limited descriptor > @@ -3322,7 +3328,7 @@ e1000_xmit_frame(struct sk_buff *skb, st > DPRINTK(DRV, ERR, > "__pskb_pull_tail failed.\n"); > dev_kfree_skb_any(skb); > - return NETDEV_TX_OK; > + return NETDEV_TX_DROPPED; > } > len = skb->len - skb->data_len; > break; > @@ -3366,22 +3372,15 @@ e1000_xmit_frame(struct sk_buff *skb, st > (adapter->hw.mac_type == e1000_82573)) > e1000_transfer_dhcp_info(adapter, skb); > > - if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) > - /* Collision - tell upper layer to requeue */ > - return NETDEV_TX_LOCKED; > - > /* need: count + 2 desc gap to keep tail from touching > * head, otherwise try next time */ > - if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) { > - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); > + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) > return NETDEV_TX_BUSY; > - } > > if (unlikely(adapter->hw.mac_type == e1000_82547)) { > if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) { > netif_stop_queue(netdev); > mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1); > - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); > return NETDEV_TX_BUSY; > } > } > @@ -3396,8 +3395,7 @@ e1000_xmit_frame(struct sk_buff *skb, st > tso = e1000_tso(adapter, tx_ring, skb); > if (tso < 0) { > dev_kfree_skb_any(skb); > - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); > - return NETDEV_TX_OK; > + return NETDEV_TX_DROPPED; > } > > if (likely(tso)) { > @@ -3416,13 +3414,61 @@ e1000_xmit_frame(struct sk_buff *skb, st > e1000_tx_map(adapter, tx_ring, skb, first, > max_per_txd, nr_frags, mss)); > > - netdev->trans_start = jiffies; > + return NETDEV_TX_OK; > +} > + > +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) > +{ > + struct e1000_adapter *adapter = netdev_priv(netdev); > + struct e1000_tx_ring *tx_ring = adapter->tx_ring; > + struct sk_buff_head *blist; > + int ret, skbs_done = 0; > + unsigned long flags; > + > + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) { > + /* Collision - tell upper layer to requeue */ > + return NETDEV_TX_LOCKED; > + } > > - /* Make sure there is space in the ring for the next send. */ > - e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2); > + blist = netdev->skb_blist; > + > + if (!skb || (blist && skb_queue_len(blist))) { > + /* > + * Either batching xmit call, or single skb case but there are > + * skbs already in the batch list from previous failure to > + * xmit - send the earlier skbs first to avoid out of order. > + */ > + if (skb) > + __skb_queue_tail(blist, skb); > + skb = __skb_dequeue(blist); Why is it put at the end? -- Evgeniy Polyakov From soeren.soedergren at yasokichi.com Fri Sep 14 05:49:07 2007 From: soeren.soedergren at yasokichi.com (Angelique Barnes) Date: Fri, 14 Sep 2007 21:49:07 +0900 Subject: [ofa-general] Adobe Photoshop CS3 US $ 89.95 Message-ID: <01c7f6cd$a3589a90$523547dc@soeren.soedergren> Adobe Photoshop CS3 Extended US $ 89.95 Retail price - $999.00 You save - US $ 909.05 http://www.railpa.cn From johnpol at 2ka.mipt.ru Fri Sep 14 05:49:07 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 16:49:07 +0400 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914124906.GE18517@2ka.mipt.ru> Hi Krishna. On Fri, Sep 14, 2007 at 02:30:58PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > -------- > The retransmission problem reported earlier seems to happen when mthca is > used as the underlying device, but when I tested ehca the retransmissions > dropped to normal levels (around 2 times the regular code). The performance > improvement is around 55% for TCP. And what about latency for this patchset? -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Fri Sep 14 05:55:38 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 16:55:38 +0400 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46E98889.1080706@opengridcomputing.com> References: <46E97BB0.9030106@opengridcomputing.com> <46E987E0.2010605@garzik.org> <46E98889.1080706@opengridcomputing.com> Message-ID: <20070914125536.GF18517@2ka.mipt.ru> On Thu, Sep 13, 2007 at 01:59:21PM -0500, Steve Wise (swise at opengridcomputing.com) wrote: > >Well, if it involves /sharing/ port space with the native stack, i.e. > >where port 1234 is IB but 1235 is Linux, pretty much all the networking > >devs have NAK'd that approach AFAICS. > > Jeff, I posted a fix that doesn't do this. No port sharing. The iwarp > device will use its own ip address and subnet to avoid collisions. You > should review the patch when I post v2. Could you please resend it, since I missed it in netdev at . -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Fri Sep 14 06:09:41 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 14 Sep 2007 17:09:41 +0400 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070913191617.30937.95960.stgit@dell3.ogc.int> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> Message-ID: <20070914130941.GG18517@2ka.mipt.ru> On Thu, Sep 13, 2007 at 02:16:17PM -0500, Steve Wise (swise at opengridcomputing.com) wrote: > > iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. > > Version 2: > > - added a per-device mutex for the address and listening endpoints lists. > > - wait for all replies if sending multiple passive_open requests to rnic. > > - log warning if no addresses are available when a listen is issued. > > - tested > > --- > > Design: > > The sysadmin creates "for iwarp use only" alias interfaces of the form > "devname:iw*" where devname is the native interface name (eg eth0) for the > iwarp netdev device. The alias label can be anything starting with "iw". > The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. > > EG: > ifconfig eth0 192.168.70.123 up > ifconfig eth0:iw1 192.168.71.123 up > ifconfig eth0:iw2 192.168.72.123 up > > In the above example, 192.168.70/24 is for TCP traffic, while > 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. > > The rdma-only interface must be on its own IP subnet. This allows routing > all rdma traffic onto this interface. > > The iWARP driver must translate all listens on address 0.0.0.0 to the > set of rdma-only ip addresses for the device in question. This prevents > incoming connect requests to the TCP ipaddresses from going up the > rdma stack. If the only solutions to solve a problem with hardware are to steal packets or became a real device, then real device is much more appropriate. Is that correct? > +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > +{ > + struct iwch_addrlist *addr; > + > + addr = kmalloc(sizeof *addr, GFP_KERNEL); As a small nitpick: this wants to be sizeof(struct in_ifaddr) > + if (!addr) { > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > + __FUNCTION__); > + return; > + } > + addr->ifa = ifa; > + mutex_lock(&rnicp->mutex); > + list_add_tail(&addr->entry, &rnicp->addrlist); > + mutex_unlock(&rnicp->mutex); > +} What about providing error back to caller and fail to register? > +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > +{ > + struct iwch_addrlist *addr, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > + if (addr->ifa == ifa) { > + list_del_init(&addr->entry); > + kfree(addr); > + goto out; > + } > + } > +out: > + mutex_unlock(&rnicp->mutex); > +} > + > +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) > +{ > + int i; > + > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) > + if (netdev == rnicp->rdev.port_info.lldevs[i]) > + return 1; > + return 0; > +} > + > +static inline int is_iwarp_label(char *label) > +{ > + char *colon; > + > + colon = strchr(label, ':'); > + if (colon && !strncmp(colon+1, "iw", 2)) > + return 1; > + return 0; > +} I.e. it is not allowed to create ':iw' alias for anyone else? Well, looks crappy, but if it is the only solution... > +static int nb_callback(struct notifier_block *self, unsigned long event, > + void *ctx) > +{ > + struct in_ifaddr *ifa = ctx; > + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); > + > + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); > + > + switch (event) { > + case NETDEV_UP: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + iwch_listeners_add_addr(rnicp, ifa->ifa_address); > + } > + break; > + case NETDEV_DOWN: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x deleted\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + iwch_listeners_del_addr(rnicp, ifa->ifa_address); > + remove_ifa(rnicp, ifa); > + } > + break; > + default: > + break; > + } > + return 0; > +} > + > +static void delete_addrlist(struct iwch_dev *rnicp) > +{ > + struct iwch_addrlist *addr, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > + list_del_init(&addr->entry); > + kfree(addr); > + } > + mutex_unlock(&rnicp->mutex); > +} > + > +static void populate_addrlist(struct iwch_dev *rnicp) > +{ > + int i; > + struct in_device *indev; > + > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { > + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); > + if (!indev) > + continue; > + for_ifa(indev) > + if (is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, > + ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + } > + endfor_ifa(indev); > + } > +} > + > static void rnic_init(struct iwch_dev *rnicp) > { > PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); > @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r > idr_init(&rnicp->qpidr); > idr_init(&rnicp->mmidr); > spin_lock_init(&rnicp->lock); > + INIT_LIST_HEAD(&rnicp->addrlist); > + INIT_LIST_HEAD(&rnicp->listen_eps); > + mutex_init(&rnicp->mutex); > + rnicp->nb.notifier_call = nb_callback; > + populate_addrlist(rnicp); > + register_inetaddr_notifier(&rnicp->nb); > > rnicp->attr.vendor_id = 0x168; > rnicp->attr.vendor_part_id = 7; > @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev > mutex_lock(&dev_mutex); > list_for_each_entry_safe(dev, tmp, &dev_list, entry) { > if (dev->rdev.t3cdev_p == tdev) { > + unregister_inetaddr_notifier(&dev->nb); > + delete_addrlist(dev); > list_del(&dev->entry); > iwch_unregister_device(dev); > cxio_rdev_close(&dev->rdev); > diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h > index caf4e60..7fa0a47 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch.h > +++ b/drivers/infiniband/hw/cxgb3/iwch.h > @@ -36,6 +36,8 @@ #include > #include > #include > #include > +#include > +#include > > #include > > @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { > u32 cq_overflow_detection; > }; > > +struct iwch_addrlist { > + struct list_head entry; > + struct in_ifaddr *ifa; > +}; > + > struct iwch_dev { > struct ib_device ibdev; > struct cxio_rdev rdev; > @@ -111,6 +118,10 @@ struct iwch_dev { > struct idr mmidr; > spinlock_t lock; > struct list_head entry; > + struct notifier_block nb; > + struct list_head addrlist; > + struct list_head listen_eps; > + struct mutex mutex; > }; > > static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c > index 1cdfcd4..954069f 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c > @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t > return CPL_RET_BUF_DONE; > } > > -static int listen_start(struct iwch_listen_ep *ep) > +static int wait_for_reply(struct iwch_ep_common *epc) > +{ > + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); > + wait_event(epc->waitq, epc->rpl_done); > + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); > + return epc->rpl_err; > +} > + > +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, > + __be32 addr) Do you know, that cxgb3 function names suck? :) Especially get_skb(). > +{ > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + struct iwch_listen_entry *le; > + > + le = kmalloc(sizeof *le, GFP_KERNEL); Wants to be sizeof(struct iwch_listen_entry) and in other places too. I skipped rdma internals of the patch, since I do not know it enough to judge, but your approach looks good from core network point of view. Maybe you should automatically create an alias each time new interface is added so that admin would not care about proper aliases? -- Evgeniy Polyakov From hadi at cyberus.ca Fri Sep 14 06:44:00 2007 From: hadi at cyberus.ca (jamal) Date: Fri, 14 Sep 2007 09:44:00 -0400 Subject: [ofa-general] TSO, TCP Cong control etc In-Reply-To: <20070914032055.8f96449b.billfink@mindspring.com> References: <46CF7B13.3020701@psc.edu> <20070826044134.eabd18cf.billfink@mindspring.com> <46D229AA.6020900@psc.edu> <20070826.190420.41652839.davem@davemloft.net> <1188257019.4250.55.camel@localhost> <20070914032055.8f96449b.billfink@mindspring.com> Message-ID: <1189777440.4266.77.camel@localhost> Ive changed the subject to match content.. On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote: > On Mon, 27 Aug 2007, jamal wrote: > > > Bill: > > who suggested (as per your email) the 75usec value and what was it based > > on measurement-wise? > > Belatedly getting back to this thread. There was a recent myri10ge > patch that changed the default value for tx/rx interrupt coalescing > to 75 usec claiming it was an optimum value for maximum throughput > (and is also mentioned in their external README documentation). I would think such a value would be very specific to the ring size and maybe even the machine in use. > I also did some empirical testing to determine the effect of different > values of TX/RX interrupt coalescing on 10-GigE network performance, > both with TSO enabled and with TSO disabled. The actual test runs > are attached at the end of this message, but the results are summarized > in the following table (network performance in Mbps). > > TX/RX interrupt coalescing in usec (both sides) > 0 15 30 45 60 75 90 105 > > TSO enabled 8909 9682 9716 9725 9739 9745 9688 9648 > TSO disabled 9113 9910 9910 9910 9910 9910 9910 9910 > > TSO disabled performance is always better than equivalent TSO enabled > performance. With TSO enabled, the optimum performance is indeed at > a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, > performance is the full 10-GigE line rate of 9910 Mbps for any value > of TX/RX interrupt coalescing from 15 usec to 105 usec. Interesting results. I think J Heffner made a very compelling description the other day based on your netstat results at the receiver as to what is going on (refer to the comments on stretch ACKs). If the receiver is fixed, then youd see better numbers from TSO. The 75 microsecs is very benchmarky in my opinion. If i was to pick a different app or different NIC or run on many cpus with many apps doing TSO, i highly doubt that will be the right number. > Here's a retest (5 tests each): > > TSO enabled: > > TCP Cubic (initial_ssthresh set to 0): [..] > TCP Bic (initial_ssthresh set to 0): [..] > > TCP Reno: > [..] > TSO disabled: > > TCP Cubic (initial_ssthresh set to 0): > [..] > TCP Bic (initial_ssthresh set to 0): > [..] > TCP Reno: > [..] > Not too much variation here, and not quite as high results > as previously. BIC seems to be on average better followed by CUBIC followed by Reno. The difference this time maybe because you set the ssthresh to 0 (hopefully every run) and so Reno is definetely going to perform less better since it is a lot less agressive in comparison to other two. > Some further testing reveals that while this > time I mainly get results like (here for TCP Bic with TSO > disabled): > > [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 > 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX > > I also sometimes get results like: > > [root at lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16 > 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX > not good. > The higher performing results seem to correspond to when there's a > somewhat lower receiver CPU utilization. I'm not sure but there > could also have been an effect from running the "-M1460" test after > the 9000 byte jumbo frame test (no jumbo tests were done at all prior > to running the above sets of 5 tests, although I did always discard > an initial "warmup" test, and now that I think about it some of > those initial discarded "warmup" tests did have somewhat anomalously > high results). If you didnt reset the ssthresh on every run, could it have been cached and used on subsequent runs? > > A side note: Although the experimentation reduces the variables (eg > > tying all to CPU0), it would be more exciting to see multi-cpu and > > multi-flow sender effect (which IMO is more real world). > > These systems are intended as test systems for 10-GigE networks, > and as such it's important to get as consistently close to full > 10-GigE line rate as possible, and that's why the interrupts and > nuttcp application are tied to CPU0, with almost all other system > applications tied to CPU1. Sure, good benchmark. You get to know how well you can do. > Now on another system that's intended as a 10-GigE firewall system, > it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to > CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding > tests of this system, I have basically achieved full bidirectional > 10-GigE line rate IP forwarding with 9000 byte jumbo frames. In forwarding a more meaningful metric would be pps. The cost per packet tends to dominate the results over the cost/byte. 9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine you are using sweating at all. To give you a comparison on a lower end opteron a single CPU i can generate with batching pktgen 1Mpps; Robert says he can do that even without batching on an opteron closer to what you are using. So if you want to run that test, youd need to use incrementally smaller packets. > If there's some other specific test you'd like to see, and it's not > too difficult to set up and I have some spare time, I'll see what I > can do. Well, the more interesting tests would be to go full throttle on all CPUs you have and target one (or more) receivers. i.e you simulate a real server. Can the utility you have be bound to a cpu? If yes, you should be able to achieve this without much effort. Thanks a lot Bill for the effort. cheers, jamal From FENKES at de.ibm.com Fri Sep 14 06:48:00 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Fri, 14 Sep 2007 15:48:00 +0200 Subject: [ofa-general] Re: [PATCH 02/12] IB/ehca: Add 1 is not longer needed because of firmware interface change In-Reply-To: Message-ID: Roland Dreier wrote on 12.09.2007 22:21:54: > What happens if someone runs the new driver with older firmware? Or > what if someone upgrades the firmware without updating the driver? Thanks for pointing our noses to this. Your comment triggered some further internal discussions about the meaning of the values for the current system implementation. We'll think this one over again and repost the final solution in time for 2.6.24-rc1. If the rest of this patchset is okay with you, could you apply it and leave out this one patch? The patchset will apply cleanly without it. Thanks, Joachim From rdreier at cisco.com Fri Sep 14 09:05:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Sep 2007 09:05:05 -0700 Subject: [ofa-general] Re: [PATCH 02/12] IB/ehca: Add 1 is not longer needed because of firmware interface change In-Reply-To: (Joachim Fenkes's message of "Fri, 14 Sep 2007 15:48:00 +0200") References: Message-ID: > If the rest of this patchset is okay with you, could you apply it and > leave out this one patch? The patchset will apply cleanly without it. Yes, no problem, I'll drop this patch for now. - R. From rdreier at cisco.com Fri Sep 14 09:06:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Sep 2007 09:06:56 -0700 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070914130941.GG18517@2ka.mipt.ru> (Evgeniy Polyakov's message of "Fri, 14 Sep 2007 17:09:41 +0400") References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> Message-ID: > Maybe you should automatically create an alias each time new interface > is added so that admin would not care about proper aliases? I agree that makes much more sense from a user interface point of view. Unfortunately an alias without an address doesn't make sense, so there doesn't seem to be a way to implement that. - R. From rdreier at cisco.com Fri Sep 14 09:09:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Sep 2007 09:09:13 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: (Shirley Ma's message of "Thu, 13 Sep 2007 15:16:40 -0700") References: Message-ID: > The patch is just needed to pick up broadcast MTU size instead of hard > coding 2K right now. SKB allocation shouldn't be different with Ethernet > Jambo Frame and IPoIB-CM which 64K MTU. I don't understand why it's > different. Could you please explain this? It's exactly the same problem as ethernet jumbo frames. A web search for '"order 1" failure e1000' might be interesting. IPoIB CM handles this properly by gathering together single pages in skbs' fragment lists. - R. From rdreier at cisco.com Fri Sep 14 09:18:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 14 Sep 2007 09:18:01 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <1189724358.9540.113.camel@dell> (Michael Chan's message of "Thu, 13 Sep 2007 15:59:18 -0700") References: <46E97BB0.9030106@opengridcomputing.com> <1189724358.9540.113.camel@dell> Message-ID: > > I've been meaning to track down the bnx2 iscsi offload patch to look > > and see if this issue is addressed, since the same problem seems to > > exist: it seems an iscsi connection and a main stack tcp connection > > might share the same 4-tuple unless something is done to avoid that > > happening. > iSCSI does not do passive listens, only active connections to the > target. But you're right, the port space is still shared between iSCSI > and the main stack. We currently rely on user apps binding to the main > stack to reserve certain ephemeral ports, and telling the iSCSI driver > which ports to use. Got it... I wasn't thinking that clearly, but it is clear that a full 4-tuple collision with only active connections is quite unlikely. I guess you would have to make both an offloaded and a non-offloaded iSCSI connection to the same target and get really unlucky with ephemeral port allocation. So in practice I guess it's not an issue at all with your driver yet. However, do you have any plans to support iSCSI offload for targets? Also, looking at the first CNIC patch, I can't help but notice that you seem to have at least some support for iWARP there. How does the CNIC look? Does it share the same interface/addresses as the non-offload NIC, or does it create a completely separate netdevice? I want to make sure that whatever solution we come up with for cxgb3 doesn't cause problems for you. And of course if you have a better idea than what Steve has come up with, that would be great :) - R. From sean.hefty at intel.com Fri Sep 14 10:45:23 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Sep 2007 10:45:23 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> Message-ID: <000001c7f6f7$074584e0$9c98070a@amr.corp.intel.com> >OK -- just to make sure I'm understanding what you're saying: have you >confirmed that your proposed patches actually fix the issue? Not directly. I cannot easily test kernel patches on our larger, production clusters. We've seen the issue with specific applications on 512 and 1024 cores, but I've only been able to test the patch on a 48-core cluster. I have verified that it successfully increases the timeout to where it *should* work, but cannot absolutely confirm that it will fix the problem. I'm unlikely to know that until the production clusters move to an OFED release (1.3?) containing this patch. - Sean From mshefty at ichips.intel.com Fri Sep 14 11:01:15 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 14 Sep 2007 11:01:15 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: References: Message-ID: <46EACC6B.5060702@ichips.intel.com> I didn't notice any issues with this patch, or anything missing from it. Reviewed-by: Sean Hefty From gjong at yahoo-inc.com Fri Sep 14 12:07:41 2007 From: gjong at yahoo-inc.com (Gary Jong) Date: Fri, 14 Sep 2007 12:07:41 -0700 Subject: [ofa-general] User Experience Designers - Yahoo! - California Message-ID: Hal: I'm a member of the Yahoo! Inc Talent Acquisition team and have been able to make contact with you through Internet research techniques. I'm contacting you to inform you of variety of exciting career opportunities within Yahoo! which I thought may be of interest to you. As a result of continued growth and the creation of new business models at Yahoo!, we're looking for UE Designers with both visual and interaction design skills to work across Yahoo! on short-term design projects in various business units. Our Designers create various artifacts to support the design process, including, personas, storyboards, and/or design representations of appropriate fidelity, then iterate the interfaces and document the designs with interaction and/or visual specifications as required by the various projects. This role offers the opportunity to collaborate with a variety of stakeholders and highly talented colleagues to design and assess proposed solutions, all in the interest of creating the best user experience for hundreds of millions of Yahoo! users. To perform these roles, we seek a BS/MS in Graphic Design, Interaction Design, HCI or a related field, with a minimum of 3 years of experience as a key member of a User Experience team. This experience should include significant involvement in the complete product development life cycle of several successfully launched web and/or software applications. We'd also like to see familiarity with field and lab-based usability research methodologies, the ability to create prototypes at a variety of levels, and a solid understanding of web application and website design with working knowledge of HTML. We have a range of opportunities available for designers in both Northern and Southern California. At Yahoo! we offer tremendous overall breadth, which translates to more opportunity to impact a wider diversity of areas from mail to music to social media to search. At Yahoo!, we develop products very quickly and release them to large audiences so that services ARE NOT in a perpetual state of beta. Since you've been involved with similar activities during your career, I thought there was a reasonable chance that you may be interested in exploring these opportunities with us. If so, we'd be interested in learning more about your background and interests relative to these positions, and provide you with additional information about the organization and our company. Thank you for considering this inquiry. I'll look forward to your response! Regards, Gary Jong Talent Scout Yahoo! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gjong at yahoo-inc.com Fri Sep 14 12:07:42 2007 From: gjong at yahoo-inc.com (Gary Jong) Date: Fri, 14 Sep 2007 12:07:42 -0700 Subject: [ofa-general] User Experience Designers - Yahoo! - California Message-ID: Bryan: I'm a member of the Yahoo! Inc Talent Acquisition team and have been able to make contact with you through Internet research techniques. I'm contacting you to inform you of variety of exciting career opportunities within Yahoo! which I thought may be of interest to you. As a result of continued growth and the creation of new business models at Yahoo!, we're looking for UE Designers with both visual and interaction design skills to work across Yahoo! on short-term design projects in various business units. Our Designers create various artifacts to support the design process, including, personas, storyboards, and/or design representations of appropriate fidelity, then iterate the interfaces and document the designs with interaction and/or visual specifications as required by the various projects. This role offers the opportunity to collaborate with a variety of stakeholders and highly talented colleagues to design and assess proposed solutions, all in the interest of creating the best user experience for hundreds of millions of Yahoo! users. To perform these roles, we seek a BS/MS in Graphic Design, Interaction Design, HCI or a related field, with a minimum of 3 years of experience as a key member of a User Experience team. This experience should include significant involvement in the complete product development life cycle of several successfully launched web and/or software applications. We'd also like to see familiarity with field and lab-based usability research methodologies, the ability to create prototypes at a variety of levels, and a solid understanding of web application and website design with working knowledge of HTML. We have a range of opportunities available for designers in both Northern and Southern California. At Yahoo! we offer tremendous overall breadth, which translates to more opportunity to impact a wider diversity of areas from mail to music to social media to search. At Yahoo!, we develop products very quickly and release them to large audiences so that services ARE NOT in a perpetual state of beta. Since you've been involved with similar activities during your career, I thought there was a reasonable chance that you may be interested in exploring these opportunities with us. If so, we'd be interested in learning more about your background and interests relative to these positions, and provide you with additional information about the organization and our company. Thank you for considering this inquiry. I'll look forward to your response! Regards, Gary Jong Talent Scout Yahoo! -------------- next part -------------- An HTML attachment was scrubbed... URL: From randy.dunlap at oracle.com Fri Sep 14 11:37:09 2007 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Fri, 14 Sep 2007 11:37:09 -0700 Subject: [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching In-Reply-To: <20070914090118.17589.43799.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070914090118.17589.43799.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070914113709.80baba4d.randy.dunlap@oracle.com> On Fri, 14 Sep 2007 14:31:18 +0530 Krishna Kumar wrote: > Add Documentation describing batching skb xmit capability. > > Signed-off-by: Krishna Kumar > --- > batching_skb_xmit.txt | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 107 insertions(+) > > diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt > --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30:00.000000000 +0530 > +++ new/Documentation/networking/batching_skb_xmit.txt 2007-09-14 10:25:36.000000000 +0530 > @@ -0,0 +1,107 @@ > + > +Section 4: Nitty gritty details for driver writers > +-------------------------------------------------- > + > + Batching is enabled from core networking stack only from softirq > + context (NET_TX_SOFTIRQ), and dev_queue_xmit() doesn't use batching. > + > + This leads to the following situation: > + A skb was not sent out as either driver lock was contested or > + the device was blocked. When the softirq handler runs, it > + moves all skbs from the device queue to the batch list, but > + then it too could fail to send due to lock contention. The > + next xmit (of a single skb) called from dev_queue_xmit() will > + not use batching and try to xmit skb, while previous skbs are > + still present in the batch list. This results in the receiver > + getting out-of-order packets, and in case of TCP the sender > + would have unnecessary retransmissions. > + > + To fix this problem, error cases where driver xmit gets called with a > + skb must code as follows: > + 1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED > + as usual. This allows qdisc to requeue the skb. > + 2. If driver xmit got the lock but failed to send the skb, it > + should return NETDEV_TX_BUSY but before that it should have > + queue'd the skb to the batch list. In this case, the qdisc queued > + does not requeue the skb. and then Acked-by: Randy Dunlap Thanks, --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** From mchan at broadcom.com Fri Sep 14 14:09:46 2007 From: mchan at broadcom.com (Michael Chan) Date: Fri, 14 Sep 2007 14:09:46 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <46E97BB0.9030106@opengridcomputing.com> <1189724358.9540.113.camel@dell> Message-ID: <1189804186.9540.173.camel@dell> On Fri, 2007-09-14 at 09:18 -0700, Roland Dreier wrote: > However, do you have any plans to support iSCSI offload for targets? > Also, looking at the first CNIC patch, I can't help but notice that > you seem to have at least some support for iWARP there. How does the > CNIC look? Does it share the same interface/addresses as the > non-offload NIC, or does it create a completely separate netdevice? We will support iWARP in the future and it should be similar to the way we do iSCSI - using the same interface/addresses as the bnx2 NIC. > > I want to make sure that whatever solution we come up with for cxgb3 > doesn't cause problems for you. And of course if you have a better > idea than what Steve has come up with, that would be great :) > We are looking at these discussions with great interest. If we have any new ideas, we'll definitely let everyone know. Thanks. From fubar at us.ibm.com Fri Sep 14 16:40:21 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:21 -0700 Subject: [ofa-general] [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <11898132322950-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> Message-ID: <1189813234208-git-send-email-fubar@us.ibm.com> From: Moni Shoua When the bonding device enslaves IPoIB devices it takes pointers to functions in the ib_ipoib module. This is fine as long as the ib_ipoib nodule remains loaded while the references to its functions exist. So, to help bonding do a cleanup on time, when the IPoIB net device is a slave of a bonding master, let the master know that the IPoIB device is about to unregister (but before calling unregister). Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..97a9661 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); @@ -772,6 +773,18 @@ static void ipoib_timeout(struct net_device *dev) /* XXX reset QP, etc. */ } +static int ipoib_slave_detach(struct net_device *dev) +{ + int ret = 0; + if (dev->flags & IFF_SLAVE) { + dev->priv_flags |= IFF_SLAVE_DETACH; + rtnl_lock(); + ret = call_netdevice_notifiers(NETDEV_CHANGE, dev); + rtnl_unlock(); + } + return ret; +} + static int ipoib_hard_header(struct sk_buff *skb, struct net_device *dev, unsigned short type, @@ -921,6 +934,7 @@ void ipoib_dev_cleanup(struct net_device *dev) /* Delete any child interfaces first */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + ipoib_slave_detach(cpriv->dev); unregister_netdev(cpriv->dev); ipoib_dev_cleanup(cpriv->dev); free_netdev(cpriv->dev); @@ -1208,6 +1222,7 @@ static void ipoib_remove_one(struct ib_device *device) ib_unregister_event_handler(&priv->event_handler); flush_scheduled_work(); + ipoib_slave_detach(priv->dev); unregister_netdev(priv->dev); ipoib_dev_cleanup(priv->dev); free_netdev(priv->dev); -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:20 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:20 -0700 Subject: [ofa-general] [PATCH 01/11] IB/ipoib: Export call to call_netdevice_notifiers and add new private flag In-Reply-To: <11898132301664-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> Message-ID: <11898132322950-git-send-email-fubar@us.ibm.com> From: Moni Shoua Export the call to raw_notifier_call_chain so modules can send notifications on netdev events to the netdev_chain. Add IFF_SLAVE_DETACH to the list of priv_flags for net_device. This flag is set by a slave that is about to unregisster from the kernel. Both changes are used in bonding slaves that wish to inform the bonding master about coming detachment. Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- include/linux/if.h | 1 + net/core/dev.c | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/if.h b/include/linux/if.h index 32bf419..b302b22 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -61,6 +61,7 @@ #define IFF_MASTER_ALB 0x10 /* bonding master, balance-alb. */ #define IFF_BONDING 0x20 /* bonding master or slave */ #define IFF_SLAVE_NEEDARP 0x40 /* need ARPs for validation */ +#define IFF_SLAVE_DETACH 0x80 /* slave is about to unregister */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 diff --git a/net/core/dev.c b/net/core/dev.c index a76021c..5322add 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1148,6 +1148,7 @@ int call_netdevice_notifiers(unsigned long val, void *v) { return raw_notifier_call_chain(&netdev_chain, val, v); } +EXPORT_SYMBOL(call_netdevice_notifiers); /* When > 0 there are consumers of rx skb time stamps */ static atomic_t netstamp_needed = ATOMIC_INIT(0); -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:19 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:19 -0700 Subject: [ofa-general] [PATCH 00/11] IPoIB support for bonding Message-ID: <11898132301664-git-send-email-fubar@us.ibm.com> Following is patch set to provide IPoIB support for bonding in active-backup mode. Patches 1 - 10 were originally posted by Moni Shoua . The changes look reasonable to me, but others (for IB and net/core changes) probably need to ack. Patch 11 modifies the IB "don't copy MAC to all slaves" code in bonding to also be optional for ethernet devices; this is occasionally useful. Original preface for patches 1 - 10 from Moni Shoua : This patch series is the fourth version (see below link to V3) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. The motivation is to enable the bonding driver on its HA mode to work with the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with fail-over and fail-back working fine. The working environment was the net-2.6 git. More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which is used by native IB ULPs whose addressing scheme is based on IP (e.g. iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for these ULPs. This holds as when the ULP is informed by the IB HW on the failure of the current IB connection, it just need to reconnect, where the bonding device will now issue the IB ARP over the active IPoIB slave. This series also includes patches to the IPoIB driver that fix some fix some neighboring related issues. Major changes from the previous version: 1) Addressing the issue of safety when unloading the IPoIB module before the bonding module 2) style changes Links to earlier discussion: 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. A discussion in openfabrics regarding changes in the IPoIB that enable using it as a slave for bonding. http://lists.openfabrics.org/pipermail/general/2007-July/038914.html From fubar at us.ibm.com Fri Sep 14 16:40:22 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:22 -0700 Subject: [ofa-general] [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <1189813234208-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> Message-ID: <11898132352341-git-send-email-fubar@us.ibm.com> From: Moni Shoua IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb->dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n->dev->flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 17 +++++++++++++++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..a13730c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_head list; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 97a9661..cb26cfd 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb->dst->neighbour); + neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -830,6 +830,17 @@ static void ipoib_neigh_cleanup(struct neighbour *n) unsigned long flags; struct ipoib_ah *ah = NULL; + if (n->dev->flags & IFF_MASTER) { + /* n->dev is not an IPoIB device and we have + to take priv from elsewhere */ + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh->dev); + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", + n->dev->name); + } else + return; + } ipoib_dbg(priv, "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), @@ -851,7 +862,8 @@ static void ipoib_neigh_cleanup(struct neighbour *n) ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -860,6 +872,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) return NULL; neigh->neighbour = neighbour; + neigh->dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(&neigh->queue); ipoib_cm_set(neigh, NULL); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index aae3670..ed0f0bb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -727,7 +727,8 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, + skb->dev); if (neigh) { kref_get(&mcast->ah->ref); -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:23 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:23 -0700 Subject: [ofa-general] [PATCH 04/11] IB/ipoib: Verify address handle validity on send In-Reply-To: <11898132352341-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> Message-ID: <11898132372856-git-send-email-fubar@us.ibm.com> From: Moni Shoua When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cb26cfd..6c4e9fb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, + if (unlikely((memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid))) || + (neigh->dev != dev))) { spin_lock(&priv->lock); /* * It's safe to call ipoib_put_ah() inside -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:24 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:24 -0700 Subject: [ofa-general] [PATCH 05/11] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <11898132372856-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> Message-ID: <11898132411426-git-send-email-fubar@us.ibm.com> From: Moni Shoua This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 1afda32..13ec73d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct bonding *bond) return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->hard_header = slave_dev->hard_header; + bond_dev->rebuild_header = slave_dev->rebuild_header; + bond_dev->hard_header_cache = slave_dev->hard_header_cache; + bond_dev->header_cache_update = slave_dev->header_cache_update; + bond_dev->hard_header_parse = slave_dev->hard_header_parse; + + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different " + "from other slaves (%d), can not enslave it.\n", + slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:25 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:25 -0700 Subject: [ofa-general] [PATCH 06/11] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address() In-Reply-To: <11898132411426-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> Message-ID: <1189813242354-git-send-email-fubar@us.ibm.com> From: Moni Shoua This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 87 ++++++++++++++++++++++++++------------ drivers/net/bonding/bonding.h | 1 + 2 files changed, 60 insertions(+), 28 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 13ec73d..d937bae 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC + * address is the one of the active slave. + */ + if (new_active && !bond->do_set_mac_addr) + memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, + new_active->dev->addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) } if (slave_dev->set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified does " - "not support setting the MAC address. " - "Your kernel likely does not support slave " - "devices.\n", bond_dev->name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond->slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + ": %s: Warning: The first slave device you " + "specified does not support setting the MAC " + "address. This bond MAC address would be that " + "of the active slave.\n", bond_dev->name); + bond->do_set_mac_addr = 0; + } else if (bond->do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + ": %s: Error: The slave device you specified " + "does not support setting the MAC addres,." + "but this bond uses this practice. \n" + , bond_dev->name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - /* - * Set slave to master's mac address. The application already - * set the master's mac address to that of the first slave - */ - memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); - addr.sa_family = slave_dev->type; - res = dev_set_mac_address(slave_dev, &addr); - if (res) { - dprintk("Error %d calling set_mac_address\n", res); - goto err_free; + if (bond->do_set_mac_addr) { + /* + * Set slave to master's mac address. The application already + * set the master's mac address to that of the first slave + */ + memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); + addr.sa_family = slave_dev->type; + res = dev_set_mac_address(slave_dev, &addr); + if (res) { + dprintk("Error %d calling set_mac_address\n", res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1612,9 +1631,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } err_free: kfree(new_slave); @@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address */ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address */ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1882,10 +1905,12 @@ static int bond_release_all(struct net_device *bond_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE); @@ -3922,6 +3947,9 @@ static int bond_set_mac_address(struct net_device *bond_dev, void *addr) dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); + if (!bond->do_set_mac_addr) + return -EOPNOTSUPP; + if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; } @@ -4312,6 +4340,9 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) bond_create_proc_entry(bond); #endif + /* set do_set_mac_addr to true on startup */ + bond->do_set_mac_addr = 1; + list_add_tail(&bond->bond_list, &bond_dev_list); return 0; diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index 6dcbd25..700d40a 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -185,6 +185,7 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; + s8 do_set_mac_addr; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:26 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:26 -0700 Subject: [ofa-general] [PATCH 07/11] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <1189813242354-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> <1189813242354-git-send-email-fubar@us.ibm.com> Message-ID: <11898132441599-git-send-email-fubar@us.ibm.com> From: Moni Shoua Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/bonding/bond_sysfs.c | 6 ++---- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index d937bae..a1fe87a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev->flags & IFF_UP)) { - dprintk("Error, master_dev is not up\n"); - return -EPERM; + printk(KERN_WARNING DRV_NAME + " %s: master_dev is not up in bond_enslave\n", + bond_dev->name); } /* already enslaved */ diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 9afd172..073841f 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -265,11 +265,9 @@ static ssize_t bonding_store_slaves(struct device *d, /* Quick sanity check -- is the bond interface up? */ if (!(bond->dev->flags & IFF_UP)) { - printk(KERN_ERR DRV_NAME - ": %s: Unable to update slaves because interface is down.\n", + printk(KERN_WARNING DRV_NAME + ": %s: doing slave updates when interface is down.\n", bond->dev->name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond->lock here, as bond_create grabs it. */ -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:27 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:27 -0700 Subject: [ofa-general] [PATCH 08/11] net/bonding: Handle wrong assumptions that slave is always an Ethernet device In-Reply-To: <11898132441599-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> <1189813242354-git-send-email-fubar@us.ibm.com> <11898132441599-git-send-email-fubar@us.ibm.com> Message-ID: <11898132452802-git-send-email-fubar@us.ibm.com> From: Moni Shoua bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 3 ++- drivers/net/bonding/bond_sysfs.c | 19 +++++++++++++------ drivers/net/bonding/bonding.h | 1 + 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a1fe87a..9ff2cf6 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1224,7 +1224,8 @@ static int bond_compute_features(struct bonding *bond) struct slave *slave; struct net_device *bond_dev = bond->dev; unsigned long features = bond_dev->features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev->hard_header_len); int i; features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 073841f..71db5d9 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -163,9 +163,7 @@ static ssize_t bonding_store_bonds(struct class *cls, const char *buffer, size_t printk(KERN_INFO DRV_NAME ": %s is being deleted...\n", bond->dev->name); - bond_deinit(bond->dev); - bond_destroy_sysfs_entry(bond); - unregister_netdevice(bond->dev); + bond_destroy(bond); rtnl_unlock(); goto out; } @@ -259,6 +257,7 @@ static ssize_t bonding_store_slaves(struct device *d, char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -324,6 +323,7 @@ static ssize_t bonding_store_slaves(struct device *d, } /* Set the slave's MTU to match the bond */ + original_mtu = dev->mtu; if (dev->mtu != bond->dev->mtu) { if (dev->change_mtu) { res = dev->change_mtu(dev, @@ -338,6 +338,9 @@ static ssize_t bonding_store_slaves(struct device *d, } rtnl_lock(); res = bond_enslave(bond->dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) + slave->original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -350,13 +353,17 @@ static ssize_t bonding_store_slaves(struct device *d, bond_for_each_slave(bond, slave, i) if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) { dev = slave->dev; + original_mtu = slave->original_mtu; break; } if (dev) { printk(KERN_INFO DRV_NAME ": %s: Removing slave %s\n", bond->dev->name, dev->name); rtnl_lock(); - res = bond_release(bond->dev, dev); + if (bond->setup_by_slave) + res = bond_release_and_destroy(bond->dev, dev); + else + res = bond_release(bond->dev, dev); rtnl_unlock(); if (res) { ret = res; @@ -364,9 +371,9 @@ static ssize_t bonding_store_slaves(struct device *d, } /* set the slave MTU to the default */ if (dev->change_mtu) { - dev->change_mtu(dev, 1500); + dev->change_mtu(dev, original_mtu); } else { - dev->mtu = 1500; + dev->mtu = original_mtu; } } else { diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index 700d40a..b7b4f4a 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -156,6 +156,7 @@ struct slave { s8 link; /* one of BOND_LINK_XXXX */ s8 state; /* one of BOND_STATE_XXXX */ u32 original_flags; + u32 original_mtu; u32 link_failure_count; u16 speed; u8 duplex; -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:28 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:28 -0700 Subject: [ofa-general] [PATCH 9/11] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <11898132452802-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> <1189813242354-git-send-email-fubar@us.ibm.com> <11898132441599-git-send-email-fubar@us.ibm.com> <11898132452802-git-send-email-fubar@us.ibm.com> Message-ID: <11898132472055-git-send-email-fubar@us.ibm.com> From: Moni Shoua Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev->state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 24 +++++++++++++++++++++--- drivers/net/bonding/bonding.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 9ff2cf6..dfbfb00 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) if (new_active && !bond->do_set_mac_addr) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); - - bond_send_gratuitous_arp(bond); + if (bond->curr_active_slave && + test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) { + dprintk("delaying gratuitous arp on %s\n", + bond->curr_active_slave->dev->name); + bond->send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device *bond_dev) * program could monitor the link itself if needed. */ + if (bond->send_grat_arp) { + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) + dprintk("Needs to send gratuitous arp but not yet\n"); + else { + dprintk("sending delayed gratuitous arp on on %s\n", + bond->curr_active_slave->dev->name); + bond_send_gratuitous_arp(bond); + bond->send_grat_arp = 0; + } + } read_lock(&bond->curr_slave_lock); oldcurrent = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); @@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(struct bonding *bond) if (bond->master_ip) { bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, - bond->master_ip, 0); + bond->master_ip, 0); } list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { @@ -4293,6 +4310,7 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) bond->current_arp_slave = NULL; bond->primary_slave = NULL; bond->dev = bond_dev; + bond->send_grat_arp = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index b7b4f4a..b1cdb1f 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:29 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:29 -0700 Subject: [ofa-general] [PATCH 10/11] net/bonding: Destroy bonding master when last slave is gone In-Reply-To: <11898132472055-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> <1189813242354-git-send-email-fubar@us.ibm.com> <11898132441599-git-send-email-fubar@us.ibm.com> <11898132452802-git-send-email-fubar@us.ibm.com> <11898132472055-git-send-email-fubar@us.ibm.com> Message-ID: <11898132492312-git-send-email-fubar@us.ibm.com> From: Moni Shoua When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 45 ++++++++++++++++++++++++++++++++++++++- drivers/net/bonding/bonding.h | 3 ++ 2 files changed, 47 insertions(+), 1 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index dfbfb00..77caca3 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1256,6 +1256,7 @@ static int bond_compute_features(struct bonding *bond) static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev->priv; bond_dev->hard_header = slave_dev->hard_header; bond_dev->rebuild_header = slave_dev->rebuild_header; bond_dev->hard_header_cache = slave_dev->hard_header_cache; @@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct net_device *bond_dev, memcpy(bond_dev->broadcast, slave_dev->broadcast, slave_dev->addr_len); + bond->setup_by_slave = 1; } /* enslave device to bond device */ @@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond->dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond->dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev->priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) && (bond->slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME " %s: destroying bond for.\n", + bond_dev->name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3322,7 +3353,11 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave switch (event) { case NETDEV_UNREGISTER: if (bond_dev) { - bond_release(bond_dev, slave_dev); + dprintk("slave %s unregisters\n", slave_dev->name); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); } break; case NETDEV_CHANGE: @@ -3331,6 +3366,13 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave * sets up a hierarchical bond, then rmmod's * one of the slave bonding devices? */ + if (slave_dev->priv_flags & IFF_SLAVE_DETACH) { + dprintk("slave %s detaching\n", slave_dev->name); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); + } break; case NETDEV_DOWN: /* @@ -4311,6 +4353,7 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) bond->primary_slave = NULL; bond->dev = bond_dev; bond->send_grat_arp = 0; + bond->setup_by_slave = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index b1cdb1f..ed0f587 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb_flags(struct bonding *bond) struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); -- 1.5.2-rc2.GIT From fubar at us.ibm.com Fri Sep 14 16:40:30 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Fri, 14 Sep 2007 16:40:30 -0700 Subject: [ofa-general] [PATCH 11/11] bonding: Optionally allow ethernet slaves to keep own MAC In-Reply-To: <11898132492312-git-send-email-fubar@us.ibm.com> References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> <11898132411426-git-send-email-fubar@us.ibm.com> <1189813242354-git-send-email-fubar@us.ibm.com> <11898132441599-git-send-email-fubar@us.ibm.com> <11898132452802-git-send-email-fubar@us.ibm.com> <11898132472055-git-send-email-fubar@us.ibm.com> <11898132492312-git-send-email-fubar@us.ibm.com> Message-ID: <11898132504165-git-send-email-fubar@us.ibm.com> Update the "don't change MAC of slaves" functionality added in previous changes to be a generic option, rather than something tied to IB devices, as it's occasionally useful for regular ethernet devices as well. Adds "fail_over_mac" option (which is automatically enabled for IB slaves), applicable only to active-backup mode. Includes documentation update. Updates bonding driver version to 3.2.0. Signed-off-by: Jay Vosburgh --- Documentation/networking/bonding.txt | 33 +++++++++++++++++++ drivers/net/bonding/bond_main.c | 57 +++++++++++++++++++++------------ drivers/net/bonding/bond_sysfs.c | 49 +++++++++++++++++++++++++++++ drivers/net/bonding/bonding.h | 6 ++-- 4 files changed, 121 insertions(+), 24 deletions(-) diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 1da5666..1134062 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -281,6 +281,39 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address (the traditional behavior), or, when + enabled, change the bond's MAC address when changing the + active interface (i.e., fail over the MAC address itself). + + Fail over MAC is useful for devices that cannot ever alter + their MAC address, or for devices that refuse incoming + broadcasts with their own source MAC (which interferes with + the ARP monitor). + + The down side of fail over MAC is that every device on the + network must be updated via gratuitous ARP, vs. just updating + a switch or set of switches (which often takes place for any + traffic, not just ARP traffic, if the switch snoops incoming + traffic to update its tables) for the traditional method. If + the gratuitous ARP is lost, communication may be disrupted. + + When fail over MAC is used in conjuction with the mii monitor, + devices which assert link up prior to being able to actually + transmit and receive are particularly susecptible to loss of + the gratuitous ARP, and an appropriate updelay setting may be + required. + + A value of 0 disables fail over MAC, and is the default. A + value of 1 enables fail over MAC. This option is enabled + automatically if the first slave added cannot change its MAC + address. This option may be modified via sysfs only when no + slaves are present in the bond. + + This option was added in bonding version 3.2.0. + lacp_rate Option specifying the rate in which we'll ask our link partner diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77caca3..c01ff9d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL; static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, }; static char *arp_validate = NULL; +static int fail_over_mac = 0; struct bond_params bonding_defaults; module_param(max_bonds, int, 0); @@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0); MODULE_PARM_DESC(arp_ip_target, "arp targets in n.n.n.n form"); module_param(arp_validate, charp, 0); MODULE_PARM_DESC(arp_validate, "validate src/dst of ARP probes: none (default), active, backup or all"); +module_param(fail_over_mac, int, 0); +MODULE_PARM_DESC(fail_over_mac, "For active-backup, do not set all slaves to the same MAC. 0 of off (default), 1 for on."); /*----------------------------- Global variables ----------------------------*/ @@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) /* when bonding does not set the slave MAC address, the bond MAC * address is the one of the active slave. */ - if (new_active && !bond->do_set_mac_addr) + if (new_active && bond->params.fail_over_mac) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); if (bond->curr_active_slave && @@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) if (slave_dev->set_mac_address == NULL) { if (bond->slave_cnt == 0) { printk(KERN_WARNING DRV_NAME - ": %s: Warning: The first slave device you " - "specified does not support setting the MAC " - "address. This bond MAC address would be that " - "of the active slave.\n", bond_dev->name); - bond->do_set_mac_addr = 0; - } else if (bond->do_set_mac_addr) { + ": %s: Warning: The first slave device " + "specified does not support setting the MAC " + "address. Enabling the fail_over_mac option.", + bond_dev->name); + bond->params.fail_over_mac = 1; + } else if (!bond->params.fail_over_mac) { printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified " - "does not support setting the MAC addres,." - "but this bond uses this practice. \n" + ": %s: Error: The slave device specified " + "does not support setting the MAC address, " + "but fail_over_mac is not enabled.\n" , bond_dev->name); res = -EOPNOTSUPP; goto err_undo_flags; @@ -1405,7 +1408,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* * Set slave to master's mac address. The application already * set the master's mac address to that of the first slave @@ -1641,7 +1644,7 @@ err_close: dev_close(slave_dev); err_restore_mac: - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; dev_set_mac_address(slave_dev, &addr); @@ -1823,7 +1826,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address */ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -1944,7 +1947,7 @@ static int bond_release_all(struct net_device *bond_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address*/ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -3066,9 +3069,15 @@ static void bond_info_show_master(struct seq_file *seq) curr = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); - seq_printf(seq, "Bonding Mode: %s\n", + seq_printf(seq, "Bonding Mode: %s", bond_mode_name(bond->params.mode)); + if (bond->params.mode == BOND_MODE_ACTIVEBACKUP && + bond->params.fail_over_mac) + seq_printf(seq, " (fail_over_mac)"); + + seq_printf(seq, "\n"); + if (bond->params.mode == BOND_MODE_XOR || bond->params.mode == BOND_MODE_8023AD) { seq_printf(seq, "Transmit Hash Policy: %s (%d)\n", @@ -4008,8 +4017,12 @@ static int bond_set_mac_address(struct net_device *bond_dev, void *addr) dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); - if (!bond->do_set_mac_addr) - return -EOPNOTSUPP; + /* + * If fail_over_mac is enabled, do nothing and return success. + * Returning an error causes ifenslave to fail. + */ + if (bond->params.fail_over_mac) + return 0; if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; @@ -4402,10 +4415,6 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) #ifdef CONFIG_PROC_FS bond_create_proc_entry(bond); #endif - - /* set do_set_mac_addr to true on startup */ - bond->do_set_mac_addr = 1; - list_add_tail(&bond->bond_list, &bond_dev_list); return 0; @@ -4739,6 +4748,11 @@ static int bond_check_params(struct bond_params *params) primary = NULL; } + if (fail_over_mac && (bond_mode != BOND_MODE_ACTIVEBACKUP)) + printk(KERN_WARNING DRV_NAME + ": Warning: fail_over_mac only affects " + "active-backup mode.\n"); + /* fill params struct with the proper values */ params->mode = bond_mode; params->xmit_policy = xmit_hashtype; @@ -4750,6 +4764,7 @@ static int bond_check_params(struct bond_params *params) params->use_carrier = use_carrier; params->lacp_fast = lacp_fast; params->primary[0] = 0; + params->fail_over_mac = fail_over_mac; if (primary) { strncpy(params->primary, primary, IFNAMSIZ); diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 71db5d9..a907b68 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -567,6 +567,54 @@ static ssize_t bonding_store_arp_validate(struct device *d, static DEVICE_ATTR(arp_validate, S_IRUGO | S_IWUSR, bonding_show_arp_validate, bonding_store_arp_validate); /* + * Show and store fail_over_mac. User only allowed to change the + * value when there are no slaves. + */ +static ssize_t bonding_show_fail_over_mac(struct device *d, struct device_attribute *attr, char *buf) +{ + struct bonding *bond = to_bond(d); + + return sprintf(buf, "%d\n", bond->params.fail_over_mac) + 1; +} + +static ssize_t bonding_store_fail_over_mac(struct device *d, struct device_attribute *attr, const char *buf, size_t count) +{ + int new_value; + int ret = count; + struct bonding *bond = to_bond(d); + + if (bond->slave_cnt != 0) { + printk(KERN_ERR DRV_NAME + ": %s: Can't alter fail_over_mac with slaves in bond.\n", + bond->dev->name); + ret = -EPERM; + goto out; + } + + if (sscanf(buf, "%d", &new_value) != 1) { + printk(KERN_ERR DRV_NAME + ": %s: no fail_over_mac value specified.\n", + bond->dev->name); + ret = -EINVAL; + goto out; + } + + if ((new_value == 0) || (new_value == 1)) { + bond->params.fail_over_mac = new_value; + printk(KERN_INFO DRV_NAME ": %s: Setting fail_over_mac to %d.\n", + bond->dev->name, new_value); + } else { + printk(KERN_INFO DRV_NAME + ": %s: Ignoring invalid fail_over_mac value %d.\n", + bond->dev->name, new_value); + } +out: + return ret; +} + +static DEVICE_ATTR(fail_over_mac, S_IRUGO | S_IWUSR, bonding_show_fail_over_mac, bonding_store_fail_over_mac); + +/* * Show and set the arp timer interval. There are two tricky bits * here. First, if ARP monitoring is activated, then we must disable * MII monitoring. Second, if the ARP timer isn't running, we must @@ -1390,6 +1438,7 @@ static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL); static struct attribute *per_bond_attrs[] = { &dev_attr_slaves.attr, &dev_attr_mode.attr, + &dev_attr_fail_over_mac.attr, &dev_attr_arp_validate.attr, &dev_attr_arp_interval.attr, &dev_attr_arp_ip_target.attr, diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index ed0f587..9d6153e 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,8 +22,8 @@ #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION "3.1.3" -#define DRV_RELDATE "June 13, 2007" +#define DRV_VERSION "3.2.0" +#define DRV_RELDATE "September 13, 2007" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" @@ -128,6 +128,7 @@ struct bond_params { int arp_interval; int arp_validate; int use_carrier; + int fail_over_mac; int updelay; int downdelay; int lacp_fast; @@ -186,7 +187,6 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; - s8 do_set_mac_addr; s8 send_grat_arp; s8 setup_by_slave; struct net_device_stats stats; -- 1.5.2-rc2.GIT From kliteyn at mellanox.co.il Fri Sep 14 21:32:15 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 15 Sep 2007 07:32:15 +0300 Subject: [ofa-general] nightly osm_sim report 2007-09-15:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-14 OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=519 Fail=1 Pass: 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 38 Stability IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: 1 Stability IS1-16.topo From vlad at lists.openfabrics.org Sat Sep 15 02:52:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 15 Sep 2007 02:52:45 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070915-0200 daily build status Message-ID: <20070915095245.A4F25E60856@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070915-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From swise at opengridcomputing.com Sat Sep 15 07:03:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 15 Sep 2007 09:03:53 -0500 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <46E97BB0.9030106@opengridcomputing.com> Message-ID: <46EBE649.3040303@opengridcomputing.com> Roland Dreier wrote: > > I was about to post v2 of my patch to avoid port space collisions with > > the native stack. Can we get that 2.6.24? It is high priority > > IMO. I've tried to solicit review on it, but I think folks are > > reluctant... ;-) > > I would like to get this in, but I'm still at least a little > reluctant, since we would be committing to a user interface that seems > a little awkward at best, so I'd like to try and find something > better. Just to summarize my understanding: > > - your patch requires the administration to configure an ethX:iwY > alias address to use iwarp. (By the way is there anything other > than "don't do that" that avoids assigning the same address to the > iwarp alias and a non-iwarp interface?) > Nope. Its totally up to the admin to create the ethX:iwY interface -and- to segment his services so host TCP runs on the ethX subnet(s) and the iwarp rdma ones run on ethX:iwY subnet(s). Without changing the core network serices, I don't see any way around this. > - it would be nicer to create the alias automatically, but an alias > without an address doesn't make sense. Creating a whole separate > net device causes problems because the iwarp stuff still needs to > use the main net device to do ARP etc. > I do log a warning if an iwarp application binds to address 0.0.0.0 and there are no ethX:iwY address available. > - so I'm out of better ideas but I still want to push back a little > before we commit to something ugly. > Me 2. :-( > I've been meaning to track down the bnx2 iscsi offload patch to look > and see if this issue is addressed, since the same problem seems to > exist: it seems an iscsi connection and a main stack tcp connection > might share the same 4-tuple unless something is done to avoid that > happening. > > Also, I think it behooves us to get some agreement on this approach > with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers > seem to be imminent. > > - R. From swise at opengridcomputing.com Sat Sep 15 07:07:06 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 15 Sep 2007 09:07:06 -0500 Subject: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46E99586.90905@ichips.intel.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <46E99586.90905@ichips.intel.com> Message-ID: <46EBE70A.6040901@opengridcomputing.com> Sean Hefty wrote: >> The iWARP driver must translate all listens on address 0.0.0.0 to the >> set of rdma-only ip addresses for the device in question. This prevents >> incoming connect requests to the TCP ipaddresses from going up the >> rdma stack. > > I've only given this a high level review at this point, and while the > patch looks okay on first pass, is there a way to move some of this > functionality to either the rdma_cm or iw_cm? I don't like the idea of > every iwarp driver having to implement address/listen list maintenance. > I may have some ideas after re-examining it. I think the translating of listen requests from 0.0.0.0->specific addresses could be moved to the iwcm... > >> Implementation Details: > > There are a couple of areas that I made a note to look at in more detail > (because I didn't understand everything that was happening), but I did > have one minor nit - most uses of list_del_init can just be list_del. > Ok. From swise at opengridcomputing.com Sat Sep 15 08:56:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 15 Sep 2007 10:56:46 -0500 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070914130941.GG18517@2ka.mipt.ru> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> Message-ID: <46EC00BE.3020801@opengridcomputing.com> Evgeniy Polyakov wrote: > On Thu, Sep 13, 2007 at 02:16:17PM -0500, Steve Wise (swise at opengridcomputing.com) wrote: >> iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. >> >> Version 2: >> >> - added a per-device mutex for the address and listening endpoints lists. >> >> - wait for all replies if sending multiple passive_open requests to rnic. >> >> - log warning if no addresses are available when a listen is issued. >> >> - tested >> >> --- >> >> Design: >> >> The sysadmin creates "for iwarp use only" alias interfaces of the form >> "devname:iw*" where devname is the native interface name (eg eth0) for the >> iwarp netdev device. The alias label can be anything starting with "iw". >> The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. >> >> EG: >> ifconfig eth0 192.168.70.123 up >> ifconfig eth0:iw1 192.168.71.123 up >> ifconfig eth0:iw2 192.168.72.123 up >> >> In the above example, 192.168.70/24 is for TCP traffic, while >> 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. >> >> The rdma-only interface must be on its own IP subnet. This allows routing >> all rdma traffic onto this interface. >> >> The iWARP driver must translate all listens on address 0.0.0.0 to the >> set of rdma-only ip addresses for the device in question. This prevents >> incoming connect requests to the TCP ipaddresses from going up the >> rdma stack. > > If the only solutions to solve a problem with hardware are to steal > packets or became a real device, then real device is much more > appropriate. Is that correct? > This is a real device. I don't understand your question? Packets aren't being stolen. >> +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) >> +{ >> + struct iwch_addrlist *addr; >> + >> + addr = kmalloc(sizeof *addr, GFP_KERNEL); > > As a small nitpick: this wants to be sizeof(struct in_ifaddr) > No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a list_head for linking, and a struct in_ifaddr pointer. >> + if (!addr) { >> + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", >> + __FUNCTION__); >> + return; >> + } >> + addr->ifa = ifa; >> + mutex_lock(&rnicp->mutex); >> + list_add_tail(&addr->entry, &rnicp->addrlist); >> + mutex_unlock(&rnicp->mutex); >> +} > > What about providing error back to caller and fail to register? > There are two causes where this is called: 1) during module init to populate the list of iwarp addresses. If we failed in that case then, I _could_ then not register. 2) we get called via the notifier mechanism when an address is added. If that fails, the caller doesn't care (since we're on the notifier callout thread). But the code could perhaps unregister the device. I prefer just logging an error in case 2. I'll look into not registering if we cannot get any address due to lack of memory. But there's another case: we load the module and the admin hasn't yet created the ethX:iw interface. Perhaps I should change the code to only register as a working rdma device _when_ we get at least one ethX:iwY interface created? Whatchathink? >> +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) >> +{ >> + struct iwch_addrlist *addr, *tmp; >> + >> + mutex_lock(&rnicp->mutex); >> + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { >> + if (addr->ifa == ifa) { >> + list_del_init(&addr->entry); >> + kfree(addr); >> + goto out; >> + } >> + } >> +out: >> + mutex_unlock(&rnicp->mutex); >> +} >> + >> +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) >> +{ >> + int i; >> + >> + for (i = 0; i < rnicp->rdev.port_info.nports; i++) >> + if (netdev == rnicp->rdev.port_info.lldevs[i]) >> + return 1; >> + return 0; >> +} >> + >> +static inline int is_iwarp_label(char *label) >> +{ >> + char *colon; >> + >> + colon = strchr(label, ':'); >> + if (colon && !strncmp(colon+1, "iw", 2)) >> + return 1; >> + return 0; >> +} > > I.e. it is not allowed to create ':iw' alias for anyone else? > Well, looks crappy, but if it is the only solution... > It is kinda crappy. But I don't see a better solution. Any ideas? >> +static int nb_callback(struct notifier_block *self, unsigned long event, >> + void *ctx) >> +{ >> + struct in_ifaddr *ifa = ctx; >> + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); >> + >> + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); >> + >> + switch (event) { >> + case NETDEV_UP: >> + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && >> + is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x added\n", >> + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); >> + insert_ifa(rnicp, ifa); >> + iwch_listeners_add_addr(rnicp, ifa->ifa_address); >> + } >> + break; >> + case NETDEV_DOWN: >> + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && >> + is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x deleted\n", >> + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); >> + iwch_listeners_del_addr(rnicp, ifa->ifa_address); >> + remove_ifa(rnicp, ifa); >> + } >> + break; >> + default: >> + break; >> + } >> + return 0; >> +} >> + >> +static void delete_addrlist(struct iwch_dev *rnicp) >> +{ >> + struct iwch_addrlist *addr, *tmp; >> + >> + mutex_lock(&rnicp->mutex); >> + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { >> + list_del_init(&addr->entry); >> + kfree(addr); >> + } >> + mutex_unlock(&rnicp->mutex); >> +} >> + >> +static void populate_addrlist(struct iwch_dev *rnicp) >> +{ >> + int i; >> + struct in_device *indev; >> + >> + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { >> + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); >> + if (!indev) >> + continue; >> + for_ifa(indev) >> + if (is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x added\n", >> + __FUNCTION__, ifa->ifa_label, >> + ifa->ifa_address); >> + insert_ifa(rnicp, ifa); >> + } >> + endfor_ifa(indev); >> + } >> +} >> + >> static void rnic_init(struct iwch_dev *rnicp) >> { >> PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); >> @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r >> idr_init(&rnicp->qpidr); >> idr_init(&rnicp->mmidr); >> spin_lock_init(&rnicp->lock); >> + INIT_LIST_HEAD(&rnicp->addrlist); >> + INIT_LIST_HEAD(&rnicp->listen_eps); >> + mutex_init(&rnicp->mutex); >> + rnicp->nb.notifier_call = nb_callback; >> + populate_addrlist(rnicp); >> + register_inetaddr_notifier(&rnicp->nb); >> >> rnicp->attr.vendor_id = 0x168; >> rnicp->attr.vendor_part_id = 7; >> @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev >> mutex_lock(&dev_mutex); >> list_for_each_entry_safe(dev, tmp, &dev_list, entry) { >> if (dev->rdev.t3cdev_p == tdev) { >> + unregister_inetaddr_notifier(&dev->nb); >> + delete_addrlist(dev); >> list_del(&dev->entry); >> iwch_unregister_device(dev); >> cxio_rdev_close(&dev->rdev); >> diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h >> index caf4e60..7fa0a47 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch.h >> +++ b/drivers/infiniband/hw/cxgb3/iwch.h >> @@ -36,6 +36,8 @@ #include >> #include >> #include >> #include >> +#include >> +#include >> >> #include >> >> @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { >> u32 cq_overflow_detection; >> }; >> >> +struct iwch_addrlist { >> + struct list_head entry; >> + struct in_ifaddr *ifa; >> +}; >> + >> struct iwch_dev { >> struct ib_device ibdev; >> struct cxio_rdev rdev; >> @@ -111,6 +118,10 @@ struct iwch_dev { >> struct idr mmidr; >> spinlock_t lock; >> struct list_head entry; >> + struct notifier_block nb; >> + struct list_head addrlist; >> + struct list_head listen_eps; >> + struct mutex mutex; >> }; >> >> static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) >> diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c >> index 1cdfcd4..954069f 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c >> +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c >> @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t >> return CPL_RET_BUF_DONE; >> } >> >> -static int listen_start(struct iwch_listen_ep *ep) >> +static int wait_for_reply(struct iwch_ep_common *epc) >> +{ >> + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); >> + wait_event(epc->waitq, epc->rpl_done); >> + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); >> + return epc->rpl_err; >> +} >> + >> +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, >> + __be32 addr) > > Do you know, that cxgb3 function names suck? :) > Especially get_skb(). > >> +{ >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >> + struct iwch_listen_entry *le; >> + >> + le = kmalloc(sizeof *le, GFP_KERNEL); > > Wants to be sizeof(struct iwch_listen_entry) and in other places too. > Do you mean I shouldn't use sizeof *le, but rather sizeof(struct iwch_listen_entry)? Is that the preferred coding style? > I skipped rdma internals of the patch, since I do not know it enough > to judge, but your approach looks good from core network point of view. > Maybe you should automatically create an alias each time new interface > is added so that admin would not care about proper aliases? > That would be much better IMO, but the problem is that I cannot create an alias without an actual ip address. Unless we change the core services to allow it. Thanks for reviewing! Steve. From sashak at voltaire.com Sat Sep 15 11:35:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 21:35:42 +0300 Subject: [ofa-general] [PATCH] opensm: configure scripts merge Message-ID: <20070915183542.GA6891@sashak.voltaire.com> This merges all subdirectories configure.in scripts into one toplevel directory script. Separate configuring per subdirectory is not needed anymore. Signed-off-by: Sasha Khapyorsky --- opensm/Makefile.am | 4 +- opensm/autogen.sh | 34 +++------------------ opensm/complib/Makefile.am | 2 + opensm/configure.in | 60 ++++++++++++++++++++++++++++++------ opensm/libvendor/Makefile.am | 2 + opensm/opensm/Makefile.am | 2 + opensm/osmeventplugin/Makefile.am | 2 + 7 files changed, 65 insertions(+), 41 deletions(-) diff --git a/opensm/Makefile.am b/opensm/Makefile.am index f99e78b..9cbce3a 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -1,12 +1,12 @@ # note that order matters: make the libs first then use them -SUBDIRS = complib libvendor opensm osmtest include $(DEFAULT_EVENT_PLUGIN) +SUBDIRS = complib libvendor opensm osmtest include $(DEFAULT_EVENT_PLUGIN) DIST_SUBDIRS = complib libvendor opensm osmtest include osmeventplugin # this will control the update of the files in order MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in -ACLOCAL = aclocal -I $(ac_aux_dir) +ACLOCAL = aclocal -I $(ac_aux_dir) # we should provide a hint for other apps about the build mode of this project install-exec-hook: diff --git a/opensm/autogen.sh b/opensm/autogen.sh index 3ae89b4..fee8800 100755 --- a/opensm/autogen.sh +++ b/opensm/autogen.sh @@ -50,32 +50,8 @@ fi # cleanup find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune -# handle our own autoconf: -(aclocal -I config 2>&1 ) && \ -(automake --add-missing --gnu --copy ) && \ -(autoconf 2>&1 ) -if test $? != 0; then - exit 1 -fi - - - -# visit all sub directories with autogen.sh -anyErr=0 -for a in include complib libvendor opensm osmtest osmeventplugin ; do - dir=`dirname $a` - test -d ${dir}/config || mkdir ${dir}/config - echo Visiting $a - ( cd $a && \ - set -x && \ - aclocal -I config -I ../config && \ - libtoolize --force --copy && \ - autoheader && \ - automake --foreign --add-missing --copy && \ - autoconf ) \ - 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted definition" - if test $? != 0; then - echo $a failed - anyErr=1 - fi -done +aclocal -I config && \ +libtoolize --force --copy && \ +autoheader && \ +automake --foreign --add-missing --copy && \ +autoconf diff --git a/opensm/complib/Makefile.am b/opensm/complib/Makefile.am index fce797a..a77964e 100644 --- a/opensm/complib/Makefile.am +++ b/opensm/complib/Makefile.am @@ -17,6 +17,8 @@ else libosmcomp_version_script = endif +complib_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmcomp.ver | sed 's/LIBVERSION=//') + libosmcomp_la_SOURCES = cl_complib.c cl_dispatcher.c \ cl_event.c cl_event_wheel.c \ cl_list.c cl_log.c cl_map.c \ diff --git a/opensm/configure.in b/opensm/configure.in index 2efd867..6c4db9f 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -4,6 +4,7 @@ AC_PREREQ(2.57) AC_INIT(opensm, 3.1.1, general at lists.openfabrics.org) AC_CONFIG_SRCDIR([opensm/osm_opensm.c]) AC_CONFIG_AUX_DIR(config) +AC_CONFIG_HEADERS(include/config.h) AM_INIT_AUTOMAKE(opensm, 3.1.1) dnl Defines the Language @@ -16,17 +17,50 @@ AM_MAINTAINER_MODE dnl Required for cases make defines a MAKE=make ??? Why AC_PROG_MAKE_SET +AC_PROG_CC +AC_PROG_LIBTOOL +AC_PROG_INSTALL +AC_PROG_LN_S +AC_PROG_MAKE_SET +AC_PROG_YACC +AC_PROG_LEX + +dnl Checks for libraries +AC_CHECK_LIB(pthread, pthread_mutex_init, [], + AC_MSG_ERROR([pthread_mutex_init() not found. libosmcomp requires libpthread.])) + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_C_INLINE +AC_TYPE_PID_T +AC_TYPE_SIZE_T +AC_HEADER_TIME +AC_STRUCT_TM +AC_C_VOLATILE + +dnl We use --version-script with ld if possible +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, +if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes +else + ac_cv_version_script=no +fi) +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") dnl Define an input config option to control debug compile -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debugging], +AC_ARG_ENABLE(debug, [ --enable-debug Turn on debugging], [case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) + yes) debug=true ;; + no) debug=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; +esac],debug=false) AM_CONDITIONAL(DEBUG, test x$debug = xtrue) +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], +[if test x$enableval = xno ; then + disable_libcheck=yes +fi]) + dnl check if they want the socket console OPENIB_OSM_CONSOLE_SOCKET_SEL @@ -39,9 +73,15 @@ OPENIB_OSM_DEFAULT_EVENT_PLUGIN_SEL dnl Provide user option to select vendor OPENIB_APP_OSMV_SEL -dnl Configure the following subdirs -AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include osmeventplugin) +dnl Checks for headers and libraries +OPENIB_APP_OSMV_CHECK_HEADER +OPENIB_APP_OSMV_CHECK_LIB + +# we have to revive the env CFLAGS as some how they are being overwritten... +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering +# for why they should NEVER be modified by the configure to allow for user +# overrides. +CFLAGS=$ac_env_CFLAGS_value dnl Create the following Makefiles -AC_OUTPUT(Makefile) -AC_OUTPUT(opensm.spec) +AC_OUTPUT([Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) diff --git a/opensm/libvendor/Makefile.am b/opensm/libvendor/Makefile.am index 3b8c3af..cb8baaa 100644 --- a/opensm/libvendor/Makefile.am +++ b/opensm/libvendor/Makefile.am @@ -23,6 +23,8 @@ else libosmvendor_version_script = endif +osmvendor_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmvendor.ver | sed 's/LIBVERSION=//') + COMM_HDRS= $(srcdir)/../include/vendor/osm_vendor_api.h \ $(srcdir)/../include/vendor/osm_vendor.h \ $(srcdir)/../include/vendor/osm_vendor_select.h \ diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 5e4229d..8440b4a 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -21,6 +21,8 @@ else libopensm_version_script = endif +opensm_api_version=$(shell grep LIBVERSION= $(srcdir)/libopensm.ver | sed 's/LIBVERSION=//') + libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ -export-dynamic $(libopensm_version_script) diff --git a/opensm/osmeventplugin/Makefile.am b/opensm/osmeventplugin/Makefile.am index bbb012f..1b7dad0 100644 --- a/opensm/osmeventplugin/Makefile.am +++ b/opensm/osmeventplugin/Makefile.am @@ -18,6 +18,8 @@ else libosmeventplugin_version_script = endif +osmeventplugin_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmeventplugin.ver | sed 's/LIBVERSION=//') + libosmeventplugin_la_SOURCES = src/osmeventplugin.c libosmeventplugin_la_LDFLAGS = -version-info $(osmeventplugin_api_version) \ -export-dynamic $(libosmeventplugin_version_script) -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Sat Sep 15 11:36:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 21:36:40 +0300 Subject: [ofa-general] [PATCH] opensm: remove obsolete configure.in and spec.in files In-Reply-To: <20070915183542.GA6891@sashak.voltaire.com> References: <20070915183542.GA6891@sashak.voltaire.com> Message-ID: <20070915183640.GB6891@sashak.voltaire.com> This removes not used configure.in and *.spec.in files from opensm subdirectories. Signed-off-by: Sasha Khapyorsky --- opensm/complib/Makefile.am | 6 +-- opensm/complib/configure.in | 69 ------------------ opensm/complib/libosmcomp.spec.in | 38 ---------- opensm/include/configure.in | 44 ------------ opensm/libvendor/Makefile.am | 5 +- opensm/libvendor/configure.in | 80 --------------------- opensm/libvendor/libosmvendor.spec.in | 38 ---------- opensm/opensm/configure.in | 86 ----------------------- opensm/osmeventplugin/Makefile.am | 7 +-- opensm/osmeventplugin/configure.in | 66 ----------------- opensm/osmeventplugin/libosmeventplugin.spec.in | 38 ---------- opensm/osmtest/configure.in | 75 -------------------- 12 files changed, 3 insertions(+), 549 deletions(-) delete mode 100644 opensm/complib/configure.in delete mode 100644 opensm/complib/libosmcomp.spec.in delete mode 100644 opensm/include/configure.in delete mode 100644 opensm/libvendor/configure.in delete mode 100644 opensm/libvendor/libosmvendor.spec.in delete mode 100644 opensm/opensm/configure.in delete mode 100644 opensm/osmeventplugin/configure.in delete mode 100644 opensm/osmeventplugin/libosmeventplugin.spec.in delete mode 100644 opensm/osmtest/configure.in diff --git a/opensm/complib/Makefile.am b/opensm/complib/Makefile.am index a77964e..2967c87 100644 --- a/opensm/complib/Makefile.am +++ b/opensm/complib/Makefile.am @@ -76,11 +76,7 @@ libosmcompinclude_HEADERS = $(srcdir)/../include/complib/cl_atomic.h \ $(srcdir)/../include/complib/cl_vector.h # headers are distributed as part of the include dir -EXTRA_DIST = $(srcdir)/libosmcomp.spec.in $(srcdir)/libosmcomp.map \ - $(srcdir)/libosmcomp.ver - -dist-hook: libosmcomp.spec - cp libosmcomp.spec $(distdir) +EXTRA_DIST = $(srcdir)/libosmcomp.map $(srcdir)/libosmcomp.ver # as we can not use libtool -release since it actually changes the SONAME # to the full release name instead of keeping it to the original diff --git a/opensm/complib/configure.in b/opensm/complib/configure.in deleted file mode 100644 index 33d5ffc..0000000 --- a/opensm/complib/configure.in +++ /dev/null @@ -1,69 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT(complib, 2.3.0, general at lists.openfabrics.org) -AC_CONFIG_SRCDIR([cl_spinlock.c]) -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE - -dnl the library version info is available in the file: libosmcomp.ver -complib_api_version=`grep LIBVERSION $srcdir/libosmcomp.ver | sed 's/LIBVERSION=//'` -if test -z $complib_api_version; then - complib_api_version=1:0:0 -fi -AC_SUBST(complib_api_version) - -dnl Checks for programs -AC_PROG_CC -AC_PROG_GCC_TRADITIONAL -AC_PROG_LIBTOOL - -dnl Checks for libraries -AC_CHECK_LIB(pthread, pthread_mutex_init, [], - AC_MSG_ERROR([pthread_mutex_init() not found. libosmcomp requires libpthread.])) - -dnl Checks for header files. -AC_HEADER_STDC -AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h]) - -dnl Checks for library functions -AC_FUNC_MALLOC -AC_FUNC_MEMCMP -AC_CHECK_FUNCS([gettimeofday memset strerror]) - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE -AC_TYPE_SIZE_T -AC_HEADER_TIME - -dnl We use --version-script with ld if possible -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -dnl Support debug mode build - if enable-debug provided the DEBUG variable is set -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode], -[case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) -AM_CONDITIONAL(DEBUG, test x$debug = xtrue) - -# we have to revive the env CFLAGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - - -AC_CONFIG_FILES([Makefile libosmcomp.spec]) -AC_OUTPUT diff --git a/opensm/complib/libosmcomp.spec.in b/opensm/complib/libosmcomp.spec.in deleted file mode 100644 index 12d581f..0000000 --- a/opensm/complib/libosmcomp.spec.in +++ /dev/null @@ -1,38 +0,0 @@ - -%define ver @VERSION@ -%define RELEASE 1 -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} - -Summary: OpenIB InfiniBand OpenSM Component Library -Name: libosmcomp -Version: %ver -Release: %rel%{?dist} -License: GPL/BSD -Group: System Environment/Libraries -BuildRoot: %{_tmppath}/%{name}-%{version}-root -Source: http://openfabrics.org/~halr/management/%{name}-%{version}.tar.gz -Url: http://openfabrics.org/ -Requires: - -%description -libosmcomp provides the OS component library for OpenSM. - -%prep -%setup -q - -%build -%configure -make - -%install -make DESTDIR=${RPM_BUILD_ROOT} install -# remove unpackaged files from the buildroot -rm -f $RPM_BUILD_ROOT%{_libdir}/*.la - -%clean -rm -rf $RPM_BUILD_ROOT - -%files -%defattr(-,root,root) -%{_libdir}/libosmcomp*.so.* -%doc ChangeLog diff --git a/opensm/include/configure.in b/opensm/include/configure.in deleted file mode 100644 index 195923a..0000000 --- a/opensm/include/configure.in +++ /dev/null @@ -1,44 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT(libinc, 2.2.1, general at lists.openfabrics.org) -AC_CONFIG_SRCDIR() -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE() -AM_INIT_AUTOMAKE(infiniband, 2.2.1) -AM_PROG_LIBTOOL - -dnl Checks for programs -AC_PROG_CC - -dnl Checks for libraries -dnl AC_CHECK_LIB - need to provide symbol and library... what do we depend on? - -dnl Checks for header files. -AC_HEADER_STDC - -dnl Checks for library functions -AC_CHECK_FUNCS() - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE - -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -# we have to revive the env CFALGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - -AC_CONFIG_FILES([Makefile]) -AC_OUTPUT diff --git a/opensm/libvendor/Makefile.am b/opensm/libvendor/Makefile.am index cb8baaa..9fbfc9b 100644 --- a/opensm/libvendor/Makefile.am +++ b/opensm/libvendor/Makefile.am @@ -87,10 +87,7 @@ libosmvendorincludedir = $(includedir)/infiniband/vendor libosmvendorinclude_HEADERS = $(HDRS) # headers are distributed as part of the include dir -EXTRA_DIST = libosmvendor.spec.in $(srcdir)/libosmvendor.map $(srcdir)/libosmvendor.ver - -dist-hook: libosmvendor.spec - cp libosmvendor.spec $(distdir) +EXTRA_DIST = $(srcdir)/libosmvendor.map $(srcdir)/libosmvendor.ver # as we can not use libtool -release since it actually changes the SONAME # to the full release name instead of keeping it to the original diff --git a/opensm/libvendor/configure.in b/opensm/libvendor/configure.in deleted file mode 100644 index e7730cd..0000000 --- a/opensm/libvendor/configure.in +++ /dev/null @@ -1,80 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT(libosmvendor, 2.2.1, general at lists.openfabrics.org) -AC_CONFIG_SRCDIR([osm_vendor_ibumad.c]) -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE - -dnl the library version info is available in the file: libosmvendor.ver -osmvendor_api_version=`grep LIBVERSION $srcdir/libosmvendor.ver |sed 's/LIBVERSION=//'` -if test -z $osmvendor_api_version; then - osmvendor_api_version=1:0:0 -fi -AC_SUBST(osmvendor_api_version) - -AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of -ib libraries], -[ if test x$enableval = xno ; then - disable_libcheck=yes - fi -]) - -dnl Checks for programs -AC_PROG_CC -AC_PROG_GCC_TRADITIONAL -AC_PROG_CPP -AC_PROG_INSTALL -AC_PROG_LN_S -AC_PROG_MAKE_SET -AC_PROG_LIBTOOL - -dnl Select appropriate vendor type -OPENIB_APP_OSMV_SEL - -dnl Checks for libraries -OPENIB_APP_OSMV_CHECK_LIB - -dnl Checks for header files. -AC_HEADER_DIRENT -AC_HEADER_STDC -OPENIB_APP_OSMV_CHECK_HEADER -AC_CHECK_HEADERS([fcntl.h stddef.h stdint.h sys/ioctl.h]) - -dnl Checks for library functions -AC_FUNC_CLOSEDIR_VOID -AC_CHECK_FUNCS([memset strerror strstr]) - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE -AC_TYPE_SIZE_T - -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -dnl support debug mode -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode], -[case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) -AM_CONDITIONAL(DEBUG, test x$debug = xtrue) - -# we have to revive the env CFLAGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - -AC_CONFIG_FILES([Makefile libosmvendor.spec]) -AC_OUTPUT diff --git a/opensm/libvendor/libosmvendor.spec.in b/opensm/libvendor/libosmvendor.spec.in deleted file mode 100644 index 5753afb..0000000 --- a/opensm/libvendor/libosmvendor.spec.in +++ /dev/null @@ -1,38 +0,0 @@ - -%define ver @VERSION@ -%define RELEASE 1 -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} - -Summary: OpenIB InfiniBand OpenSM Vendor Library -Name: libosmvendor -Version: %ver -Release: %rel%{?dist} -License: GPL/BSD -Group: System Environment/Libraries -BuildRoot: %{_tmppath}/%{name}-%{version}-root -Source: http://openfabrics.org/~halr/management/%{name}-%{version}.tar.gz -Url: http://openfabrics.org/ -Requires: - -%description -libosmvendor provides the vendor library for OpenSM. - -%prep -%setup -q - -%build -%configure -make - -%install -make DESTDIR=${RPM_BUILD_ROOT} install -# remove unpackaged files from the buildroot -rm -f $RPM_BUILD_ROOT%{_libdir}/*.la - -%clean -rm -rf $RPM_BUILD_ROOT - -%files -%defattr(-,root,root) -%{_libdir}/libosmvendor*.so.* -%doc ChangeLog diff --git a/opensm/opensm/configure.in b/opensm/opensm/configure.in deleted file mode 100644 index a49538d..0000000 --- a/opensm/opensm/configure.in +++ /dev/null @@ -1,86 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT(opensm, 2.2.1, general at lists.openfabrics.org) -AC_CONFIG_SRCDIR([osm_opensm.c]) -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE - -dnl the library version info is available in the file: libopensm.ver -opensm_api_version=`grep LIBVERSION $srcdir/libopensm.ver | sed 's/LIBVERSION=//'` -if test -z $opensm_api_version; then - opensm_api_version=1:0:0 -fi -AC_SUBST(opensm_api_version) - -dnl Checks for programs -AC_PROG_CXX -AC_PROG_CC -AC_PROG_CPP -AC_PROG_INSTALL -AC_PROG_LN_S -AC_PROG_MAKE_SET -AC_PROG_LIBTOOL -AM_PROG_LEX -AC_PROG_YACC - -dnl Checks for libraries - -dnl Checks for header files. -AC_HEADER_STDC -AC_CHECK_HEADERS([fcntl.h stdlib.h sys/time.h unistd.h]) - -dnl Checks for library functions -#AC_FUNC_MALLOC -AC_FUNC_VPRINTF -AC_CHECK_FUNCS([gettimeofday localtime_r strcspn strtol strtoull]) - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE -AC_TYPE_PID_T -AC_TYPE_SIZE_T -AC_HEADER_TIME -AC_STRUCT_TM -AC_C_VOLATILE - -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -dnl support debug mode -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode], -[case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) -AM_CONDITIONAL(DEBUG, test x$debug = xtrue) - -dnl check if they want the socket console -OPENIB_OSM_CONSOLE_SOCKET_SEL - -dnl select performance manager or not -OPENIB_OSM_PERF_MGR_SEL - -dnl select example event plugin or not -OPENIB_OSM_DEFAULT_EVENT_PLUGIN_SEL - -dnl Provide user option to select vendor -OPENIB_APP_OSMV_SEL - -# we have to revive the env CFALGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - -AC_CONFIG_FILES([Makefile]) -AC_OUTPUT diff --git a/opensm/osmeventplugin/Makefile.am b/opensm/osmeventplugin/Makefile.am index 1b7dad0..1404c11 100644 --- a/opensm/osmeventplugin/Makefile.am +++ b/opensm/osmeventplugin/Makefile.am @@ -31,9 +31,4 @@ libosmeventpluginincludedir = $(includedir)/infiniband/complib libosmeventplugininclude_HEADERS = # headers are distributed as part of the include dir -EXTRA_DIST = $(srcdir)/libosmeventplugin.spec.in $(srcdir)/libosmeventplugin.map \ - $(srcdir)/libosmeventplugin.ver - -dist-hook: libosmeventplugin.spec - cp libosmeventplugin.spec $(distdir) - +EXTRA_DIST = $(srcdir)/libosmeventplugin.map $(srcdir)/libosmeventplugin.ver diff --git a/opensm/osmeventplugin/configure.in b/opensm/osmeventplugin/configure.in deleted file mode 100644 index bf86d35..0000000 --- a/opensm/osmeventplugin/configure.in +++ /dev/null @@ -1,66 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT(libosmeventplugin, 1.0.0, general at lists.openfabrics.org) -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE - -dnl the library version info is available in the file: libosmeventplugin.ver -osmeventplugin_api_version=`grep LIBVERSION $srcdir/libosmeventplugin.ver | sed 's/LIBVERSION=//'` -if test -z $osmeventplugin_api_version; then - osmeventplugin_api_version=1:0:0 -fi -AC_SUBST(osmeventplugin_api_version) - -dnl Checks for programs -AC_PROG_CC -AC_PROG_GCC_TRADITIONAL -AC_PROG_LIBTOOL - -dnl Checks for header files. -AC_HEADER_STDC -AC_CHECK_HEADERS([fcntl.h stdlib.h string.h sys/ioctl.h sys/time.h syslog.h unistd.h]) - -dnl Checks for library functions -AC_FUNC_MALLOC -AC_FUNC_MEMCMP -AC_CHECK_FUNC([time]) -dnl AC_CHECK_FUNC([cl_plock_excl_acquire], [], -dnl AC_MSG_ERROR([cl_plock_excl_acquire not found, libosmeventplugin requires libosmcomp])) - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE -AC_TYPE_SIZE_T -AC_HEADER_TIME - -dnl We use --version-script with ld if possible -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -dnl Support debug mode build - if enable-debug provided the DEBUG variable is set -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode], -[case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) -AM_CONDITIONAL(DEBUG, test x$debug = xtrue) - -# we have to revive the env CFLAGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - - -AC_CONFIG_FILES([Makefile libosmeventplugin.spec]) -AC_OUTPUT diff --git a/opensm/osmeventplugin/libosmeventplugin.spec.in b/opensm/osmeventplugin/libosmeventplugin.spec.in deleted file mode 100644 index 60ab1b7..0000000 --- a/opensm/osmeventplugin/libosmeventplugin.spec.in +++ /dev/null @@ -1,38 +0,0 @@ - -%define ver @VERSION@ -%define RELEASE 1 -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} - -Summary: OpenIB InfiniBand OpenSM Event Database Library -Name: libibeventdb -Version: %ver -Release: %rel%{?dist} -License: GPL/BSD -Group: System Environment/Libraries -BuildRoot: %{_tmppath}/%{name}-%{version}-root -Source: http://openfabrics.org/~halr/management/%{name}-%{version}.tar.gz -Url: http://openfabrics.org/ -Requires: opensm - -%description -libibeventdb provides a default plugin for the OpenSM event database - -%prep -%setup -q - -%build -%configure -make - -%install -make DESTDIR=${RPM_BUILD_ROOT} install -# remove unpackaged files from the buildroot -rm -f $RPM_BUILD_ROOT%{_libdir}/*.la - -%clean -rm -rf $RPM_BUILD_ROOT - -%files -%defattr(-,root,root) -%{_libdir}/libibeventdb*.so.* -%doc ChangeLog diff --git a/opensm/osmtest/configure.in b/opensm/osmtest/configure.in deleted file mode 100644 index 8470487..0000000 --- a/opensm/osmtest/configure.in +++ /dev/null @@ -1,75 +0,0 @@ -dnl Process this file with autoconf to produce a configure script. - -AC_PREREQ(2.57) -AC_INIT([osmtest.c]) - -#AC_INIT(opensm, 0.9.0, general at lists.openfabrics.org) -#AC_CONFIG_SRCDIR([osm_sa_service_record_ctrl.c]) -AC_CONFIG_AUX_DIR(config) -AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE(osmtest, 0.9.0) - -dnl Checks for programs -AC_PROG_CXX -AC_PROG_CC -AC_PROG_CPP -AC_PROG_INSTALL -AC_PROG_LN_S -AC_PROG_MAKE_SET -AC_PROG_LIBTOOL - -dnl Checks for libraries -#AC_CHECK_LIB(osmcomp, cl_thread_pool_init, [], -# AC_MSG_ERROR([cl_thread_pool_init() not found. opensm requires libosmcomp.])) -#AC_CHECK_LIB(osmvendor, osm_vendor_init, [], -# AC_MSG_ERROR([osm_vendor_init() not found. opensm requires libosmvendor.])) - -dnl Checks for header files. -AC_HEADER_STDC -AC_CHECK_HEADERS([fcntl.h stdlib.h sys/time.h unistd.h]) - -dnl Checks for library functions -#AC_FUNC_MALLOC -AC_FUNC_VPRINTF -AC_CHECK_FUNCS([gettimeofday localtime_r strcspn strtol strtoull]) - -dnl Checks for typedefs, structures, and compiler characteristics. -AC_C_CONST -AC_C_INLINE -AC_TYPE_PID_T -AC_TYPE_SIZE_T -AC_HEADER_TIME -AC_STRUCT_TM -AC_C_VOLATILE - -AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, - if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then - ac_cv_version_script=yes - else - ac_cv_version_script=no - fi) - -AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") - -dnl support debug mode -AC_ARG_ENABLE(debug, -[ --enable-debug Turn on debug mode], -[case "${enableval}" in - yes) debug=true ;; - no) debug=false ;; - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; -esac],[debug=false]) -AM_CONDITIONAL(DEBUG, test x$debug = xtrue) - - -dnl Provide user option to select vendor -OPENIB_APP_OSMV_SEL - -# we have to revive the env CFALGS as some how they are being overwritten... -# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering -# for why they should NEVER be modified by the configure to allow for user -# overrides. -CFLAGS=$ac_env_CFLAGS_value - -AC_CONFIG_FILES([Makefile]) -AC_OUTPUT -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Sat Sep 15 11:37:29 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 21:37:29 +0300 Subject: [ofa-general] [PATCH] opensm: fix broken make dist Message-ID: <20070915183729.GC6891@sashak.voltaire.com> This adds qos parser lex and yacc files to the opensm tarball created by 'make dist' to make it buildable. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/Makefile.am | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 8440b4a..5e1abd5 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -105,7 +105,8 @@ opensminclude_HEADERS = $(srcdir)/../include/opensm/osm_base.h \ # headers are distributed as part of the include dir EXTRA_DIST = $(srcdir)/libopensm.map $(srcdir)/libopensm.ver \ - $(srcdir)/ChangeLog + $(srcdir)/ChangeLog \ + $(srcdir)/osm_qos_parser.y $(srcdir)/osm_qos_parser.l # as we can not use libtool -release since it actually changes the SONAME # to the full release name instead of keeping it to the original -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Sat Sep 15 11:38:54 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 21:38:54 +0300 Subject: [ofa-general] [PATCH] libibcommon|umad: remove AC_PROG_CXX config check Message-ID: <20070915183854.GD6891@sashak.voltaire.com> This removes AC_PROG_CXX from configure.in. Signed-off-by: Sasha Khapyorsky --- libibcommon/configure.in | 1 - libibumad/configure.in | 1 - 2 files changed, 0 insertions(+), 2 deletions(-) diff --git a/libibcommon/configure.in b/libibcommon/configure.in index 50de5ff..78f615d 100644 --- a/libibcommon/configure.in +++ b/libibcommon/configure.in @@ -15,7 +15,6 @@ fi AC_SUBST(ibcommon_api_version) dnl Checks for programs -AC_PROG_CXX AC_PROG_CC AC_PROG_CPP AC_PROG_INSTALL diff --git a/libibumad/configure.in b/libibumad/configure.in index eb30c8b..d5ebe5b 100644 --- a/libibumad/configure.in +++ b/libibumad/configure.in @@ -34,7 +34,6 @@ else fi dnl Checks for programs -AC_PROG_CXX AC_PROG_CC AC_PROG_CPP AC_PROG_INSTALL -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Sat Sep 15 11:40:13 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 21:40:13 +0300 Subject: [ofa-general] [PATCH] opensm: build improvements Message-ID: <20070915184013.GE6891@sashak.voltaire.com> Build and link OpenSM against libibumad. libibcommon header files and libraries in order: local tree, then installed. Signed-off-by: Sasha Khapyorsky --- opensm/config/osmvsel.m4 | 11 ++++++----- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 index 47ad36f..36c5ddf 100644 --- a/opensm/config/osmvsel.m4 +++ b/opensm/config/osmvsel.m4 @@ -61,11 +61,12 @@ with_sim="/usr") dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" - OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" - if test "x$with_umad_libs" = "x"; then - OSMV_LDADD="-libumad" - else - OSMV_LDADD="-L$with_umad_libs -libumad" + OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include -I\$(srcdir)/../../libibumad/include -I\$(includedir)" + OSMV_LDADD="-L\$(abs_srcdir)/../../libibumad/.libs -L\$(abs_srcdir)/../../libibcommon/.libs -L\$(libdir) -libumad -libcommon" + OSMV_LDADD="-Wl,--rpath -Wl,\$(abs_srcdir)/../../libibumad/.libs -Wl,--rpath -Wl,\$(abs_srcdir)/../../libibcommon/.libs -Wl,--rpath -Wl,\$(libdir) $OSMV_LDADD" + + if test "x$with_umad_libs" != "x"; then + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" fi if test "x$with_umad_includes" != "x"; then -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Sat Sep 15 12:23:39 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 22:23:39 +0300 Subject: [ofa-general] [PATCH] Fix umad_get_cas_names() usage in libibumad. In-Reply-To: <87k5qysfls.fsf@confield.dd.xiranet.com> References: <878x7ilhrl.fsf@confield.dd.xiranet.com> <87k5qysfls.fsf@confield.dd.xiranet.com> Message-ID: <20070915192339.GF6891@sashak.voltaire.com> On 17:30 Mon 10 Sep , Arne Redlich wrote: > "Hal Rosenstock" writes: > > > On 9/7/07, Arne Redlich wrote: > >> resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. > >> > >> Signed-off-by: Arne Redlich > >> --- > >> diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > >> index 787aa92..589684c 100644 > >> --- a/libibumad/src/umad.c > >> +++ b/libibumad/src/umad.c > >> @@ -307,7 +307,7 @@ resolve_ca_name(char *ca_name, int *best_port) > >> } > >> > >> /* Get the list of CA names */ > >> - if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) > >> + if ((n = umad_get_cas_names((void *)names, 20)) < 0) > > > > Rather than the hard coded 20 here and elsewhere, should this be > > replaced by a #define ? > > How about a umad_get_cas_count() helper instead? I'm not against using '20' here since this fixed size array is declared just few lines above. A helper function could be nicer, but what do you mean? Something like (sizeof(names)/UMAD_CA_NAME_LEN)? Sasha From sashak at voltaire.com Sat Sep 15 12:51:19 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 15 Sep 2007 22:51:19 +0300 Subject: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> <20070910163014.GK29384@sashak.voltaire.com> Message-ID: <20070915195119.GG6891@sashak.voltaire.com> On 12:25 Mon 10 Sep , Hal Rosenstock wrote: > On 9/10/07, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 12:13 Mon 10 Sep , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On 9/10/07, Sasha Khapyorsky wrote: > > > > On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > > > > > Hi Sasha, > > > > > > > > > > In several places in MPR implementation IB_PR_COMPMASK_* > > > > > was used instead of IB_MPR_COMPMASK_* > > > > > > > > > > Signed-off-by: Yevgeny Kliteynik > > > > > > > > Applied. Thanks. > > > > > > Shouldn't this also be applied to OFED 1.2 ? > > > > It does not look for me that any new OFED 1.2x distribution is planned. > > Seems like this is an EWG issue. > > Should there be OFED 1.2.x fix release(s) ? At least I don't know about. In case if there will I could incorporate critical fixes from 'master' into 'ofed_1_2' branch. For users who prefer to use OpenSM separately from OFED I would suggest 'master' anyway. Seems reasonable? Sasha From sashak at voltaire.com Sat Sep 15 19:16:59 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Sep 2007 04:16:59 +0200 Subject: [ofa-general] Re: [PATCH 3/7 V3] osm: QoS policy C & H files In-Reply-To: <46E6AB89.6050102@dev.mellanox.co.il> References: <46D359BE.6040009@dev.mellanox.co.il> <20070828134044.GD18082@sashak.voltaire.com> <46E6AB89.6050102@dev.mellanox.co.il> Message-ID: <20070916021659.GK6891@sashak.voltaire.com> Hi Yevgeny, On 17:51 Tue 11 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > >> +typedef struct _osm_qos_policy_t { > >> + cl_list_t port_groups; /* list of osm_qos_port_group_t */ > >> + cl_list_t sl2vl_tables; /* list of osm_qos_sl2vl_scope_t */ > >> + cl_list_t vlarb_tables; /* list of osm_qos_vlarb_scope_t */ > >> + cl_list_t qos_levels; /* list of osm_qos_level_t */ > >> + cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > > > > Here and above - where possible please use cl_qlist_t instead of > > cl_list_t - it is _much_ faster (I did some benchmarking when worked > > on up/down performance issues). > > What about cl_map_t vs cl_qmap_t? > Is the difference there significant? I guess it should be so (never checked however) - basically cl_map* does cl_qmap* + additional things (including memory allocations, etc.). Sasha From ogerlitz at voltaire.com Sun Sep 16 00:34:23 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 10:34:23 +0300 Subject: [ofa-general] [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <20070912100025.3190.89259.stgit@dell3.ogc.int> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> Message-ID: <46ECDC7F.2070805@voltaire.com> Steve Wise wrote: > RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. > > Calling arp_send() to initiate neighbour discovery (ND) doesn't do the > full ND protocol. Namely, it doesn't handle retransmitting the arp > request if it is dropped. The function neigh_event_send() does all this. > Without doing full ND, rdma address resolution fails in the presence of > dropped arp bcast packets. Jay, Is there a way to deploy something similar for the gratuitous arp being sent by the bonding driver at bond_arp_send()? We have seen rare situations where the skb was dropped by the stack and hence bonding fail-over was detected by the remote peer only when its neighboring subsystem probe failures dictated that a new arp must be issued. Or. > > Signed-off-by: Steve Wise > --- > > drivers/infiniband/core/addr.c | 3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > index c5c33d3..5381c80 100644 > --- a/drivers/infiniband/core/addr.c > +++ b/drivers/infiniband/core/addr.c > @@ -161,8 +161,7 @@ static void addr_send_arp(struct sockadd > if (ip_route_output_key(&rt, &fl)) > return; > > - arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, > - rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); > + neigh_event_send(rt->u.dst.neighbour, NULL); > ip_rt_put(rt); > } From kliteyn at dev.mellanox.co.il Sun Sep 16 01:54:32 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 16 Sep 2007 10:54:32 +0200 Subject: [ofa-general] [PATCH] opensm: build improvements In-Reply-To: <20070915184013.GE6891@sashak.voltaire.com> References: <20070915184013.GE6891@sashak.voltaire.com> Message-ID: <46ECEF48.9010504@dev.mellanox.co.il> Great, thanks !!! This and the bunch of previous build-related patches really simplifies the build. -- Yevgeny Sasha Khapyorsky wrote: > Build and link OpenSM against libibumad. libibcommon header files and > libraries in order: local tree, then installed. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/config/osmvsel.m4 | 11 ++++++----- > 1 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 > index 47ad36f..36c5ddf 100644 > --- a/opensm/config/osmvsel.m4 > +++ b/opensm/config/osmvsel.m4 > @@ -61,11 +61,12 @@ with_sim="/usr") > dnl based on the with_osmv we can try the vendor flag > if test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > - OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > - if test "x$with_umad_libs" = "x"; then > - OSMV_LDADD="-libumad" > - else > - OSMV_LDADD="-L$with_umad_libs -libumad" > + OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include -I\$(srcdir)/../../libibumad/include -I\$(includedir)" > + OSMV_LDADD="-L\$(abs_srcdir)/../../libibumad/.libs -L\$(abs_srcdir)/../../libibcommon/.libs -L\$(libdir) -libumad -libcommon" > + OSMV_LDADD="-Wl,--rpath -Wl,\$(abs_srcdir)/../../libibumad/.libs -Wl,--rpath -Wl,\$(abs_srcdir)/../../libibcommon/.libs -Wl,--rpath -Wl,\$(libdir) $OSMV_LDADD" > + > + if test "x$with_umad_libs" != "x"; then > + OSMV_LDADD="-L$with_umad_libs $OSMV_LDADD" > fi > > if test "x$with_umad_includes" != "x"; then From ogerlitz at voltaire.com Sun Sep 16 00:57:05 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 10:57:05 +0300 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <46ECE1D1.4020503@voltaire.com> > Krishna Kumar > date Aug 22, 2007 11:28 AM > subject [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB > Issues: > -------- > I am getting a huge amount of retransmissions for both TCP and TCP No Delay > cases for IPoIB (which explains the slight degradation for some test cases > mentioned in previous mail). After a full test run, there were 18500 > retransmissions for every 1 in regular code. But there is 20.7% overall > improvement in BW even with this huge amount of retransmissions (which implies > batching could improve results even more if this problem is fixed). Results of > experiments are: > a. With batching set to maximum 2 skbs, I get almost the same number > of retransmissions (implies receiver probably is not dropping skbs). > ifconfig/netstat on receiver gives no clue (drop/errors, etc). > b. Making the IPoIB xmit create single work requests for each skb on > blist reduces retrans to same as in regular code. > c. Similar retransmission increase is not seen for E1000. Krishna Kumar wrote: > Issues: > -------- > The retransmission problem reported earlier seems to happen when mthca is > used as the underlying device, but when I tested ehca the retransmissions > dropped to normal levels (around 2 times the regular code). The performance > improvement is around 55% for TCP. Hi, So with ipoib/mthca you still see this 1 : 18.5K retransmission rate (with no noticeable retransmission increase for E1000) you were reporting at the V4 post?! if this is the case, I think it calls for further examination, where help from Mellanox could ease things, I guess. By saying that with ehca you see "normal level retransmissions - 2 times the regular code" do you mean 1 : 2 retransmission rate between batching to no batching? I am not sure this was mentioned over the threads, but clearly two sides are needed for the dance here, namely I think you want to do your tests (both the no batching and with batching) with something like NAPI enabled at the receiver side, 2.6.23-rc5 has NAPI > ---------------------------------------------------- > TCP > ---- is this with no delay set or not? connected or datagram mode? mtu? netperf command? system spec (specifically hca device id and fw version), etc? > Size:32 Procs:1 2728 3544 29.91 > Size:128 Procs:1 11803 13679 15.89 > Size:512 Procs:1 43279 49665 14.75 > Size:4096 Procs:1 147952 101246 -31.56 > Size:16384 Procs:1 149852 141897 -5.30 > > Size:32 Procs:4 10562 11349 7.45 > Size:128 Procs:4 41010 40832 -.43 > Size:512 Procs:4 75374 130943 73.72 > Size:4096 Procs:4 167996 368218 119.18 > Size:16384 Procs:4 123176 379524 208.11 > > Size:32 Procs:8 21125 21990 4.09 > Size:128 Procs:8 77419 78605 1.53 > Size:512 Procs:8 234678 265047 12.94 > Size:4096 Procs:8 218063 367604 68.57 > Size:16384 Procs:8 184283 370972 101.30 > > Average: 1509300 -> 2345115 = 55.38% From ogerlitz at voltaire.com Sun Sep 16 01:04:25 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 11:04:25 +0300 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <46ECE389.4020308@voltaire.com> Shirley Ma wrote: > Since ehca can support 4K MTU, we would like to see a patch in > IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 > kernel. The idea is IPoIB link MTU will pick up a return value from SM's > default broadcast MTU. This patch should be a small patch, I hope you are > OK with this. The only IB switching chip I know does not support 4K IB MTU so you would be able to use it only in p2p connections, correct? Or. From mst at dev.mellanox.co.il Sun Sep 16 02:10:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 16 Sep 2007 11:10:02 +0200 Subject: [ofa-general] Re: [GIT PULL ofed-1.3] cxgb3 bug fixes In-Reply-To: <46E98A5A.1000507@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> <46E98A5A.1000507@opengridcomputing.com> Message-ID: <20070916091002.GE30150@mellanox.co.il> OK, I'll be doing this by merging ofed_1_2_c into ofed_kernel and then removing 029. Quoting Steve Wise : Subject: [GIT PULL ofed-1.3] cxgb3 bug fixes For ofed-1.3, please pull from: git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel The 1.3 patch series is identical to the ofed_1_2_c series except that the first patch, 0029-*, isn't needed since its already in ofed-1.3 from 2.6.23. Thanks, Steve. Steve Wise wrote: >Vlad (Michael/Tziporet in Vlad's absence), > >Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of >these patches are either in 2.6.23 or merged into Jeff Garzik's upstream >branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we >update ofed-1.2.5 and ofed-1.3 will all of these fixes. > >I'll send another email with the ofed-1.3 changes as they will be >slightly different. > >Please pull the ofed_1_2_c changes from: > >git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > >The patch files added to kernel_patches/fixes include: > >>swise at dell3:~/git/ofed-1.2.5> stg series >>+ 0029-cxgb3-engine-microcode-load >>+ 0030-cxgb3-MAC-workaround-update >>+ 0031-cxgb3-Update-rx-coalescing-length >>+ 0032-cxgb3-SGE-doorbell-overflow-warning >>+ 0033-cxgb3-use-immediate-data-for-offload-Tx >>+ 0034-cxgb3-Expose-HW-memory-page-info >>+ 0035-cxgb3-tighten-checks-on-TID-values >>+ 0036-cxgb3-Fatal-error-update >>+ 0037-cxgb3-log-adapter-serial-number >>+ 0038-cxgb3-Update-internal-memory-management >>+ 0039-cxgb3-update-firmware-version >>+ 0040-cxgb3-log-and-clear-PEX-errors >>+ 0041-cxgb3-remove-false-positive-in-xgmac-workaround >>+ 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >>+ 0043-cxgb3-CQ-context-operations-time-out-too-soon >>+ 0044-cxgb3-Add-T3C-rev >>+ 0045-cxgb3-Update-engine-microcode-version >>> 0046-cxgb3-driver-version > >Steve. > _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- MST From mst at dev.mellanox.co.il Sun Sep 16 02:10:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 16 Sep 2007 11:10:26 +0200 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <46E94B36.70406@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> Message-ID: <20070916091024.GF30150@mellanox.co.il> Done. I'll push soon. Quoting Steve Wise : Subject: [GIT PULL ofed_1_2_c] cxgb3 bug fixes Vlad (Michael/Tziporet in Vlad's absence), Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of these patches are either in 2.6.23 or merged into Jeff Garzik's upstream branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we update ofed-1.2.5 and ofed-1.3 will all of these fixes. I'll send another email with the ofed-1.3 changes as they will be slightly different. Please pull the ofed_1_2_c changes from: git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c The patch files added to kernel_patches/fixes include: >swise at dell3:~/git/ofed-1.2.5> stg series >+ 0029-cxgb3-engine-microcode-load >+ 0030-cxgb3-MAC-workaround-update >+ 0031-cxgb3-Update-rx-coalescing-length >+ 0032-cxgb3-SGE-doorbell-overflow-warning >+ 0033-cxgb3-use-immediate-data-for-offload-Tx >+ 0034-cxgb3-Expose-HW-memory-page-info >+ 0035-cxgb3-tighten-checks-on-TID-values >+ 0036-cxgb3-Fatal-error-update >+ 0037-cxgb3-log-adapter-serial-number >+ 0038-cxgb3-Update-internal-memory-management >+ 0039-cxgb3-update-firmware-version >+ 0040-cxgb3-log-and-clear-PEX-errors >+ 0041-cxgb3-remove-false-positive-in-xgmac-workaround >+ 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >+ 0043-cxgb3-CQ-context-operations-time-out-too-soon >+ 0044-cxgb3-Add-T3C-rev >+ 0045-cxgb3-Update-engine-microcode-version >> 0046-cxgb3-driver-version Steve. _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- MST From ogerlitz at voltaire.com Sun Sep 16 01:50:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 11:50:07 +0300 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <46ECEE3F.60301@voltaire.com> Roland Dreier wrote: > With 2.6.24 probably opening in the not-too-distant future, it's > probably a good time to review what my plans are for when the merge > window opens. > Core: > - Sean's QoS changes. These look fine at first glance, and I just > plan to understand the backwards compatibility story (ie how this > works with an old SM) and merge. Anyone who objects let me know. Hi Roland, I have reviewed the qos patches and provided comments which were deployed in v2 of the series. I also tested it (ipoib and iser which is rdma-cm based) against the Voltaire SM/SA to see that nothing was broken. I will send you a "reviewed by:" signature. > ULPs: > [ofa-general] [PATCH RFC] IB/ipoib: enable IGMP for userpsace multicast IB apps The IGMP enabling patch posted by me on September 2nd isn't on your list http://lists.openfabrics.org/pipermail/general/2007-September/040250.html can you add it? > - Moni's IPoIB bonding support. This seems mostly an issue of > getting the core bonding maintainer's attention. However getting a > Reviewed-by: for the IPoIB changes wouldn't hurt too. Jay Vosburgh, the bonding driver maintainer just sent an ack on all patch series. As for the IPoIB changes, there are three patches, where two of them, namely > [PATCH 02/11] IB/ipoib: Notify the world before doing unregister > [PATCH 04/11] IB/ipoib: Verify address handle validity on send are handling a corner-case problems pointed by Michael Tsirkin. Michael, will you be able to look on it and provide a reviewed-by signature? the third patch > [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue is somehow much more simple, I don't think more review is needed for it. > - Eli and Michael's IPoIB stateless offload (checksum offload, LSO, > LRO, etc). It's a big series that makes quite a few core changes. > I think it needs some careful review and is probably at risk of > missing this merge window. Sorting in order of invasiveness so we > can merge at least some of it (if splitting it makes sense) might > be a good idea. Just for the record, the 'etc' above relates to the interrupt moderation support (mlx4, core, ipoib {config through ethertool, usage). Among other things, what is not clear to me here is if/how this goes hand-in-hand with NAPI. As you saw the patch adding checksum offload support had a long thread, and I think the discussion has reached the point where Michael is waiting for your take on it. As for the LSO, LRO patches, I did not see any review comment. I will see that I can review from the series, to begin with, will send Eli some comments and questions. > HW specific: > - Jack and Michael's mlx4 FMR support. Will merge I guess, although > I do hope to have time to address the DMA API abuse that is being > copied from mthca, so that mlx4 and mthca work in Xen domU. This patch series is somehow important as without them iser is useless over connectx. Can be nice if you merge this and at max fix the abuse later. Or. From vlad at lists.openfabrics.org Sun Sep 16 02:53:01 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 16 Sep 2007 02:53:01 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070916-0200 daily build status Message-ID: <20070916095302.1668FE6083B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070916-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Sun Sep 16 02:59:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 16 Sep 2007 11:59:31 +0200 Subject: [ofa-general] Re: RFC: modify upstream code to make backporting easier In-Reply-To: References: <20070911062851.GC15363@mellanox.co.il> Message-ID: <20070916095930.GI30150@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: RFC: modify upstream code to make backporting easier > > > I wonder whether it's acceptable in cases such as this to add > > a wrapper in upstream code. For example, upstream could have: > > > > #ifndef pci_get_revision > > #define pci_get_revision(dev) ((dev)->revision) > > #endif > > My feeling is that this type of wrapper is just obfuscation that makes > the driver harder to read and maintain. Note that some people only run backported drivers, so making it easier to read and maintain *the backport* is also important. > If there's a way to make > backporting easier that also makes the upstream driver better, then > I'm in favor of it, but this sounds like a bad example to me. Do you think applying a patch as we do now is the best way to do it then? Or do you have other ideas on how make backporting this example better? -- MST From ogerlitz at voltaire.com Sun Sep 16 02:04:54 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 12:04:54 +0300 Subject: [ofa-general] Re: [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> References: <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> Message-ID: <46ECF1B6.3020802@voltaire.com> Sean Hefty wrote: > To support QoS within and between subnets, modify IPoIB to request > specific Traffic Class values with path record queries, using > the value associated with the IPoIB broadcast group. > > Signed-off-by: Sean Hefty See some comments I made on this at v1 and v2 of the posts http://lists.openfabrics.org/pipermail/general/2007-August/039275.html http://lists.openfabrics.org/pipermail/general/2007-September/040312.html Reviewed-by: Or Gerlitz > --- > Added missing traffic class to PR component mask. > > drivers/infiniband/ulp/ipoib/ipoib.h | 22 > +++++++++++++++++++++- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 +++++--- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 22 > ---------------------- > 3 files changed, 26 insertions(+), 26 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > index 285c143..fc16bce 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -113,7 +113,27 @@ struct ipoib_pseudoheader { > u8 hwaddr[INFINIBAND_ALEN]; > }; > > -struct ipoib_mcast; > +/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ > +struct ipoib_mcast { > + struct ib_sa_mcmember_rec mcmember; > + struct ib_sa_multicast *mc; > + struct ipoib_ah *ah; > + > + struct rb_node rb_node; > + struct list_head list; > + > + unsigned long created; > + unsigned long backoff; > + > + unsigned long flags; > + unsigned char logcount; > + > + struct list_head neigh_list; > + > + struct sk_buff_head pkt_queue; > + > + struct net_device *dev; > +}; > > struct ipoib_rx_buf { > struct sk_buff *skb; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 894b1dc..841e068 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -468,9 +468,10 @@ static struct ipoib_path *path_rec_create(struct > net_device *dev, void *gid) > INIT_LIST_HEAD(&path->neigh_list); > > memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); > - path->pathrec.sgid = priv->local_gid; > - path->pathrec.pkey = cpu_to_be16(priv->pkey); > - path->pathrec.numb_path = 1; > + path->pathrec.sgid = priv->local_gid; > + path->pathrec.pkey = cpu_to_be16(priv->pkey); > + path->pathrec.numb_path = 1; > + path->pathrec.traffic_class = > priv->broadcast->mcmember.traffic_class; > > return path; > } > @@ -491,6 +492,7 @@ static int path_rec_start(struct net_device *dev, > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | > + IB_SA_PATH_REC_TRAFFIC_CLASS | > IB_SA_PATH_REC_PKEY, > 1000, GFP_ATOMIC, > path_rec_completion, > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index aae3670..94a5709 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -57,28 +57,6 @@ MODULE_PARM_DESC(mcast_debug_level, > > static DEFINE_MUTEX(mcast_mutex); > > -/* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ > -struct ipoib_mcast { > - struct ib_sa_mcmember_rec mcmember; > - struct ib_sa_multicast *mc; > - struct ipoib_ah *ah; > - > - struct rb_node rb_node; > - struct list_head list; > - > - unsigned long created; > - unsigned long backoff; > - > - unsigned long flags; > - unsigned char logcount; > - > - struct list_head neigh_list; > - > - struct sk_buff_head pkt_queue; > - > - struct net_device *dev; > -}; > - > struct ipoib_mcast_iter { > struct net_device *dev; > union ib_gid mgid; > From ogerlitz at voltaire.com Sun Sep 16 02:06:31 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 12:06:31 +0300 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specifytype of service In-Reply-To: <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> Message-ID: <46ECF217.900@voltaire.com> Sean Hefty wrote: > Provide support to specify a type of service for a communication > identifier. A new function call is used when dealing with IPv4 > addresses. For IPv6 addresses, the ToS is specified through the > traffic class field in the sockaddr_in6 structure. > > Signed-off-by: Sean Hefty The comments Eitan Zahavi and myself have made over the v1 post at http://lists.openfabrics.org/pipermail/general/2007-August/039247.html were fully addressed. Reviewed-by: Or Gerlitz > --- > > drivers/infiniband/core/cma.c | 44 > ++++++++++++++++++++++++++++++++--------- > include/rdma/rdma_cm.h | 14 +++++++++++++ > 2 files changed, 48 insertions(+), 10 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 9ffb998..19c9172 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -138,6 +138,7 @@ struct rdma_id_private { > u32 qkey; > u32 qp_num; > u8 srq; > + u8 tos; > }; > > struct cma_multicast { > @@ -1474,6 +1475,15 @@ err: > } > EXPORT_SYMBOL(rdma_listen); > > +void rdma_set_service_type(struct rdma_cm_id *id, int tos) > +{ > + struct rdma_id_private *id_priv; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + id_priv->tos = (u8) tos; > +} > +EXPORT_SYMBOL(rdma_set_service_type); > + > static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, > void *context) > { > @@ -1498,23 +1508,37 @@ static void cma_query_handler(int status, struct > ib_sa_path_rec > *path_rec, > static int cma_query_ib_route(struct rdma_id_private *id_priv, int > timeout_ms, > struct cma_work *work) > { > - struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > + struct rdma_addr *addr = &id_priv->id.route.addr; > struct ib_sa_path_rec path_rec; > + ib_sa_comp_mask comp_mask; > + struct sockaddr_in6 *sin6; > > memset(&path_rec, 0, sizeof path_rec); > - ib_addr_get_sgid(addr, &path_rec.sgid); > - ib_addr_get_dgid(addr, &path_rec.dgid); > - path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > + ib_addr_get_sgid(&addr->dev_addr, &path_rec.sgid); > + ib_addr_get_dgid(&addr->dev_addr, &path_rec.dgid); > + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(&addr->dev_addr)); > path_rec.numb_path = 1; > path_rec.reversible = 1; > + path_rec.service_id = cma_get_service_id(id_priv->id.ps, > &addr->dst_addr); > + > + comp_mask = IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | > + IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; > + > + if (addr->src_addr.sa_family == AF_INET) { > + path_rec.qos_class = cpu_to_be16((u16) id_priv->tos); > + comp_mask |= IB_SA_PATH_REC_QOS_CLASS; > + } else { > + sin6 = (struct sockaddr_in6 *) &addr->src_addr; > + path_rec.traffic_class = (u8) > (be32_to_cpu(sin6->sin6_flowinfo) >> 20); > + comp_mask |= IB_SA_PATH_REC_TRAFFIC_CLASS; > + } > > id_priv->query_id = ib_sa_path_rec_get(&sa_client, > id_priv->id.device, > - id_priv->id.port_num, &path_rec, > - IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > - IB_SA_PATH_REC_PKEY | > IB_SA_PATH_REC_NUMB_PATH | > - IB_SA_PATH_REC_REVERSIBLE, > - timeout_ms, GFP_KERNEL, > - cma_query_handler, work, &id_priv->query); > + id_priv->id.port_num, > &path_rec, > + comp_mask, timeout_ms, > + GFP_KERNEL, > cma_query_handler, > + work, &id_priv->query); > > return (id_priv->query_id < 0) ? id_priv->query_id : 0; > } > diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h > index 2d6a770..010f876 100644 > --- a/include/rdma/rdma_cm.h > +++ b/include/rdma/rdma_cm.h > @@ -314,4 +314,18 @@ int rdma_join_multicast(struct rdma_cm_id *id, > struct sockaddr *addr, > */ > void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr); > > +/** > + * rdma_set_service_type - Set the type of service associated with a > + * connection identifier. > + * @id: Communication identifier to associated with service type. > + * @tos: Type of service. > + * > + * The type of service is interpretted as a differentiated service > + * field (RFC 2474). The service type should be specified before > + * performing route resolution, as existing communication on the > + * connection identifier may be unaffected. The type of service > + * requested may not be supported by the network to all destinations. > + */ > +void rdma_set_service_type(struct rdma_cm_id *id, int tos); > + > #endif /* RDMA_CM_H */ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Sun Sep 16 02:09:44 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 16 Sep 2007 12:09:44 +0300 Subject: [ofa-general] [RFC] [PATCH 2/5 v2] ib/sa: add new QoS fields to path record In-Reply-To: <000701c7ef3b$d16562e0$3c98070a@amr.corp.intel.com> References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000701c7ef3b$d16562e0$3c98070a@amr.corp.intel.com> Message-ID: <46ECF2D8.9000803@voltaire.com> Sean Hefty wrote: > The QoS annex defines new fields for path records. Add them to the > ib_sa for consumers that want to use them. > > Signed-off-by: Sean Hefty Reviewed-by: Or Gerlitz > --- > > drivers/infiniband/core/sa_query.c | 10 +++------- > include/rdma/ib_sa.h | 11 +++++------ > 2 files changed, 8 insertions(+), 13 deletions(-) > > diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c > index d271bd7..6f56bb5 100644 > --- a/drivers/infiniband/core/sa_query.c > +++ b/drivers/infiniband/core/sa_query.c > @@ -123,14 +123,10 @@ static u32 tid; > .field_name = "sa_path_rec:" #field > > static const struct ib_field path_rec_table[] = { > - { RESERVED, > + { PATH_REC_FIELD(service_id), > .offset_words = 0, > .offset_bits = 0, > - .size_bits = 32 }, > - { RESERVED, > - .offset_words = 1, > - .offset_bits = 0, > - .size_bits = 32 }, > + .size_bits = 64 }, > { PATH_REC_FIELD(dgid), > .offset_words = 2, > .offset_bits = 0, > @@ -179,7 +175,7 @@ static const struct ib_field path_rec_table[] = { > .offset_words = 12, > .offset_bits = 16, > .size_bits = 16 }, > - { RESERVED, > + { PATH_REC_FIELD(qos_class), > .offset_words = 13, > .offset_bits = 0, > .size_bits = 12 }, > diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h > index 5e26b2f..942692b 100644 > --- a/include/rdma/ib_sa.h > +++ b/include/rdma/ib_sa.h > @@ -109,8 +109,8 @@ enum ib_sa_selector { > * Reserved rows are indicated with comments to help maintainability. > */ > > -/* reserved: 0 */ > -/* reserved: 1 */ > +#define IB_SA_PATH_REC_SERVICE_ID (IB_SA_COMP_MASK( 0) |\ > + IB_SA_COMP_MASK( 1)) > #define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) > #define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) > #define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) > @@ -123,7 +123,7 @@ enum ib_sa_selector { > #define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) > #define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) > #define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) > -/* reserved: 14 */ > +#define IB_SA_PATH_REC_QOS_CLASS IB_SA_COMP_MASK(14) > #define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) > #define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) > #define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) > @@ -134,8 +134,7 @@ enum ib_sa_selector { > #define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) > > struct ib_sa_path_rec { > - /* reserved */ > - /* reserved */ > + __be64 service_id; > union ib_gid dgid; > union ib_gid sgid; > __be16 dlid; > @@ -148,7 +147,7 @@ struct ib_sa_path_rec { > int reversible; > u8 numb_path; > __be16 pkey; > - /* reserved */ > + __be16 qos_class; > u8 sl; > u8 mtu_selector; > u8 mtu; > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From johnpol at 2ka.mipt.ru Sun Sep 16 07:22:41 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Sun, 16 Sep 2007 18:22:41 +0400 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46EC00BE.3020801@opengridcomputing.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> Message-ID: <20070916142241.GA26848@2ka.mipt.ru> Hi Steve. On Sat, Sep 15, 2007 at 10:56:46AM -0500, Steve Wise (swise at opengridcomputing.com) wrote: > >>The iWARP driver must translate all listens on address 0.0.0.0 to the > >>set of rdma-only ip addresses for the device in question. This prevents > >>incoming connect requests to the TCP ipaddresses from going up the > >>rdma stack. > > > >If the only solutions to solve a problem with hardware are to steal > >packets or became a real device, then real device is much more > >appropriate. Is that correct? > > > > This is a real device. I don't understand your question? Packets > aren't being stolen. I meant port from main network stack. Sorry for confusion. > >>+static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > >>+{ > >>+ struct iwch_addrlist *addr; > >>+ > >>+ addr = kmalloc(sizeof *addr, GFP_KERNEL); > > > >As a small nitpick: this wants to be sizeof(struct in_ifaddr) > > > > No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a > list_head for linking, and a struct in_ifaddr pointer. sizeof(struct iwch_addrlist) of course, not (*addr). To simplify grep. > >>+ if (!addr) { > >>+ printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > >>+ __FUNCTION__); > >>+ return; > >>+ } > >>+ addr->ifa = ifa; > >>+ mutex_lock(&rnicp->mutex); > >>+ list_add_tail(&addr->entry, &rnicp->addrlist); > >>+ mutex_unlock(&rnicp->mutex); > >>+} > > > >What about providing error back to caller and fail to register? > > > > There are two causes where this is called: 1) during module init to > populate the list of iwarp addresses. If we failed in that case then, I > _could_ then not register. 2) we get called via the notifier mechanism > when an address is added. If that fails, the caller doesn't care (since > we're on the notifier callout thread). But the code could perhaps > unregister the device. I prefer just logging an error in case 2. I'll > look into not registering if we cannot get any address due to lack of > memory. But there's another case: we load the module and the admin > hasn't yet created the ethX:iw interface. > > Perhaps I should change the code to only register as a working rdma > device _when_ we get at least one ethX:iwY interface created? Whatchathink? Does second case ends up with problem you described in the initial e-mail not being fixed? > >>+static inline int is_iwarp_label(char *label) > >>+{ > >>+ char *colon; > >>+ > >>+ colon = strchr(label, ':'); > >>+ if (colon && !strncmp(colon+1, "iw", 2)) > >>+ return 1; > >>+ return 0; > >>+} > > > >I.e. it is not allowed to create ':iw' alias for anyone else? > >Well, looks crappy, but if it is the only solution... > > > > It is kinda crappy. But I don't see a better solution. Any ideas? Does creating the whole new netdevice is a too big overhead, or is it considered bad idea? > >>+static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep > >>*ep, > >>+ __be32 addr) > > > >Do you know, that cxgb3 function names suck? :) > >Especially get_skb(). > > > >>+{ > >>+ struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > >>+ struct iwch_listen_entry *le; > >>+ > >>+ le = kmalloc(sizeof *le, GFP_KERNEL); > > > >Wants to be sizeof(struct iwch_listen_entry) and in other places too. > > > > Do you mean I shouldn't use sizeof *le, but rather sizeof(struct > iwch_listen_entry)? Is that the preferred coding style? Yes, exactly. > >I skipped rdma internals of the patch, since I do not know it enough > >to judge, but your approach looks good from core network point of view. > >Maybe you should automatically create an alias each time new interface > >is added so that admin would not care about proper aliases? > > > > That would be much better IMO, but the problem is that I cannot create > an alias without an actual ip address. Unless we change the core > services to allow it. > > Thanks for reviewing! > > Steve. > -- Evgeniy Polyakov From listmaster at legalexpertsearch.com Sun Sep 16 07:59:59 2007 From: listmaster at legalexpertsearch.com (Legal Experts) Date: Sun, 16 Sep 2007 07:59:59 -0700 Subject: [ofa-general] Legal Experts Directory Message-ID: <97df1f62fdf9a643b30b14cb2913a6fb@legalexpertsearch.com> Legal Experts Directory Attorneys | Law Firms | Expert Witnesses | Legal News | Legal Job Listings | Legal Events | Legal Articles Visit Us @ http://www.legalexpertsearch.com SHOWCASE LISTING SPECIAL STARTUP OFFER! 1 year Featured Showcase Listing $100 DISCOUNT CODE: 6TYH2KQ4 SITE FEATURES: Fast & Easy Signup Personal Control Panel List Complete Company Content List up to 20 categories Video Commercial Ads Interactive Map & Directions Search by Zipcode Proximity Include your professional photo's Include your company logo Direct client contact Free Consultation Icons Direct Link to your Website Direct Email Link Active Contact Us Form Post FREE Job Listings Up to date Legal News Post your Legal Articles Post Upcoming Legal Events Post your Resume Offer Ends Oct. 31st 2007 You may also Forward this message to a Friend >>>> http://www.legalexpertsearch.com/lawlist/?p=forward&uid=95da5248746aa7bde0e2863f1e1a9127&mid=7 <<<< To opt out of any further communications, Please click here >>>> Unsubscribe <<<< Optionally, you may also reply to this message with "Unsubscribe" in the subject line. BNK DIRECTORIES ™ LegalExpertSearch.Com PO Box 130411 Carlsbad, CA 92013-0411 LegalExpertSearch.Com -- Powered by PHPlist, www.phplist.com -- From sashak at voltaire.com Sun Sep 16 09:49:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Sep 2007 18:49:51 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS - MultiPathRecord selection according to QoS level In-Reply-To: <46E40AC0.3090609@dev.mellanox.co.il> References: <46E40AC0.3090609@dev.mellanox.co.il> Message-ID: <20070916164951.GM6891@sashak.voltaire.com> On 18:01 Sun 09 Sep , Yevgeny Kliteynik wrote: > Hi Sasha > > This patch implements the MultiPathRecord selection according to QoS level. > > NOTE: this patch depends on another MPR patch that I sent earlier today: > "osm: bugfix - IB_PR_COMPMASK was used in MPR" > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Sun Sep 16 10:01:35 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 16 Sep 2007 19:01:35 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS - changing 'no_qos' option to 'qos' In-Reply-To: <46E691A9.90308@dev.mellanox.co.il> References: <46E691A9.90308@dev.mellanox.co.il> Message-ID: <20070916170135.GN6891@sashak.voltaire.com> On 16:01 Tue 11 Sep , Yevgeny Kliteynik wrote: > Changing OpenSM option "no_qos" with default > value 'TRUE 'to "qos" with deafult value 'FALSE' > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From davem at davemloft.net Sun Sep 16 16:17:48 2007 From: davem at davemloft.net (David Miller) Date: Sun, 16 Sep 2007 16:17:48 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070916.161748.48388692.davem@davemloft.net> From: Krishna Kumar Date: Fri, 14 Sep 2007 14:30:58 +0530 > This set of patches implements the batching xmit capability, and > adds support for batching in IPoIB and E1000 (E1000 driver changes > is ported, thanks to changes taken from Jamal's code from an old > kernel). The only major complaint I have about this patch series is that the IPoIB part should just be one big changeset. Otherwise the tree is not bisectable, for example the initial ipoib header file change breaks the build. The tree must compile and work properly after every single patch. On a lower priority, I question the indirection of skb_blist by making it a pointer. For what? Saving 12 bytes on 64-bit? That kmalloc()'d thing is a nearly guarenteed cache and/or TLB miss. Just inline the thing, we generally don't do crap like this anywhere else. From hadi at cyberus.ca Sun Sep 16 17:29:18 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 16 Sep 2007 20:29:18 -0400 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.161748.48388692.davem@davemloft.net> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> Message-ID: <1189988958.4230.55.camel@localhost> On Sun, 2007-16-09 at 16:17 -0700, David Miller wrote: > The only major complaint I have about this patch series is that > the IPoIB part should just be one big changeset. Dave, you do realize that i have been investing my time working on batching as well, right? cheers, jamal From davem at davemloft.net Sun Sep 16 18:02:32 2007 From: davem at davemloft.net (David Miller) Date: Sun, 16 Sep 2007 18:02:32 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <1189988958.4230.55.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> Message-ID: <20070916.180232.101592975.davem@davemloft.net> From: jamal Date: Sun, 16 Sep 2007 20:29:18 -0400 > On Sun, 2007-16-09 at 16:17 -0700, David Miller wrote: > > > The only major complaint I have about this patch series is that > > the IPoIB part should just be one big changeset. > > Dave, you do realize that i have been investing my time working on > batching as well, right? I do. And I'm reviewing and applying several hundred patches a day. What's the point? :-) From hadi at cyberus.ca Sun Sep 16 19:14:21 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 16 Sep 2007 22:14:21 -0400 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.180232.101592975.davem@davemloft.net> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <20070916.180232.101592975.davem@davemloft.net> Message-ID: <1189995261.4230.61.camel@localhost> On Sun, 2007-16-09 at 18:02 -0700, David Miller wrote: > I do. > > And I'm reviewing and applying several hundred patches a day. > > What's the point? :-) Reading the commentary made me think you were about to swallow that with one more change by the time i wake up;-> I still think this work - despite my vested interest - needs more scrutiny from a performance perspective. I tend to send a url to my work, but it may be time to start posting patches. cheers, jamal From davem at davemloft.net Sun Sep 16 19:25:02 2007 From: davem at davemloft.net (David Miller) Date: Sun, 16 Sep 2007 19:25:02 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <1189995261.4230.61.camel@localhost> References: <1189988958.4230.55.camel@localhost> <20070916.180232.101592975.davem@davemloft.net> <1189995261.4230.61.camel@localhost> Message-ID: <20070916.192502.123919711.davem@davemloft.net> From: jamal Date: Sun, 16 Sep 2007 22:14:21 -0400 > I still think this work - despite my vested interest - needs more > scrutiny from a performance perspective. Absolutely. There are tertiary issues I'm personally interested in, for example how well this stuff works when we enable software GSO on a non-TSO capable card. In such a case the GSO segment should be split right before we hit the driver and then all the sub-segments of the original GSO frame batched in one shot down to the device driver. In this way you'll get a large chunk of the benefit of TSO without explicit hardware support for the feature. There are several cards (some even 10GB) that will benefit immensely from this. From hadi at cyberus.ca Sun Sep 16 20:01:43 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 16 Sep 2007 23:01:43 -0400 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.192502.123919711.davem@davemloft.net> References: <1189988958.4230.55.camel@localhost> <20070916.180232.101592975.davem@davemloft.net> <1189995261.4230.61.camel@localhost> <20070916.192502.123919711.davem@davemloft.net> Message-ID: <1189998103.4230.76.camel@localhost> On Sun, 2007-16-09 at 19:25 -0700, David Miller wrote: > There are tertiary issues I'm personally interested in, for example > how well this stuff works when we enable software GSO on a non-TSO > capable card. > > In such a case the GSO segment should be split right before we hit the > driver and then all the sub-segments of the original GSO frame batched > in one shot down to the device driver. I think GSO is still useful on top of this. In my patches anything with gso gets put into the batch list and shot down the driver. Ive never considered checking whether the nic is TSO capable, that may be something worth checking into. The netiron allows you to shove upto 128 skbs utilizing one tx descriptor, which makes for interesting possibilities. > In this way you'll get a large chunk of the benefit of TSO without > explicit hardware support for the feature. > > There are several cards (some even 10GB) that will benefit immensely > from this. indeed - ive always wondered if batching this way would make the NICs behave differently from the way TSO does. On a side note: My observation is that with large packets on a very busy system; bulk transfer type app, one approaches wire speed; with or without batching, the apps are mostly idling (Ive seen upto 90% idle time polling at the socket level for write to complete with a really busy system). This is the case with or without batching. cpu seems a little better with batching. As the aggregation of the apps gets more aggressive (achievable by reducing their packet sizes), one can achieve improved throughput and reduced cpu utilization. This all with UDP; i am still studying tcp. cheers, jamal From davem at davemloft.net Sun Sep 16 20:13:18 2007 From: davem at davemloft.net (David Miller) Date: Sun, 16 Sep 2007 20:13:18 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <1189998103.4230.76.camel@localhost> References: <1189995261.4230.61.camel@localhost> <20070916.192502.123919711.davem@davemloft.net> <1189998103.4230.76.camel@localhost> Message-ID: <20070916.201318.71091570.davem@davemloft.net> From: jamal Date: Sun, 16 Sep 2007 23:01:43 -0400 > I think GSO is still useful on top of this. > In my patches anything with gso gets put into the batch list and shot > down the driver. Ive never considered checking whether the nic is TSO > capable, that may be something worth checking into. The netiron allows > you to shove upto 128 skbs utilizing one tx descriptor, which makes for > interesting possibilities. We're talking past each other, but I'm happy to hear that for sure your code does the right thing :-) Right now only TSO capable hardware sets the TSO capable bit, except perhaps for the XEN netfront driver. What Herbert and I want to do is basically turn on TSO for devices that can't do it in hardware, and rely upon the GSO framework to do the segmenting in software right before we hit the device. This only makes sense for devices which can 1) scatter-gather and 2) checksum on transmit. Otherwise we make too many copies and/or passes over the data. And we can only get the full benefit if we can pass all the sub-segments down to the driver in one ->hard_start_xmit() call. > On a side note: My observation is that with large packets on a very busy > system; bulk transfer type app, one approaches wire speed; with or > without batching, the apps are mostly idling (Ive seen upto 90% idle > time polling at the socket level for write to complete with a really > busy system). This is the case with or without batching. cpu seems a > little better with batching. As the aggregation of the apps gets more > aggressive (achievable by reducing their packet sizes), one can achieve > improved throughput and reduced cpu utilization. This all with UDP; i am > still studying tcp. UDP apps spraying data tend to naturally batch well and load balance amongst themselves because each socket fills up to it's socket send buffer limit, then sleeps, and we then get a stream from the next UDP socket up to it's limit, and so on and so forth. UDP is too easy a test case in fact :-) From krkumar2 at in.ibm.com Sun Sep 16 20:49:36 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 09:19:36 +0530 Subject: [ofa-general] Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching In-Reply-To: <20070914121518.GB18517@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 09/14/2007 05:45:19 PM: > > + if (skb->next) { > > + int count = 0; > > + > > + do { > > + struct sk_buff *nskb = skb->next; > > + > > + skb->next = nskb->next; > > + __skb_queue_tail(dev->skb_blist, nskb); > > + count++; > > + } while (skb->next); > > Could it be list_move()-like function for skb lists? > I'm pretty sure if you change first and the last skbs and ke of the > queue in one shot, result will be the same. I have to do a bit more like update count, etc, but I agree it is do-able. I had mentioned in my PATCH 0/10 that I will later try this suggestion that you provided last time. > Actually how many skbs are usually batched in your load? It depends, eg when the tx lock is not got, I get batching of upto 8-10 skbs (assuming that tx lock was not got quite a few times). But when the queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len is 4K). > > + /* Reset destructor for kfree_skb to work */ > > + skb->destructor = DEV_GSO_CB(skb)->destructor; > > + kfree_skb(skb); > > Why do you free first skb in the chain? This is the gso code which has segmented 'skb' to skb'1-n', and those skb'1-n' are sent out and freed by driver, which means the dummy 'skb' (without any data) remains to be freed. Thanks, - KK From krkumar2 at in.ibm.com Sun Sep 16 20:51:45 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 09:21:45 +0530 Subject: [ofa-general] Re: [PATCH 2/10 REV5] [core] Add skb_blist & support for batching In-Reply-To: <20070914124637.GC18517@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 09/14/2007 06:16:38 PM: > > + if (dev->features & NETIF_F_BATCH_SKBS) { > > + /* Driver supports batching skb */ > > + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL); > > + if (dev->skb_blist) > > + skb_queue_head_init(dev->skb_blist); > > + } > > + > > A nitpick is that you should use sizeof(struct ...) and I think it > requires flag clearing in cae of failed initialization? I thought it is better to use *var name in case the name of the structure changes. Also, the flag is not cleared since I could try to enable batching later, and it could succeed at that time. When skb_blist is allocated, then batching is enabled otherwise it is disabled (while features flag just indicates that driver supports batching). Thanks, - KK From krkumar2 at in.ibm.com Sun Sep 16 20:56:47 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 09:26:47 +0530 Subject: [ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching In-Reply-To: <20070914124714.GD18517@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 09/14/2007 06:17:14 PM: > > if (unlikely(skb->len <= 0)) { > > dev_kfree_skb_any(skb); > > - return NETDEV_TX_OK; > > + return NETDEV_TX_DROPPED; > > } > > This changes could actually go as own patch, although not sure it is > ever used. just a though, not a stopper. Since this flag is new and useful only for batching, I feel it is OK to include it in this patch. > > + if (!skb || (blist && skb_queue_len(blist))) { > > + /* > > + * Either batching xmit call, or single skb case but there are > > + * skbs already in the batch list from previous failure to > > + * xmit - send the earlier skbs first to avoid out of order. > > + */ > > + if (skb) > > + __skb_queue_tail(blist, skb); > > + skb = __skb_dequeue(blist); > > Why is it put at the end? There is a bug that I had explained in rev4 (see XXX below) resulting in sending out skbs out of order. The fix is that if the driver gets called with a skb but there are older skbs already in the batch list (which failed to get sent out), send those skbs first before this one. Thanks, - KK [XXX] Dave had suggested to use batching only in the net_tx_action case. When I implemented that in earlier revisions, there were lots of TCP retransmissions (about 18,000 to every 1 in regular code). I found the reason for part of that problem as: skbs get queue'd up in dev->qdisc (when tx lock was not got or queue blocked); when net_tx_action is called later, it passes the batch list as argument to qdisc_run and this results in skbs being moved to the batch list; then batching xmit also fails due to tx lock failure; the next many regular xmit of a single skb will go through the fast path (pass NULL batch list to qdisc_run) and send those skbs out to the device while previous skbs are cooling their heels in the batch list. The first fix was to not pass NULL/batch-list to qdisc_run() but to always check whether skbs are present in batch list when trying to xmit. This reduced retransmissions by a third (from 18,000 to around 12,000), but led to another problem while testing - iperf transmit almost zero data for higher # of parallel flows like 64 or more (and when I run iperf for a 2 min run, it takes about 5-6 mins, and reports that it ran 0 secs and the amount of data transfered is a few MB's). I don't know why this happens with this being the only change (any ideas is very appreciated). The second fix that resolved this was to revert back to Dave's suggestion to use batching only in net_tx_action case, and modify the driver to see if skbs are present in batch list and to send them out first before sending the current skb. I still see huge retransmission for IPoIB (but not for E1000), though it has reduced to 12,000 from the earlier 18,000 number. From krkumar2 at in.ibm.com Sun Sep 16 21:08:36 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 09:38:36 +0530 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.161748.48388692.davem@davemloft.net> Message-ID: Hi Dave, David Miller wrote on 09/17/2007 04:47:48 AM: > The only major complaint I have about this patch series is that > the IPoIB part should just be one big changeset. Otherwise the > tree is not bisectable, for example the initial ipoib header file > change breaks the build. Right, I will change it accordingly. > On a lower priority, I question the indirection of skb_blist by making > it a pointer. For what? Saving 12 bytes on 64-bit? That kmalloc()'d > thing is a nearly guarenteed cache and/or TLB miss. Just inline the > thing, we generally don't do crap like this anywhere else. The intention was to avoid having two flags (one that driver supports batching and second to indicate that batching is on/off). So I could test skb_blist as an indication of whether batching is on/off. But your point on cache miss is absolutely correct, and I will change this part to be inline. thanks, - KK From krkumar2 at in.ibm.com Sun Sep 16 21:10:36 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 09:40:36 +0530 Subject: [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching In-Reply-To: <20070914113709.80baba4d.randy.dunlap@oracle.com> Message-ID: Hi Randy, Randy Dunlap wrote on 09/15/2007 12:07:09 AM: > > + To fix this problem, error cases where driver xmit gets called with a > > + skb must code as follows: > > + 1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED > > + as usual. This allows qdisc to requeue the skb. > > + 2. If driver xmit got the lock but failed to send the skb, it > > + should return NETDEV_TX_BUSY but before that it should have > > + queue'd the skb to the batch list. In this case, the qdisc > > queued > > > + does not requeue the skb. Since this was a new section that I added to the documentation, this error creeped up. Thanks for catching it, and review comments/ack-off :) thanks, - KK From jeff at garzik.org Sun Sep 16 21:13:05 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 17 Sep 2007 00:13:05 -0400 Subject: [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching In-Reply-To: References: Message-ID: <46EDFED1.6010000@garzik.org> Please remove me from the CC list. I get this via netdev, and not having said a single thing in the thread, I don't feel the need to be CC'd on every email. The CC list is pretty massive as it is, anyway. Jeff From krkumar2 at in.ibm.com Sun Sep 16 21:35:22 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 10:05:22 +0530 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <46ECE1D1.4020503@voltaire.com> Message-ID: Hi Or, > So with ipoib/mthca you still see this 1 : 18.5K retransmission rate > (with no noticeable retransmission increase for E1000) you were > reporting at the V4 post?! if this is the case, I think it calls for > further examination, where help from Mellanox could ease things, I guess. What I will do today/tomorrow is to run the rev5 (which I didn't run for mthca) on both ehca and mthca and get statistics and send it out. Otherwise what you stated is correct as far as rev4 goes. After giving latest details, I will appreciate any help from Mellanox developers. > By saying that with ehca you see "normal level retransmissions - 2 times > the regular code" do you mean 1 : 2 retransmission rate between batching > to no batching? Correct, for every 1 retransmission in the regular code, I see two retransmissions in batching case (which I assume is due to overflow at the receiver side as I batch sometimes upto 4K skbs). I will post the exact numbers in the next post. > I am not sure this was mentioned over the threads, but clearly two sides > are needed for the dance here, namely I think you want to do your tests > (both the no batching and with batching) with something like NAPI > enabled at the receiver side, 2.6.23-rc5 has NAPI I was using 2.6.23-rc1 on receiver (which also has NAPI, but uses the old API - the same fn ipoib_poll()). > is this with no delay set or not? connected or datagram mode? mtu? > netperf command? system spec (specifically hca device id and fw > version), etc? This is TCP (without No Delay), datagram mode, I didn't change mtu from the default (is it 2K?). Command is iperf with various options for different test buffer-size/threads. Regarding id/etc, this is what dmesg has: Sep 16 22:49:26 elm3b39 kernel: eHCA Infiniband Device Driver (Rel.: SVNEHCA_0023) Sep 16 22:49:26 elm3b39 kernel: xics_enable_irq: irq=36868: ibm_int_on returned -3 There are *fw* files for mthca0, but I don't see for ehca in /sys/class, so I am not sure (since these are pci-e cards, nothing shows up in lspci -v). What should I look for? Thanks, - KK From krkumar2 at in.ibm.com Sun Sep 16 21:46:02 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 17 Sep 2007 10:16:02 +0530 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.192502.123919711.davem@davemloft.net> Message-ID: [Removing Jeff as requested from thread :) ] Hi Dave, David Miller wrote on 09/17/2007 07:55:02 AM: > From: jamal > Date: Sun, 16 Sep 2007 22:14:21 -0400 > > > I still think this work - despite my vested interest - needs more > > scrutiny from a performance perspective. > > Absolutely. > > There are tertiary issues I'm personally interested in, for example > how well this stuff works when we enable software GSO on a non-TSO > capable card. > > In such a case the GSO segment should be split right before we hit the > driver and then all the sub-segments of the original GSO frame batched > in one shot down to the device driver. > > In this way you'll get a large chunk of the benefit of TSO without > explicit hardware support for the feature. > > There are several cards (some even 10GB) that will benefit immensely > from this. I have tried this on ehca which does not support TSO. I added GSO flag at the ipoib layer (and that resulted in a panic/fix that is mentioned in this patchset). I will re-run tests for this and submit results. Thanks, - KK From kliteyn at mellanox.co.il Sun Sep 16 22:22:26 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 17 Sep 2007 07:22:26 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-17:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-16 OpenSM git rev = Sun_Sep_16_18:47:46_2007 [8224cc5e3f6e5ce03d783e674b4eaa6e1cf37acd] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From mst at dev.mellanox.co.il Sun Sep 16 23:22:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 17 Sep 2007 08:22:52 +0200 Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: References: <20070911090313.GE15363@mellanox.co.il> Message-ID: <20070917062252.GA30842@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: mlx4 violating radix tree API locking rules? > > > I guess CQ spinlock implies rcu_read_lock - is that right? > > But I do not see any synchronize_rcu calls anywhere in mlx4. > > Should destroy QP and friends call synchronize_rcu after > > removing the QP from radix tree but before freeing the QP structure? > > Well, I don't think we're really trying to use RCU to synchronize the > radix tree. It's the same locking scheme as in mthca, except without > the home-grown sparse array stuff: we have a qp table lock that > protects inserting and removing QPs, and then we use the CQ locks to > avoid looking up a QP that is being removed. > > However, I think you're right: we do violate the radix tree locking > rules. So maybe we need to fall back to our own homegrown array stuff > as in mthca. Why not just call synchronize_rcu instead? -- MST From egp.group at tcunet.com Mon Sep 17 02:32:38 2007 From: egp.group at tcunet.com (egp.group at tcunet.com) Date: Mon, 17 Sep 2007 12:32:38 +0300 Subject: [ofa-general] Check this out Message-ID: <46EE49B6.6020308@tcunet.com> 1000 Online Free games, take a look http://87.123.2.100/ From arne.redlich at xiranet.com Mon Sep 17 02:25:08 2007 From: arne.redlich at xiranet.com (Arne Redlich) Date: Mon, 17 Sep 2007 11:25:08 +0200 Subject: [ofa-general] [PATCH] Fix umad_get_cas_names() usage in libibumad. In-Reply-To: <20070915192339.GF6891@sashak.voltaire.com> (Sasha Khapyorsky's message of "Sat\, 15 Sep 2007 22\:23\:39 +0300") References: <878x7ilhrl.fsf@confield.dd.xiranet.com> <87k5qysfls.fsf@confield.dd.xiranet.com> <20070915192339.GF6891@sashak.voltaire.com> Message-ID: <87wsuptzij.fsf@confield.dd.xiranet.com> Sasha Khapyorsky writes: > On 17:30 Mon 10 Sep , Arne Redlich wrote: >> "Hal Rosenstock" writes: >> >> > On 9/7/07, Arne Redlich wrote: >> >> resolve_ca_name() passes a wrong "max" argument to umad_get_cas_names. >> >> >> >> Signed-off-by: Arne Redlich >> >> --- >> >> diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c >> >> index 787aa92..589684c 100644 >> >> --- a/libibumad/src/umad.c >> >> +++ b/libibumad/src/umad.c >> >> @@ -307,7 +307,7 @@ resolve_ca_name(char *ca_name, int *best_port) >> >> } >> >> >> >> /* Get the list of CA names */ >> >> - if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) >> >> + if ((n = umad_get_cas_names((void *)names, 20)) < 0) >> > >> > Rather than the hard coded 20 here and elsewhere, should this be >> > replaced by a #define ? >> >> How about a umad_get_cas_count() helper instead? > > I'm not against using '20' here since this fixed size array is declared > just few lines above. A helper function could be nicer, but what do > you mean? Something like (sizeof(names)/UMAD_CA_NAME_LEN)? No, sorry for being unclear. What I had in mind was a function that returns the actual number of CAs in the system, so users of umad_get_cas_names() don't need to take guesses anymore, i.e.: nca = umad_get_cas_count(); cas = calloc(nca, UMAD_CA_NAME_LEN); ret = umad_get_cas_names(cas, nca); Arne From vlad at lists.openfabrics.org Mon Sep 17 02:52:54 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 17 Sep 2007 02:52:54 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070917-0200 daily build status Message-ID: <20070917095254.AF345E60854@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070917-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Mon Sep 17 02:52:58 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 17 Sep 2007 11:52:58 +0200 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: References: Message-ID: <46EE4E7A.1000700@voltaire.com> Hi Krishna, Krishna Kumar2 wrote: > What I will do today/tomorrow is to run the rev5 (which I didn't run > for mthca) on both ehca and mthca and get statistics and send it out. > Otherwise what you stated is correct as far as rev4 goes. After giving > latest details, I will appreciate any help from Mellanox developers. good, please test with rev5 and let us know. > Correct, for every 1 retransmission in the regular code, I see two > retransmissions in batching case (which I assume is due to overflow at the > receiver side as I batch sometimes upto 4K skbs). I will post the exact > numbers in the next post. transmission of 4K batched packets sounds like a real problem for the receiver side, with 0.5K send/recv queue size, its 8 batches of 512 packets each were for each RX there is completion (WC) to process, SKB to alloc and post to the QP where for the TX there's only posting to the QP, processes one (?) WC and free 512 SKBs. If indeed the situation is so unsymmetrical, I am starting to think that the CPU utilization at the sender side might be much higher with batching then without batching, have you looked into that? > I was using 2.6.23-rc1 on receiver (which also has NAPI, but uses the > old API - the same fn ipoib_poll()). I am not with you. Looking on 2.6.22 and 2.6.23-rc5, for both their ipoib-NAPI mechanism is implemented through the function ipoib_poll being the polling api for the network stack etc, so what is the old API and where does this difference exist? > This is TCP (without No Delay), datagram mode, I didn't change mtu from > the default (is it 2K?). Command is iperf with various options for different > test buffer-size/threads. You might want to try something lighter such as iperf udp test, where a nice criteria would be to compare bandwidth AND packet loss between no-batching and batching. As for the MTU, the default is indeed 2K (2044) but its always to just know the facts, namely what was the mtu during the test. > Regarding id/etc, this is what dmesg has: if you have user space libraries installed, load ib_uverbs and run the command ibv_devinfo, you will see all the infiniband devices on your system and for each its device id and firmware version. If not, you should be looking on /sys/class/infiniband/$device/hca_type and /sys/class/infiniband/$device/fw_ver > Sep 16 22:49:26 elm3b39 kernel: eHCA Infiniband Device Driver (Rel.: > SVNEHCA_0023) > There are *fw* files for mthca0, but I don't see for ehca in /sys/class, so > I am not sure (since these are pci-e cards, nothing shows up in lspci -v). > What should I look for? the above print seems to be from the ehca driver where you are talking on mthca0, which is quite confusing. If you want to be sure what hca is being used by the netdevice you are testing with (eg ib0) take a look on the directory /sys/class/net/$netdevice/device/ If you have hca which is not reported in lspci and/or in /sys/class/infinidand it sounds like you have a problem or you found a bug. Or. From hrosenstock at xsigo.com Mon Sep 17 05:34:17 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 05:34:17 -0700 Subject: [Fwd: [ofa-general] nightly osm_sim report 2007-09-15:normal completion] Message-ID: <1190032458.6272.67.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, Is the failure below a simulator or OpenSM issue ? Thanks. -- Hal -------- Forwarded Message -------- From: kliteyn at mellanox.co.il To: sashak at voltaire.com Cc: general at lists.openfabrics.org Subject: [ofa-general] nightly osm_sim report 2007-09-15:normal completion Date: 15 Sep 2007 07:32:15 +0300 OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-14 OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=519 Fail=1 Pass: 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 38 Stability IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: 1 Stability IS1-16.topo _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Mon Sep 17 05:36:46 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 05:36:46 -0700 Subject: [ewg] Re: [ofa-general] Re: [PATCH] osm: bugfix - IB_PR_COMPMASK was used in MPR In-Reply-To: <20070915195119.GG6891@sashak.voltaire.com> References: <46E3EDC6.9070901@dev.mellanox.co.il> <20070910161926.GJ29384@sashak.voltaire.com> <20070910163014.GK29384@sashak.voltaire.com> <20070915195119.GG6891@sashak.voltaire.com> Message-ID: <1190032606.6272.71.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Sat, 2007-09-15 at 22:51 +0300, Sasha Khapyorsky wrote: > On 12:25 Mon 10 Sep , Hal Rosenstock wrote: > > On 9/10/07, Sasha Khapyorsky wrote: > > > Hi Hal, > > > > > > On 12:13 Mon 10 Sep , Hal Rosenstock wrote: > > > > Hi Sasha, > > > > > > > > On 9/10/07, Sasha Khapyorsky wrote: > > > > > On 15:57 Sun 09 Sep , Yevgeny Kliteynik wrote: > > > > > > Hi Sasha, > > > > > > > > > > > > In several places in MPR implementation IB_PR_COMPMASK_* > > > > > > was used instead of IB_MPR_COMPMASK_* > > > > > > > > > > > > Signed-off-by: Yevgeny Kliteynik > > > > > > > > > > Applied. Thanks. > > > > > > > > Shouldn't this also be applied to OFED 1.2 ? > > > > > > It does not look for me that any new OFED 1.2x distribution is planned. > > > > Seems like this is an EWG issue. > > > > Should there be OFED 1.2.x fix release(s) ? > > At least I don't know about. In case if there will I could incorporate > critical fixes from 'master' into 'ofed_1_2' branch. Your OFED 1.2 branch, right ? Vlad could pick this up if there is to be an OFED 1.2 fix release. > For users who prefer > to use OpenSM separately from OFED I would suggest 'master' anyway. Master is not as "baked"/tested as OFED 1.2 although it has other goodies. > Seems reasonable? For your part, yes but still haven't heard on EWG. Seems like there is not much interest there in maintaining OFED 1.2. -- Hal > Sasha > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From hrosenstock at xsigo.com Mon Sep 17 05:42:09 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 05:42:09 -0700 Subject: [ofa-general] [PATCH] opensm: configure scripts merge In-Reply-To: <20070915183542.GA6891@sashak.voltaire.com> References: <20070915183542.GA6891@sashak.voltaire.com> Message-ID: <1190032929.6272.75.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Sat, 2007-09-15 at 21:35 +0300, Sasha Khapyorsky wrote: > This merges all subdirectories configure.in scripts into one toplevel > directory script. Separate configuring per subdirectory is not needed > anymore. How is the requirement for separate OpenSM libraries (complib, libosmvendor, and libopensm) now met ? There are some tools (e.g. ibutils and others) which require these libraries with OpenSM itself. -- Hal > Signed-off-by: Sasha Khapyorsky > --- > opensm/Makefile.am | 4 +- > opensm/autogen.sh | 34 +++------------------ > opensm/complib/Makefile.am | 2 + > opensm/configure.in | 60 ++++++++++++++++++++++++++++++------ > opensm/libvendor/Makefile.am | 2 + > opensm/opensm/Makefile.am | 2 + > opensm/osmeventplugin/Makefile.am | 2 + > 7 files changed, 65 insertions(+), 41 deletions(-) > > diff --git a/opensm/Makefile.am b/opensm/Makefile.am > index f99e78b..9cbce3a 100644 > --- a/opensm/Makefile.am > +++ b/opensm/Makefile.am > @@ -1,12 +1,12 @@ > > # note that order matters: make the libs first then use them > -SUBDIRS = complib libvendor opensm osmtest include $(DEFAULT_EVENT_PLUGIN) > +SUBDIRS = complib libvendor opensm osmtest include $(DEFAULT_EVENT_PLUGIN) > DIST_SUBDIRS = complib libvendor opensm osmtest include osmeventplugin > > # this will control the update of the files in order > MAINTAINERCLEANFILES = Makefile.in aclocal.m4 configure config-h.in > > -ACLOCAL = aclocal -I $(ac_aux_dir) > +ACLOCAL = aclocal -I $(ac_aux_dir) > > # we should provide a hint for other apps about the build mode of this project > install-exec-hook: > diff --git a/opensm/autogen.sh b/opensm/autogen.sh > index 3ae89b4..fee8800 100755 > --- a/opensm/autogen.sh > +++ b/opensm/autogen.sh > @@ -50,32 +50,8 @@ fi > # cleanup > find . \( -name Makefile.in -o -name aclocal.m4 -o -name autom4te.cache -o -name configure -o -name aclocal.m4 \) -exec \rm -rf {} \; -prune > > -# handle our own autoconf: > -(aclocal -I config 2>&1 ) && \ > -(automake --add-missing --gnu --copy ) && \ > -(autoconf 2>&1 ) > -if test $? != 0; then > - exit 1 > -fi > - > - > - > -# visit all sub directories with autogen.sh > -anyErr=0 > -for a in include complib libvendor opensm osmtest osmeventplugin ; do > - dir=`dirname $a` > - test -d ${dir}/config || mkdir ${dir}/config > - echo Visiting $a > - ( cd $a && \ > - set -x && \ > - aclocal -I config -I ../config && \ > - libtoolize --force --copy && \ > - autoheader && \ > - automake --foreign --add-missing --copy && \ > - autoconf ) \ > - 2>&1 | sed 's/^/| /' | grep -v "arning: underquoted definition" > - if test $? != 0; then > - echo $a failed > - anyErr=1 > - fi > -done > +aclocal -I config && \ > +libtoolize --force --copy && \ > +autoheader && \ > +automake --foreign --add-missing --copy && \ > +autoconf > diff --git a/opensm/complib/Makefile.am b/opensm/complib/Makefile.am > index fce797a..a77964e 100644 > --- a/opensm/complib/Makefile.am > +++ b/opensm/complib/Makefile.am > @@ -17,6 +17,8 @@ else > libosmcomp_version_script = > endif > > +complib_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmcomp.ver | sed 's/LIBVERSION=//') > + > libosmcomp_la_SOURCES = cl_complib.c cl_dispatcher.c \ > cl_event.c cl_event_wheel.c \ > cl_list.c cl_log.c cl_map.c \ > diff --git a/opensm/configure.in b/opensm/configure.in > index 2efd867..6c4db9f 100644 > --- a/opensm/configure.in > +++ b/opensm/configure.in > @@ -4,6 +4,7 @@ AC_PREREQ(2.57) > AC_INIT(opensm, 3.1.1, general at lists.openfabrics.org) > AC_CONFIG_SRCDIR([opensm/osm_opensm.c]) > AC_CONFIG_AUX_DIR(config) > +AC_CONFIG_HEADERS(include/config.h) > AM_INIT_AUTOMAKE(opensm, 3.1.1) > > dnl Defines the Language > @@ -16,17 +17,50 @@ AM_MAINTAINER_MODE > > dnl Required for cases make defines a MAKE=make ??? Why > AC_PROG_MAKE_SET > +AC_PROG_CC > +AC_PROG_LIBTOOL > +AC_PROG_INSTALL > +AC_PROG_LN_S > +AC_PROG_MAKE_SET > +AC_PROG_YACC > +AC_PROG_LEX > + > +dnl Checks for libraries > +AC_CHECK_LIB(pthread, pthread_mutex_init, [], > + AC_MSG_ERROR([pthread_mutex_init() not found. libosmcomp requires libpthread.])) > + > +dnl Checks for typedefs, structures, and compiler characteristics. > +AC_C_CONST > +AC_C_INLINE > +AC_TYPE_PID_T > +AC_TYPE_SIZE_T > +AC_HEADER_TIME > +AC_STRUCT_TM > +AC_C_VOLATILE > + > +dnl We use --version-script with ld if possible > +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, > +if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then > + ac_cv_version_script=yes > +else > + ac_cv_version_script=no > +fi) > +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") > > dnl Define an input config option to control debug compile > -AC_ARG_ENABLE(debug, > -[ --enable-debug Turn on debugging], > +AC_ARG_ENABLE(debug, [ --enable-debug Turn on debugging], > [case "${enableval}" in > - yes) debug=true ;; > - no) debug=false ;; > - *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; > -esac],[debug=false]) > + yes) debug=true ;; > + no) debug=false ;; > + *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; > +esac],debug=false) > AM_CONDITIONAL(DEBUG, test x$debug = xtrue) > > +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], > +[if test x$enableval = xno ; then > + disable_libcheck=yes > +fi]) > + > dnl check if they want the socket console > OPENIB_OSM_CONSOLE_SOCKET_SEL > > @@ -39,9 +73,15 @@ OPENIB_OSM_DEFAULT_EVENT_PLUGIN_SEL > dnl Provide user option to select vendor > OPENIB_APP_OSMV_SEL > > -dnl Configure the following subdirs > -AC_CONFIG_SUBDIRS(complib libvendor opensm osmtest include osmeventplugin) > +dnl Checks for headers and libraries > +OPENIB_APP_OSMV_CHECK_HEADER > +OPENIB_APP_OSMV_CHECK_LIB > + > +# we have to revive the env CFLAGS as some how they are being overwritten... > +# see http://sources.redhat.com/automake/automake.html#Flag-Variables-Ordering > +# for why they should NEVER be modified by the configure to allow for user > +# overrides. > +CFLAGS=$ac_env_CFLAGS_value > > dnl Create the following Makefiles > -AC_OUTPUT(Makefile) > -AC_OUTPUT(opensm.spec) > +AC_OUTPUT([Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) > diff --git a/opensm/libvendor/Makefile.am b/opensm/libvendor/Makefile.am > index 3b8c3af..cb8baaa 100644 > --- a/opensm/libvendor/Makefile.am > +++ b/opensm/libvendor/Makefile.am > @@ -23,6 +23,8 @@ else > libosmvendor_version_script = > endif > > +osmvendor_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmvendor.ver | sed 's/LIBVERSION=//') > + > COMM_HDRS= $(srcdir)/../include/vendor/osm_vendor_api.h \ > $(srcdir)/../include/vendor/osm_vendor.h \ > $(srcdir)/../include/vendor/osm_vendor_select.h \ > diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am > index 5e4229d..8440b4a 100644 > --- a/opensm/opensm/Makefile.am > +++ b/opensm/opensm/Makefile.am > @@ -21,6 +21,8 @@ else > libopensm_version_script = > endif > > +opensm_api_version=$(shell grep LIBVERSION= $(srcdir)/libopensm.ver | sed 's/LIBVERSION=//') > + > libopensm_la_SOURCES = osm_log.c osm_mad_pool.c osm_helper.c > libopensm_la_LDFLAGS = -version-info $(opensm_api_version) \ > -export-dynamic $(libopensm_version_script) > diff --git a/opensm/osmeventplugin/Makefile.am b/opensm/osmeventplugin/Makefile.am > index bbb012f..1b7dad0 100644 > --- a/opensm/osmeventplugin/Makefile.am > +++ b/opensm/osmeventplugin/Makefile.am > @@ -18,6 +18,8 @@ else > libosmeventplugin_version_script = > endif > > +osmeventplugin_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmeventplugin.ver | sed 's/LIBVERSION=//') > + > libosmeventplugin_la_SOURCES = src/osmeventplugin.c > libosmeventplugin_la_LDFLAGS = -version-info $(osmeventplugin_api_version) \ > -export-dynamic $(libosmeventplugin_version_script) From hadi at cyberus.ca Mon Sep 17 05:51:40 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 17 Sep 2007 08:51:40 -0400 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <20070916.201318.71091570.davem@davemloft.net> References: <1189995261.4230.61.camel@localhost> <20070916.192502.123919711.davem@davemloft.net> <1189998103.4230.76.camel@localhost> <20070916.201318.71091570.davem@davemloft.net> Message-ID: <1190033500.4230.102.camel@localhost> On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote: > What Herbert and I want to do is basically turn on TSO for > devices that can't do it in hardware, and rely upon the GSO > framework to do the segmenting in software right before we > hit the device. Sensible. > This only makes sense for devices which can 1) scatter-gather > and 2) checksum on transmit. If you have knowledge there are enough descriptors in the driver to cover all skbs you are passing, do you need to have #1? Note i dont touch fragments, i am assuming the driver is smart enough to handle them otherwise it wont advertise it can handle scatter-gather > Otherwise we make too many copies and/or passes over the data. I didnt understand this last bit - you are still going to go over the list regardless of whether you call ->hard_start_xmit() once or multiple times over the same list, no? In the later case i am assuming a trimmed down ->hard_start_xmit() > UDP is too easy a test case in fact :-) I learnt a lot about the behavior out of doing udp (and before that with pktgen); theres a lot of driver habits that may need to be tuned before batching becomes really effective - which is easier to see with udp than with tcp. cheers, jamal From hrosenstock at xsigo.com Mon Sep 17 06:00:15 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 06:00:15 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> Message-ID: <1190034015.6272.83.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-09-13 at 11:20 -0700, Sean Hefty wrote: > > - My user_mad P_Key index support patch. I'll test the ioctl to > > change to the new mode and merge this I guess, since Hal and Sean > > have tested this out. > > I can give this patch a reviewed-by: too, and I will also try to review a couple > of the pending ipoib patches. > > > - Sean's QoS changes. These look fine at first glance, and I just > > plan to understand the backwards compatibility story (ie how this > > works with an old SM) and merge. Anyone who objects let me know. > > The new QoS fields fall into fields that are currently reserved, which should be > ignored by an older SM. By older, you mean one which doesn't support QoS (as indicated by the setting in SA's ClassPortInfo). > I've only tested this against openSM however. in non QoS mode, right ? Has anyone tested these with QoS actually be used ? I suppose this requires Connect-X. -- Hal > > - Sean's IB CM MRA interface changes. Don't know at this point. It > > seems OK but I'm not clear on what if any real-world improvement > > this gives us. > > This patch was generated in response to an Intel MPI issue. We've seen MPI take > several minutes to respond to a connection request during the middle of large > application runs. When this happens, the active side times out the connection. > In OFED, we added module parameters to adjust the rdma_cm connection timeout on > the active side, but I believe that sending an MRA from the passive side is a > better solution. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Mon Sep 17 06:14:24 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 06:14:24 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: <46EACC6B.5060702@ichips.intel.com> References: <46EACC6B.5060702@ichips.intel.com> Message-ID: <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-09-14 at 11:01 -0700, Sean Hefty wrote: > I didn't notice any issues with this patch, or anything missing from it. > > Reviewed-by: Sean Hefty I'll ditto the above and mention it was tested in old (coexistence) mode. Reviewed-by: Hal Rosenstock From swise at opengridcomputing.com Mon Sep 17 07:59:49 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 17 Sep 2007 09:59:49 -0500 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <20070916091024.GF30150@mellanox.co.il> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> Message-ID: <46EE9665.7090807@opengridcomputing.com> After this is pushed, can you build and publish a new ofed-1.2.5.x tarball? Or at least a daily build of the full ofed-1.2.5 kit? Thanks, Steve. Michael S. Tsirkin wrote: > Done. I'll push soon. > > Quoting Steve Wise : > Subject: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > > Vlad (Michael/Tziporet in Vlad's absence), > > Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of > these patches are either in 2.6.23 or merged into Jeff Garzik's upstream > branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we > update ofed-1.2.5 and ofed-1.3 will all of these fixes. > > I'll send another email with the ofed-1.3 changes as they will be > slightly different. > > Please pull the ofed_1_2_c changes from: > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > > The patch files added to kernel_patches/fixes include: > >> swise at dell3:~/git/ofed-1.2.5> stg series >> + 0029-cxgb3-engine-microcode-load >> + 0030-cxgb3-MAC-workaround-update >> + 0031-cxgb3-Update-rx-coalescing-length >> + 0032-cxgb3-SGE-doorbell-overflow-warning >> + 0033-cxgb3-use-immediate-data-for-offload-Tx >> + 0034-cxgb3-Expose-HW-memory-page-info >> + 0035-cxgb3-tighten-checks-on-TID-values >> + 0036-cxgb3-Fatal-error-update >> + 0037-cxgb3-log-adapter-serial-number >> + 0038-cxgb3-Update-internal-memory-management >> + 0039-cxgb3-update-firmware-version >> + 0040-cxgb3-log-and-clear-PEX-errors >> + 0041-cxgb3-remove-false-positive-in-xgmac-workaround >> + 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >> + 0043-cxgb3-CQ-context-operations-time-out-too-soon >> + 0044-cxgb3-Add-T3C-rev >> + 0045-cxgb3-Update-engine-microcode-version >>> 0046-cxgb3-driver-version > > Steve. > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From hnguyen at linux.vnet.ibm.com Mon Sep 17 08:11:09 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 17 Sep 2007 17:11:09 +0200 Subject: [ofa-general] ofed-1.3 daily build package's content Message-ID: <200709171711.09316.hnguyen@linux.vnet.ibm.com> Hello Vlad and Michael! Just downloaded daily build package OFED-1.3-20070917-0600 and saw in SRPMS: localhost:/home/nguyen/tmp/OFED-1.3-20070917-0600/SRPMS # ls -l ofa_kernel-1.3-ofed2007091* -rw-r--r-- 1 1011 1011 1967453 2007-09-10 15:27 ofa_kernel-1.3-ofed20070910.src.rpm -rw-r--r-- 1 1011 1011 1960701 2007-09-11 15:02 ofa_kernel-1.3-ofed20070911.src.rpm -rw-r--r-- 1 1011 1011 1966672 2007-09-12 15:02 ofa_kernel-1.3-ofed20070912.src.rpm -rw-r--r-- 1 1011 1011 1957624 2007-09-13 15:02 ofa_kernel-1.3-ofed20070913.src.rpm -rw-r--r-- 1 1011 1011 1963469 2007-09-14 15:02 ofa_kernel-1.3-ofed20070914.src.rpm -rw-r--r-- 1 1011 1011 1965865 2007-09-15 15:02 ofa_kernel-1.3-ofed20070915.src.rpm -rw-r--r-- 1 1011 1011 1963044 2007-09-16 15:01 ofa_kernel-1.3-ofed20070916.src.rpm -rw-r--r-- 1 1011 1011 1959261 2007-09-17 15:01 ofa_kernel-1.3-ofed20070917.src.rpm Is there a reason to include earlier versions of ofa_kernel-1.3? Are they needed by the build script? Nam From hnguyen at linux.vnet.ibm.com Mon Sep 17 08:12:29 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 17 Sep 2007 17:12:29 +0200 Subject: [ofa-general] Fwd: [PATCH] ehca patches inclusion for ofed-1.3-alpha Message-ID: <200709171712.30087.hnguyen@linux.vnet.ibm.com> Sorry, forgot put this to the list. Nam ---------- Forwarded Message ---------- Subject: [PATCH] ehca patches inclusion for ofed-1.3-alpha Date: Monday 17 September 2007 16:36 From: Hoang-Nam Nguyen To: Vladimir Sokolovsky Cc: tziporet at dev.mellanox.co.il, mst at mellanox.co.il, raisch at de.ibm.com, stefan.roscher at de.ibm.com Hello Vladimir! Please include the ehca patches being queued for 2.6.24 in ofed-1.3-alpha. They are listed below for your convenience. Thanks! Nam [ofa-general] [PATCH 01/12] IB/ehca: Small QP userspace support http://lists.openfabrics.org/pipermail/general/2007-September/040567.html [ofa-general] [PATCH 03/12] IB/ehca: Support more than 4k QPs for userspace and kernelspace http://lists.openfabrics.org/pipermail/general/2007-September/040569.html [ofa-general] [PATCH 04/12] IB/ehca: Use remap_4k_pfn() to map firmware contexts to user space http://lists.openfabrics.org/pipermail/general/2007-September/040570.html [ofa-general] [PATCH 05/12] IB/ehca: Refactor hvcall tracing http://lists.openfabrics.org/pipermail/general/2007-September/040571.html [ofa-general] [PATCH 06/12] IB/ehca: Print return codes as signed decimal integers http://lists.openfabrics.org/pipermail/general/2007-September/040572.html [ofa-general] [PATCH 07/12] IB/ehca: ehca_gen_warn() should always print http://lists.openfabrics.org/pipermail/general/2007-September/040573.html [ofa-general] [PATCH 08/12] IB/ehca: Replace get_paca()->paca_index by the more portable raw_smp_processor_id() http://lists.openfabrics.org/pipermail/general/2007-September/040626.html [ofa-general] [PATCH 09/12] IB/ehca: Add check for max #SGE to create_qp() http://lists.openfabrics.org/pipermail/general/2007-September/040574.html [ofa-general] [PATCH 10/12] IB/ehca: Path migration support http://lists.openfabrics.org/pipermail/general/2007-September/040576.html [ofa-general] [PATCH 11/12] IB/ehca: Serialize MR alloc and MR free hvCalls http://lists.openfabrics.org/pipermail/general/2007-September/040577.html [ofa-general] [PATCH 12/12] IB/ehca: Bump version number and change its format http://lists.openfabrics.org/pipermail/general/2007-September/040578.html [ofa-general] [PATCH 0/3] IB/ehca: MR/MW fixes *** Please include this patch set entirely *** http://lists.openfabrics.org/pipermail/general/2007-September/040654.html ------------------------------------------------------- From john.blackwood at ccur.com Mon Sep 17 08:22:52 2007 From: john.blackwood at ccur.com (John Blackwood) Date: Mon, 17 Sep 2007 11:22:52 -0400 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue Message-ID: <46EE9BCC.1040301@ccur.com> When using OFED-1.2.5 based infiniband kernel modules on 2.6.22 based kernels with the Ingo Molnar CONFIG_PREEMPT_RT applied, then commands such as ibnetdiscvoer, smpquery, sminfo, etc. will hang. The problem is with the downgrade_write() rw semaphore usage in the ib_umad_close() routine. This patch is a temporary work-around that gets around this issue by changing the ib_umad_port mutex from a rw_semaphore to a compat_rw_semaphore. This is admittedly only a temporary solution. An example of the BUG console message output and work around patch are shown below. bowser> ------------[ cut here ]------------ kernel BUG at kernel/rt.c:352! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /class/infiniband_mad/umad0/port Modules linked in: rdma_ucm(F) rds(F) ib_ucm(F) ib_srp(F) ib_sdp(F) rdma_cm(F) iw_cm(F) ib_addr(F) ib_ipoib(F) ib_cm(F) ib_sa(F) ib_uverbs(F) ib_umad(F) ib_mthca(F) ib_mad(F) ib_core(F) CPU: 1 EIP: 0060:[] Tainted: GF N VLI EFLAGS: 00210282 (2.6.22.6-rt_shield_trace #1) EIP is at rt_downgrade_write+0x0/0x10 eax: f705df7c ebx: 00000008 ecx: f7171014 edx: f705df9c esi: f7170ffc edi: f7171000 ebp: f7171004 esp: f5fcdef0 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 preempt:00000001 Process ibnetdiscover (pid: 10842, ti=f5fcc000 task=f6cd00f0 task.ti=f5fcc000) Stack: f883a4a8 f705df40 00000000 00000008 f6ca1680 f63adf20 f734e7dc c018c240 00000000 00000000 f734e7dc c2d442c0 f63adf20 f6ca1680 f71d06c0 00000000 00000001 c018a2ec 00000000 00000001 00000003 f44af440 c012bd20 f71d06c0 Call Trace: [] ib_umad_close+0x98/0xf0 [ib_umad] [] __fput+0x170/0x1a0 [] filp_close+0x3c/0x80 [] close_files+0x50/0x60 [] put_files_struct+0x28/0x80 [] do_exit+0x1c6/0x570 [] sys_write+0x6a/0xf0 [] do_group_exit+0x26/0x70 [] sysenter_past_esp+0x68/0x99 ======================= --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [] .... __spin_lock_irqsave+0x19/0x50 .....[<00000000>] .. ( <= 0x0) Code: 5b e9 45 7a 51 00 90 8d 74 26 00 8b 53 1c 85 d2 74 e3 89 d8 89 ca e8 c0 84 51 00 ff 4b 1c 5b c3 8d 74 26 00 8d bc 27 00 00 00 00 <0f> 0b eb fe 8d b6 00 00 00 00 8d bf 00 00 00 00 e8 db 79 51 00 EIP: [] rt_downgrade_write+0x0/0x10 SS:ESP 0068:f5fcdef0 BUG: sleeping function called from invalid context ibnetdiscover(10842) at kernel/rtmutex.c:636 in_atomic():1 [00000001], irqs_disabled():1 [] __might_sleep+0xe1/0x100 [] set_palette+0x2b/0x60 [] __rt_spin_lock+0x36/0x50 [] __wake_up+0x1e/0x70 [] wake_up_klogd+0x3b/0x40 [] die+0x166/0x240 [] do_trap+0x1b1/0x260 [] raw_notifier_call_chain+0x17/0x20 [] notify_die+0x30/0x40 [] do_invalid_op+0x0/0x90 [] do_invalid_op+0x83/0x90 [] rt_downgrade_write+0x0/0x10 [] add_preempt_count+0x12/0xe0 [] add_preempt_count+0x12/0xe0 [] __rt_spin_lock+0x36/0x50 [] lock_list_del_init+0x55/0x80 [] file_kill+0x18d/0x1a0 [] error_code+0x72/0x80 [] load_module+0x33b/0xe10 [] rt_downgrade_write+0x0/0x10 [] ib_umad_close+0x98/0xf0 [ib_umad] [] __fput+0x170/0x1a0 [] filp_close+0x3c/0x80 [] close_files+0x50/0x60 [] put_files_struct+0x28/0x80 [] do_exit+0x1c6/0x570 [] sys_write+0x6a/0xf0 [] do_group_exit+0x26/0x70 [] sysenter_past_esp+0x68/0x99 ======================= --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [] .... __spin_lock_irqsave+0x19/0x50 .....[<00000000>] .. ( <= 0x0) Fixing recursive fault but reboot is needed! --- linux-2.6.22/drivers/infiniband/core/user_mad.c 2007-09-17 09:48:45.000000000 -0400 +++ new/drivers/infiniband/core/user_mad.c 2007-09-17 09:50:41.000000000 -0400 @@ -93,7 +93,7 @@ struct ib_umad_port { struct class_device *sm_class_dev; struct semaphore sm_sem; - struct rw_semaphore mutex; + struct compat_rw_semaphore mutex; struct list_head file_list; struct ib_device *ib_dev; @@ -159,7 +159,7 @@ static int queue_packet(struct ib_umad_f { int ret = 1; - down_read(&file->port->mutex); + compat_down_read(&file->port->mutex); for (packet->mad.hdr.id = 0; packet->mad.hdr.id < IB_UMAD_MAX_AGENTS; @@ -173,7 +173,7 @@ static int queue_packet(struct ib_umad_f break; } - up_read(&file->port->mutex); + compat_up_read(&file->port->mutex); return ret; } @@ -461,7 +461,7 @@ static ssize_t ib_umad_write(struct file goto err; } - down_read(&file->port->mutex); + compat_down_read(&file->port->mutex); agent = __get_agent(file, packet->mad.hdr.id); if (!agent) { @@ -558,7 +558,7 @@ static ssize_t ib_umad_write(struct file if (ret) goto err_send; - up_read(&file->port->mutex); + compat_up_read(&file->port->mutex); return count; err_send: @@ -568,7 +568,7 @@ err_msg: err_ah: ib_destroy_ah(ah); err_up: - up_read(&file->port->mutex); + compat_up_read(&file->port->mutex); err: kfree(packet); return ret; @@ -597,7 +597,7 @@ static int ib_umad_reg_agent(struct ib_u int agent_id; int ret; - down_write(&file->port->mutex); + compat_down_write(&file->port->mutex); if (!file->port->ib_dev) { ret = -EPIPE; @@ -650,7 +650,7 @@ found: ret = 0; out: - up_write(&file->port->mutex); + compat_up_write(&file->port->mutex); return ret; } @@ -663,7 +663,7 @@ static int ib_umad_unreg_agent(struct ib if (get_user(id, (u32 __user *) arg)) return -EFAULT; - down_write(&file->port->mutex); + compat_down_write(&file->port->mutex); if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) { ret = -EINVAL; @@ -674,7 +674,7 @@ static int ib_umad_unreg_agent(struct ib file->agent[id] = NULL; out: - up_write(&file->port->mutex); + compat_up_write(&file->port->mutex); if (agent) ib_unregister_mad_agent(agent); @@ -710,7 +710,7 @@ static int ib_umad_open(struct inode *in if (!port) return -ENXIO; - down_write(&port->mutex); + compat_down_write(&port->mutex); if (!port->ib_dev) { ret = -ENXIO; @@ -736,7 +736,7 @@ static int ib_umad_open(struct inode *in list_add_tail(&file->port_list, &port->file_list); out: - up_write(&port->mutex); + compat_up_write(&port->mutex); return ret; } @@ -748,7 +748,7 @@ static int ib_umad_close(struct inode *i int already_dead; int i; - down_write(&file->port->mutex); + compat_down_write(&file->port->mutex); already_dead = file->agents_dead; file->agents_dead = 1; @@ -761,14 +761,14 @@ static int ib_umad_close(struct inode *i list_del(&file->port_list); - downgrade_write(&file->port->mutex); + compat_downgrade_write(&file->port->mutex); if (!already_dead) for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) if (file->agent[i]) ib_unregister_mad_agent(file->agent[i]); - up_read(&file->port->mutex); + compat_up_read(&file->port->mutex); kfree(file); kref_put(&dev->ref, ib_umad_release_dev); @@ -839,10 +839,10 @@ static int ib_umad_sm_close(struct inode }; int ret = 0; - down_write(&port->mutex); + compat_down_write(&port->mutex); if (port->ib_dev) ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); - up_write(&port->mutex); + compat_up_write(&port->mutex); up(&port->sm_sem); @@ -906,7 +906,7 @@ static int ib_umad_init_port(struct ib_d port->ib_dev = device; port->port_num = port_num; init_MUTEX(&port->sm_sem); - init_rwsem(&port->mutex); + compat_init_rwsem(&port->mutex); INIT_LIST_HEAD(&port->file_list); port->dev = cdev_alloc(); @@ -992,7 +992,7 @@ static void ib_umad_kill_port(struct ib_ umad_port[port->dev_num] = NULL; spin_unlock(&port_lock); - down_write(&port->mutex); + compat_down_write(&port->mutex); port->ib_dev = NULL; @@ -1017,17 +1017,17 @@ static void ib_umad_kill_port(struct ib_ file->agents_dead = 1; list_del_init(&file->port_list); - downgrade_write(&port->mutex); + compat_downgrade_write(&port->mutex); for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) if (file->agent[id]) ib_unregister_mad_agent(file->agent[id]); - up_read(&port->mutex); - down_write(&port->mutex); + compat_up_read(&port->mutex); + compat_down_write(&port->mutex); } - up_write(&port->mutex); + compat_up_write(&port->mutex); clear_bit(port->dev_num, dev_map); } From swise at opengridcomputing.com Mon Sep 17 08:25:04 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 17 Sep 2007 10:25:04 -0500 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070916142241.GA26848@2ka.mipt.ru> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> <20070916142241.GA26848@2ka.mipt.ru> Message-ID: <46EE9C50.7070406@opengridcomputing.com> Evgeniy Polyakov wrote: > Hi Steve. > > On Sat, Sep 15, 2007 at 10:56:46AM -0500, Steve Wise (swise at opengridcomputing.com) wrote: >>>> The iWARP driver must translate all listens on address 0.0.0.0 to the >>>> set of rdma-only ip addresses for the device in question. This prevents >>>> incoming connect requests to the TCP ipaddresses from going up the >>>> rdma stack. >>> If the only solutions to solve a problem with hardware are to steal >>> packets or became a real device, then real device is much more >>> appropriate. Is that correct? >>> >> This is a real device. I don't understand your question? Packets >> aren't being stolen. > > I meant port from main network stack. Sorry for confusion. > >>>> +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) >>>> +{ >>>> + struct iwch_addrlist *addr; >>>> + >>>> + addr = kmalloc(sizeof *addr, GFP_KERNEL); >>> As a small nitpick: this wants to be sizeof(struct in_ifaddr) >>> >> No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a >> list_head for linking, and a struct in_ifaddr pointer. > > sizeof(struct iwch_addrlist) of course, not (*addr). > To simplify grep. > >>>> + if (!addr) { >>>> + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", >>>> + __FUNCTION__); >>>> + return; >>>> + } >>>> + addr->ifa = ifa; >>>> + mutex_lock(&rnicp->mutex); >>>> + list_add_tail(&addr->entry, &rnicp->addrlist); >>>> + mutex_unlock(&rnicp->mutex); >>>> +} >>> What about providing error back to caller and fail to register? >>> >> There are two causes where this is called: 1) during module init to >> populate the list of iwarp addresses. If we failed in that case then, I >> _could_ then not register. 2) we get called via the notifier mechanism >> when an address is added. If that fails, the caller doesn't care (since >> we're on the notifier callout thread). But the code could perhaps >> unregister the device. I prefer just logging an error in case 2. I'll >> look into not registering if we cannot get any address due to lack of >> memory. But there's another case: we load the module and the admin >> hasn't yet created the ethX:iw interface. >> >> Perhaps I should change the code to only register as a working rdma >> device _when_ we get at least one ethX:iwY interface created? Whatchathink? > > Does second case ends up with problem you described in the initial > e-mail not being fixed? No, the 2nd case (a failure to get the list of iwarp-only ip addresses) will cause rdma apps to not receive any incoming connections. Consider we have eth1 with 1.1.1.1/24. Next, the admin creates the iwarp interface thusly: ifconfig eth1:iw 2.2.2.2 netmask 255.255.255.0 up The iw_cxgb3 driver gets a netblock notifier event for the addition of the eth1:iw interface, but FAILS to alloc the memory to keep track that 2.2.2.2 is the iwarp-only address. Next an rdma app binds to 0.0.0.0, port X. The iw_cxgb3 attempts to map 0.0.0.0 to the set of valid iwarp addresses, but there are none. iw_cxgb3 logs a warning saying no addresses are available. The application ends up not listening to any address. So TCP apps aren't affected. Also, if the admin notes the error log entry, and re-initializes the eth1:iw interface -and- this time the kmalloc() works, then the existing rdma app bound to 0.0.0.0 port X will then start receiving connect requests to 2.2.2.2 port X. Hope this makes sense. > >>>> +static inline int is_iwarp_label(char *label) >>>> +{ >>>> + char *colon; >>>> + >>>> + colon = strchr(label, ':'); >>>> + if (colon && !strncmp(colon+1, "iw", 2)) >>>> + return 1; >>>> + return 0; >>>> +} >>> I.e. it is not allowed to create ':iw' alias for anyone else? >>> Well, looks crappy, but if it is the only solution... >>> >> It is kinda crappy. But I don't see a better solution. Any ideas? > > Does creating the whole new netdevice is a too big overhead, or is it > considered bad idea? I think its too big overhead, and pretty invasive on the low level cxgb3 driver. I think having a device in the 'ifconfig -a' after iw_cxgb3 is loaded and devices discovered would be a good thing for the admin. This is the angle Roland suggested. I'm just not sure how to implement it. But if someone could explain how I might create this full netdevice as a pseudo device on top of the real one, maybe I could implement it. Note that non TCP traffic still needs to utilize this interface for ND to work properly with the RDMA core. > >>>> +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep >>>> *ep, >>>> + __be32 addr) >>> Do you know, that cxgb3 function names suck? :) >>> Especially get_skb(). >>> >>>> +{ >>>> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >>>> + struct iwch_listen_entry *le; >>>> + >>>> + le = kmalloc(sizeof *le, GFP_KERNEL); >>> Wants to be sizeof(struct iwch_listen_entry) and in other places too. >>> >> Do you mean I shouldn't use sizeof *le, but rather sizeof(struct >> iwch_listen_entry)? Is that the preferred coding style? > > Yes, exactly. > Ok, now I get it. :) Steve. From rdreier at cisco.com Mon Sep 17 08:56:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 08:56:01 -0700 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue In-Reply-To: <46EE9BCC.1040301@ccur.com> (John Blackwood's message of "Mon, 17 Sep 2007 11:22:52 -0400") References: <46EE9BCC.1040301@ccur.com> Message-ID: > When using OFED-1.2.5 based infiniband kernel modules on 2.6.22 based > kernels with the Ingo Molnar CONFIG_PREEMPT_RT applied, then commands > such as ibnetdiscvoer, smpquery, sminfo, etc. will hang. The problem > is with the downgrade_write() rw semaphore usage in the > ib_umad_close() routine. Can you give a few more details on how PREEMPT_RT changes locking rules (or just exposes existing bugs maybe?) so that the downgrade_write() causes the issue? I would like to fix this cleanly but I don't really understand what the problem is. - R. From mshefty at ichips.intel.com Mon Sep 17 09:09:06 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 09:09:06 -0700 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46EC00BE.3020801@opengridcomputing.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> Message-ID: <46EEA6A2.2080001@ichips.intel.com> >>> + addr = kmalloc(sizeof *addr, GFP_KERNEL); >> >> As a small nitpick: this wants to be sizeof(struct in_ifaddr) See chapter 14 of CodingStyle document. kmalloc(sizeof *addr... is correct. - Sean From johnpol at 2ka.mipt.ru Mon Sep 17 09:17:07 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Mon, 17 Sep 2007 20:17:07 +0400 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46EEA6A2.2080001@ichips.intel.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> <46EEA6A2.2080001@ichips.intel.com> Message-ID: <20070917161706.GA28431@2ka.mipt.ru> On Mon, Sep 17, 2007 at 09:09:06AM -0700, Sean Hefty (mshefty at ichips.intel.com) wrote: > >>>+ addr = kmalloc(sizeof *addr, GFP_KERNEL); > >> > >>As a small nitpick: this wants to be sizeof(struct in_ifaddr) > > See chapter 14 of CodingStyle document. kmalloc(sizeof *addr... is correct. Come on, do not start a flame war about how parameters into kmalloc should be provided - there are much more serious issues unresolved yes. It does help grepping the code, but if you feel that this is a serious threat, then use your preferred way. -- Evgeniy Polyakov From sashak at voltaire.com Mon Sep 17 09:43:31 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 18:43:31 +0200 Subject: [ofa-general] [PATCH] opensm: configure scripts merge In-Reply-To: <1190032929.6272.75.camel@hrosenstock-ws.xsigo.com> References: <20070915183542.GA6891@sashak.voltaire.com> <1190032929.6272.75.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917164331.GP6891@sashak.voltaire.com> Hi Hal, On 05:42 Mon 17 Sep , Hal Rosenstock wrote: > Hi Sasha, > > On Sat, 2007-09-15 at 21:35 +0300, Sasha Khapyorsky wrote: > > This merges all subdirectories configure.in scripts into one toplevel > > directory script. Separate configuring per subdirectory is not needed > > anymore. > > How is the requirement for separate OpenSM libraries (complib, > libosmvendor, and libopensm) now met ? There are some tools (e.g. > ibutils and others) which require these libraries with OpenSM itself. Yes, it does not change existing libraries. Only way how it is configuired. Sasha From davem at davemloft.net Mon Sep 17 09:37:15 2007 From: davem at davemloft.net (David Miller) Date: Mon, 17 Sep 2007 09:37:15 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <1190033500.4230.102.camel@localhost> References: <1189998103.4230.76.camel@localhost> <20070916.201318.71091570.davem@davemloft.net> <1190033500.4230.102.camel@localhost> Message-ID: <20070917.093715.106267826.davem@davemloft.net> From: jamal Date: Mon, 17 Sep 2007 08:51:40 -0400 > On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote: > > > This only makes sense for devices which can 1) scatter-gather > > and 2) checksum on transmit. > > If you have knowledge there are enough descriptors in the driver to > cover all skbs you are passing, do you need to have #1? > Note i dont touch fragments, i am assuming the driver is smart enough to > handle them otherwise it wont advertise it can handle scatter-gather Yes, because you can have multiple descriptors per SKB because we have the head part in skb->data and the rest in the page vector. Thus the device must be able to handle multiple descriptors representing one packet. > > Otherwise we make too many copies and/or passes over the data. > > I didnt understand this last bit - you are still going to go over the > list regardless of whether you call ->hard_start_xmit() once or > multiple times over the same list, no? In the later case i am assuming > a trimmed down ->hard_start_xmit() If the device can't checksum, we have to pass over the data to compute the checksum and stick it into the headers. If the device can't scatter-gather, we have to allocate and copy into a linear buffer. Otherwise it's just bumping page reference counts and adjusting offsets, no data touching at all. From hal.rosenstock at gmail.com Mon Sep 17 09:37:23 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 12:37:23 -0400 Subject: [ewg] Re: [ofa-general] [PATCH] opensm: configure scripts merge In-Reply-To: <20070917164331.GP6891@sashak.voltaire.com> References: <20070915183542.GA6891@sashak.voltaire.com> <1190032929.6272.75.camel@hrosenstock-ws.xsigo.com> <20070917164331.GP6891@sashak.voltaire.com> Message-ID: Hi Sasha, On 9/17/07, Sasha Khapyorsky wrote: > Hi Hal, > > On 05:42 Mon 17 Sep , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Sat, 2007-09-15 at 21:35 +0300, Sasha Khapyorsky wrote: > > > This merges all subdirectories configure.in scripts into one toplevel > > > directory script. Separate configuring per subdirectory is not needed > > > anymore. > > > > How is the requirement for separate OpenSM libraries (complib, > > libosmvendor, and libopensm) now met ? There are some tools (e.g. > > ibutils and others) which require these libraries with OpenSM itself. > > Yes, it does not change existing libraries. Only way how it is > configuired. I understand. How can library only RPM be produced ? -- Hal > Sasha > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From sashak at voltaire.com Mon Sep 17 10:01:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 19:01:42 +0200 Subject: [ewg] Re: [ofa-general] [PATCH] opensm: configure scripts merge In-Reply-To: References: <20070915183542.GA6891@sashak.voltaire.com> <1190032929.6272.75.camel@hrosenstock-ws.xsigo.com> <20070917164331.GP6891@sashak.voltaire.com> Message-ID: <20070917170142.GQ6891@sashak.voltaire.com> On 12:37 Mon 17 Sep , Hal Rosenstock wrote: > Hi Sasha, > > On 9/17/07, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 05:42 Mon 17 Sep , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On Sat, 2007-09-15 at 21:35 +0300, Sasha Khapyorsky wrote: > > > > This merges all subdirectories configure.in scripts into one toplevel > > > > directory script. Separate configuring per subdirectory is not needed > > > > anymore. > > > > > > How is the requirement for separate OpenSM libraries (complib, > > > libosmvendor, and libopensm) now met ? There are some tools (e.g. > > > ibutils and others) which require these libraries with OpenSM itself. > > > > Yes, it does not change existing libraries. Only way how it is > > configuired. > > I understand. How can library only RPM be produced ? As usual - top level spec file describes opensm-libs rpm package. Sasha > > -- Hal > > > Sasha > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > From dwalker at mvista.com Mon Sep 17 10:07:56 2007 From: dwalker at mvista.com (Daniel Walker) Date: Mon, 17 Sep 2007 10:07:56 -0700 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue In-Reply-To: References: <46EE9BCC.1040301@ccur.com> Message-ID: <1190048876.3253.41.camel@imap.mvista.com> On Mon, 2007-09-17 at 08:56 -0700, Roland Dreier wrote: > > When using OFED-1.2.5 based infiniband kernel modules on 2.6.22 based > > kernels with the Ingo Molnar CONFIG_PREEMPT_RT applied, then commands > > such as ibnetdiscvoer, smpquery, sminfo, etc. will hang. The problem > > is with the downgrade_write() rw semaphore usage in the > > ib_umad_close() routine. > > Can you give a few more details on how PREEMPT_RT changes locking > rules (or just exposes existing bugs maybe?) so that the > downgrade_write() causes the issue? I would like to fix this cleanly > but I don't really understand what the problem is. the read/write semaphore functionality is basically reduced to just a binary semaphore , i.e. one reader, or one writer . I think the BUG(); in downgrade_write() is likely part of a removal plan for downgrade_write() (that's just a guess tho) Daniel From sean.hefty at intel.com Mon Sep 17 10:10:57 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 10:10:57 -0700 Subject: [ofa-general] RE: [PATCH] core/cm: improve request message interpretation of subnet local fields In-Reply-To: <011d01c7f938$56e03ed0$04c8c8c8@olympus> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> Message-ID: <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> (I don't think this made it to the mailng list, so re-posting.) I don't disagree with the concept here, but can you explain the problem that you're seeing? Is it that the path is assumed to be routed based on the hop_limit (set in ib_init_ah_from_path)? Are any changes needed for active side processing? Btw, I'd prefer something more like: if (cm_req_get_primary_subnet_local. ) primary_path->hop_limit = 1; else primary_path->hop_limit = req_msg->primary_hop_limit; (or '? :' equivalent), versus setting hop_limit, then overriding it in the common case. And I'm fine if we don't keep the comment. - Sean _____ From: Jim Hall [mailto:jhalljr at systemfabricworks.com] Sent: Monday, September 17, 2007 7:38 AM To: general at openfabrics Cc: Hefty, Sean Subject: [PATCH] core/cm: improve request message interpertation of subnet local fields When parsing a CMA connect request message, if the subnet local is 1 (both nodes on same subnet), then explicitly set the hop limit in the corresponding path record to 1. This avoids a Global/Local mis-configuration problem with Solaris infinband CMA sessions. Signed-off-by: Jim L Hall < jhalljr at systemfabricworks.com> --- drivers/infiniband/core/cm.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d446998..3d8740c 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1109,6 +1109,14 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, cm_req_get_primary_local_ack_timeout(req_msg); primary_path->packet_life_time -= (primary_path->packet_life_time > 0); + if (cm_req_get_primary_subnet_local(req_msg) == 1) { + + /* At this point we know that both sides are on the same + * subnet, any hop limits above 1 don't make much sense + */ + primary_path->hop_limit = 1; + } + if (req_msg->alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); alt_path->dgid = req_msg->alt_local_gid; @@ -1129,6 +1137,14 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path->packet_life_time -= (alt_path->packet_life_time > 0); + + if (cm_req_get_alt_subnet_local(req_msg) == 1) { + + /* At this point we know that both sides are on the same + * subnet, any hop limits above 1 don't make much sense + */ + alt_path->hop_limit = 1; + } } } -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.blackwood at ccur.com Mon Sep 17 10:19:17 2007 From: john.blackwood at ccur.com (John Blackwood) Date: Mon, 17 Sep 2007 13:19:17 -0400 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue Message-ID: <46EEB715.7060509@ccur.com> > Subject: Re: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue > From: Roland Dreier > Date: Mon, 17 Sep 2007 08:56:01 -0700 > To: John Blackwood > CC: linux-rt-users at vger.kernel.org, linux-kernel at vger.kernel.org, general at lists.openfabrics.org, Sven-Thorsten Dietrich > > > When using OFED-1.2.5 based infiniband kernel modules on 2.6.22 based > > kernels with the Ingo Molnar CONFIG_PREEMPT_RT applied, then commands > > such as ibnetdiscvoer, smpquery, sminfo, etc. will hang. The problem > > is with the downgrade_write() rw semaphore usage in the > > ib_umad_close() routine. > > Can you give a few more details on how PREEMPT_RT changes locking > rules (or just exposes existing bugs maybe?) so that the > downgrade_write() causes the issue? I would like to fix this cleanly > but I don't really understand what the problem is. > > - R. Hi Roland, Thanks for your interest in this matter. I'm not one of the preempt rt experts, so others may want to speak up ... (thanks Daniel...) But basically, with CONFIG_PREEMPT_RT enabled, the lock points, such as aqcuiring a spinlock, potentially become places where the current task may be context switched out / preempted. Therefore, when a call is made to lock a spinlock for example, the caller should not currently have irqs disabled, or preemption disabled, since a context switch may occur. I believe that in the case of rw_semaphores, the comments in include/linux/rt_lock.h with the rt preempt patch applied say: /* * RW-semaphores are a spinlock plus a reader-depth count. * * Note that the semantics are different from the usual * Linux rw-sems, in PREEMPT_RT mode we do not allow * multiple readers to hold the lock at once, we only allow * a read-lock owner to read-lock recursively. This is * better for latency, makes the implementation inherently * fair and makes it simpler as well: */ So I believe that a read lock on a rw_semaphore is just as exclusive as the old write lock, except that the read locks may nest. And with the preempt patch enabled, the downgrade_write() becomes: void fastcall rt_downgrade_write(struct rw_semaphore *rwsem) { BUG(); } EXPORT_SYMBOL(rt_downgrade_write); So I think code such as: ib_umad_close() { ... down_write(&file->port->mutex); ... do exclusive stuff downgrade_write(&file->port->mutex); ... do potentially recursive stuff up_read(&file->port->mutex); ... } Could probably become (only when CONFIG_PREEMPT_RT is enabled): ib_umad_close() { ... down_read(&file->port->mutex); ... do exclusive stuff ... do potentially recursive stuff up_read(&file->port->mutex); ... } since the down_read will not allow other readers at the same time, but will allow nesting. I'm not aware of any tools that find these issues, other than just running through the code. I do know that Ingo's preempt rt patch can be found at http://www.kernel.org/pub/linux/kernel/projects/rt and applied to an infiniband kernel. If you enabled CONFIG_PREEMPT_RT, and maybe also enable parameters such as CONFIG_DEBUG_PREEMPT, CONFIG_DEBUG_SPINLOCK, etc. you should see the issue with something like a ibnetdiscover invocation. Thanks. From rcook22415 at jewelnet.com Mon Sep 17 10:28:20 2007 From: rcook22415 at jewelnet.com (Vicky Thomson) Date: Mon, 17 Sep 2007 26:28:20 +0900 Subject: [ofa-general] *-+[+[.:. :(]:[.-[.]-.:! * ])!). Message-ID: <01c7f952$28daa240$60bd64cb@rcook22415> Ti ck!eeer F:D.E G Last 0.04 Ta rg et 0.12 From jhalljr at systemfabricworks.com Mon Sep 17 10:49:20 2007 From: jhalljr at systemfabricworks.com (Jim Hall) Date: Mon, 17 Sep 2007 12:49:20 -0500 Subject: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> Message-ID: <01d201c7f953$13a8cd60$04c8c8c8@olympus> Hi Sean, The problem arises when the active Solaris client is sending a connection request to a passive OFED server instance. Solaris will set the hop_limit field to 0xFF and will not expect or enable GRH routing. The subsequent exchange of RC messages are therefore silently dropped since one side expects GRH traffic and the other doesn't. The active side seems to work ok for local only subnets so nothing needs to be changed there. Here is an updated patch: Signed-off-by: Jim L Hall --- drivers/infiniband/core/cm.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d446998..25a77ec 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1095,7 +1095,10 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, primary_path->dlid = req_msg->primary_local_lid; primary_path->slid = req_msg->primary_remote_lid; primary_path->flow_label = cm_req_get_primary_flow_label(req_msg); - primary_path->hop_limit = req_msg->primary_hop_limit; + if (cm_req_get_primary_subnet_local(req_msg) == 1) + primary_path->hop_limit = 1; + else + primary_path->hop_limit = req_msg->primary_hop_limit; primary_path->traffic_class = req_msg->primary_traffic_class; primary_path->reversible = 1; primary_path->pkey = req_msg->pkey; @@ -1116,7 +1119,10 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->dlid = req_msg->alt_local_lid; alt_path->slid = req_msg->alt_remote_lid; alt_path->flow_label = cm_req_get_alt_flow_label(req_msg); - alt_path->hop_limit = req_msg->alt_hop_limit; + if (cm_req_get_alt_subnet_local(req_msg) == 1) + alt_path->hop_limit = 1; + else + alt_path->hop_limit = req_msg->alt_hop_limit; alt_path->traffic_class = req_msg->alt_traffic_class; alt_path->reversible = 1; alt_path->pkey = req_msg->pkey; Thanks, - Jim H. ----- Original Message ----- From: Sean Hefty To: 'Jim Hall' ; general at lists.openfabrics.org Sent: Monday, September 17, 2007 12:10 PM Subject: RE: [PATCH] core/cm: improve request message interpretation of subnet local fields (I don't think this made it to the mailng list, so re-posting.) I don't disagree with the concept here, but can you explain the problem that you're seeing? Is it that the path is assumed to be routed based on the hop_limit (set in ib_init_ah_from_path)? Are any changes needed for active side processing? Btw, I'd prefer something more like: if (cm_req_get_primary_subnet_local. ) primary_path->hop_limit = 1; else primary_path->hop_limit = req_msg->primary_hop_limit; (or '? :' equivalent), versus setting hop_limit, then overriding it in the common case. And I'm fine if we don't keep the comment. - Sean ------------------------------------------------------------------------------ From: Jim Hall [mailto:jhalljr at systemfabricworks.com] Sent: Monday, September 17, 2007 7:38 AM To: general at openfabrics Cc: Hefty, Sean Subject: [PATCH] core/cm: improve request message interpertation of subnet local fields When parsing a CMA connect request message, if the subnet local is 1 (both nodes on same subnet), then explicitly set the hop limit in the corresponding path record to 1. This avoids a Global/Local mis-configuration problem with Solaris infinband CMA sessions. Signed-off-by: Jim L Hall --- drivers/infiniband/core/cm.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d446998..3d8740c 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1109,6 +1109,14 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, cm_req_get_primary_local_ack_timeout(req_msg); primary_path->packet_life_time -= (primary_path->packet_life_time > 0); + if (cm_req_get_primary_subnet_local(req_msg) == 1) { + + /* At this point we know that both sides are on the same + * subnet, any hop limits above 1 don't make much sense + */ + primary_path->hop_limit = 1; + } + if (req_msg->alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); alt_path->dgid = req_msg->alt_local_gid; @@ -1129,6 +1137,14 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path->packet_life_time -= (alt_path->packet_life_time > 0); + + if (cm_req_get_alt_subnet_local(req_msg) == 1) { + + /* At this point we know that both sides are on the same + * subnet, any hop limits above 1 don't make much sense + */ + alt_path->hop_limit = 1; + } } } -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Sep 17 10:56:07 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 13:56:07 -0400 Subject: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields In-Reply-To: <01d201c7f953$13a8cd60$04c8c8c8@olympus> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> <01d201c7f953$13a8cd60$04c8c8c8@olympus> Message-ID: Jim, On 9/17/07, Jim Hall wrote: > > Hi Sean, > > The problem arises when the active Solaris client is sending a connection > request to a passive OFED server instance. Solaris will set the hop_limit > field to 0xFF and will not expect or enable GRH routing. The subsequent > exchange of RC messages are therefore silently dropped since one side > expects GRH traffic and the other doesn't. Sounds like a Solaris compliance bug as you are required to be able to receive either LRH only or not (GRH + LRH) even if subnet local. -- Hal > > The active side seems to work ok for local only subnets so nothing needs to > be changed there. > > Here is an updated patch: > > Signed-off-by: Jim L Hall > --- > drivers/infiniband/core/cm.c | 10 ++++++++-- > 1 files changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index d446998..25a77ec 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -1095,7 +1095,10 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > primary_path->dlid = req_msg->primary_local_lid; > primary_path->slid = req_msg->primary_remote_lid; > primary_path->flow_label = > cm_req_get_primary_flow_label(req_msg); > - primary_path->hop_limit = req_msg->primary_hop_limit; > + if (cm_req_get_primary_subnet_local(req_msg) == 1) > + primary_path->hop_limit = 1; > + else > + primary_path->hop_limit = req_msg->primary_hop_limit; > primary_path->traffic_class = req_msg->primary_traffic_class; > primary_path->reversible = 1; > primary_path->pkey = req_msg->pkey; > @@ -1116,7 +1119,10 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > alt_path->dlid = req_msg->alt_local_lid; > alt_path->slid = req_msg->alt_remote_lid; > alt_path->flow_label = > cm_req_get_alt_flow_label(req_msg); > - alt_path->hop_limit = req_msg->alt_hop_limit; > + if (cm_req_get_alt_subnet_local(req_msg) == > 1) > + alt_path->hop_limit = 1; > + else > + alt_path->hop_limit = req_msg->alt_hop_limit; > alt_path->traffic_class = req_msg->alt_traffic_class; > alt_path->reversible = 1; > alt_path->pkey = req_msg->pkey; > > > Thanks, > > - Jim H. > > ----- Original Message ----- > From: Sean Hefty > To: 'Jim Hall' ; general at lists.openfabrics.org > Sent: Monday, September 17, 2007 12:10 PM > Subject: RE: [PATCH] core/cm: improve request message interpretation of > subnet local fields > > > > (I don't think this made it to the mailng list, so re-posting.) > > > > I don't disagree with the concept here, but can you explain the problem that > you're seeing? Is it that the path is assumed to be routed based on the > hop_limit (set in ib_init_ah_from_path)? Are any changes needed for active > side processing? > > > > Btw, I'd prefer something more like: > > > > if (cm_req_get_primary_subnet_local… ) > > primary_path->hop_limit = 1; > > else > > primary_path->hop_limit = req_msg->primary_hop_limit; > > > > (or '? :' equivalent), versus setting hop_limit, then overriding it in the > common case. And I'm fine if we don't keep the comment. > > > > - Sean > > > > ________________________________ > > > From: Jim Hall [mailto:jhalljr at systemfabricworks.com] > Sent: Monday, September 17, 2007 7:38 AM > To: general at openfabrics > Cc: Hefty, Sean > Subject: [PATCH] core/cm: improve request message interpertation of subnet > local fields > > > > > When parsing a CMA connect request message, if the subnet local is 1 > > > (both nodes on same subnet), then explicitly set the hop limit in the > corresponding > > > path record to 1. This avoids a Global/Local mis-configuration problem with > Solaris > > > infinband CMA sessions. > > > > > > Signed-off-by: Jim L Hall > --- > drivers/infiniband/core/cm.c | 16 ++++++++++++++++ > 1 files changed, 16 insertions(+), 0 deletions(-) > > > > > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index d446998..3d8740c 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -1109,6 +1109,14 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > > cm_req_get_primary_local_ack_timeout(req_msg); > primary_path->packet_life_time -= (primary_path->packet_life_time > > 0); > > > > > > + if (cm_req_get_primary_subnet_local(req_msg) == 1) > { > + > + /* At this point we know that both sides are on the same > + * subnet, any hop limits above 1 don't make much sense > + */ > + primary_path->hop_limit = 1; > + } > + > if (req_msg->alt_local_lid) { > memset(alt_path, 0, sizeof *alt_path); > alt_path->dgid = req_msg->alt_local_gid; > @@ -1129,6 +1137,14 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > alt_path->packet_life_time = > > cm_req_get_alt_local_ack_timeout(req_msg); > alt_path->packet_life_time -= (alt_path->packet_life_time > > 0); > + > + if (cm_req_get_alt_subnet_local(req_msg) == > 1) { > + > + /* At this point we know that both sides are on the > same > + * subnet, any hop limits above 1 don't make much > sense > + */ > + alt_path->hop_limit = 1; > + } > } > } > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Mon Sep 17 11:01:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 11:01:47 -0700 Subject: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields In-Reply-To: <01d201c7f953$13a8cd60$04c8c8c8@olympus> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> <01d201c7f953$13a8cd60$04c8c8c8@olympus> Message-ID: <46EEC10B.1060704@ichips.intel.com> > The problem arises when the active Solaris client is sending a > connection request to a passive OFED server instance. Solaris will set > the hop_limit field to 0xFF and will not expect or enable GRH routing. > The subsequent exchange of RC messages are therefore silently dropped > since one side expects GRH traffic and the other doesn't. Is this an issue with the SM setting the hop_limit to 0xff or the active CM? Currently the ib_cm sets the local_subnet value to 1 on the active side. I have a patch that sets it based on the hop_limit in the path record. I'm trying to determine if a more complicated solution will be needed for ib router support. (Those changes can be separate if needed.) > The active side seems to work ok for local only subnets so nothing needs > to be changed there. > > Here is an updated patch: > > When parsing a CMA connect request message, if the subnet local is 1 > (both nodes on same subnet), then explicitly set the hop limit in > the corresponding path record to 1. > This avoids a Global/Local mis-configuration problem with Solaris > infinband CMA sessions. > Signed-off-by: Jim L Hall > > > --- > drivers/infiniband/core/cm.c | 10 ++++++++-- > 1 files changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index d446998..25a77ec 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -1095,7 +1095,10 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > primary_path->dlid = req_msg->primary_local_lid; > primary_path->slid = req_msg->primary_remote_lid; > primary_path->flow_label = cm_req_get_primary_flow_label(req_msg); > - primary_path->hop_limit = req_msg->primary_hop_limit; > + if (cm_req_get_primary_subnet_local(req_msg) == 1) > + primary_path->hop_limit = 1; > + else > + primary_path->hop_limit = req_msg->primary_hop_limit; > primary_path->traffic_class = req_msg->primary_traffic_class; > primary_path->reversible = 1; > primary_path->pkey = req_msg->pkey; > @@ -1116,7 +1119,10 @@ static void cm_format_paths_from_req(struct > cm_req_msg *req_msg, > alt_path->dlid = req_msg->alt_local_lid; > alt_path->slid = req_msg->alt_remote_lid; > alt_path->flow_label = cm_req_get_alt_flow_label(req_msg); > - alt_path->hop_limit = req_msg->alt_hop_limit; > + if (cm_req_get_alt_subnet_local(req_msg) == 1) > + alt_path->hop_limit = 1; > + else > + alt_path->hop_limit = req_msg->alt_hop_limit; > alt_path->traffic_class = req_msg->alt_traffic_class; > alt_path->reversible = 1; > alt_path->pkey = req_msg->pkey; > > From jhalljr at systemfabricworks.com Mon Sep 17 11:07:38 2007 From: jhalljr at systemfabricworks.com (Jim Hall) Date: Mon, 17 Sep 2007 13:07:38 -0500 Subject: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> <01d201c7f953$13a8cd60$04c8c8c8@olympus> <46EEC10B.1060704@ichips.intel.com> Message-ID: <01f101c7f955$a22ef030$04c8c8c8@olympus> The issue is between the active CM (in this case Solaris) and passive OFED. The SM doesn't look to be involved. ----- Original Message ----- From: "Sean Hefty" To: "Jim Hall" Cc: "Sean Hefty" ; Sent: Monday, September 17, 2007 1:01 PM Subject: Re: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields >> The problem arises when the active Solaris client is sending a connection >> request to a passive OFED server instance. Solaris will set the hop_limit >> field to 0xFF and will not expect or enable GRH routing. The subsequent >> exchange of RC messages are therefore silently dropped since one side >> expects GRH traffic and the other doesn't. > > Is this an issue with the SM setting the hop_limit to 0xff or the active > CM? Currently the ib_cm sets the local_subnet value to 1 on the active > side. I have a patch that sets it based on the hop_limit in the path > record. I'm trying to determine if a more complicated solution will be > needed for ib router support. (Those changes can be separate if needed.) > >> The active side seems to work ok for local only subnets so nothing needs >> to be changed there. >> Here is an updated patch: >> When parsing a CMA connect request message, if the subnet local is 1 >> (both nodes on same subnet), then explicitly set the hop limit in >> the corresponding path record to 1. >> This avoids a Global/Local mis-configuration problem with Solaris >> infinband CMA sessions. Signed-off-by: Jim L Hall >> > Acked-by: Sean Hefty > >> > >> --- >> drivers/infiniband/core/cm.c | 10 ++++++++-- >> 1 files changed, 8 insertions(+), 2 deletions(-) >> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >> index d446998..25a77ec 100644 >> --- a/drivers/infiniband/core/cm.c >> +++ b/drivers/infiniband/core/cm.c >> @@ -1095,7 +1095,10 @@ static void cm_format_paths_from_req(struct >> cm_req_msg *req_msg, >> primary_path->dlid = req_msg->primary_local_lid; >> primary_path->slid = req_msg->primary_remote_lid; >> primary_path->flow_label = >> cm_req_get_primary_flow_label(req_msg); >> - primary_path->hop_limit = req_msg->primary_hop_limit; >> + if (cm_req_get_primary_subnet_local(req_msg) == 1) >> + primary_path->hop_limit = 1; >> + else >> + primary_path->hop_limit = req_msg->primary_hop_limit; >> primary_path->traffic_class = req_msg->primary_traffic_class; >> primary_path->reversible = 1; >> primary_path->pkey = req_msg->pkey; >> @@ -1116,7 +1119,10 @@ static void cm_format_paths_from_req(struct >> cm_req_msg *req_msg, >> alt_path->dlid = req_msg->alt_local_lid; >> alt_path->slid = req_msg->alt_remote_lid; >> alt_path->flow_label = >> cm_req_get_alt_flow_label(req_msg); >> - alt_path->hop_limit = req_msg->alt_hop_limit; >> + if (cm_req_get_alt_subnet_local(req_msg) == 1) >> + alt_path->hop_limit = 1; >> + else >> + alt_path->hop_limit = req_msg->alt_hop_limit; >> alt_path->traffic_class = req_msg->alt_traffic_class; >> alt_path->reversible = 1; >> alt_path->pkey = req_msg->pkey; >> > From jgunthorpe at obsidianresearch.com Mon Sep 17 11:30:45 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 17 Sep 2007 12:30:45 -0600 Subject: [ofa-general] Re: [PATCH] core/cm: improve request message interpretation of subnet local fields In-Reply-To: <46EEC10B.1060704@ichips.intel.com> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus> <000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com> <01d201c7f953$13a8cd60$04c8c8c8@olympus> <46EEC10B.1060704@ichips.intel.com> Message-ID: <20070917183045.GY4472@obsidianresearch.com> On Mon, Sep 17, 2007 at 11:01:47AM -0700, Sean Hefty wrote: > >The problem arises when the active Solaris client is sending a > >connection request to a passive OFED server instance. Solaris will set > >the hop_limit field to 0xFF and will not expect or enable GRH routing. > >The subsequent exchange of RC messages are therefore silently dropped > >since one side expects GRH traffic and the other doesn't. > > Is this an issue with the SM setting the hop_limit to 0xff or the active > CM? Currently the ib_cm sets the local_subnet value to 1 on the active > side. I have a patch that sets it based on the hop_limit in the path > record. I'm trying to determine if a more complicated solution will be > needed for ib router support. (Those changes can be separate if needed.) I'm with Hal on this - why does this cause a problem? There is no IB packet verification check that tests if a GRH is present, only if it is presen it must be valid - so how can an extra correctly filled in GRH cause anything but degraded performance? So, it either must be that Solaris is not configuring the card to validate the GRH properly, or the GRH fields produced by Linux are incorrect. I'm generally leary about overriding GRH insertion outside of the control of the SM. I'd much prefer it if the clients never tested the prefix to select if a GRH is needed or not. There may be useful HA situations where a SM could route traffic for an apparently on-link GID to a router port - same use cases as proxy arp in IP land. Jason From sean.hefty at intel.com Mon Sep 17 11:39:46 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 11:39:46 -0700 Subject: [ofa-general] Re: [PATCH] core/cm: improve requestmessage interpretation of subnet local fields In-Reply-To: <20070917183045.GY4472@obsidianresearch.com> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus><000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com><01d201c7f953$13a8cd60$04c8c8c8@olympus><46EEC10B.1060704@ichips.intel.com> <20070917183045.GY4472@obsidianresearch.com> Message-ID: <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> >I'm with Hal on this - why does this cause a problem? There is no IB >packet verification check that tests if a GRH is present, only if it >is presen it must be valid - so how can an extra correctly filled in >GRH cause anything but degraded performance? ib_init_ah_from_path() uses the hop_limit in the path record to determine if a GRH should be used. It sets the address handle attributes (used to configure the QP) based on hop_limit > 1. If hop_limit is set incorrectly in the CM REQ, the path record formed by the CM based on data carried in the REQ could have invalid GRH values. It's possible that this is an active side CM issue, but that's not clear to me. - Sean From jgunthorpe at obsidianresearch.com Mon Sep 17 11:51:34 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 17 Sep 2007 12:51:34 -0600 Subject: [ofa-general] Re: [PATCH] core/cm: improve requestmessage interpretation of subnet local fields In-Reply-To: <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> References: <20070917183045.GY4472@obsidianresearch.com> <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> Message-ID: <20070917185134.GA4472@obsidianresearch.com> On Mon, Sep 17, 2007 at 11:39:46AM -0700, Sean Hefty wrote: > >I'm with Hal on this - why does this cause a problem? There is no IB > >packet verification check that tests if a GRH is present, only if it > >is presen it must be valid - so how can an extra correctly filled in > >GRH cause anything but degraded performance? > > ib_init_ah_from_path() uses the hop_limit in the path record to > determine if a GRH should be used. It sets the address handle > attributes (used to configure the QP) based on hop_limit > 1. If > hop_limit is set incorrectly in the CM REQ, the path record formed > by the CM based on data carried in the REQ could have invalid GRH > values. But using this hop limit this was is just a convention we have adopted. There is nothing in IBA that allows the active side to send a REQ with invalid GIDs/etc in any situation. The passive side should always be able to form a correct GRH based on any REQ. Jason From sean.hefty at intel.com Mon Sep 17 11:56:03 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 11:56:03 -0700 Subject: [ofa-general] Re: [PATCH] core/cm: improve requestmessage interpretation of subnet local fields In-Reply-To: <20070917185134.GA4472@obsidianresearch.com> References: <20070917183045.GY4472@obsidianresearch.com> <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> <20070917185134.GA4472@obsidianresearch.com> Message-ID: <000601c7f95c$652b9150$9c98070a@amr.corp.intel.com> >But using this hop limit this was is just a convention we have >adopted. There is nothing in IBA that allows the active side to send a >REQ with invalid GIDs/etc in any situation. The passive side should >always be able to form a correct GRH based on any REQ. I don't think TC and FL need to be valid, though I can't say off the top of my head if that would break the connection. From sashak at voltaire.com Mon Sep 17 12:13:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 21:13:51 +0200 Subject: [ofa-general] [PATCH] opensm: fix iba/*.h installation path Message-ID: <20070917191351.GS6891@sashak.voltaire.com> Fix iba/*.h installation path. Signed-off-by: Sasha Khapyorsky --- opensm/include/Makefile.am | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am index fc2d7ca..5c41126 100644 --- a/opensm/include/Makefile.am +++ b/opensm/include/Makefile.am @@ -156,4 +156,6 @@ EXTRA_DIST = \ $(srcdir)/vendor/osm_vendor_sa_api.h \ $(srcdir)/vendor/osm_mtl_bind.h +pkgincludedir = $(includedir)/infiniband + dist-hook: -- 1.5.3.1.91.gd3392 From kliteyn at dev.mellanox.co.il Mon Sep 17 12:06:06 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 17 Sep 2007 21:06:06 +0200 Subject: [ofa-general] [PATCH] osm: mkey lease period description in options file Message-ID: <46EED01E.2060104@dev.mellanox.co.il> M_Key lease period description should be in [sec] instead of [msec]. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_subnet.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 3895732..9456f22 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1298,7 +1298,7 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "guid 0x%016" PRIx64 "\n\n" "# M_Key value sent to all ports qualifying all Set(PortInfo)\n" "m_key 0x%016" PRIx64 "\n\n" - "# The lease period used for the M_Key on this subnet in [msec]\n" + "# The lease period used for the M_Key on this subnet in [sec]\n" "m_key_lease_period %u\n\n" "# SM_Key value of the SM to qualify rcv SA queries as 'trusted'\n" "sm_key 0x%016" PRIx64 "\n\n" -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Mon Sep 17 12:09:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 17 Sep 2007 21:09:55 +0200 Subject: [ofa-general] [PATCH] osm: TrapRepress was failing for mkey != 0 Message-ID: <46EED103.9010808@dev.mellanox.co.il> TrapRepress always had mkey 0, which was copied from trap notice's mkey (which is always 0). As a result, TrapRepress was failing for port with mkey != 0 Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_trap_rcv.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 3323a83..9c28005 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -439,6 +439,16 @@ __osm_trap_rcv_process_request(IN osm_trap_rcv_t * const p_rcv, osm_dump_notice(p_rcv->p_log, p_ntci, OSM_LOG_VERBOSE); + p_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + &tmp_madw.mad_addr); + if (p_physp) + p_smp->m_key = p_physp->port_info.m_key; + else + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3809: " + "Failed to find source physical port for trap\n"); + status = osm_resp_send(p_rcv->p_resp, &tmp_madw, 0, payload); if (status != IB_SUCCESS) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, -- 1.5.1.4 From mshefty at ichips.intel.com Mon Sep 17 12:11:36 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 12:11:36 -0700 Subject: [ofa-general] IPoIB CM (NOSRQ) [PATCH 1] review Message-ID: <46EED168.3050102@ichips.intel.com> Copied from web link. I didn't have this in my inbox anymore. I should also mention that I'm not up on all of the IPoIB RFCs, so if some of my comments don't apply to an RFC, just ignore them. :) Most of these comments are about code organization, to make the with and without srq code cleaner. > --- a/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib.h 2007-08-20 17:39:25.000000000 -0400 > +++ b/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib.h 2007-08-20 17:49:14.000000000 -0400 > @@ -95,11 +95,14 @@ enum { > IPOIB_MCAST_FLAG_ATTACHED = 3, > }; > > +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) > #define IPOIB_OP_RECV (1ul << 31) Inserting a blank line here... > #ifdef CONFIG_INFINIBAND_IPOIB_CM > -#define IPOIB_CM_OP_SRQ (1ul << 30) > +#define IPOIB_CM_OP_RECV (1ul << 30) > + ...but not here, would make it easier to follow the ifdef's. > +#define NOSRQ_INDEX_TABLE_SIZE 128 I'm not fond of this name. It's really just the default number of supported connected QPs. The value is only used in one place - to set max_rc_qp, which is a module parameter anyway. We can just remove this definition. > #else > -#define IPOIB_CM_OP_SRQ (0) > +#define IPOIB_CM_OP_RECV (0) > #endif > > /* structs */ > @@ -166,11 +169,14 @@ enum ipoib_cm_state { > }; > > struct ipoib_cm_rx { > - struct ib_cm_id *id; > - struct ib_qp *qp; > - struct list_head list; > - struct net_device *dev; > - unsigned long jiffies; > + struct ib_cm_id *id; > + struct ib_qp *qp; > + struct ipoib_cm_rx_buf *rx_ring; /* Used by NOSRQ only */ Nit: can we use 'no SRQ' or 'without SRQ', rather than 'NOSRQ' as a single string? Or, alternately, only call out when SRQ is in use? > + struct list_head list; > + struct net_device *dev; > + unsigned long jiffies; > + u32 index; /* wr_ids are distinguished by index > + * to identify the QP -NOSRQ only */ > enum ipoib_cm_state state; > }; > > @@ -215,6 +221,8 @@ struct ipoib_cm_dev_priv { > struct ib_wc ibwc[IPOIB_NUM_WC]; > struct ib_sge rx_sge[IPOIB_CM_RX_SG]; > struct ib_recv_wr rx_wr; > + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() > + *for usage of this element */ Just call this rx_table. We're not storing indices... Also, linking the entries would avoid the linear search through the table. E.g. struct ipoib_cm_rx_entry { int next; /* -1 = end of list */ struct ipoib_cm_rx *rx; }; struct ipoib_cm_dev_priv { ... int free_entry; /* -1 = none free */ struct ipoib_cm_rx_entry **rx_table; ... > }; > > /* > @@ -438,6 +446,7 @@ void ipoib_drain_cq(struct net_device *d > /* We don't support UC connections at the moment */ > #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) > > +extern int max_rc_qp; > static inline int ipoib_cm_admin_enabled(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > --- a/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-08-20 17:39:25.000000000 -0400 > +++ b/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-08-20 17:51:46.000000000 -0400 > @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, > > #include "ipoib.h" > > +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; > +static int max_recv_buf = 1024; /* Default is 1024 MB */ > + > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); > +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported; must be a power of 2"); Would multiple values be better here? Something like: max_conn_qp, qp_type, and use_srq. We're getting into a lot of possible options: UD, UD with SRQ (?), UC, RC, RC with SRQ, UC with SRQ and spec changes... I'm guessing that each one is useful under different configurations (fabric size, application load, etc.) It would be nice if the framework moved in the direction of supporting any of these. E.g. use 'conn' in place of 'rc'. Also, why is max_rc_qp restricted to a power of 2? We can just let the lower (30?) bits of a wr_id match the ipoib_cm_rx index. max_rc_qp just needs to be less than 2^30, which is required anyway. > +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); > +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); Do we really need a new parameter here? What controls does the user have access over? If they can set the max number of QPs, size of each QP, and the size of each message, then I think we should eliminate this. (And if they can't set each of these, then maybe we should look at adding those parameters versus an all encompassing max memory type of value.) Btw, the naming and description are a little misleading. This is a limit on all allocated receive buffers for all connected QPs. The name and description make it sound like a limitation for a single buffer. > +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for NOSRQ */ Taking some of the changes above, we can drop this variable. Even without the changes above, there's no point in setting max_rc_qp higher than the number of QPs that the user could create because of max_recv_buf limitations. In other words, I would rather see current_rc_qp and max_recv_buf go away, but even if max_recv_buf were kept, the recv_mem_used check in allocate_and_post_rbuf_nosrq() should instead be used to limit setting max_rc_qp. (Hope this make sense.) > + > +#define NOSRQ_INDEX_MASK (max_rc_qp -1) This goes away by just reserving the lower bits of the wr_id for the rx_table index. > #define IPOIB_CM_IETF_ID 0x1000000000000000ULL > > #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) > @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct > ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); > } > > -static int ipoib_cm_post_receive(struct net_device *dev, int id) > +static int post_receive_srq(struct net_device *dev, u64 id) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_recv_wr *bad_wr; > int i, ret; > > - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; > + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; > > for (i = 0; i < IPOIB_CM_RX_SG; ++i) > priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; > > ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); > if (unlikely(ret)) { > - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); > + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", > + (unsigned long long)id, ret); > ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > priv->cm.srq_ring[id].mapping); > dev_kfree_skb_any(priv->cm.srq_ring[id].skb); I see that the code was already this way, but it's not clear why unmap and free_skb are called within this function. The mapping and skb allocation are not done here. I would rather see the function that does the mapping and allocation do the cleanup, rather than assuming that it's done in a called routine. Optionally, move the mapping and allocation into this routine. > -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, > +static int post_receive_nosrq(struct net_device *dev, u64 id) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ib_recv_wr *bad_wr; > + int i, ret; > + u32 index; > + u32 wr_id; > + struct ipoib_cm_rx *rx_ptr; > + > + index = id & NOSRQ_INDEX_MASK; > + wr_id = id >> 32; > + > + rx_ptr = priv->cm.rx_index_table[index]; > + > + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; > + > + for (i = 0; i < IPOIB_CM_RX_SG; ++i) > + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; > + > + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); > + if (unlikely(ret)) { > + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", > + wr_id, ret); > + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > + rx_ptr->rx_ring[wr_id].mapping); > + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); > + rx_ptr->rx_ring[wr_id].skb = NULL; > + } same cleanup issue as above > + > + return ret; > +} > + > +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, > + int frags, > u64 mapping[IPOIB_CM_RX_SG]) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct sk_buff *skb; > int i; > + struct ipoib_cm_rx *rx_ptr; > + u32 index, wr_id; > > skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); > if (unlikely(!skb)) > @@ -141,7 +189,14 @@ static struct sk_buff *ipoib_cm_alloc_rx > goto partial_error; > } > > - priv->cm.srq_ring[id].skb = skb; > + if (priv->cm.srq) > + priv->cm.srq_ring[id].skb = skb; > + else { > + index = id & NOSRQ_INDEX_MASK; > + wr_id = id >> 32; > + rx_ptr = priv->cm.rx_index_table[index]; > + rx_ptr->rx_ring[wr_id].skb = skb; > + } > return skb; > > partial_error: > @@ -203,11 +258,14 @@ static struct ib_qp *ipoib_cm_create_rx_ > .recv_cq = priv->cq, > .srq = priv->cm.srq, > .cap.max_send_wr = 1, /* For drain WR */ > + .cap.max_recv_wr = ipoib_recvq_size + 1, > .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ > .sq_sig_type = IB_SIGNAL_ALL_WR, > .qp_type = IB_QPT_RC, > .qp_context = p, > }; > + if (!priv->cm.srq) > + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; We can just set max_recv_sge here without the check. It is ignored if the QP is associated with an srq. > return ib_create_qp(priv->pd, &attr); > } > > @@ -281,12 +339,130 @@ static int ipoib_cm_send_rep(struct net_ > rep.private_data_len = sizeof data; > rep.flow_control = 0; > rep.rnr_retry_count = req->rnr_retry_count; > - rep.srq = 1; > rep.qp_num = qp->qp_num; > rep.starting_psn = psn; > + rep.srq = !!priv->cm.srq; > return ib_send_cm_rep(cm_id, &rep); > } > > +static void init_context_and_add_list(struct ib_cm_id *cm_id, I'm really not sure what this function is trying to do. There's got to be a better name for this function, or a better way to organize the code. This looks more like just a blob of code, rather than code performing a well defined task. See additional comment below (6-7 comments down) about locking as well. > + struct ipoib_cm_rx *p, Some naming consistency for struct ipoib_cm_rx would be nice. 'rx_ptr' or just 'rx' are fine. In a lot of places, the variable is just called 'p', which is really bad IMO. Some of this already exists, so applying a patch which renames 'p' to something useful, either before or after applying this patch would be nice. > + struct ipoib_dev_priv *priv) > +{ > + cm_id->context = p; > + p->jiffies = jiffies; > + spin_lock_irq(&priv->lock); > + if (list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > + if (priv->cm.srq) { > + /* Add this entry to passive ids list head, but do not re-add > + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush > + * list. > + */ > + if (p->state == IPOIB_CM_RX_LIVE) > + list_move(&p->list, &priv->cm.passive_ids); Should there be a state change here? It just seems cleaner to me if the state indicated which list the rx were located on. I don't like that the state only seems to be used consistently in the srq case. I would think it could apply to all cases. > + } > + spin_unlock_irq(&priv->lock); > +} > + > +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, > + struct ipoib_cm_rx *p, unsigned psn) Function name is a little long. Maybe there should be multiple functions here. (Use of 'and' in the function name points to multiple functions that are grouped together. Maybe we should add a function naming rule: if the function name contains 'and', create separate functions...) > +{ > + struct net_device *dev = cm_id->context; > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + u32 qp_num, index; > + u64 i, recv_mem_used; > + > + qp_num = p->qp->qp_num; qp_num is only used in one place in this function, and only for a debug print. > + > + /* In the SRQ case there is a common rx buffer called the srq_ring. > + * However, for the NOSRQ case we create an rx_ring for every > + * struct ipoib_cm_rx. > + */ > + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); > + if (!p->rx_ring) { > + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", > + qp_num); > + return -ENOMEM; > + } > + > + spin_lock_irq(&priv->lock); > + list_add(&p->list, &priv->cm.passive_ids); > + spin_unlock_irq(&priv->lock); > + > + init_context_and_add_list(cm_id, p, priv); stale_task thread could be executing on 'p' at this point. Is that acceptable? (I'm pretty sure I pointed this out before, but I don't remember what the response was.) We just added 'p' to the passive_ids list here, but init_context_and_add_list() also adds it to the list, but only in the srq case. It would be cleaner to always just add it to the list in init_context_and_add_list() or always do it outside of the list. > + spin_lock_irq(&priv->lock); Including the call above, we end up acquiring this lock 3 times in a row, setting 2 variables between the first and second time, and doing nothing between the second and third time. > + > + for (index = 0; index < max_rc_qp; index++) > + if (priv->cm.rx_index_table[index] == NULL) > + break; See previous comment about avoiding a linear search. > + > + recv_mem_used = (u64)ipoib_recvq_size * > + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; > + if ((index == max_rc_qp) || > + (recv_mem_used >= max_recv_buf * (1ul << 20))) { I would prefer a single check against max_rc_qp. (Fold memory constraints into limiting the value of max_rc_qp.) Otherwise, we can end up allocating a larger array of rx_index_table than is actually usable. > + spin_unlock_irq(&priv->lock); > + ipoib_warn(priv, "NOSRQ has reached the configurable limit " > + "of either %d RC QPs or, max recv buf size of " > + "0x%x MB\n", max_rc_qp, max_recv_buf); > + > + /* We send a REJ to the remote side indicating that we > + * have no more free RC QPs and leave it to the remote side > + * to take appropriate action. This should leave the > + * current set of QPs unaffected and any subsequent REQs > + * will be able to use RC QPs if they are available. > + */ > + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); > + ret = -EINVAL; > + goto err_alloc_and_post; > + } > + > + priv->cm.rx_index_table[index] = p; > + spin_unlock_irq(&priv->lock); > + > + /* We will subsequently use this stored pointer while freeing > + * resources in stale task > + */ > + p->index = index; Is it dangerous to have this not set before releasing the lock? (It doesn't look like it, but wanted to check.) Could anything be iterating over the table expecting p->index to be set. > + > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) { > + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); > + ipoib_cm_dev_cleanup(dev); > + goto err_alloc_and_post; > + } > + > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping)) { > + ipoib_warn(priv, "failed to allocate receive " > + "buffer %d\n", (int)i); > + ipoib_cm_dev_cleanup(dev); > + ret = -ENOMEM; > + goto err_alloc_and_post; > + } > + > + if (post_receive_nosrq(dev, i << 32 | index)) { > + ipoib_warn(priv, "post_receive_nosrq " > + "failed for buf %lld\n", (unsigned long long)i); > + ipoib_cm_dev_cleanup(dev); > + ret = -EIO; Why not just do: ret = post_receive_nosrq()? if (ret) ... > + goto err_alloc_and_post; > + } > + } > + > + return 0; > + > +err_alloc_and_post: > + atomic_dec(¤t_rc_qp); > + kfree(p->rx_ring); > + list_del_init(&p->list); We need a lock here. Is priv->cm.rx_index_table[index] cleaned up anywhere? > + return ret; > +} > + > static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) > { > struct net_device *dev = cm_id->context; > @@ -301,9 +477,6 @@ static int ipoib_cm_req_handler(struct i > return -ENOMEM; > p->dev = dev; > p->id = cm_id; > - cm_id->context = p; > - p->state = IPOIB_CM_RX_LIVE; > - p->jiffies = jiffies; > INIT_LIST_HEAD(&p->list); > > p->qp = ipoib_cm_create_rx_qp(dev, p); > @@ -313,19 +486,21 @@ static int ipoib_cm_req_handler(struct i > } > > psn = random32() & 0xffffff; > - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > - if (ret) > - goto err_modify; > + if (!priv->cm.srq) { > + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); > + if (ret) > + goto err_modify; > + } else { > + p->rx_ring = NULL; > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) > + goto err_modify; > + } > > - spin_lock_irq(&priv->lock); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > - /* Add this entry to passive ids list head, but do not re-add it > - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ > - p->jiffies = jiffies; > - if (p->state == IPOIB_CM_RX_LIVE) > - list_move(&p->list, &priv->cm.passive_ids); > - spin_unlock_irq(&priv->lock); > + if (priv->cm.srq) { > + p->state = IPOIB_CM_RX_LIVE; This if can be merged with the previous if statement above, which performs a similar check. Does it matter that the state is set outside of any locks? > + init_context_and_add_list(cm_id, p, priv); > + } > > ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); > if (ret) { > @@ -398,29 +573,60 @@ static void skb_put_frags(struct sk_buff > } > } > > -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + unsigned long flags; > + > + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > + spin_lock_irqsave(&priv->lock, flags); > + p->jiffies = jiffies; > + /* Move this entry to list head, but do > + * not re-add it if it has been removed. > + */ > + if (p->state == IPOIB_CM_RX_LIVE) > + list_move(&p->list, &priv->cm.passive_ids); > + spin_unlock_irqrestore(&priv->lock, flags); > + } > +} > + > +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + unsigned long flags; > + > + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > + spin_lock_irqsave(&priv->lock, flags); > + p->jiffies = jiffies; > + /* Move this entry to list head, but do > + * not re-add it if it has been removed. */ > + if (!list_empty(&p->list)) > + list_move(&p->list, &priv->cm.passive_ids); > + spin_unlock_irqrestore(&priv->lock, flags); > + } > +} Letting with and without srq use the same state let's us combine these routines. It seems cleaner to act in the no srq case based on an explicit state of the ipoib_cm_rx, rather than the state of a list item, given that the state tracking is already there. > + > +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; > + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; > struct sk_buff *skb, *newskb; > struct ipoib_cm_rx *p; > unsigned long flags; > u64 mapping[IPOIB_CM_RX_SG]; > - int frags; > + int frags, ret; > > - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", > - wr_id, wc->status); > + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", > + (unsigned long long)wr_id, wc->status); > > if (unlikely(wr_id >= ipoib_recvq_size)) { Why would this ever occur? > - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { > + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { > spin_lock_irqsave(&priv->lock, flags); > list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); > ipoib_cm_start_rx_drain(priv); > queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > spin_unlock_irqrestore(&priv->lock, flags); > } else > - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", > - wr_id, ipoib_recvq_size); > + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", > + (unsigned long long)wr_id, ipoib_recvq_size); > return; > } > > @@ -428,23 +634,15 @@ void ipoib_cm_handle_rx_wc(struct net_de > > if (unlikely(wc->status != IB_WC_SUCCESS)) { > ipoib_dbg(priv, "cm recv error " > - "(status=%d, wrid=%d vend_err %x)\n", > - wc->status, wr_id, wc->vendor_err); > + "(status=%d, wrid=%lld vend_err %x)\n", > + wc->status, (unsigned long long)wr_id, wc->vendor_err); > ++priv->stats.rx_dropped; > - goto repost; > + goto repost_srq; > } > > if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > p = wc->qp->qp_context; > - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > - spin_lock_irqsave(&priv->lock, flags); > - p->jiffies = jiffies; > - /* Move this entry to list head, but do not re-add it > - * if it has been moved out of list. */ > - if (p->state == IPOIB_CM_RX_LIVE) > - list_move(&p->list, &priv->cm.passive_ids); > - spin_unlock_irqrestore(&priv->lock, flags); > - } > + timer_check_srq(priv, p); This looks like noise at the moment. (See previous comment about timer_check_srq.) > } > > frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > @@ -456,13 +654,112 @@ void ipoib_cm_handle_rx_wc(struct net_de > * If we can't allocate a new RX buffer, dump > * this packet and reuse the old buffer. > */ > - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", > + (unsigned long long)wr_id); > ++priv->stats.rx_dropped; > - goto repost; > + goto repost_srq; > } > > - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); > - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); > + ipoib_cm_dma_unmap_rx(priv, frags, > + priv->cm.srq_ring[wr_id].mapping); > + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, > + (frags + 1) * sizeof *mapping); > + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > + wc->byte_len, wc->slid); > + > + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > + > + skb->protocol = ((struct ipoib_header *) skb->data)->proto; > + skb_reset_mac_header(skb); > + skb_pull(skb, IPOIB_ENCAP_LEN); > + > + dev->last_rx = jiffies; > + ++priv->stats.rx_packets; > + priv->stats.rx_bytes += skb->len; > + > + skb->dev = dev; > + /* XXX get correct PACKET_ type here */ > + skb->pkt_type = PACKET_HOST; > + netif_receive_skb(skb); > + > +repost_srq: > + ret = post_receive_srq(dev, wr_id); > + > + if (unlikely(ret)) > + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", > + (unsigned long long)wr_id); > + > +} Some of the changes to this call look like noise. There's a lot of code at the end of this routine (shows as 'new' code in the diff) that's duplicated in handle_rx_wc_nosrq. Can we pull out the common code into a function or merge these two routines? One possibility is to store a function pointer with the ipoib_cm_rx that's invoked for posting receive buffers. (Even ib_post_recv and ib_post_srq_recv are similar, if you want to carry this concept further to allow better code sharing.) > + > +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct sk_buff *skb, *newskb; > + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; > + u32 index; > + struct ipoib_cm_rx *rx_ptr; > + int frags, ret; > + > + extra white space > + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", > + (unsigned long long)wr_id, wc->status); > + > + if (unlikely(wr_id >= ipoib_recvq_size)) { Why would this ever occur? > + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", > + (unsigned long long)wr_id, ipoib_recvq_size); > + return; > + } > + > + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; > + > + /* This is the only place where rx_ptr could be a NULL - could > + * have just received a packet from a connection that has become > + * stale and so is going away. We will simply drop the packet and > + * let the hardware (it s IB_QPT_RC) handle the dropped packet. I don't understand this comment. How can the hardware handle a packet dropped by software? If the completion can be for a connection that has gone away, what's to prevent a new connection from grabbing the same slot in the rx_index_table. If this occurs, then the completion will reference the wrong connection. > + * In the timer_check() function below, p->jiffies is updated and > + * hence the connection will not be stale after that. > + */ > + rx_ptr = priv->cm.rx_index_table[index]; > + if (unlikely(!rx_ptr)) { > + ipoib_warn(priv, "Received packet from a connection " > + "that is going away. Hardware will handle it.\n"); > + return; > + } > + > + skb = rx_ptr->rx_ring[wr_id].skb; > + > + if (unlikely(wc->status != IB_WC_SUCCESS)) { > + ipoib_dbg(priv, "cm recv error " > + "(status=%d, wrid=%lld vend_err %x)\n", > + wc->status, (unsigned long long)wr_id, wc->vendor_err); > + ++priv->stats.rx_dropped; > + goto repost_nosrq; > + } > + > + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) > + /* There are no guarantees that wc->qp is not NULL for HCAs > + * that do not support SRQ. */ This comment seems kind of random here... First, it would help to word it without 'no' 'not' "NULL' 'not', to help with deciphering. Second, I don't see how it relates to the surrounding code. > + timer_check_nosrq(priv, rx_ptr); > + > + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; > + > + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, > + mapping); > + if (unlikely(!newskb)) { > + /* > + * If we can't allocate a new RX buffer, dump > + * this packet and reuse the old buffer. > + */ > + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", > + (unsigned long long)wr_id); > + ++priv->stats.rx_dropped; > + goto repost_nosrq; > + } > + > + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); > + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, > + (frags + 1) * sizeof *mapping); > > ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > wc->byte_len, wc->slid); > @@ -482,10 +779,22 @@ void ipoib_cm_handle_rx_wc(struct net_de > skb->pkt_type = PACKET_HOST; > netif_receive_skb(skb); > > -repost: > - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) > - ipoib_warn(priv, "ipoib_cm_post_receive failed " > - "for buf %d\n", wr_id); > +repost_nosrq: > + ret = post_receive_nosrq(dev, wr_id << 32 | index); > + > + if (unlikely(ret)) > + ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n", > + (unsigned long long)wr_id); > +} > + > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + > + if (priv->cm.srq) > + handle_rx_wc_srq(dev, wc); > + else > + handle_rx_wc_nosrq(dev, wc); > } We're taking a branch here to two functions that contain a fair amount of identical code. My personal preference is to have a single handle_rx_wc call with 2-3 if (srq) checks if needed than the current duplication. There are so many if (unlikely...) if (likely... checks in the rx_wc handlers, that I have a hard time believing that an additional 1-2 if (srq) checks will impact performance worse than the code increase. > > static inline int post_send(struct ipoib_dev_priv *priv, > @@ -677,6 +986,43 @@ err_cm: > return ret; > } > > +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + int i; > + > + for (i = 0; i < ipoib_recvq_size; ++i) > + if (p->rx_ring[i].skb) { > + ipoib_cm_dma_unmap_rx(priv, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping); > + dev_kfree_skb_any(p->rx_ring[i].skb); > + p->rx_ring[i].skb = NULL; We're freeing rx_ring, so setting skb to NULL seems unnecessary. > + } > + kfree(p->rx_ring); > +} > + > +void dev_stop_nosrq(struct ipoib_dev_priv *priv) > +{ > + struct ipoib_cm_rx *p; > + > + spin_lock_irq(&priv->lock); > + while (!list_empty(&priv->cm.passive_ids)) { > + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); > + free_resources_nosrq(priv, p); Does this call need to be made under lock? Or can we just remove it from the list, release the lock, then cleanup? > + list_del(&p->list); > + spin_unlock_irq(&priv->lock); > + ib_destroy_cm_id(p->id); > + ib_destroy_qp(p->qp); > + atomic_dec(¤t_rc_qp); > + kfree(p); > + spin_lock_irq(&priv->lock); > + } > + spin_unlock_irq(&priv->lock); > + > + cancel_delayed_work(&priv->cm.stale_task); > + kfree(priv->cm.rx_index_table); > +} > + > void ipoib_cm_dev_stop(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -691,6 +1037,11 @@ void ipoib_cm_dev_stop(struct net_device > ib_destroy_cm_id(priv->cm.id); > priv->cm.id = NULL; > > + if (!priv->cm.srq) { > + dev_stop_nosrq(priv); > + return; > + } > + Maybe it would be better to create two dev_stop calls, versus the current code flow where srq is done entirely within dev_stop, but no srq jumps to a different routine. (Is there any way to make this cleanup code a little more similar?) > spin_lock_irq(&priv->lock); > while (!list_empty(&priv->cm.passive_ids)) { > p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); > @@ -814,7 +1165,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > attr.recv_cq = priv->cq; > attr.srq = priv->cm.srq; > attr.cap.max_send_wr = ipoib_sendq_size; > + attr.cap.max_recv_wr = 0; > attr.cap.max_send_sge = 1; > + attr.cap.max_recv_sge = 0; > attr.sq_sig_type = IB_SIGNAL_ALL_WR; > attr.qp_type = IB_QPT_RC; > attr.send_cq = cq; > @@ -854,7 +1207,7 @@ static int ipoib_cm_send_req(struct net_ > req.retry_count = 0; /* RFC draft warns against retries */ > req.rnr_retry_count = 0; /* RFC draft warns against retries */ > req.max_cm_retries = 15; > - req.srq = 1; > + req.srq = !!priv->cm.srq; > return ib_send_cm_req(id, &req); > } > > @@ -1198,6 +1551,8 @@ static void ipoib_cm_rx_reap(struct work > list_for_each_entry_safe(p, n, &list, list) { > ib_destroy_cm_id(p->id); > ib_destroy_qp(p->qp); > + if (!priv->cm.srq) > + atomic_dec(¤t_rc_qp); I think QP limitations should apply independent of SRQ. See comments at the top of mail about separating the limitations. > kfree(p); > } > } > @@ -1216,12 +1571,19 @@ static void ipoib_cm_stale_task(struct w > p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); > if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) > break; > - list_move(&p->list, &priv->cm.rx_error_list); > - p->state = IPOIB_CM_RX_ERROR; > - spin_unlock_irq(&priv->lock); > - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); > - if (ret) > - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); > + if (!priv->cm.srq) { > + free_resources_nosrq(priv, p); > + list_del_init(&p->list); > + priv->cm.rx_index_table[p->index] = NULL; > + spin_unlock_irq(&priv->lock); > + } else { > + list_move(&p->list, &priv->cm.rx_error_list); > + p->state = IPOIB_CM_RX_ERROR; > + spin_unlock_irq(&priv->lock); > + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); > + if (ret) > + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); > + } I don't understand the differences between srq and no srq. Both have a list of QPs? Why does one track state, the other just remove itself from a list? Why not just have both transition into the error state? > spin_lock_irq(&priv->lock); > } > > @@ -1275,16 +1637,40 @@ int ipoib_cm_add_mode_attr(struct net_de > return device_create_file(&dev->dev, &dev_attr_mode); > } > > +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) > +{ > + struct ib_srq_init_attr srq_init_attr; > + int ret; > + > + srq_init_attr.attr.max_wr = ipoib_recvq_size; > + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; > + > + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); > + if (IS_ERR(priv->cm.srq)) { > + ret = PTR_ERR(priv->cm.srq); > + priv->cm.srq = NULL; > + return ret; > + } Can a failure here result in trying to use no SRQ mode? > + > + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * > + sizeof *priv->cm.srq_ring, > + GFP_KERNEL); > + if (!priv->cm.srq_ring) { > + printk(KERN_WARNING "%s: failed to allocate CM ring " > + "(%d entries)\n", > + priv->ca->name, ipoib_recvq_size); > + ipoib_cm_dev_cleanup(dev); I think we should limit the cleanup of this function to only what it creates. Only destroy the srq here if we can't allocate srq_ring. > + return -ENOMEM; > + } > + > + return 0; > +} > + > int ipoib_cm_dev_init(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - struct ib_srq_init_attr srq_init_attr = { > - .attr = { > - .max_wr = ipoib_recvq_size, > - .max_sge = IPOIB_CM_RX_SG > - } > - }; > int ret, i; > + struct ib_device_attr attr; > > INIT_LIST_HEAD(&priv->cm.passive_ids); > INIT_LIST_HEAD(&priv->cm.reap_list); > @@ -1301,20 +1687,32 @@ int ipoib_cm_dev_init(struct net_device > > skb_queue_head_init(&priv->cm.skb_queue); > > - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); > - if (IS_ERR(priv->cm.srq)) { > - ret = PTR_ERR(priv->cm.srq); > - priv->cm.srq = NULL; > + ret = ib_query_device(priv->ca, &attr); > + if (ret) > return ret; > - } > > - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, > - GFP_KERNEL); > - if (!priv->cm.srq_ring) { > - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", > - priv->ca->name, ipoib_recvq_size); > - ipoib_cm_dev_cleanup(dev); > - return -ENOMEM; > + if (attr.max_srq) { > + /* This device supports SRQ */ > + ret = create_srq(dev, priv); > + if (ret) > + return ret; > + priv->cm.rx_index_table = NULL; > + } else { > + priv->cm.srq = NULL; > + priv->cm.srq_ring = NULL; > + > + /* Every new REQ that arrives creates a struct ipoib_cm_rx. > + * These structures form a link list starting with the > + * passive_ids. For quick and easy access we maintain a table > + * of pointers to struct ipoib_cm_rx called the rx_index_table > + */ > + priv->cm.rx_index_table = kcalloc(max_rc_qp, > + sizeof *priv->cm.rx_index_table, > + GFP_KERNEL); > + if (!priv->cm.rx_index_table) { > + printk(KERN_WARNING "Failed to allocate rx_index_table\n"); > + return -ENOMEM; > + } > } > > for (i = 0; i < IPOIB_CM_RX_SG; ++i) > @@ -1327,17 +1725,24 @@ int ipoib_cm_dev_init(struct net_device > priv->cm.rx_wr.sg_list = priv->cm.rx_sge; > priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; > > - for (i = 0; i < ipoib_recvq_size; ++i) { > - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, > + /* One can post receive buffers even before the RX QP is created > + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init > + * and do that in ipoib_cm_req_handler() > + */ > + > + if (priv->cm.srq) { > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, > priv->cm.srq_ring[i].mapping)) { > - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); > - ipoib_cm_dev_cleanup(dev); > - return -ENOMEM; > - } > - if (ipoib_cm_post_receive(dev, i)) { > - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); > - ipoib_cm_dev_cleanup(dev); > - return -EIO; > + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + return -ENOMEM; > + } > + if (post_receive_srq(dev, i)) { > + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); > + ipoib_cm_dev_cleanup(dev); > + return -EIO; > + } Why not just wait until the req_handler to post receives in both cases? There's no need to consume the resources until a connection has been made. > } > } > > --- a/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-08-20 17:39:25.000000000 -0400 > +++ b/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-08-14 19:53:16.000000000 -0400 > @@ -300,7 +300,7 @@ int ipoib_poll(struct net_device *dev, i > for (i = 0; i < n; ++i) { > struct ib_wc *wc = priv->ibwc + i; > > - if (wc->wr_id & IPOIB_CM_OP_SRQ) { > + if (wc->wr_id & IPOIB_CM_OP_RECV) { > ++done; > --max; > ipoib_cm_handle_rx_wc(dev, wc); > @@ -558,7 +558,7 @@ void ipoib_drain_cq(struct net_device *d > do { > n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); > for (i = 0; i < n; ++i) { > - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) > + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) > ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); > else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) > ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); > --- a/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-08-20 17:39:25.000000000 -0400 > +++ b/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-08-14 19:53:16.000000000 -0400 > @@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_ > if (!ret) > size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; > > +#ifdef CONFIG_INFINIBAND_IPOIB_CM > + > + /* We increase the size of the CQ in the NOSRQ case to prevent CQ > + * overflow. Every new REQ creates a new RX QP and each QP has an > + * RX ring associated with it. Therefore we could have > + * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs > + * in a CQ. > + */ > + if (!priv->cm.srq) > + size += (max_rc_qp - 1) * ipoib_recvq_size; > +#endif > + > priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); > if (IS_ERR(priv->cq)) { > printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); > --- a/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-20 17:39:25.000000000 -0400 > +++ b/linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-20 17:42:13.000000000 -0400 > @@ -1227,6 +1227,7 @@ static int __init ipoib_init_module(void > ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); > ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); > ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); > + max_rc_qp = roundup_pow_of_two(max_rc_qp); > > ret = ipoib_register_debugfs(); > if (ret) - Sean From sashak at voltaire.com Mon Sep 17 12:25:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 21:25:01 +0200 Subject: [ofa-general] Re: [PATCH] osm: mkey lease period description in options file In-Reply-To: <46EED01E.2060104@dev.mellanox.co.il> References: <46EED01E.2060104@dev.mellanox.co.il> Message-ID: <20070917192501.GU6891@sashak.voltaire.com> On 21:06 Mon 17 Sep , Yevgeny Kliteynik wrote: > M_Key lease period description should > be in [sec] instead of [msec]. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 17 12:29:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 21:29:27 +0200 Subject: [ofa-general] Re: [PATCH] osm: TrapRepress was failing for mkey != 0 In-Reply-To: <46EED103.9010808@dev.mellanox.co.il> References: <46EED103.9010808@dev.mellanox.co.il> Message-ID: <20070917192927.GV6891@sashak.voltaire.com> On 21:09 Mon 17 Sep , Yevgeny Kliteynik wrote: > TrapRepress always had mkey 0, which was copied from trap > notice's mkey (which is always 0). > As a result, TrapRepress was failing for port with mkey != 0 > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Mon Sep 17 12:22:26 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 17 Sep 2007 21:22:26 +0200 Subject: [ofa-general] [PATCH 1/2] osm: QoS - replace guid ranges and partition list by port map Message-ID: <46EED3F2.1020308@dev.mellanox.co.il> QoS policy optimization: replacing partition list and guid ranges in a port group by port map indexed by port guid. The port map is filled at parse time, thus checking whether some port belongs to a group becomes a single map query. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 11 ++- opensm/opensm/osm_qos_parser.y | 115 ++++++++++++++++++++++++++----- opensm/opensm/osm_qos_policy.c | 59 ++++++++--------- 3 files changed, 131 insertions(+), 54 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 0c220ee..680bf71 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -50,6 +50,7 @@ #include #include #include +#include #include #include @@ -59,17 +60,20 @@ /***************************************************/ +typedef struct _osm_qos_port_t { + cl_map_item_t map_item; + osm_physp_t * p_physp; +} osm_qos_port_t; + typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ cl_list_t port_name_list; /* list of port names (.../.../...) */ - uint64_t **guid_range_arr; /* array of guid ranges (pair of 64-bit guids) */ - unsigned guid_range_len; /* num of guid ranges in the array */ - cl_list_t partition_list; /* list of partition names */ boolean_t node_type_ca; boolean_t node_type_switch; boolean_t node_type_router; boolean_t node_type_self; + cl_qmap_t port_map; } osm_qos_port_group_t; /***************************************************/ @@ -147,6 +151,7 @@ typedef struct _osm_qos_policy_t { /***************************************************/ +osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t * p_physp); osm_qos_port_group_t * osm_qos_policy_port_group_create(); void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p_port_group); diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index a477084..ca77536 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -103,6 +103,19 @@ static void __merge_rangearr( uint64_t ** * p_arr, unsigned * p_arr_len ); +static void __parser_add_port_to_port_map( + cl_qmap_t * p_map, + osm_physp_t * p_physp); + +static void __parser_add_range_to_port_map( + cl_qmap_t * p_map, + uint64_t ** range_arr, + unsigned range_len); + +static void __parser_add_map_to_port_map( + cl_qmap_t * p_dmap, + cl_map_t * p_smap); + extern char * __qos_parser_text; extern void __qos_parser_error (char *s); extern int __qos_parser_lex (void); @@ -612,24 +625,9 @@ port_group_port_guid: port_group_port_guid_start list_of_ranges { &range_arr, &range_len ); - if ( !p_current_port_group->guid_range_len ) - { - p_current_port_group->guid_range_arr = range_arr; - p_current_port_group->guid_range_len = range_len; - } - else - { - uint64_t ** new_range_arr; - unsigned new_range_len; - __merge_rangearr( p_current_port_group->guid_range_arr, - p_current_port_group->guid_range_len, - range_arr, - range_len, - &new_range_arr, - &new_range_len ); - p_current_port_group->guid_range_arr = new_range_arr; - p_current_port_group->guid_range_len = new_range_len; - } + __parser_add_range_to_port_map(&p_current_port_group->port_map, + range_arr, + range_len); } } ; @@ -643,13 +641,26 @@ port_group_partition: port_group_partition_start string_list { /* 'partition' in 'port-group' - any num of instances */ cl_list_iterator_t list_iterator; char * tmp_str; + osm_prtn_t * p_prtn; + /* extract all the ports from the partition + to the port map of this port group */ list_iterator = cl_list_head(&tmp_parser_struct.str_list); while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) { tmp_str = (char*)cl_list_obj(list_iterator); if (tmp_str) - cl_list_insert_tail(&p_current_port_group->partition_list,tmp_str); + { + p_prtn = osm_prtn_find_by_name(p_qos_policy->p_subn, tmp_str); + if (p_prtn) + { + __parser_add_map_to_port_map(&p_current_port_group->port_map, + &p_prtn->part_guid_tbl); + __parser_add_map_to_port_map(&p_current_port_group->port_map, + &p_prtn->full_guid_tbl); + } + free(tmp_str); + } list_iterator = cl_list_next(list_iterator); } cl_list_remove_all(&tmp_parser_struct.str_list); @@ -2185,3 +2196,69 @@ static void __merge_rangearr( /*************************************************** ***************************************************/ + +static void __parser_add_port_to_port_map( + cl_qmap_t * p_map, + osm_physp_t * p_physp) +{ + if (p_physp && osm_physp_is_valid(p_physp) && + cl_qmap_get(p_map, cl_ntoh64( + osm_physp_get_port_guid(p_physp))) == cl_qmap_end(p_map)) + { + osm_qos_port_t * p_port = osm_qos_policy_port_create(p_physp); + cl_qmap_insert(p_map, + cl_ntoh64(osm_physp_get_port_guid(p_physp)), + &p_port->map_item); + } +} + +/*************************************************** + ***************************************************/ + +static void __parser_add_range_to_port_map( + cl_qmap_t * p_map, + uint64_t ** range_arr, + unsigned range_len) +{ + unsigned i; + uint64_t guid_ho; + osm_port_t * p_osm_port; + + if (!range_arr || !range_len) + return; + + for (i = 0; i < range_len; i++) { + for (guid_ho = range_arr[i][0]; guid_ho <= range_arr[i][1]; guid_ho++) { + p_osm_port = + osm_get_port_by_guid(p_qos_policy->p_subn, cl_hton64(guid_ho)); + if (p_osm_port) + __parser_add_port_to_port_map(p_map, p_osm_port->p_physp); + } + free(range_arr[i]); + } + free(range_arr); +} + +/*************************************************** + ***************************************************/ + +static void __parser_add_map_to_port_map( + cl_qmap_t * p_dmap, + cl_map_t * p_smap) +{ + cl_map_iterator_t map_iterator; + osm_physp_t * p_physp; + + if (!p_dmap || !p_smap) + return; + + map_iterator = cl_map_head(p_smap); + while (map_iterator != cl_map_end(p_smap)) { + p_physp = (osm_physp_t*)cl_map_obj(map_iterator); + __parser_add_port_to_port_map(p_dmap, p_physp); + map_iterator = cl_map_next(map_iterator); + } +} + +/*************************************************** + ***************************************************/ diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index b2d1622..d1b227f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -101,6 +101,27 @@ static void __free_single_element(void *p_element, void *context) free(p_element); } +static void __free_port_map_element(cl_map_item_t *p_element, void *context) +{ + if (p_element) + free(p_element); +} + +/*************************************************** + ***************************************************/ + +osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t *p_physp) +{ + osm_qos_port_t *p = + (osm_qos_port_t *) malloc(sizeof(osm_qos_port_t)); + if (!p) + return NULL; + memset(p, 0, sizeof(osm_qos_port_t)); + + p->p_physp = p_physp; + return p; +} + /*************************************************** ***************************************************/ @@ -114,7 +135,7 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() memset(p, 0, sizeof(osm_qos_port_group_t)); cl_list_init(&p->port_name_list, 10); - cl_list_init(&p->partition_list, 10); + cl_qmap_init(&p->port_map); return p; } @@ -124,8 +145,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) { - unsigned i; - if (!p) return; @@ -134,18 +153,12 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) if (p->use) free(p->use); - for (i = 0; i < p->guid_range_len; i++) - free(p->guid_range_arr[i]); - if (p->guid_range_arr) - free(p->guid_range_arr); - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); cl_list_remove_all(&p->port_name_list); cl_list_destroy(&p->port_name_list); - cl_list_apply_func(&p->partition_list, __free_single_element, NULL); - cl_list_remove_all(&p->partition_list); - cl_list_destroy(&p->partition_list); + cl_qmap_apply_func(&p->port_map, __free_port_map_element, NULL); + cl_qmap_remove_all(&p->port_map); free(p); } @@ -491,12 +504,9 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, osm_qos_port_group_t * p_port_group) { osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); - osm_prtn_t *p_prtn = NULL; ib_net64_t port_guid = osm_physp_get_port_guid(p_physp); uint64_t port_guid_ho = cl_ntoh64(port_guid); uint8_t node_type = osm_node_get_type(p_node); - cl_list_iterator_t list_iterator; - char *partition_name; /* check whether this port's type matches any of group's types */ @@ -506,27 +516,12 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, && p_port_group->node_type_router)) return TRUE; - /* check whether this port's guid is in range of this group's guids */ + /* check whether this port's guid is in group's port map */ - if (__is_num_in_range_arr(p_port_group->guid_range_arr, - p_port_group->guid_range_len, port_guid_ho)) + if (cl_qmap_get(&p_port_group->port_map, port_guid_ho) != + cl_qmap_end(&p_port_group->port_map)) return TRUE; - /* check whether this port is member of this group's partitions */ - - list_iterator = cl_list_head(&p_port_group->partition_list); - while (list_iterator != cl_list_end(&p_port_group->partition_list)) { - partition_name = (char *)cl_list_obj(list_iterator); - if (partition_name && strlen(partition_name)) { - p_prtn = osm_prtn_find_by_name(p_subn, partition_name); - if (p_prtn) { - if (osm_prtn_is_guid(p_prtn, port_guid)) - return TRUE; - } - } - list_iterator = cl_list_next(list_iterator); - } - /* check whether this port's name matches any of group's names */ /* -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Mon Sep 17 12:23:27 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 17 Sep 2007 21:23:27 +0200 Subject: [ofa-general] [PATCH 2/2] osm: QoS - reworked node types in port group Message-ID: <46EED42F.8050504@dev.mellanox.co.il> QoS policy optimization: replaced node types with single bitmask, and node_type_self is implemented as an additional guid in group's port map. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 9 +++++---- opensm/opensm/osm_qos_parser.y | 24 +++++++++++++++++------- opensm/opensm/osm_qos_policy.c | 7 ++----- 3 files changed, 24 insertions(+), 16 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 680bf71..30c2e6d 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -58,6 +58,10 @@ #define OSM_QOS_POLICY_MAX_PORTS_ON_SWITCH 128 #define OSM_QOS_POLICY_DEFAULT_LEVEL_NAME "default" +#define OSM_QOS_POLICY_NODE_TYPE_CA (((uint8_t)1)<node_type_ca = TRUE;; + p_current_port_group->node_types |= + OSM_QOS_POLICY_NODE_TYPE_CA; } ; node_type_switch: TK_NODE_TYPE_SWITCH { - p_current_port_group->node_type_switch = TRUE; + p_current_port_group->node_types |= + OSM_QOS_POLICY_NODE_TYPE_SWITCH; } ; node_type_router: TK_NODE_TYPE_ROUTER { - p_current_port_group->node_type_router = TRUE; + p_current_port_group->node_types |= + OSM_QOS_POLICY_NODE_TYPE_ROUTER; } ; node_type_all: TK_NODE_TYPE_ALL { - p_current_port_group->node_type_ca = TRUE; - p_current_port_group->node_type_switch = TRUE; - p_current_port_group->node_type_router = TRUE; + p_current_port_group->node_types |= + (OSM_QOS_POLICY_NODE_TYPE_CA | + OSM_QOS_POLICY_NODE_TYPE_SWITCH | + OSM_QOS_POLICY_NODE_TYPE_ROUTER); } ; node_type_self: TK_NODE_TYPE_SELF { - p_current_port_group->node_type_self = TRUE; + osm_port_t * p_osm_port = + osm_get_port_by_guid(p_qos_policy->p_subn, + p_qos_policy->p_subn->sm_port_guid); + if (p_osm_port) + __parser_add_port_to_port_map( + &p_current_port_group->port_map, + p_osm_port->p_physp); } ; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index d1b227f..c84fb8b 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -506,14 +506,11 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); ib_net64_t port_guid = osm_physp_get_port_guid(p_physp); uint64_t port_guid_ho = cl_ntoh64(port_guid); - uint8_t node_type = osm_node_get_type(p_node); /* check whether this port's type matches any of group's types */ - if ((node_type == IB_NODE_TYPE_CA && p_port_group->node_type_ca) || - (node_type == IB_NODE_TYPE_SWITCH && p_port_group->node_type_switch) - || (node_type == IB_NODE_TYPE_ROUTER - && p_port_group->node_type_router)) + if ( p_port_group->node_types & + (((uint8_t)1)<<000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com><01d201c7f953$13a8cd60$04c8c8c8@olympus><46EEC10B.1060704@ichips.intel.com><20070917183045.GY4472@obsidianresearch.com> <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> Message-ID: <023901c7f961$090a91a0$04c8c8c8@olympus> Is it valid to have a CM request message with subnet local = 1 and hop limit > 1? If so, then it's not clear in the spec on how each side of the QP creation are to determine if they use GRHs or not. Can Mellanox be configured to accept both (LRH,LRH_GRH) on a RC QP at the same time? - Jim ----- Original Message ----- From: "Sean Hefty" To: "'Jason Gunthorpe'" ; "Sean Hefty" Cc: Sent: Monday, September 17, 2007 1:39 PM Subject: RE: [ofa-general] Re: [PATCH] core/cm: improverequestmessage interpretation of subnet local fields > >I'm with Hal on this - why does this cause a problem? There is no IB >>packet verification check that tests if a GRH is present, only if it >>is presen it must be valid - so how can an extra correctly filled in >>GRH cause anything but degraded performance? > > ib_init_ah_from_path() uses the hop_limit in the path record to determine > if a > GRH should be used. It sets the address handle attributes (used to > configure > the QP) based on hop_limit > 1. If hop_limit is set incorrectly in the CM > REQ, > the path record formed by the CM based on data carried in the REQ could > have > invalid GRH values. > > It's possible that this is an active side CM issue, but that's not clear > to me. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hrosenstock at xsigo.com Mon Sep 17 12:47:05 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 12:47:05 -0700 Subject: [ofa-general] libibumad and valgrind Message-ID: <1190058425.12099.46.camel@hrosenstock-ws.xsigo.com> Hi Sasha, Should the default for valgrind be off (in libibumad) ? It seems to be the other way around: src/umad.c:59:6: warning: #warning "Valgrind support requested, but VALGRIND_MAKE_MEM_DEFINED not available" -- Hal From jgunthorpe at obsidianresearch.com Mon Sep 17 12:47:37 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 17 Sep 2007 13:47:37 -0600 Subject: [ofa-general] Re: [PATCH] core/cm: improverequestmessage interpretation of subnet local fields In-Reply-To: <023901c7f961$090a91a0$04c8c8c8@olympus> References: <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> <023901c7f961$090a91a0$04c8c8c8@olympus> Message-ID: <20070917194737.GC4472@obsidianresearch.com> On Mon, Sep 17, 2007 at 02:29:15PM -0500, Jim Hall wrote: > Is it valid to have a CM request message with subnet local = 1 and hop > limit > 1? Oop, I hadn't considered the subnet local field. That should probably unconditionally control the GRH, like your patch. Hmm, you know, there is compliance statement C9-43.1.2 that is not reflected in the flow diagram of Figure 81, so the GRH presence/absence is explicitly matched. My bad. > C9-43.1.2: For RC, RD and UC services, if a received packet is > consistent with the configuration of the QP (or EEC) with respect to > the presence or absence of a GRH, then the packet shall be > considered to have passed the GRH check, subject to the remaining > GRH checks described in the rest of Section 9.6.1.2 GRH Checks on > page 274. One thing though, if the subnet local = 0 and the incoming hop limit <= 1 a GRH will still not be used. Your patch should probably also force the hop limit to 2 in this case and include a note for later fixup.. Jason From tranber.gauthier at yahoo.co.in Mon Sep 17 12:45:54 2007 From: tranber.gauthier at yahoo.co.in (Willis Hurley) Date: Mon, 17 Sep 2007 20:45:54 +0100 Subject: [ofa-general] ]:( :+!-+[!*-[++ [()[(- Message-ID: <01c7f964$a2b27a40$22cf0d53@tranber.gauthier> Sy!m b-oool From sashak at voltaire.com Mon Sep 17 13:20:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 22:20:38 +0200 Subject: [ofa-general] Re: libibumad and valgrind In-Reply-To: <1190058425.12099.46.camel@hrosenstock-ws.xsigo.com> References: <1190058425.12099.46.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917202038.GX6891@sashak.voltaire.com> Hi Hal, On 12:47 Mon 17 Sep , Hal Rosenstock wrote: > Hi Sasha, > > Should the default for valgrind be off (in libibumad) ? It seems to be > the other way around: > > src/umad.c:59:6: warning: #warning "Valgrind support requested, but > VALGRIND_MAKE_MEM_DEFINED not available" Hmm, it is "off" on my machine, but valgrind is not installed. I'll check how it will with valgrind. Sasha From sashak at voltaire.com Mon Sep 17 13:32:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 22:32:33 +0200 Subject: [ofa-general] Re: libibumad and valgrind In-Reply-To: <20070917202038.GX6891@sashak.voltaire.com> References: <1190058425.12099.46.camel@hrosenstock-ws.xsigo.com> <20070917202038.GX6891@sashak.voltaire.com> Message-ID: <20070917203233.GY6891@sashak.voltaire.com> On 22:20 Mon 17 Sep , Sasha Khapyorsky wrote: > Hi Hal, > > On 12:47 Mon 17 Sep , Hal Rosenstock wrote: > > Hi Sasha, > > > > Should the default for valgrind be off (in libibumad) ? It seems to be > > the other way around: > > > > src/umad.c:59:6: warning: #warning "Valgrind support requested, but > > VALGRIND_MAKE_MEM_DEFINED not available" > > Hmm, it is "off" on my machine, but valgrind is not installed. I'll > check how it will with valgrind. Actually valgrind support is off by default, but there are also check for valgrind/memcheck.h header file and it is included by umad.h (with valgrind "off"). Seems your version valgrind does not have VALGRIND_MAKE_MEM_DEFINED() macro defined for the case when valgrind is disabled. This message is not very clear. Sasha From sean.hefty at intel.com Mon Sep 17 14:01:22 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 14:01:22 -0700 Subject: [ofa-general] Re: [PATCH] core/cm: improverequestmessage interpretation of subnet local fields In-Reply-To: <20070917194737.GC4472@obsidianresearch.com> References: <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> <023901c7f961$090a91a0$04c8c8c8@olympus> <20070917194737.GC4472@obsidianresearch.com> Message-ID: <000701c7f96d$e6ef1c50$9c98070a@amr.corp.intel.com> >One thing though, if the subnet local = 0 and the incoming hop limit ><= 1 a GRH will still not be used. Your patch should probably also >force the hop limit to 2 in this case and include a note for later >fixup.. I think this case would be better handled by rejecting the REQ. - Sean From kliteyn at mellanox.co.il Mon Sep 17 14:07:22 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 17 Sep 2007 23:07:22 +0200 Subject: [Fwd: [ofa-general] nightly osm_sim report 2007-09-15:normal completion] In-Reply-To: <1190032458.6272.67.camel@hrosenstock-ws.xsigo.com> References: <1190032458.6272.67.camel@hrosenstock-ws.xsigo.com> Message-ID: <46EEEC8A.1020502@mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > Is the failure below a simulator or OpenSM issue ? Thanks. > Not sure yet. I'll try to recreate the failure and update you with more info. -- Yevgeny > -- Hal > > -------- Forwarded Message -------- > From: kliteyn at mellanox.co.il > To: sashak at voltaire.com > Cc: general at lists.openfabrics.org > Subject: [ofa-general] nightly osm_sim report 2007-09-15:normal > completion > Date: 15 Sep 2007 07:32:15 +0300 > OSM Simulation Regression Summary > > [Generated mail - please do NOT reply] > > > OpenSM binary date = 2007-09-14 > OpenSM git rev = Sun_Sep_9_15:57:42_2007 [27f7ec84dbb1060397fa930569bc88d8f6e1d373] > ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] > > > Total=520 Pass=519 Fail=1 > > > Pass: > 39 Pkey IS1-16.topo > 39 OsmTest IS1-16.topo > 39 OsmStress IS1-16.topo > 39 Multicast IS1-16.topo > 39 LidMgr IS1-16.topo > 38 Stability IS1-16.topo > 13 Stability IS3-loop.topo > 13 Stability IS3-128.topo > 13 Pkey IS3-128.topo > 13 OsmTest IS3-loop.topo > 13 OsmTest IS3-128.topo > 13 OsmStress IS3-128.topo > 13 Multicast IS3-loop.topo > 13 Multicast IS3-128.topo > 13 LidMgr IS3-128.topo > 13 FatTree merge-roots-4-ary-2-tree.topo > 13 FatTree merge-root-4-ary-3-tree.topo > 13 FatTree gnu-stallion-64.topo > 13 FatTree blend-4-ary-2-tree.topo > 13 FatTree RhinoDDR.topo > 13 FatTree FullGnu.topo > 13 FatTree 4-ary-2-tree.topo > 13 FatTree 2-ary-4-tree.topo > 13 FatTree 12-node-spaced.topo > 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo > 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo > 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo > 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo > > Failures: > 1 Stability IS1-16.topo > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mshefty at ichips.intel.com Mon Sep 17 14:11:09 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 14:11:09 -0700 Subject: [ofa-general] CM REQ: subnet local versus hop limit meaning In-Reply-To: <023901c7f961$090a91a0$04c8c8c8@olympus> References: <011d01c7f938$56e03ed0$04c8c8c8@olympus><000001c7f94d$b73c9f70$9c98070a@amr.corp.intel.com><01d201c7f953$13a8cd60$04c8c8c8@olympus><46EEC10B.1060704@ichips.intel.com><20070917183045.GY4472@obsidianresearch.com> <000501c7f95a$1f4ea890$9c98070a@amr.corp.intel.com> <023901c7f961$090a91a0$04c8c8c8@olympus> Message-ID: <46EEED6D.5040803@ichips.intel.com> Jim Hall wrote: > Is it valid to have a CM request message with subnet local = 1 and hop > limit > 1? I don't see this explicitly called out as invalid, even if it doesn't make any sense. As a related note, there are a reject codes for invalid hop limit and TC. I guess we have a choice between ignoring hop limit, TC, and FL if subnet local = 1; versus validating those values and rejecting the connection if they are unsupported. Right now, I'm leaning more towards rejecting the connection if subnet local = 1 and hop limit > 1. Does anyone else have any thoughts? - Sean From rdreier at cisco.com Mon Sep 17 14:40:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 14:40:08 -0700 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue In-Reply-To: <46EEB715.7060509@ccur.com> (John Blackwood's message of "Mon, 17 Sep 2007 13:19:17 -0400") References: <46EEB715.7060509@ccur.com> Message-ID: Thanks for the explanation... > But basically, with CONFIG_PREEMPT_RT enabled, the lock points, such as > aqcuiring a spinlock, potentially become places where the current task > may be context switched out / preempted. > > Therefore, when a call is made to lock a spinlock for example, the > caller should not currently have irqs disabled, or preemption disabled, > since a context switch may occur. this doesn't seem relevant here... > void fastcall rt_downgrade_write(struct rw_semaphore *rwsem) > { > BUG(); > } this seems to be the problem... the -rt patch turns downgrade_write() into a BUG(). I need to look at the locking in user_mad.c again, but I think it may be possible to replace both places that do downgrade_write() with up_write() followed by down_read(). - R. From sashak at voltaire.com Mon Sep 17 14:51:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 17 Sep 2007 23:51:47 +0200 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917215147.GZ6891@sashak.voltaire.com> Hi Hal, On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > ibnetdiscover: Support Xsigo chassis grouping > > I think this also fixes a bug with grouping of multiple non Voltaire > chassis as well. Could you provide more details about this bug. Should this be a separate patch? > Note: this patch is against OFED 1.2 Hal, you know - the patches for master should be against master (I spent some time). Some comments are below. > > Signed-off-by: Hal Rosenstock > > diff --git a/diags/include/grouping.h b/diags/include/grouping.h > index 4666935..3ba872c 100644 > --- a/diags/include/grouping.h > +++ b/diags/include/grouping.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); > char *get_chassis_slot(unsigned char chassisslot); > uint64_t get_chassis_guid(unsigned char chassisnum); > > +int is_xsigo_guid(uint64_t guid); > +int is_xsigo_tca(uint64_t guid); > +int is_xsigo_hca(uint64_t guid); > + > #endif /* _GROUPING_H_ */ > diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h > index d13a666..bfbe7f5 100644 > --- a/diags/include/ibnetdiscover.h > +++ b/diags/include/ibnetdiscover.h > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -44,6 +45,7 @@ > #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ > #define TS_VENDOR_ID 0x5ad /* Cisco */ > #define SS_VENDOR_ID 0x66a /* InfiniCon */ > +#define XS_VENDOR_ID 0x1397 /* Xsigo */ > > > typedef struct Port Port; > diff --git a/diags/src/grouping.c b/diags/src/grouping.c > index 0e5bd78..6602f26 100644 > --- a/diags/src/grouping.c > +++ b/diags/src/grouping.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) > return guid & 0xffffffff00ffffffULL; > } > > -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) > +int is_xsigo_guid(uint64_t guid) > { > - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) > - return topspin_chassisguid(guid); > + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) > + return 1; > else > - return guid; > + return 0; > +} > + > +static int is_xsigo_leafone(uint64_t guid) > +{ > + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) > + return 1; > + else > + return 0; > +} > + > +int is_xsigo_hca(uint64_t guid) > +{ > + /* NodeType 2 is HCA */ > + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) > + return 1; > + else > + return 0; > +} > + > +int is_xsigo_tca(uint64_t guid) > +{ > + /* NodeType 3 is TCA */ > + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) > + return 1; > + else > + return 0; > +} > + > +static int is_xsigo_ca(uint64_t guid) > +{ > + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) > + return 1; > + else > + return 0; > +} > + > +static int is_xsigo_switch(uint64_t guid) > +{ > + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) > + return 1; > + else > + return 0; > +} > + > +static uint64_t xsigo_chassisguid(Node *node) > +{ > + if (!is_xsigo_ca(node->sysimgguid)) { > + /* Byte 3 is NodeType and byte 4 is PortType */ > + /* If NodeType is 1 (switch), PortType is masked */ > + if (is_xsigo_switch(node->sysimgguid)) > + return node->sysimgguid & 0xffffffff00ffffffULL; > + else > + return node->sysimgguid; > + } else { > + /* If peer port is Leaf 1, use its chassis GUID */ > + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) > + return node->ports->remoteport->node->sysimgguid & > + 0xffffffff00ffffffULL; > + else > + return node->sysimgguid; > + } > } > > -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) > +static uint64_t get_chassisguid(Node *node) > +{ > + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) > + return topspin_chassisguid(node->sysimgguid); > + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) > + return xsigo_chassisguid(node); > + else > + return node->sysimgguid; > +} > + > +static struct ChassisList *find_chassisguid(Node *node) > { > ChassisList *current; > uint64_t chguid; > > - chguid = get_chassisguid(guid, vendid); > + chguid = get_chassisguid(node); > for (current = mylist.first; current; current = current->next) { > if (current->chassisguid == chguid) > return current; > @@ -668,14 +740,13 @@ ChassisList *group_nodes() > if (node->vendid == VTR_VENDOR_ID) > continue; > if (node->sysimgguid) { > - chassis = find_chassisguid(node->sysimgguid, > - node->vendid); > + chassis = find_chassisguid(node); > if (chassis) > chassis->nodecount++; > else { > /* Possible new chassis */ > add_chassislist(); > - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); > + mylist.current->chassisguid = get_chassisguid(node); > mylist.current->nodecount = 1; > } > } > @@ -684,13 +755,12 @@ ChassisList *group_nodes() > > /* now, make another pass to see which nodes are part of chassis */ > /* (defined as chassis->nodecount > 1) */ > - for (dist = 0; dist <= maxhops_discovered; dist++) { > + for (dist = 0; dist <= MAXHOPS; ) { > for (node = nodesdist[dist]; node; node = node->dnext) { > if (node->vendid == VTR_VENDOR_ID) > continue; > if (node->sysimgguid) { > - chassis = find_chassisguid(node->sysimgguid, > - node->vendid); > + chassis = find_chassisguid(node); > if (chassis && chassis->nodecount > 1) { > if (!chassis->chassisnum) > chassis->chassisnum = ++chassisnum; > @@ -702,6 +772,10 @@ ChassisList *group_nodes() > } > } > } > + if (dist == maxhops_discovered) > + dist = MAXHOPS; /* skip to CAs */ > + else > + dist++; > } > > return (mylist.first); > diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c > index cb62c44..2cff87e 100644 > --- a/diags/src/ibnetdiscover.c > +++ b/diags/src/ibnetdiscover.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -450,14 +451,26 @@ list_node(Node *node) > } > > void > -out_ids(Node *node) > +out_ids(Node *node, int group, char *chname) > { > fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); > if (node->sysimgguid) > - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); > + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); > + if (group) > + if (node->chrecord) > + if (node->chrecord->chassisnum) { > + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); > + if (chname) > + fprintf(f, " (%s)", clean_nodedesc(chname)); > + if (is_xsigo_tca(node->nodeguid)) { > + if (node->ports->remoteport) > + fprintf(f, " slot %d", node->ports->remoteport->portnum); > + } > + } > + fprintf(f, "\n"); > } > > -void > +uint64_t > out_chassis(int chassisnum) > { > uint64_t guid; > @@ -467,20 +480,20 @@ out_chassis(int chassisnum) > if (guid) > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > fprintf(f, "\n"); > + return guid; > } > > void > -out_switch(Node *node, int group) > +out_switch(Node *node, int group, char *chname) > { > char *str; > char *nodename = NULL; > > - out_ids(node); > + out_ids(node, group, chname); > fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); > if (group) { > if (node->chrecord) { > if (node->chrecord->chassisnum) { > - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); > /* Currently, only if Voltaire chassis */ > if (node->vendid == VTR_VENDOR_ID) { > str = get_chassis_type(node->chrecord->chassistype); > @@ -510,12 +523,12 @@ out_switch(Node *node, int group) > } > > void > -out_ca(Node *node) > +out_ca(Node *node, int group, char *chname) > { > char *node_type; > char *node_type2; > > - out_ids(node); > + out_ids(node, group, chname); > switch(node->type) { > case CA_NODE: > node_type = "ca"; > @@ -532,9 +545,12 @@ out_ca(Node *node) > } > > fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); > - fprintf(f, "%s\t%d %s\t\t# \"%s\"\n", > + fprintf(f, "%s\t%d %s\t\t# \"%s\"", > node_type2, node->numports, node_name(node), > clean_nodedesc(node->nodedesc)); > + if (group && is_xsigo_hca(node->nodeguid)) > + fprintf(f, " (scp)"); > + fprintf(f, "\n"); > } > > static char * > @@ -572,12 +588,17 @@ out_switch_port(Port *port, int group) > rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); > > ext_port_str = out_ext_port(port->remoteport, group); > - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", > + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", > node_name(port->remoteport->node), > port->remoteport->portnum, > ext_port_str ? ext_port_str : "", > rem_nodename, > port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); > + if (is_xsigo_tca(port->remoteport->portguid)) > + fprintf(f, " slot %d", port->portnum); > + else if (is_xsigo_hca(port->remoteport->portguid)) > + fprintf(f, " (scp)"); > + fprintf(f, "\n"); > > if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) > free(rem_nodename); > @@ -616,6 +637,8 @@ dump_topology(int listtype, int group) > Port *port; > int i = 0, dist = 0; > time_t t = time(0); > + uint64_t chguid; > + char *chname = NULL; > > if (!listtype) { > fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > @@ -633,11 +656,31 @@ dump_topology(int listtype, int group) > > if (!ch->chassisnum) > continue; > - out_chassis(ch->chassisnum); > + chguid = out_chassis(ch->chassisnum); > + chname = NULL; > + if (is_xsigo_guid(chguid)) { > + /* !!! */ > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > + if (node->chrecord) { > + if (!node->chrecord->chassisnum) > + continue; > + } else > + continue; > + > + if (node->chrecord->chassisnum != ch->chassisnum) > + continue; > + > + if (is_xsigo_hca(node->nodeguid)) { > + chname = node->nodedesc; > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > + } > + } > + } > + Not sure I understand this code correctly, but is it Xsigo only? I mean where is_xsigo_hca() is used. Anyway why to not hide all this section inside out_chassis()? Sasha > fprintf(f, "\n# Spine Nodes"); > for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { > if (ch->spinenode[n]) { > - out_switch(ch->spinenode[n], group); > + out_switch(ch->spinenode[n], group, chname); > for (port = ch->spinenode[n]->ports; port; port = port->next, i++) > if (port->remoteport) > out_switch_port(port, group); > @@ -646,34 +689,57 @@ dump_topology(int listtype, int group) > fprintf(f, "\n# Line Nodes"); > for (n = 1; n <= (LINES_MAX_NUM+1); n++) { > if (ch->linenode[n]) { > - out_switch(ch->linenode[n], group); > + out_switch(ch->linenode[n], group, chname); > for (port = ch->linenode[n]->ports; port; port = port->next, i++) > if (port->remoteport) > out_switch_port(port, group); > } > } > > - } > + fprintf(f, "\n# Chassis Switches"); > + for (dist = 0; dist <= maxhops_discovered; dist++) { > > - for (dist = 0; dist <= maxhops_discovered; dist++) { > + for (node = nodesdist[dist]; node; node = node->dnext) { > > - for (node = nodesdist[dist]; node; node = node->dnext) { > + /* Non Voltaire chassis */ > + if (node->vendid == VTR_VENDOR_ID) > + continue; > + if (node->chrecord) { > + if (!node->chrecord->chassisnum) > + continue; > + } else > + continue; > > - /* Non Voltaire chassis */ > - if (node->vendid == VTR_VENDOR_ID) > - continue; > + if (node->chrecord->chassisnum != ch->chassisnum) > + continue; > + > + out_switch(node, group, chname); > + for (port = node->ports; port; port = port->next, i++) > + if (port->remoteport) > + out_switch_port(port, group); > + > + } > + > + } > + > + fprintf(f, "\n# Chassis CAs"); > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > if (node->chrecord) { > if (!node->chrecord->chassisnum) > continue; > } else > continue; > > - out_switch(node, group); > + if (node->chrecord->chassisnum != ch->chassisnum) > + continue; > + > + out_ca(node, group, chname); > for (port = node->ports; port; port = port->next, i++) > if (port->remoteport) > - out_switch_port(port, group); > + out_ca_port(port, group); > > } > + > } > > } else { > @@ -683,7 +749,7 @@ dump_topology(int listtype, int group) > > DEBUG("SWITCH: dist %d node %p", dist, node); > if (!listtype) { > - out_switch(node, group); > + out_switch(node, group, chname); > } else { > if (listtype & SWITCH_NODE) > list_node(node); > @@ -697,6 +763,7 @@ dump_topology(int listtype, int group) > } > } > > + chname = NULL; > if (group && !listtype) { > > fprintf(f, "\nNon-Chassis Nodes\n"); > @@ -710,7 +777,7 @@ dump_topology(int listtype, int group) > if (node->chrecord) > if (node->chrecord->chassisnum) > continue; > - out_switch(node, group); > + out_switch(node, group, chname); > > for (port = node->ports; port; port = port->next, i++) > if (port->remoteport) > @@ -725,9 +792,14 @@ dump_topology(int listtype, int group) > for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > DEBUG("CA: dist %d node %p", dist, node); > - if (!listtype) > - out_ca(node); > - else { > + if (!listtype) { > + if (group) > + /* Now, skip chassis based CAs */ > + if (node->chrecord) > + if (node->chrecord->chassisnum) > + continue; > + out_ca(node, group, chname); > + } else { > if (listtype & CA_NODE) > list_node(node); > continue; From rdreier at cisco.com Mon Sep 17 14:43:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 14:43:45 -0700 Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: <20070917062252.GA30842@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 17 Sep 2007 08:22:52 +0200") References: <20070911090313.GE15363@mellanox.co.il> <20070917062252.GA30842@mellanox.co.il> Message-ID: > Why not just call synchronize_rcu instead? Not sure I understand. Where would you put the synchronize_rcu and what would it protect against? RCU is being used to protect the radix tree internals, not the mlx4 data structures. > > I guess CQ spinlock implies rcu_read_lock - is that right? > > But I do not see any synchronize_rcu calls anywhere in mlx4. > > Should destroy QP and friends call synchronize_rcu after > > removing the QP from radix tree but before freeing the QP structure? By the way, replying to this earlier bit: I don't think the CQ spinlock is equivalent to an rcu_read_lock(). In most configurations it may be but I suspect the assumption would be broken by PREEMPT_RT or the like. - R. From rdreier at cisco.com Mon Sep 17 14:47:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 14:47:01 -0700 Subject: [ofa-general] Re: RFC: modify upstream code to make backporting easier In-Reply-To: <20070916095930.GI30150@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 16 Sep 2007 11:59:31 +0200") References: <20070911062851.GC15363@mellanox.co.il> <20070916095930.GI30150@mellanox.co.il> Message-ID: > Note that some people only run > backported drivers, so making it easier to read and maintain > *the backport* is also important. The philosophy of the kernel has always been that the backport needs to bear the cost, and we don't want to add extra #ifdefs to the standard kernel that don't do anything except in non-standard situations. > Do you think applying a patch as we do now is the best way to do it then? To be honest the patch that started this thread looked very reasonable and easy to maintain to me. The pain of backporting gets higher as the kernel you're backporting to gets older, but that's just hte way things are. From rdreier at cisco.com Mon Sep 17 14:47:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 14:47:42 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: (Shirley Ma's message of "Fri, 14 Sep 2007 11:36:12 -0700") References: Message-ID: > > IPoIB CM handles this properly by gathering together single pages in > > skbs' fragment lists. > Then can we reuse IPoIB CM code here? Yes, if possible, refactoring things so that the rx skb allocation code becomes common between CM and non-CM would definitely make sense. From rdreier at cisco.com Mon Sep 17 14:48:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 14:48:43 -0700 Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: (Roland Dreier's message of "Mon, 17 Sep 2007 14:43:45 -0700") References: <20070911090313.GE15363@mellanox.co.il> <20070917062252.GA30842@mellanox.co.il> Message-ID: By the way, in the past we've gotten push-back against using RCU in dual GPL/BSD code. I have no problem relicensing mlx4 to GPL-only and then sticking in the rcu_read_lock() stuff to handle this I guess. - R. From sashak at voltaire.com Mon Sep 17 15:10:03 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 00:10:03 +0200 Subject: [ofa-general] Re: [PATCH 1/2] osm: QoS - replace guid ranges and partition list by port map In-Reply-To: <46EED3F2.1020308@dev.mellanox.co.il> References: <46EED3F2.1020308@dev.mellanox.co.il> Message-ID: <20070917221003.GA6891@sashak.voltaire.com> Hi Yevgeny, Small comment is below. On 21:22 Mon 17 Sep , Yevgeny Kliteynik wrote: > QoS policy optimization: replacing partition list and guid > ranges in a port group by port map indexed by port guid. > The port map is filled at parse time, thus checking whether > some port belongs to a group becomes a single map query. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_qos_policy.h | 11 ++- > opensm/opensm/osm_qos_parser.y | 115 ++++++++++++++++++++++++++----- > opensm/opensm/osm_qos_policy.c | 59 ++++++++--------- > 3 files changed, 131 insertions(+), 54 deletions(-) > > diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h > index 0c220ee..680bf71 100644 > --- a/opensm/include/opensm/osm_qos_policy.h > +++ b/opensm/include/opensm/osm_qos_policy.h > @@ -50,6 +50,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -59,17 +60,20 @@ > > /***************************************************/ > > +typedef struct _osm_qos_port_t { > + cl_map_item_t map_item; > + osm_physp_t * p_physp; > +} osm_qos_port_t; > + > typedef struct _osm_qos_port_group_t { > char *name; /* single string (this port group name) */ > char *use; /* single string (description) */ > cl_list_t port_name_list; /* list of port names (.../.../...) */ > - uint64_t **guid_range_arr; /* array of guid ranges (pair of 64-bit guids) */ > - unsigned guid_range_len; /* num of guid ranges in the array */ > - cl_list_t partition_list; /* list of partition names */ > boolean_t node_type_ca; > boolean_t node_type_switch; > boolean_t node_type_router; > boolean_t node_type_self; > + cl_qmap_t port_map; > } osm_qos_port_group_t; > > /***************************************************/ > @@ -147,6 +151,7 @@ typedef struct _osm_qos_policy_t { > > /***************************************************/ > > +osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t * p_physp); > osm_qos_port_group_t * osm_qos_policy_port_group_create(); > void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p_port_group); > > diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y > index a477084..ca77536 100644 > --- a/opensm/opensm/osm_qos_parser.y > +++ b/opensm/opensm/osm_qos_parser.y > @@ -103,6 +103,19 @@ static void __merge_rangearr( > uint64_t ** * p_arr, > unsigned * p_arr_len ); > > +static void __parser_add_port_to_port_map( > + cl_qmap_t * p_map, > + osm_physp_t * p_physp); > + > +static void __parser_add_range_to_port_map( > + cl_qmap_t * p_map, > + uint64_t ** range_arr, > + unsigned range_len); > + > +static void __parser_add_map_to_port_map( > + cl_qmap_t * p_dmap, > + cl_map_t * p_smap); > + > extern char * __qos_parser_text; > extern void __qos_parser_error (char *s); > extern int __qos_parser_lex (void); > @@ -612,24 +625,9 @@ port_group_port_guid: port_group_port_guid_start list_of_ranges { > &range_arr, > &range_len ); > > - if ( !p_current_port_group->guid_range_len ) > - { > - p_current_port_group->guid_range_arr = range_arr; > - p_current_port_group->guid_range_len = range_len; > - } > - else > - { > - uint64_t ** new_range_arr; > - unsigned new_range_len; > - __merge_rangearr( p_current_port_group->guid_range_arr, > - p_current_port_group->guid_range_len, > - range_arr, > - range_len, > - &new_range_arr, > - &new_range_len ); > - p_current_port_group->guid_range_arr = new_range_arr; > - p_current_port_group->guid_range_len = new_range_len; > - } > + __parser_add_range_to_port_map(&p_current_port_group->port_map, > + range_arr, > + range_len); > } > } > ; > @@ -643,13 +641,26 @@ port_group_partition: port_group_partition_start string_list { > /* 'partition' in 'port-group' - any num of instances */ > cl_list_iterator_t list_iterator; > char * tmp_str; > + osm_prtn_t * p_prtn; > > + /* extract all the ports from the partition > + to the port map of this port group */ > list_iterator = cl_list_head(&tmp_parser_struct.str_list); > while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) > { > tmp_str = (char*)cl_list_obj(list_iterator); > if (tmp_str) > - cl_list_insert_tail(&p_current_port_group->partition_list,tmp_str); > + { > + p_prtn = osm_prtn_find_by_name(p_qos_policy->p_subn, tmp_str); > + if (p_prtn) > + { > + __parser_add_map_to_port_map(&p_current_port_group->port_map, > + &p_prtn->part_guid_tbl); > + __parser_add_map_to_port_map(&p_current_port_group->port_map, > + &p_prtn->full_guid_tbl); > + } > + free(tmp_str); > + } > list_iterator = cl_list_next(list_iterator); > } > cl_list_remove_all(&tmp_parser_struct.str_list); > @@ -2185,3 +2196,69 @@ static void __merge_rangearr( > > /*************************************************** > ***************************************************/ > + > +static void __parser_add_port_to_port_map( > + cl_qmap_t * p_map, > + osm_physp_t * p_physp) > +{ > + if (p_physp && osm_physp_is_valid(p_physp) && > + cl_qmap_get(p_map, cl_ntoh64( > + osm_physp_get_port_guid(p_physp))) == cl_qmap_end(p_map)) > + { > + osm_qos_port_t * p_port = osm_qos_policy_port_create(p_physp); Here mem allocation can fail and p_port will be NULL. Don't need to resubmit the patch for this, just send subsequent patch. Sasha > + cl_qmap_insert(p_map, > + cl_ntoh64(osm_physp_get_port_guid(p_physp)), > + &p_port->map_item); > + } > +} > + > +/*************************************************** > + ***************************************************/ > + > +static void __parser_add_range_to_port_map( > + cl_qmap_t * p_map, > + uint64_t ** range_arr, > + unsigned range_len) > +{ > + unsigned i; > + uint64_t guid_ho; > + osm_port_t * p_osm_port; > + > + if (!range_arr || !range_len) > + return; > + > + for (i = 0; i < range_len; i++) { > + for (guid_ho = range_arr[i][0]; guid_ho <= range_arr[i][1]; guid_ho++) { > + p_osm_port = > + osm_get_port_by_guid(p_qos_policy->p_subn, cl_hton64(guid_ho)); > + if (p_osm_port) > + __parser_add_port_to_port_map(p_map, p_osm_port->p_physp); > + } > + free(range_arr[i]); > + } > + free(range_arr); > +} > + > +/*************************************************** > + ***************************************************/ > + > +static void __parser_add_map_to_port_map( > + cl_qmap_t * p_dmap, > + cl_map_t * p_smap) > +{ > + cl_map_iterator_t map_iterator; > + osm_physp_t * p_physp; > + > + if (!p_dmap || !p_smap) > + return; > + > + map_iterator = cl_map_head(p_smap); > + while (map_iterator != cl_map_end(p_smap)) { > + p_physp = (osm_physp_t*)cl_map_obj(map_iterator); > + __parser_add_port_to_port_map(p_dmap, p_physp); > + map_iterator = cl_map_next(map_iterator); > + } > +} > + > +/*************************************************** > + ***************************************************/ > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index b2d1622..d1b227f 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -101,6 +101,27 @@ static void __free_single_element(void *p_element, void *context) > free(p_element); > } > > +static void __free_port_map_element(cl_map_item_t *p_element, void *context) > +{ > + if (p_element) > + free(p_element); > +} > + > +/*************************************************** > + ***************************************************/ > + > +osm_qos_port_t *osm_qos_policy_port_create(osm_physp_t *p_physp) > +{ > + osm_qos_port_t *p = > + (osm_qos_port_t *) malloc(sizeof(osm_qos_port_t)); > + if (!p) > + return NULL; > + memset(p, 0, sizeof(osm_qos_port_t)); > + > + p->p_physp = p_physp; > + return p; > +} > + > /*************************************************** > ***************************************************/ > > @@ -114,7 +135,7 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() > memset(p, 0, sizeof(osm_qos_port_group_t)); > > cl_list_init(&p->port_name_list, 10); > - cl_list_init(&p->partition_list, 10); > + cl_qmap_init(&p->port_map); > > return p; > } > @@ -124,8 +145,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() > > void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) > { > - unsigned i; > - > if (!p) > return; > > @@ -134,18 +153,12 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) > if (p->use) > free(p->use); > > - for (i = 0; i < p->guid_range_len; i++) > - free(p->guid_range_arr[i]); > - if (p->guid_range_arr) > - free(p->guid_range_arr); > - > cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); > cl_list_remove_all(&p->port_name_list); > cl_list_destroy(&p->port_name_list); > > - cl_list_apply_func(&p->partition_list, __free_single_element, NULL); > - cl_list_remove_all(&p->partition_list); > - cl_list_destroy(&p->partition_list); > + cl_qmap_apply_func(&p->port_map, __free_port_map_element, NULL); > + cl_qmap_remove_all(&p->port_map); > > free(p); > } > @@ -491,12 +504,9 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, > osm_qos_port_group_t * p_port_group) > { > osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); > - osm_prtn_t *p_prtn = NULL; > ib_net64_t port_guid = osm_physp_get_port_guid(p_physp); > uint64_t port_guid_ho = cl_ntoh64(port_guid); > uint8_t node_type = osm_node_get_type(p_node); > - cl_list_iterator_t list_iterator; > - char *partition_name; > > /* check whether this port's type matches any of group's types */ > > @@ -506,27 +516,12 @@ __qos_policy_is_port_in_group(osm_subn_t * p_subn, > && p_port_group->node_type_router)) > return TRUE; > > - /* check whether this port's guid is in range of this group's guids */ > + /* check whether this port's guid is in group's port map */ > > - if (__is_num_in_range_arr(p_port_group->guid_range_arr, > - p_port_group->guid_range_len, port_guid_ho)) > + if (cl_qmap_get(&p_port_group->port_map, port_guid_ho) != > + cl_qmap_end(&p_port_group->port_map)) > return TRUE; > > - /* check whether this port is member of this group's partitions */ > - > - list_iterator = cl_list_head(&p_port_group->partition_list); > - while (list_iterator != cl_list_end(&p_port_group->partition_list)) { > - partition_name = (char *)cl_list_obj(list_iterator); > - if (partition_name && strlen(partition_name)) { > - p_prtn = osm_prtn_find_by_name(p_subn, partition_name); > - if (p_prtn) { > - if (osm_prtn_is_guid(p_prtn, port_guid)) > - return TRUE; > - } > - } > - list_iterator = cl_list_next(list_iterator); > - } > - > /* check whether this port's name matches any of group's names */ > > /* > -- > 1.5.1.4 > > From sashak at voltaire.com Mon Sep 17 15:10:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 00:10:51 +0200 Subject: [ofa-general] Re: [PATCH 1/2] osm: QoS - replace guid ranges and partition list by port map In-Reply-To: <46EED3F2.1020308@dev.mellanox.co.il> References: <46EED3F2.1020308@dev.mellanox.co.il> Message-ID: <20070917221051.GB6891@sashak.voltaire.com> On 21:22 Mon 17 Sep , Yevgeny Kliteynik wrote: > QoS policy optimization: replacing partition list and guid > ranges in a port group by port map indexed by port guid. > The port map is filled at parse time, thus checking whether > some port belongs to a group becomes a single map query. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Mon Sep 17 15:12:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 00:12:28 +0200 Subject: [ofa-general] Re: [PATCH 2/2] osm: QoS - reworked node types in port group In-Reply-To: <46EED42F.8050504@dev.mellanox.co.il> References: <46EED42F.8050504@dev.mellanox.co.il> Message-ID: <20070917221228.GC6891@sashak.voltaire.com> On 21:23 Mon 17 Sep , Yevgeny Kliteynik wrote: > QoS policy optimization: replaced node types with > single bitmask, and node_type_self is implemented > as an additional guid in group's port map. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From hrosenstock at xsigo.com Mon Sep 17 15:08:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 15:08:51 -0700 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <20070917215147.GZ6891@sashak.voltaire.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> <20070917215147.GZ6891@sashak.voltaire.com> Message-ID: <1190066931.12099.65.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Mon, 2007-09-17 at 23:51 +0200, Sasha Khapyorsky wrote: > Hi Hal, > > On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > > ibnetdiscover: Support Xsigo chassis grouping > > > > I think this also fixes a bug with grouping of multiple non Voltaire > > chassis as well. > > Could you provide more details about this bug. I found it because the Xsigo grouping is similar to the non Voltaire grouping and tested a multiple chassis case which did not work. > Should this be a separate patch? Is this really needed ? I have no way of testing this independently of the (other) Xsigo changes. > > Note: this patch is against OFED 1.2 > > Hal, you know - the patches for master should be against master (I spent > some time). Thanks. As you know, we are working with OFED 1.2. > Some comments are below. > > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/diags/include/grouping.h b/diags/include/grouping.h > > index 4666935..3ba872c 100644 > > --- a/diags/include/grouping.h > > +++ b/diags/include/grouping.h > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); > > char *get_chassis_slot(unsigned char chassisslot); > > uint64_t get_chassis_guid(unsigned char chassisnum); > > > > +int is_xsigo_guid(uint64_t guid); > > +int is_xsigo_tca(uint64_t guid); > > +int is_xsigo_hca(uint64_t guid); > > + > > #endif /* _GROUPING_H_ */ > > diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h > > index d13a666..bfbe7f5 100644 > > --- a/diags/include/ibnetdiscover.h > > +++ b/diags/include/ibnetdiscover.h > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -44,6 +45,7 @@ > > #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ > > #define TS_VENDOR_ID 0x5ad /* Cisco */ > > #define SS_VENDOR_ID 0x66a /* InfiniCon */ > > +#define XS_VENDOR_ID 0x1397 /* Xsigo */ > > > > > > typedef struct Port Port; > > diff --git a/diags/src/grouping.c b/diags/src/grouping.c > > index 0e5bd78..6602f26 100644 > > --- a/diags/src/grouping.c > > +++ b/diags/src/grouping.c > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) > > return guid & 0xffffffff00ffffffULL; > > } > > > > -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) > > +int is_xsigo_guid(uint64_t guid) > > { > > - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) > > - return topspin_chassisguid(guid); > > + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) > > + return 1; > > else > > - return guid; > > + return 0; > > +} > > + > > +static int is_xsigo_leafone(uint64_t guid) > > +{ > > + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) > > + return 1; > > + else > > + return 0; > > +} > > + > > +int is_xsigo_hca(uint64_t guid) > > +{ > > + /* NodeType 2 is HCA */ > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) > > + return 1; > > + else > > + return 0; > > +} > > + > > +int is_xsigo_tca(uint64_t guid) > > +{ > > + /* NodeType 3 is TCA */ > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) > > + return 1; > > + else > > + return 0; > > +} > > + > > +static int is_xsigo_ca(uint64_t guid) > > +{ > > + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) > > + return 1; > > + else > > + return 0; > > +} > > + > > +static int is_xsigo_switch(uint64_t guid) > > +{ > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) > > + return 1; > > + else > > + return 0; > > +} > > + > > +static uint64_t xsigo_chassisguid(Node *node) > > +{ > > + if (!is_xsigo_ca(node->sysimgguid)) { > > + /* Byte 3 is NodeType and byte 4 is PortType */ > > + /* If NodeType is 1 (switch), PortType is masked */ > > + if (is_xsigo_switch(node->sysimgguid)) > > + return node->sysimgguid & 0xffffffff00ffffffULL; > > + else > > + return node->sysimgguid; > > + } else { > > + /* If peer port is Leaf 1, use its chassis GUID */ > > + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) > > + return node->ports->remoteport->node->sysimgguid & > > + 0xffffffff00ffffffULL; > > + else > > + return node->sysimgguid; > > + } > > } > > > > -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) > > +static uint64_t get_chassisguid(Node *node) > > +{ > > + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) > > + return topspin_chassisguid(node->sysimgguid); > > + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) > > + return xsigo_chassisguid(node); > > + else > > + return node->sysimgguid; > > +} > > + > > +static struct ChassisList *find_chassisguid(Node *node) > > { > > ChassisList *current; > > uint64_t chguid; > > > > - chguid = get_chassisguid(guid, vendid); > > + chguid = get_chassisguid(node); > > for (current = mylist.first; current; current = current->next) { > > if (current->chassisguid == chguid) > > return current; > > @@ -668,14 +740,13 @@ ChassisList *group_nodes() > > if (node->vendid == VTR_VENDOR_ID) > > continue; > > if (node->sysimgguid) { > > - chassis = find_chassisguid(node->sysimgguid, > > - node->vendid); > > + chassis = find_chassisguid(node); > > if (chassis) > > chassis->nodecount++; > > else { > > /* Possible new chassis */ > > add_chassislist(); > > - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); > > + mylist.current->chassisguid = get_chassisguid(node); > > mylist.current->nodecount = 1; > > } > > } > > @@ -684,13 +755,12 @@ ChassisList *group_nodes() > > > > /* now, make another pass to see which nodes are part of chassis */ > > /* (defined as chassis->nodecount > 1) */ > > - for (dist = 0; dist <= maxhops_discovered; dist++) { > > + for (dist = 0; dist <= MAXHOPS; ) { > > for (node = nodesdist[dist]; node; node = node->dnext) { > > if (node->vendid == VTR_VENDOR_ID) > > continue; > > if (node->sysimgguid) { > > - chassis = find_chassisguid(node->sysimgguid, > > - node->vendid); > > + chassis = find_chassisguid(node); > > if (chassis && chassis->nodecount > 1) { > > if (!chassis->chassisnum) > > chassis->chassisnum = ++chassisnum; > > @@ -702,6 +772,10 @@ ChassisList *group_nodes() > > } > > } > > } > > + if (dist == maxhops_discovered) > > + dist = MAXHOPS; /* skip to CAs */ > > + else > > + dist++; > > } > > > > return (mylist.first); > > diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c > > index cb62c44..2cff87e 100644 > > --- a/diags/src/ibnetdiscover.c > > +++ b/diags/src/ibnetdiscover.c > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the GNU > > @@ -450,14 +451,26 @@ list_node(Node *node) > > } > > > > void > > -out_ids(Node *node) > > +out_ids(Node *node, int group, char *chname) > > { > > fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); > > if (node->sysimgguid) > > - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); > > + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); > > + if (group) > > + if (node->chrecord) > > + if (node->chrecord->chassisnum) { > > + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); > > + if (chname) > > + fprintf(f, " (%s)", clean_nodedesc(chname)); > > + if (is_xsigo_tca(node->nodeguid)) { > > + if (node->ports->remoteport) > > + fprintf(f, " slot %d", node->ports->remoteport->portnum); > > + } > > + } > > + fprintf(f, "\n"); > > } > > > > -void > > +uint64_t > > out_chassis(int chassisnum) > > { > > uint64_t guid; > > @@ -467,20 +480,20 @@ out_chassis(int chassisnum) > > if (guid) > > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > > fprintf(f, "\n"); > > + return guid; > > } > > > > void > > -out_switch(Node *node, int group) > > +out_switch(Node *node, int group, char *chname) > > { > > char *str; > > char *nodename = NULL; > > > > - out_ids(node); > > + out_ids(node, group, chname); > > fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); > > if (group) { > > if (node->chrecord) { > > if (node->chrecord->chassisnum) { > > - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); > > /* Currently, only if Voltaire chassis */ > > if (node->vendid == VTR_VENDOR_ID) { > > str = get_chassis_type(node->chrecord->chassistype); > > @@ -510,12 +523,12 @@ out_switch(Node *node, int group) > > } > > > > void > > -out_ca(Node *node) > > +out_ca(Node *node, int group, char *chname) > > { > > char *node_type; > > char *node_type2; > > > > - out_ids(node); > > + out_ids(node, group, chname); > > switch(node->type) { > > case CA_NODE: > > node_type = "ca"; > > @@ -532,9 +545,12 @@ out_ca(Node *node) > > } > > > > fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); > > - fprintf(f, "%s\t%d %s\t\t# \"%s\"\n", > > + fprintf(f, "%s\t%d %s\t\t# \"%s\"", > > node_type2, node->numports, node_name(node), > > clean_nodedesc(node->nodedesc)); > > + if (group && is_xsigo_hca(node->nodeguid)) > > + fprintf(f, " (scp)"); > > + fprintf(f, "\n"); > > } > > > > static char * > > @@ -572,12 +588,17 @@ out_switch_port(Port *port, int group) > > rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); > > > > ext_port_str = out_ext_port(port->remoteport, group); > > - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", > > + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", > > node_name(port->remoteport->node), > > port->remoteport->portnum, > > ext_port_str ? ext_port_str : "", > > rem_nodename, > > port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); > > + if (is_xsigo_tca(port->remoteport->portguid)) > > + fprintf(f, " slot %d", port->portnum); > > + else if (is_xsigo_hca(port->remoteport->portguid)) > > + fprintf(f, " (scp)"); > > + fprintf(f, "\n"); > > > > if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) > > free(rem_nodename); > > @@ -616,6 +637,8 @@ dump_topology(int listtype, int group) > > Port *port; > > int i = 0, dist = 0; > > time_t t = time(0); > > + uint64_t chguid; > > + char *chname = NULL; > > > > if (!listtype) { > > fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > > @@ -633,11 +656,31 @@ dump_topology(int listtype, int group) > > > > if (!ch->chassisnum) > > continue; > > - out_chassis(ch->chassisnum); > > + chguid = out_chassis(ch->chassisnum); > > + chname = NULL; > > + if (is_xsigo_guid(chguid)) { > > + /* !!! */ > > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > + if (node->chrecord) { > > + if (!node->chrecord->chassisnum) > > + continue; > > + } else > > + continue; > > + > > + if (node->chrecord->chassisnum != ch->chassisnum) > > + continue; > > + > > + if (is_xsigo_hca(node->nodeguid)) { > > + chname = node->nodedesc; > > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > > + } > > + } > > + } > > + > > Not sure I understand this code correctly, but is it Xsigo only? I mean > where is_xsigo_hca() is used. Yes, this is specific to Xsigo. > Anyway why to not hide all this section inside out_chassis()? It looks like it could be done as you suggest but it is currently done similar to other code slightly lower down which loop in a similar manner (Chassis Switches, Chassis CAs). -- Hal > Sasha > > > fprintf(f, "\n# Spine Nodes"); > > for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { > > if (ch->spinenode[n]) { > > - out_switch(ch->spinenode[n], group); > > + out_switch(ch->spinenode[n], group, chname); > > for (port = ch->spinenode[n]->ports; port; port = port->next, i++) > > if (port->remoteport) > > out_switch_port(port, group); > > @@ -646,34 +689,57 @@ dump_topology(int listtype, int group) > > fprintf(f, "\n# Line Nodes"); > > for (n = 1; n <= (LINES_MAX_NUM+1); n++) { > > if (ch->linenode[n]) { > > - out_switch(ch->linenode[n], group); > > + out_switch(ch->linenode[n], group, chname); > > for (port = ch->linenode[n]->ports; port; port = port->next, i++) > > if (port->remoteport) > > out_switch_port(port, group); > > } > > } > > > > - } > > + fprintf(f, "\n# Chassis Switches"); > > + for (dist = 0; dist <= maxhops_discovered; dist++) { > > > > - for (dist = 0; dist <= maxhops_discovered; dist++) { > > + for (node = nodesdist[dist]; node; node = node->dnext) { > > > > - for (node = nodesdist[dist]; node; node = node->dnext) { > > + /* Non Voltaire chassis */ > > + if (node->vendid == VTR_VENDOR_ID) > > + continue; > > + if (node->chrecord) { > > + if (!node->chrecord->chassisnum) > > + continue; > > + } else > > + continue; > > > > - /* Non Voltaire chassis */ > > - if (node->vendid == VTR_VENDOR_ID) > > - continue; > > + if (node->chrecord->chassisnum != ch->chassisnum) > > + continue; > > + > > + out_switch(node, group, chname); > > + for (port = node->ports; port; port = port->next, i++) > > + if (port->remoteport) > > + out_switch_port(port, group); > > + > > + } > > + > > + } > > + > > + fprintf(f, "\n# Chassis CAs"); > > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > if (node->chrecord) { > > if (!node->chrecord->chassisnum) > > continue; > > } else > > continue; > > > > - out_switch(node, group); > > + if (node->chrecord->chassisnum != ch->chassisnum) > > + continue; > > + > > + out_ca(node, group, chname); > > for (port = node->ports; port; port = port->next, i++) > > if (port->remoteport) > > - out_switch_port(port, group); > > + out_ca_port(port, group); > > > > } > > + > > } > > > > } else { > > @@ -683,7 +749,7 @@ dump_topology(int listtype, int group) > > > > DEBUG("SWITCH: dist %d node %p", dist, node); > > if (!listtype) { > > - out_switch(node, group); > > + out_switch(node, group, chname); > > } else { > > if (listtype & SWITCH_NODE) > > list_node(node); > > @@ -697,6 +763,7 @@ dump_topology(int listtype, int group) > > } > > } > > > > + chname = NULL; > > if (group && !listtype) { > > > > fprintf(f, "\nNon-Chassis Nodes\n"); > > @@ -710,7 +777,7 @@ dump_topology(int listtype, int group) > > if (node->chrecord) > > if (node->chrecord->chassisnum) > > continue; > > - out_switch(node, group); > > + out_switch(node, group, chname); > > > > for (port = node->ports; port; port = port->next, i++) > > if (port->remoteport) > > @@ -725,9 +792,14 @@ dump_topology(int listtype, int group) > > for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > > > DEBUG("CA: dist %d node %p", dist, node); > > - if (!listtype) > > - out_ca(node); > > - else { > > + if (!listtype) { > > + if (group) > > + /* Now, skip chassis based CAs */ > > + if (node->chrecord) > > + if (node->chrecord->chassisnum) > > + continue; > > + out_ca(node, group, chname); > > + } else { > > if (listtype & CA_NODE) > > list_node(node); > > continue; > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Sep 17 15:11:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:11:14 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46ECEE3F.60301@voltaire.com> (Or Gerlitz's message of "Sun, 16 Sep 2007 11:50:07 +0300") References: <46ECEE3F.60301@voltaire.com> Message-ID: > The IGMP enabling patch posted by me on September 2nd isn't on your list > http://lists.openfabrics.org/pipermail/general/2007-September/040250.html > can you add it? Yes, I lost that somehow. I will add it to my list of things to take a look at (no opinion yet). - R. From rdreier at cisco.com Mon Sep 17 15:12:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:12:18 -0700 Subject: [ofa-general] Re: [PATCH 1/3] IB/ehca: Fix large page HW cap defines In-Reply-To: <200709131814.59307.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 13 Sep 2007 18:14:58 +0200") References: <200709131814.13937.fenkes@de.ibm.com> <200709131814.59307.fenkes@de.ibm.com> Message-ID: obviously OK...applied. From rdreier at cisco.com Mon Sep 17 15:17:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:17:14 -0700 Subject: [ofa-general] Re: [PATCH 01/11] IB/ipoib: Export call to call_netdevice_notifiers and add new private flag In-Reply-To: <11898132322950-git-send-email-fubar@us.ibm.com> (Jay Vosburgh's message of "Fri, 14 Sep 2007 16:40:20 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> Message-ID: I tried to look at the ipoib stuff in this series... this patch looks fine but it doesn't actually touch ipoib, so the subject line is a bit misleading... From rdreier at cisco.com Mon Sep 17 15:20:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:20:48 -0700 Subject: [ofa-general] Re: [PATCH 04/11] IB/ipoib: Verify address handle validity on send In-Reply-To: <11898132372856-git-send-email-fubar@us.ibm.com> (Jay Vosburgh's message of "Fri, 14 Sep 2007 16:40:23 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> <11898132372856-git-send-email-fubar@us.ibm.com> Message-ID: Looks fine overall, with one minor nitpick: > - if (unlikely(memcmp(&neigh->dgid.raw, > + if (unlikely((memcmp(&neigh->dgid.raw, > skb->dst->neighbour->ha + 4, > - sizeof(union ib_gid)))) { > + sizeof(union ib_gid))) || > + (neigh->dev != dev))) { the indentation here makes this confusing to read -- I would just do: } else if (neigh->ah) { if (unlikely(memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid)) || + neigh->dev != dev)) { From rdreier at cisco.com Mon Sep 17 15:22:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:22:00 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <1189813234208-git-send-email-fubar@us.ibm.com> (Jay Vosburgh's message of "Fri, 14 Sep 2007 16:40:21 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> Message-ID: OK with me. From rdreier at cisco.com Mon Sep 17 15:23:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:23:58 -0700 Subject: [ofa-general] Re: [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <11898132352341-git-send-email-fubar@us.ibm.com> (Jay Vosburgh's message of "Fri, 14 Sep 2007 16:40:22 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <11898132352341-git-send-email-fubar@us.ibm.com> Message-ID: Overall idea looks good... one comment: > + if (n->dev->flags & IFF_MASTER) { > + /* n->dev is not an IPoIB device and we have > + to take priv from elsewhere */ > + neigh = *to_ipoib_neigh(n); > + if (neigh) { > + priv = netdev_priv(neigh->dev); > + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", > + n->dev->name); > + } else > + return; > + } seems like it would be cleaner not to worry about bonding here and just use neigh->dev all the time rather than having two ways to look up the device. From xma at us.ibm.com Mon Sep 17 15:24:12 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 17 Sep 2007 15:24:12 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: Message-ID: netdev-owner at vger.kernel.org wrote on 09/17/2007 02:47:42 PM: > > > IPoIB CM handles this properly by gathering together single pages in > > > skbs' fragment lists. > > > Then can we reuse IPoIB CM code here? > > Yes, if possible, refactoring things so that the rx skb allocation > code becomes common between CM and non-CM would definitely make sense. This is also applied to MTU=2K size as well since ppage size is greater than 2K on different platforms is not guaranteed. This skb issue is an independent effort with supporting MTU=4K. We need to address non-CM skb issue in general. Let's have a simple patch to enable MTU=4K support first in 2.6.24, then address non-CM skb issue in a different patch next, Do you agree? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Mon Sep 17 15:23:44 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 18 Sep 2007 00:23:44 +0200 Subject: [ofa-general] [PATCH] osm: QoS parser - checking allocation status Message-ID: <46EEFE70.4050600@dev.mellanox.co.il> Checking memory allocation status is qos policy parser to prevent seg. fault. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.y | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index a73cf6b..3c54205 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -2216,9 +2216,10 @@ static void __parser_add_port_to_port_map( osm_physp_get_port_guid(p_physp))) == cl_qmap_end(p_map)) { osm_qos_port_t * p_port = osm_qos_policy_port_create(p_physp); - cl_qmap_insert(p_map, - cl_ntoh64(osm_physp_get_port_guid(p_physp)), - &p_port->map_item); + if (p_port) + cl_qmap_insert(p_map, + cl_ntoh64(osm_physp_get_port_guid(p_physp)), + &p_port->map_item); } } -- 1.5.1.4 From rdreier at cisco.com Mon Sep 17 15:25:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 15:25:55 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <1189813234208-git-send-email-fubar@us.ibm.com> (Jay Vosburgh's message of "Fri, 14 Sep 2007 16:40:21 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> Message-ID: Actually, thinking about this some more... would it be cleaner to more the knowledge about bonding out of the ipoib driver? in other words, export something similar to > +static int ipoib_slave_detach(struct net_device *dev) > +{ > + int ret = 0; > + if (dev->flags & IFF_SLAVE) { > + dev->priv_flags |= IFF_SLAVE_DETACH; > + rtnl_lock(); > + ret = call_netdevice_notifiers(NETDEV_CHANGE, dev); > + rtnl_unlock(); > + } > + return ret; > +} for drivers to use, rather than putting use of IFF_SLAVE and IFF_SLAVE_DETACH outside of the bonding driver. Also it seems this function could return void, since both call sites ignore the return value and I don't see anything sensible that IPoIB could do with the notifier chain return value anyway. - R. From sean.hefty at intel.com Mon Sep 17 15:26:47 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Sep 2007 15:26:47 -0700 Subject: [ofa-general] [RFC 2/2] ib/cm: add basic performance counters In-Reply-To: References: <000001c7f62c$121c31a0$65cc180a@amr.corp.intel.com><000201c7f62d$1c004750$65cc180a@amr.corp.intel.com><000301c7f630$d8ac1d90$65cc180a@amr.corp.intel.com> Message-ID: <000901c7f979$d57370a0$9c98070a@amr.corp.intel.com> >My first reaction would be to stick them somewhere in debugfs. (I'm >assuming this feature is for diagnostics etc) I'm looking into using debugfs, since the counters are mainly for diagnostics, but are these much different than exposing counters through: /sys/class/infiniband/mthca0/ports/1/counters or /sys/class/net/ib0/statistics ? I'm concerned about not having access to these counters because debugfs wasn't installed. (I don't see a great solution atm.) - Sean From sashak at voltaire.com Mon Sep 17 15:40:00 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 00:40:00 +0200 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <1190066931.12099.65.camel@hrosenstock-ws.xsigo.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> <20070917215147.GZ6891@sashak.voltaire.com> <1190066931.12099.65.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917224000.GE6891@sashak.voltaire.com> On 15:08 Mon 17 Sep , Hal Rosenstock wrote: > Hi Sasha, > > On Mon, 2007-09-17 at 23:51 +0200, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > > > ibnetdiscover: Support Xsigo chassis grouping > > > > > > I think this also fixes a bug with grouping of multiple non Voltaire > > > chassis as well. > > > > Could you provide more details about this bug. > > I found it because the Xsigo grouping is similar to the non Voltaire > grouping and tested a multiple chassis case which did not work. But what the bug is? > > Should this be a separate patch? > > Is this really needed ? I have no way of testing this independently of > the (other) Xsigo changes. > > > > Note: this patch is against OFED 1.2 > > > > Hal, you know - the patches for master should be against master (I spent > > some time). > > Thanks. As you know, we are working with OFED 1.2. But this patch targets master, not OFED 1.2. It is not something new - the patches should be generated against branch they are targeted. > > > Some comments are below. > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > diff --git a/diags/include/grouping.h b/diags/include/grouping.h > > > index 4666935..3ba872c 100644 > > > --- a/diags/include/grouping.h > > > +++ b/diags/include/grouping.h > > > @@ -1,5 +1,6 @@ > > > /* > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > * > > > * This software is available to you under a choice of one of two > > > * licenses. You may choose to be licensed under the terms of the GNU > > > @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); > > > char *get_chassis_slot(unsigned char chassisslot); > > > uint64_t get_chassis_guid(unsigned char chassisnum); > > > > > > +int is_xsigo_guid(uint64_t guid); > > > +int is_xsigo_tca(uint64_t guid); > > > +int is_xsigo_hca(uint64_t guid); > > > + > > > #endif /* _GROUPING_H_ */ > > > diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h > > > index d13a666..bfbe7f5 100644 > > > --- a/diags/include/ibnetdiscover.h > > > +++ b/diags/include/ibnetdiscover.h > > > @@ -1,5 +1,6 @@ > > > /* > > > * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > * > > > * This software is available to you under a choice of one of two > > > * licenses. You may choose to be licensed under the terms of the GNU > > > @@ -44,6 +45,7 @@ > > > #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ > > > #define TS_VENDOR_ID 0x5ad /* Cisco */ > > > #define SS_VENDOR_ID 0x66a /* InfiniCon */ > > > +#define XS_VENDOR_ID 0x1397 /* Xsigo */ > > > > > > > > > typedef struct Port Port; > > > diff --git a/diags/src/grouping.c b/diags/src/grouping.c > > > index 0e5bd78..6602f26 100644 > > > --- a/diags/src/grouping.c > > > +++ b/diags/src/grouping.c > > > @@ -1,5 +1,6 @@ > > > /* > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > * > > > * This software is available to you under a choice of one of two > > > * licenses. You may choose to be licensed under the terms of the GNU > > > @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) > > > return guid & 0xffffffff00ffffffULL; > > > } > > > > > > -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) > > > +int is_xsigo_guid(uint64_t guid) > > > { > > > - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) > > > - return topspin_chassisguid(guid); > > > + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) > > > + return 1; > > > else > > > - return guid; > > > + return 0; > > > +} > > > + > > > +static int is_xsigo_leafone(uint64_t guid) > > > +{ > > > + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) > > > + return 1; > > > + else > > > + return 0; > > > +} > > > + > > > +int is_xsigo_hca(uint64_t guid) > > > +{ > > > + /* NodeType 2 is HCA */ > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) > > > + return 1; > > > + else > > > + return 0; > > > +} > > > + > > > +int is_xsigo_tca(uint64_t guid) > > > +{ > > > + /* NodeType 3 is TCA */ > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) > > > + return 1; > > > + else > > > + return 0; > > > +} > > > + > > > +static int is_xsigo_ca(uint64_t guid) > > > +{ > > > + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) > > > + return 1; > > > + else > > > + return 0; > > > +} > > > + > > > +static int is_xsigo_switch(uint64_t guid) > > > +{ > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) > > > + return 1; > > > + else > > > + return 0; > > > +} > > > + > > > +static uint64_t xsigo_chassisguid(Node *node) > > > +{ > > > + if (!is_xsigo_ca(node->sysimgguid)) { > > > + /* Byte 3 is NodeType and byte 4 is PortType */ > > > + /* If NodeType is 1 (switch), PortType is masked */ > > > + if (is_xsigo_switch(node->sysimgguid)) > > > + return node->sysimgguid & 0xffffffff00ffffffULL; > > > + else > > > + return node->sysimgguid; > > > + } else { > > > + /* If peer port is Leaf 1, use its chassis GUID */ > > > + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) > > > + return node->ports->remoteport->node->sysimgguid & > > > + 0xffffffff00ffffffULL; > > > + else > > > + return node->sysimgguid; > > > + } > > > } > > > > > > -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) > > > +static uint64_t get_chassisguid(Node *node) > > > +{ > > > + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) > > > + return topspin_chassisguid(node->sysimgguid); > > > + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) > > > + return xsigo_chassisguid(node); > > > + else > > > + return node->sysimgguid; > > > +} > > > + > > > +static struct ChassisList *find_chassisguid(Node *node) > > > { > > > ChassisList *current; > > > uint64_t chguid; > > > > > > - chguid = get_chassisguid(guid, vendid); > > > + chguid = get_chassisguid(node); > > > for (current = mylist.first; current; current = current->next) { > > > if (current->chassisguid == chguid) > > > return current; > > > @@ -668,14 +740,13 @@ ChassisList *group_nodes() > > > if (node->vendid == VTR_VENDOR_ID) > > > continue; > > > if (node->sysimgguid) { > > > - chassis = find_chassisguid(node->sysimgguid, > > > - node->vendid); > > > + chassis = find_chassisguid(node); > > > if (chassis) > > > chassis->nodecount++; > > > else { > > > /* Possible new chassis */ > > > add_chassislist(); > > > - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); > > > + mylist.current->chassisguid = get_chassisguid(node); > > > mylist.current->nodecount = 1; > > > } > > > } > > > @@ -684,13 +755,12 @@ ChassisList *group_nodes() > > > > > > /* now, make another pass to see which nodes are part of chassis */ > > > /* (defined as chassis->nodecount > 1) */ > > > - for (dist = 0; dist <= maxhops_discovered; dist++) { > > > + for (dist = 0; dist <= MAXHOPS; ) { > > > for (node = nodesdist[dist]; node; node = node->dnext) { > > > if (node->vendid == VTR_VENDOR_ID) > > > continue; > > > if (node->sysimgguid) { > > > - chassis = find_chassisguid(node->sysimgguid, > > > - node->vendid); > > > + chassis = find_chassisguid(node); > > > if (chassis && chassis->nodecount > 1) { > > > if (!chassis->chassisnum) > > > chassis->chassisnum = ++chassisnum; > > > @@ -702,6 +772,10 @@ ChassisList *group_nodes() > > > } > > > } > > > } > > > + if (dist == maxhops_discovered) > > > + dist = MAXHOPS; /* skip to CAs */ > > > + else > > > + dist++; > > > } > > > > > > return (mylist.first); > > > diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c > > > index cb62c44..2cff87e 100644 > > > --- a/diags/src/ibnetdiscover.c > > > +++ b/diags/src/ibnetdiscover.c > > > @@ -1,5 +1,6 @@ > > > /* > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > * > > > * This software is available to you under a choice of one of two > > > * licenses. You may choose to be licensed under the terms of the GNU > > > @@ -450,14 +451,26 @@ list_node(Node *node) > > > } > > > > > > void > > > -out_ids(Node *node) > > > +out_ids(Node *node, int group, char *chname) > > > { > > > fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); > > > if (node->sysimgguid) > > > - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); > > > + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); > > > + if (group) > > > + if (node->chrecord) > > > + if (node->chrecord->chassisnum) { > > > + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); > > > + if (chname) > > > + fprintf(f, " (%s)", clean_nodedesc(chname)); > > > + if (is_xsigo_tca(node->nodeguid)) { > > > + if (node->ports->remoteport) > > > + fprintf(f, " slot %d", node->ports->remoteport->portnum); > > > + } > > > + } > > > + fprintf(f, "\n"); > > > } > > > > > > -void > > > +uint64_t > > > out_chassis(int chassisnum) > > > { > > > uint64_t guid; > > > @@ -467,20 +480,20 @@ out_chassis(int chassisnum) > > > if (guid) > > > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > > > fprintf(f, "\n"); > > > + return guid; > > > } > > > > > > void > > > -out_switch(Node *node, int group) > > > +out_switch(Node *node, int group, char *chname) > > > { > > > char *str; > > > char *nodename = NULL; > > > > > > - out_ids(node); > > > + out_ids(node, group, chname); > > > fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); > > > if (group) { > > > if (node->chrecord) { > > > if (node->chrecord->chassisnum) { > > > - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); > > > /* Currently, only if Voltaire chassis */ > > > if (node->vendid == VTR_VENDOR_ID) { > > > str = get_chassis_type(node->chrecord->chassistype); > > > @@ -510,12 +523,12 @@ out_switch(Node *node, int group) > > > } > > > > > > void > > > -out_ca(Node *node) > > > +out_ca(Node *node, int group, char *chname) > > > { > > > char *node_type; > > > char *node_type2; > > > > > > - out_ids(node); > > > + out_ids(node, group, chname); > > > switch(node->type) { > > > case CA_NODE: > > > node_type = "ca"; > > > @@ -532,9 +545,12 @@ out_ca(Node *node) > > > } > > > > > > fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); > > > - fprintf(f, "%s\t%d %s\t\t# \"%s\"\n", > > > + fprintf(f, "%s\t%d %s\t\t# \"%s\"", > > > node_type2, node->numports, node_name(node), > > > clean_nodedesc(node->nodedesc)); > > > + if (group && is_xsigo_hca(node->nodeguid)) > > > + fprintf(f, " (scp)"); > > > + fprintf(f, "\n"); > > > } > > > > > > static char * > > > @@ -572,12 +588,17 @@ out_switch_port(Port *port, int group) > > > rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); > > > > > > ext_port_str = out_ext_port(port->remoteport, group); > > > - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", > > > + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", > > > node_name(port->remoteport->node), > > > port->remoteport->portnum, > > > ext_port_str ? ext_port_str : "", > > > rem_nodename, > > > port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); > > > + if (is_xsigo_tca(port->remoteport->portguid)) > > > + fprintf(f, " slot %d", port->portnum); > > > + else if (is_xsigo_hca(port->remoteport->portguid)) > > > + fprintf(f, " (scp)"); > > > + fprintf(f, "\n"); > > > > > > if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) > > > free(rem_nodename); > > > @@ -616,6 +637,8 @@ dump_topology(int listtype, int group) > > > Port *port; > > > int i = 0, dist = 0; > > > time_t t = time(0); > > > + uint64_t chguid; > > > + char *chname = NULL; > > > > > > if (!listtype) { > > > fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > > > @@ -633,11 +656,31 @@ dump_topology(int listtype, int group) > > > > > > if (!ch->chassisnum) > > > continue; > > > - out_chassis(ch->chassisnum); > > > + chguid = out_chassis(ch->chassisnum); > > > + chname = NULL; > > > + if (is_xsigo_guid(chguid)) { > > > + /* !!! */ > > > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > > + if (node->chrecord) { > > > + if (!node->chrecord->chassisnum) > > > + continue; > > > + } else > > > + continue; > > > + > > > + if (node->chrecord->chassisnum != ch->chassisnum) > > > + continue; > > > + > > > + if (is_xsigo_hca(node->nodeguid)) { > > > + chname = node->nodedesc; > > > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > > > + } > > > + } > > > + } > > > + > > > > Not sure I understand this code correctly, but is it Xsigo only? I mean > > where is_xsigo_hca() is used. > > Yes, this is specific to Xsigo. > > > Anyway why to not hide all this section inside out_chassis()? > > It looks like it could be done as you suggest but it is currently done > similar to other code slightly lower down which loop in a similar manner > (Chassis Switches, Chassis CAs). Ok. Sasha From sashak at voltaire.com Mon Sep 17 15:42:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 00:42:28 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - checking allocation status In-Reply-To: <46EEFE70.4050600@dev.mellanox.co.il> References: <46EEFE70.4050600@dev.mellanox.co.il> Message-ID: <20070917224228.GF6891@sashak.voltaire.com> On 00:23 Tue 18 Sep , Yevgeny Kliteynik wrote: > Checking memory allocation status is qos > policy parser to prevent seg. fault. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From hrosenstock at xsigo.com Mon Sep 17 15:38:37 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 15:38:37 -0700 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <20070917224000.GE6891@sashak.voltaire.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> <20070917215147.GZ6891@sashak.voltaire.com> <1190066931.12099.65.camel@hrosenstock-ws.xsigo.com> <20070917224000.GE6891@sashak.voltaire.com> Message-ID: <1190068717.12099.84.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-09-18 at 00:40 +0200, Sasha Khapyorsky wrote: > On 15:08 Mon 17 Sep , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Mon, 2007-09-17 at 23:51 +0200, Sasha Khapyorsky wrote: > > > Hi Hal, > > > > > > On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > > > > ibnetdiscover: Support Xsigo chassis grouping > > > > > > > > I think this also fixes a bug with grouping of multiple non Voltaire > > > > chassis as well. > > > > > > Could you provide more details about this bug. > > > > I found it because the Xsigo grouping is similar to the non Voltaire > > grouping and tested a multiple chassis case which did not work. > > But what the bug is? The bug was that with multiple non Voltaire chassis, it would display the chassis numbers (and some other basic information) and then list all the switches not organized by chassis number. > > > Should this be a separate patch? > > > > Is this really needed ? I have no way of testing this independently of > > the (other) Xsigo changes. > > > > > > Note: this patch is against OFED 1.2 > > > > > > Hal, you know - the patches for master should be against master (I spent > > > some time). > > > > Thanks. As you know, we are working with OFED 1.2. > > But this patch targets master, not OFED 1.2. It is not something new - > the patches should be generated against branch they are targeted. I know; in the future, I will endeavor to take the time to up rev the changes to the master. -- Hal > > > Some comments are below. > > > > > > > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > diff --git a/diags/include/grouping.h b/diags/include/grouping.h > > > > index 4666935..3ba872c 100644 > > > > --- a/diags/include/grouping.h > > > > +++ b/diags/include/grouping.h > > > > @@ -1,5 +1,6 @@ > > > > /* > > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > > * > > > > * This software is available to you under a choice of one of two > > > > * licenses. You may choose to be licensed under the terms of the GNU > > > > @@ -104,4 +105,8 @@ char *get_chassis_type(unsigned char chassistype); > > > > char *get_chassis_slot(unsigned char chassisslot); > > > > uint64_t get_chassis_guid(unsigned char chassisnum); > > > > > > > > +int is_xsigo_guid(uint64_t guid); > > > > +int is_xsigo_tca(uint64_t guid); > > > > +int is_xsigo_hca(uint64_t guid); > > > > + > > > > #endif /* _GROUPING_H_ */ > > > > diff --git a/diags/include/ibnetdiscover.h b/diags/include/ibnetdiscover.h > > > > index d13a666..bfbe7f5 100644 > > > > --- a/diags/include/ibnetdiscover.h > > > > +++ b/diags/include/ibnetdiscover.h > > > > @@ -1,5 +1,6 @@ > > > > /* > > > > * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. > > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > > * > > > > * This software is available to you under a choice of one of two > > > > * licenses. You may choose to be licensed under the terms of the GNU > > > > @@ -44,6 +45,7 @@ > > > > #define VTR_VENDOR_ID 0x8f1 /* Voltaire */ > > > > #define TS_VENDOR_ID 0x5ad /* Cisco */ > > > > #define SS_VENDOR_ID 0x66a /* InfiniCon */ > > > > +#define XS_VENDOR_ID 0x1397 /* Xsigo */ > > > > > > > > > > > > typedef struct Port Port; > > > > diff --git a/diags/src/grouping.c b/diags/src/grouping.c > > > > index 0e5bd78..6602f26 100644 > > > > --- a/diags/src/grouping.c > > > > +++ b/diags/src/grouping.c > > > > @@ -1,5 +1,6 @@ > > > > /* > > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > > * > > > > * This software is available to you under a choice of one of two > > > > * licenses. You may choose to be licensed under the terms of the GNU > > > > @@ -96,20 +97,91 @@ static uint64_t topspin_chassisguid(uint64_t guid) > > > > return guid & 0xffffffff00ffffffULL; > > > > } > > > > > > > > -static uint64_t get_chassisguid(uint64_t guid, uint32_t vendid) > > > > +int is_xsigo_guid(uint64_t guid) > > > > { > > > > - if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) > > > > - return topspin_chassisguid(guid); > > > > + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) > > > > + return 1; > > > > else > > > > - return guid; > > > > + return 0; > > > > +} > > > > + > > > > +static int is_xsigo_leafone(uint64_t guid) > > > > +{ > > > > + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) > > > > + return 1; > > > > + else > > > > + return 0; > > > > +} > > > > + > > > > +int is_xsigo_hca(uint64_t guid) > > > > +{ > > > > + /* NodeType 2 is HCA */ > > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) > > > > + return 1; > > > > + else > > > > + return 0; > > > > +} > > > > + > > > > +int is_xsigo_tca(uint64_t guid) > > > > +{ > > > > + /* NodeType 3 is TCA */ > > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) > > > > + return 1; > > > > + else > > > > + return 0; > > > > +} > > > > + > > > > +static int is_xsigo_ca(uint64_t guid) > > > > +{ > > > > + if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) > > > > + return 1; > > > > + else > > > > + return 0; > > > > +} > > > > + > > > > +static int is_xsigo_switch(uint64_t guid) > > > > +{ > > > > + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) > > > > + return 1; > > > > + else > > > > + return 0; > > > > +} > > > > + > > > > +static uint64_t xsigo_chassisguid(Node *node) > > > > +{ > > > > + if (!is_xsigo_ca(node->sysimgguid)) { > > > > + /* Byte 3 is NodeType and byte 4 is PortType */ > > > > + /* If NodeType is 1 (switch), PortType is masked */ > > > > + if (is_xsigo_switch(node->sysimgguid)) > > > > + return node->sysimgguid & 0xffffffff00ffffffULL; > > > > + else > > > > + return node->sysimgguid; > > > > + } else { > > > > + /* If peer port is Leaf 1, use its chassis GUID */ > > > > + if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) > > > > + return node->ports->remoteport->node->sysimgguid & > > > > + 0xffffffff00ffffffULL; > > > > + else > > > > + return node->sysimgguid; > > > > + } > > > > } > > > > > > > > -static struct ChassisList *find_chassisguid(uint64_t guid, uint32_t vendid) > > > > +static uint64_t get_chassisguid(Node *node) > > > > +{ > > > > + if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) > > > > + return topspin_chassisguid(node->sysimgguid); > > > > + else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) > > > > + return xsigo_chassisguid(node); > > > > + else > > > > + return node->sysimgguid; > > > > +} > > > > + > > > > +static struct ChassisList *find_chassisguid(Node *node) > > > > { > > > > ChassisList *current; > > > > uint64_t chguid; > > > > > > > > - chguid = get_chassisguid(guid, vendid); > > > > + chguid = get_chassisguid(node); > > > > for (current = mylist.first; current; current = current->next) { > > > > if (current->chassisguid == chguid) > > > > return current; > > > > @@ -668,14 +740,13 @@ ChassisList *group_nodes() > > > > if (node->vendid == VTR_VENDOR_ID) > > > > continue; > > > > if (node->sysimgguid) { > > > > - chassis = find_chassisguid(node->sysimgguid, > > > > - node->vendid); > > > > + chassis = find_chassisguid(node); > > > > if (chassis) > > > > chassis->nodecount++; > > > > else { > > > > /* Possible new chassis */ > > > > add_chassislist(); > > > > - mylist.current->chassisguid = get_chassisguid(node->sysimgguid, node->vendid); > > > > + mylist.current->chassisguid = get_chassisguid(node); > > > > mylist.current->nodecount = 1; > > > > } > > > > } > > > > @@ -684,13 +755,12 @@ ChassisList *group_nodes() > > > > > > > > /* now, make another pass to see which nodes are part of chassis */ > > > > /* (defined as chassis->nodecount > 1) */ > > > > - for (dist = 0; dist <= maxhops_discovered; dist++) { > > > > + for (dist = 0; dist <= MAXHOPS; ) { > > > > for (node = nodesdist[dist]; node; node = node->dnext) { > > > > if (node->vendid == VTR_VENDOR_ID) > > > > continue; > > > > if (node->sysimgguid) { > > > > - chassis = find_chassisguid(node->sysimgguid, > > > > - node->vendid); > > > > + chassis = find_chassisguid(node); > > > > if (chassis && chassis->nodecount > 1) { > > > > if (!chassis->chassisnum) > > > > chassis->chassisnum = ++chassisnum; > > > > @@ -702,6 +772,10 @@ ChassisList *group_nodes() > > > > } > > > > } > > > > } > > > > + if (dist == maxhops_discovered) > > > > + dist = MAXHOPS; /* skip to CAs */ > > > > + else > > > > + dist++; > > > > } > > > > > > > > return (mylist.first); > > > > diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c > > > > index cb62c44..2cff87e 100644 > > > > --- a/diags/src/ibnetdiscover.c > > > > +++ b/diags/src/ibnetdiscover.c > > > > @@ -1,5 +1,6 @@ > > > > /* > > > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > > > * > > > > * This software is available to you under a choice of one of two > > > > * licenses. You may choose to be licensed under the terms of the GNU > > > > @@ -450,14 +451,26 @@ list_node(Node *node) > > > > } > > > > > > > > void > > > > -out_ids(Node *node) > > > > +out_ids(Node *node, int group, char *chname) > > > > { > > > > fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); > > > > if (node->sysimgguid) > > > > - fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); > > > > + fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); > > > > + if (group) > > > > + if (node->chrecord) > > > > + if (node->chrecord->chassisnum) { > > > > + fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); > > > > + if (chname) > > > > + fprintf(f, " (%s)", clean_nodedesc(chname)); > > > > + if (is_xsigo_tca(node->nodeguid)) { > > > > + if (node->ports->remoteport) > > > > + fprintf(f, " slot %d", node->ports->remoteport->portnum); > > > > + } > > > > + } > > > > + fprintf(f, "\n"); > > > > } > > > > > > > > -void > > > > +uint64_t > > > > out_chassis(int chassisnum) > > > > { > > > > uint64_t guid; > > > > @@ -467,20 +480,20 @@ out_chassis(int chassisnum) > > > > if (guid) > > > > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > > > > fprintf(f, "\n"); > > > > + return guid; > > > > } > > > > > > > > void > > > > -out_switch(Node *node, int group) > > > > +out_switch(Node *node, int group, char *chname) > > > > { > > > > char *str; > > > > char *nodename = NULL; > > > > > > > > - out_ids(node); > > > > + out_ids(node, group, chname); > > > > fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); > > > > if (group) { > > > > if (node->chrecord) { > > > > if (node->chrecord->chassisnum) { > > > > - fprintf(f, "\t\t# Chassis %d ", node->chrecord->chassisnum); > > > > /* Currently, only if Voltaire chassis */ > > > > if (node->vendid == VTR_VENDOR_ID) { > > > > str = get_chassis_type(node->chrecord->chassistype); > > > > @@ -510,12 +523,12 @@ out_switch(Node *node, int group) > > > > } > > > > > > > > void > > > > -out_ca(Node *node) > > > > +out_ca(Node *node, int group, char *chname) > > > > { > > > > char *node_type; > > > > char *node_type2; > > > > > > > > - out_ids(node); > > > > + out_ids(node, group, chname); > > > > switch(node->type) { > > > > case CA_NODE: > > > > node_type = "ca"; > > > > @@ -532,9 +545,12 @@ out_ca(Node *node) > > > > } > > > > > > > > fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); > > > > - fprintf(f, "%s\t%d %s\t\t# \"%s\"\n", > > > > + fprintf(f, "%s\t%d %s\t\t# \"%s\"", > > > > node_type2, node->numports, node_name(node), > > > > clean_nodedesc(node->nodedesc)); > > > > + if (group && is_xsigo_hca(node->nodeguid)) > > > > + fprintf(f, " (scp)"); > > > > + fprintf(f, "\n"); > > > > } > > > > > > > > static char * > > > > @@ -572,12 +588,17 @@ out_switch_port(Port *port, int group) > > > > rem_nodename = clean_nodedesc(port->remoteport->node->nodedesc); > > > > > > > > ext_port_str = out_ext_port(port->remoteport, group); > > > > - fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d\n", > > > > + fprintf(f, "\t%s[%d]%s\t\t# \"%s\" lid %d", > > > > node_name(port->remoteport->node), > > > > port->remoteport->portnum, > > > > ext_port_str ? ext_port_str : "", > > > > rem_nodename, > > > > port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid); > > > > + if (is_xsigo_tca(port->remoteport->portguid)) > > > > + fprintf(f, " slot %d", port->portnum); > > > > + else if (is_xsigo_hca(port->remoteport->portguid)) > > > > + fprintf(f, " (scp)"); > > > > + fprintf(f, "\n"); > > > > > > > > if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) > > > > free(rem_nodename); > > > > @@ -616,6 +637,8 @@ dump_topology(int listtype, int group) > > > > Port *port; > > > > int i = 0, dist = 0; > > > > time_t t = time(0); > > > > + uint64_t chguid; > > > > + char *chname = NULL; > > > > > > > > if (!listtype) { > > > > fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > > > > @@ -633,11 +656,31 @@ dump_topology(int listtype, int group) > > > > > > > > if (!ch->chassisnum) > > > > continue; > > > > - out_chassis(ch->chassisnum); > > > > + chguid = out_chassis(ch->chassisnum); > > > > + chname = NULL; > > > > + if (is_xsigo_guid(chguid)) { > > > > + /* !!! */ > > > > + for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > > > > + if (node->chrecord) { > > > > + if (!node->chrecord->chassisnum) > > > > + continue; > > > > + } else > > > > + continue; > > > > + > > > > + if (node->chrecord->chassisnum != ch->chassisnum) > > > > + continue; > > > > + > > > > + if (is_xsigo_hca(node->nodeguid)) { > > > > + chname = node->nodedesc; > > > > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > > > > + } > > > > + } > > > > + } > > > > + > > > > > > Not sure I understand this code correctly, but is it Xsigo only? I mean > > > where is_xsigo_hca() is used. > > > > Yes, this is specific to Xsigo. > > > > > Anyway why to not hide all this section inside out_chassis()? > > > > It looks like it could be done as you suggest but it is currently done > > similar to other code slightly lower down which loop in a similar manner > > (Chassis Switches, Chassis CAs). > > Ok. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Sep 17 16:29:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 01:29:51 +0200 Subject: [ofa-general] [PATCH] OpenSM/console: Support loopback in -console option In-Reply-To: <1189533839.11745.9.camel@hrosenstock-ws.xsigo.com> References: <1189533839.11745.9.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917232951.GG6891@sashak.voltaire.com> On 11:03 Tue 11 Sep , Hal Rosenstock wrote: > OpenSM/(osm_console main).c: Support loopback option to -console for > local only telnet support > > Note: Patch is based on OFED 1.2 > > Signed-off-by: Hal Rosenstock Applied. Thanks. Please next time generate patches against master (unless it is for OFED 1.x) Sasha From sashak at voltaire.com Mon Sep 17 16:31:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 01:31:10 +0200 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <1190068717.12099.84.camel@hrosenstock-ws.xsigo.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> <20070917215147.GZ6891@sashak.voltaire.com> <1190066931.12099.65.camel@hrosenstock-ws.xsigo.com> <20070917224000.GE6891@sashak.voltaire.com> <1190068717.12099.84.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917233110.GH6891@sashak.voltaire.com> On 15:38 Mon 17 Sep , Hal Rosenstock wrote: > On Tue, 2007-09-18 at 00:40 +0200, Sasha Khapyorsky wrote: > > On 15:08 Mon 17 Sep , Hal Rosenstock wrote: > > > Hi Sasha, > > > > > > On Mon, 2007-09-17 at 23:51 +0200, Sasha Khapyorsky wrote: > > > > Hi Hal, > > > > > > > > On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > > > > > ibnetdiscover: Support Xsigo chassis grouping > > > > > > > > > > I think this also fixes a bug with grouping of multiple non Voltaire > > > > > chassis as well. > > > > > > > > Could you provide more details about this bug. > > > > > > I found it because the Xsigo grouping is similar to the non Voltaire > > > grouping and tested a multiple chassis case which did not work. > > > > But what the bug is? > > The bug was that with multiple non Voltaire chassis, it would display > the chassis numbers (and some other basic information) and then list all > the switches not organized by chassis number. > > > > > Should this be a separate patch? > > > > > > Is this really needed ? I have no way of testing this independently of > > > the (other) Xsigo changes. > > > > > > > > Note: this patch is against OFED 1.2 > > > > > > > > Hal, you know - the patches for master should be against master (I spent > > > > some time). > > > > > > Thanks. As you know, we are working with OFED 1.2. > > > > But this patch targets master, not OFED 1.2. It is not something new - > > the patches should be generated against branch they are targeted. > > I know; in the future, I will endeavor to take the time to up rev the > changes to the master. Thanks. Sasha From sashak at voltaire.com Mon Sep 17 16:31:30 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 01:31:30 +0200 Subject: [ofa-general] Re: [PATCHv2] ibnetdiscover: Support Xsigo chassis grouping In-Reply-To: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> References: <1189730194.6062.1.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070917233130.GI6891@sashak.voltaire.com> On 17:36 Thu 13 Sep , Hal Rosenstock wrote: > ibnetdiscover: Support Xsigo chassis grouping > > I think this also fixes a bug with grouping of multiple non Voltaire > chassis as well. > > Note: this patch is against OFED 1.2 > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From fubar at us.ibm.com Mon Sep 17 16:23:58 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Mon, 17 Sep 2007 16:23:58 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> Message-ID: <18593.1190071438@death> Roland Dreier wrote: >Actually, thinking about this some more... would it be cleaner to more >the knowledge about bonding out of the ipoib driver? in other words, >export something similar to > > > +static int ipoib_slave_detach(struct net_device *dev) > > +{ > > + int ret = 0; > > + if (dev->flags & IFF_SLAVE) { > > + dev->priv_flags |= IFF_SLAVE_DETACH; > > + rtnl_lock(); > > + ret = call_netdevice_notifiers(NETDEV_CHANGE, dev); > > + rtnl_unlock(); > > + } > > + return ret; > > +} > >for drivers to use, rather than putting use of IFF_SLAVE and >IFF_SLAVE_DETACH outside of the bonding driver. Conceptually, I see your point and I'm ok with doing it either way. My only question is, would this change would make the ipoib module dependent upon having the bonding module loaded (to resolve all of the symbols)? -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com From rdreier at cisco.com Mon Sep 17 16:33:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 17 Sep 2007 16:33:39 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <18593.1190071438@death> (Jay Vosburgh's message of "Mon, 17 Sep 2007 16:23:58 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <18593.1190071438@death> Message-ID: > Conceptually, I see your point and I'm ok with doing it either > way. My only question is, would this change would make the ipoib module > dependent upon having the bonding module loaded (to resolve all of the > symbols)? Yes, I guess so, if that function is in bonding. Hmm, that wouldn't be a good change. Maybe this new notification function should be in net/core/dev.c instead of exporting call_netdevice_notifiers()? - R. From bnlsmith at cox.net Mon Sep 17 16:38:15 2007 From: bnlsmith at cox.net (Madam Susana Kadrlik Cole) Date: Mon, 17 Sep 2007 19:38:15 -0400 Subject: [ofa-general] From Madam Susana Kadrlik Cole Here writes Madam Susan Cole, suffering from cancerous ailment. When my late husband was alive he left the sum of 5Million (Five Million Pound Sterling) which were derived from his vast estates and investment in capital market with finance Company here in Europe. Recently, my doctor told me that I have limited days to live due to the cancerous problems I am suffering from, So, I decided to contact you due to time limit. Though what bothers me most is the stroke that I have in addition to the cancer. With this hard reality that has befallen my family, and me I have decided to donate this fund to you and want you to use this gift which comes from my husbands effort to fund the upkeep of widows, widowers, orphans, destitute, the down-trodden, physically challenged children, barren-women and persons who prove to be genuinely handicapped financially. It is often said that blessed is the hand that giveth. I took this decision because I do not have any child that will inherit this money and my husband relatives are bourgeois and very wealthy persons and I do not want my husband hard earned money to be misused or invested into ill perceived ventures. I do not want a situation where this money will be used in an ungodly manner, hence the reason for taking this bold decision. I am not afraid of death hence I know where I am going. I know that I am going to be with the Almighty when I eventually pass on. The Almighty will fight my case and I shall hold my peace. I do not need any telephone communication in this regard due to my deteriorating health and because of the presence of my husband relatives around me. I do not want them to know about this development. I want you to stand as the new beneficiary to the funds. As soon as I receive your reply I shall give you the contact of the finance company. Please send all emails to my confidential emails below: Email: susancolesusancole501@hotmail.com Endeavor to send me your names, address, telephone and fax number to enable contact you with more details/all the relevant documents by email/ fax. Hope to hear from you soon. Best regards Madam Susana Kadrlik Cole Email: susancolesusancole501@hotmail.com Message-ID: <11513880.1190072295588.JavaMail.root@fed1wml24.mgt.cox.net> From john.blackwood at ccur.com Mon Sep 17 16:41:22 2007 From: john.blackwood at ccur.com (John Blackwood) Date: Mon, 17 Sep 2007 19:41:22 -0400 Subject: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue In-Reply-To: References: <46EEB715.7060509@ccur.com> Message-ID: <46EF10A2.40905@ccur.com> Roland Dreier wrote: > Thanks for the explanation... > > > But basically, with CONFIG_PREEMPT_RT enabled, the lock points, such as > > aqcuiring a spinlock, potentially become places where the current task > > may be context switched out / preempted. > > > > Therefore, when a call is made to lock a spinlock for example, the > > caller should not currently have irqs disabled, or preemption disabled, > > since a context switch may occur. > > this doesn't seem relevant here... Hi Roland, right. just some background info. > > void fastcall rt_downgrade_write(struct rw_semaphore *rwsem) > > { > > BUG(); > > } > > this seems to be the problem... the -rt patch turns downgrade_write() > into a BUG(). > > I need to look at the locking in user_mad.c again, but I think it may > be possible to replace both places that do downgrade_write() with > up_write() followed by down_read(). > > - R. that sounds like it would be a good solution for both preempt rt and non-preempt rt kernels. thanks again for looking at this for us. From hrosenstock at xsigo.com Mon Sep 17 16:59:58 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 16:59:58 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/main.c: Fix compile error Message-ID: <1190073598.12099.97.camel@hrosenstock-ws.xsigo.com> OpenSM/main.c: Fix compile error Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 9597cf1..0005531 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -1043,7 +1043,7 @@ int main(int argc, char *argv[]) if (strcmp(opt.console, "local") == 0 #ifdef ENABLE_OSM_CONSOLE_SOCKET || strcmp(opt.console, "socket") == 0 - || strcmp(opt.console, "loopback") = 0 + || strcmp(opt.console, "loopback") == 0 #endif ) osm_console(&osm); From sashak at voltaire.com Mon Sep 17 17:58:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 02:58:20 +0200 Subject: [ofa-general] [PATCH] OpenSM: Improve QP0 and QP1 counter accounting In-Reply-To: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> References: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918005820.GJ6891@sashak.voltaire.com> Hi Hal, On 11:04 Tue 11 Sep , Hal Rosenstock wrote: > OpenSM: Improve QP0 and QP1 counter accounting > > Note: Patch is based on OFED 1.2 Well, you know :) The question is below. > > Signed-off-by: Hal Rosenstock > > diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h > index ea60341..eced96b 100644 > --- a/osm/include/opensm/osm_sa.h > +++ b/osm/include/opensm/osm_sa.h > @@ -209,6 +209,7 @@ typedef struct _osm_sa > * FIELDS > * state > * State of this SA object > +* > * p_subn > * Pointer to the Subnet object for this subnet. > * > @@ -448,6 +449,22 @@ osm_sa_bind( > * SEE ALSO > *********/ > > +/****f* OpenSM: SA/osm_sa_vendor_send > +* NAME > +* osm_sa_vendor_send > +* > +* DESCRIPTION > +* Sends SA MAD via osm_vendor_call and maintains the QP1 sent statistic > +* > +* SYNOPSIS > +*/ > +ib_api_status_t > +osm_sa_vendor_send( > + IN osm_bind_handle_t h_bind, > + IN osm_madw_t* const p_madw, > + IN boolean_t const resp_expected, > + IN osm_subn_t* const p_subn ); > + > struct _osm_opensm_t; > /****f* OpenSM: SA/osm_sa_db_file_dump > * NAME > diff --git a/osm/include/opensm/osm_sa_guidinfo_record.h b/osm/include/opensm/osm_sa_guidinfo_record.h > index 5c23cf9..d3cb23d 100644 > --- a/osm/include/opensm/osm_sa_guidinfo_record.h > +++ b/osm/include/opensm/osm_sa_guidinfo_record.h > @@ -98,7 +98,7 @@ BEGIN_C_DECLS > */ > typedef struct _osm_gir_rcv > { > - const osm_subn_t *p_subn; > + osm_subn_t *p_subn; > osm_sa_resp_t *p_resp; > osm_mad_pool_t *p_mad_pool; > osm_log_t *p_log; > @@ -209,7 +209,7 @@ osm_gir_rcv_init( > IN osm_gir_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > diff --git a/osm/include/opensm/osm_sa_node_record.h b/osm/include/opensm/osm_sa_node_record.h > index c0e8988..0ee8ae1 100644 > --- a/osm/include/opensm/osm_sa_node_record.h > +++ b/osm/include/opensm/osm_sa_node_record.h > @@ -99,7 +99,7 @@ BEGIN_C_DECLS > */ > typedef struct _osm_nr_recv > { > - const osm_subn_t *p_subn; > + osm_subn_t *p_subn; > osm_sa_resp_t *p_resp; > osm_mad_pool_t *p_mad_pool; > osm_log_t *p_log; > @@ -206,7 +206,7 @@ ib_api_status_t osm_nr_rcv_init( > IN osm_nr_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > diff --git a/osm/include/opensm/osm_sa_pkey_record.h b/osm/include/opensm/osm_sa_pkey_record.h > index aceab9a..08b7fee 100644 > --- a/osm/include/opensm/osm_sa_pkey_record.h > +++ b/osm/include/opensm/osm_sa_pkey_record.h > @@ -87,7 +87,7 @@ BEGIN_C_DECLS > */ > typedef struct _osm_pkey_rec_rcv > { > - const osm_subn_t* p_subn; > + osm_subn_t* p_subn; > osm_sa_resp_t* p_resp; > osm_mad_pool_t* p_mad_pool; > osm_log_t* p_log; > @@ -198,7 +198,7 @@ osm_pkey_rec_rcv_init( > IN osm_pkey_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > diff --git a/osm/include/opensm/osm_sa_response.h b/osm/include/opensm/osm_sa_response.h > index b9e84d1..d883c3b 100644 > --- a/osm/include/opensm/osm_sa_response.h > +++ b/osm/include/opensm/osm_sa_response.h > @@ -52,6 +52,7 @@ > #include > #include > #include > +#include > > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > @@ -97,6 +98,7 @@ BEGIN_C_DECLS > typedef struct _osm_sa_resp > { > osm_mad_pool_t *p_pool; > + osm_subn_t *p_subn; > osm_log_t *p_log; > } osm_sa_resp_t; > /* > @@ -186,6 +188,7 @@ ib_api_status_t > osm_sa_resp_init( > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_pool, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log ); > /* > * PARAMETERS > @@ -195,8 +198,8 @@ osm_sa_resp_init( > * p_mad_pool > * [in] Pointer to the MAD pool. > * > -* p_vl15 > -* [in] Pointer to the VL15 interface. > +* p_subn > +* [in] Pointer to Subnet object for this subnet. > * > * p_log > * [in] Pointer to the log object. > diff --git a/osm/include/opensm/osm_sa_slvl_record.h b/osm/include/opensm/osm_sa_slvl_record.h > index a5ce9b4..fabd133 100644 > --- a/osm/include/opensm/osm_sa_slvl_record.h > +++ b/osm/include/opensm/osm_sa_slvl_record.h > @@ -100,7 +100,7 @@ BEGIN_C_DECLS > */ > typedef struct _osm_slvl_rec_rcv > { > - const osm_subn_t *p_subn; > + osm_subn_t *p_subn; > osm_sa_resp_t *p_resp; > osm_mad_pool_t *p_mad_pool; > osm_log_t *p_log; > @@ -211,7 +211,7 @@ osm_slvl_rec_rcv_init( > IN osm_slvl_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > diff --git a/osm/include/opensm/osm_sa_vlarb_record.h b/osm/include/opensm/osm_sa_vlarb_record.h > index 4aad76f..9796483 100644 > --- a/osm/include/opensm/osm_sa_vlarb_record.h > +++ b/osm/include/opensm/osm_sa_vlarb_record.h > @@ -100,7 +100,7 @@ BEGIN_C_DECLS > */ > typedef struct _osm_vlarb_rec_rcv > { > - const osm_subn_t *p_subn; > + osm_subn_t *p_subn; > osm_sa_resp_t *p_resp; > osm_mad_pool_t *p_mad_pool; > osm_log_t *p_log; > @@ -211,7 +211,7 @@ osm_vlarb_rec_rcv_init( > IN osm_vlarb_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > diff --git a/osm/include/opensm/osm_stats.h b/osm/include/opensm/osm_stats.h > index 5cffc00..15bc8e0 100644 > --- a/osm/include/opensm/osm_stats.h > +++ b/osm/include/opensm/osm_stats.h > @@ -90,9 +90,12 @@ typedef struct _osm_stats > atomic32_t qp0_mads_rcvd; > atomic32_t qp0_mads_sent; > atomic32_t qp0_unicasts_sent; > + atomic32_t qp0_mads_rcvd_unknown; > atomic32_t qp1_mads_outstanding; > atomic32_t qp1_mads_rcvd; > atomic32_t qp1_mads_sent; > + atomic32_t qp1_mads_rcvd_unknown; > + atomic32_t qp1_mads_ignored; > > } osm_stats_t; > /* > @@ -117,6 +120,27 @@ typedef struct _osm_stats > * Total number of response-less MADs sent on the wire. This count > * includes getresp(), send() and trap() methods. > * > +* qp0_mads_rcvd_unknown > +* Total number of unknown QP0 MADs received. This includes > +* unrecognized attribute IDs and methods. > +* > +* qp1_mads_outstanding > +* Contains the number of MADs outstanding on QP1. > +* > +* qp1_mads_rcvd > +* Total number of QP1 MADs received. > +* > +* qp1_mads_sent > +* Total number of QP1 MADs sent. > +* > +* qp1_mads_rcvd_unknown > +* Total number of unknown QP1 MADs received. This includes > +* unrecognized attribute IDs and methods. > +* > +* qp1_mads_ignored > +* Total number of QP1 MADs received because SM is not > +* master or SM is in first time sweep. > +* > * SEE ALSO > ***************/ > > diff --git a/osm/include/opensm/osm_version.h b/osm/include/opensm/osm_version.h > index ef91e16..6d2c8ee 100644 > --- a/osm/include/opensm/osm_version.h > +++ b/osm/include/opensm/osm_version.h > @@ -55,7 +55,7 @@ BEGIN_C_DECLS > * > * SYNOPSIS > */ > -#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo2" > +#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo3" > /********/ > > END_C_DECLS > diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c > index 5575425..7acfdf1 100644 > --- a/osm/opensm/osm_console.c > +++ b/osm/opensm/osm_console.c > @@ -336,23 +336,29 @@ static void print_status(osm_opensm_t *p_osm, FILE *out) > p_osm->routing_engine.name ? p_osm->routing_engine.name : "null (min-hop)"); > fprintf(out, "\n MAD stats\n" > " ---------\n" > - " QP0 MADS outstanding : %d\n" > - " QP0 MADS outstanding (on wire) : %d\n" > - " QP0 MADS rcvd : %d\n" > - " QP0 MADS sent : %d\n" > + " QP0 MADs outstanding : %d\n" > + " QP0 MADs outstanding (on wire) : %d\n" > + " QP0 MADs rcvd : %d\n" > + " QP0 MADs sent : %d\n" > " QP0 unicasts sent : %d\n" > - " QP1 MADS outstanding : %d\n" > - " QP1 MADS rcvd : %d\n" > - " QP1 MADS sent : %d\n" > + " QP0 unknown MADs rcvd : %d\n" > + " QP1 MADs outstanding : %d\n" > + " QP1 MADs rcvd : %d\n" > + " QP1 MADs sent : %d\n" > + " QP1 unknown MADs rcvd : %d\n" > + " QP1 MADs ignored : %d\n" > , > p_osm->stats.qp0_mads_outstanding, > p_osm->stats.qp0_mads_outstanding_on_wire, > p_osm->stats.qp0_mads_rcvd, > p_osm->stats.qp0_mads_sent, > p_osm->stats.qp0_unicasts_sent, > + p_osm->stats.qp0_mads_rcvd_unknown, > p_osm->stats.qp1_mads_outstanding, > p_osm->stats.qp1_mads_rcvd, > - p_osm->stats.qp1_mads_sent > + p_osm->stats.qp1_mads_sent, > + p_osm->stats.qp1_mads_rcvd_unknown, > + p_osm->stats.qp1_mads_ignored > ); > fprintf(out, "\n Subnet flags\n" > " ------------\n" > diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c > index f91fa49..e1e1dec 100644 > --- a/osm/opensm/osm_inform.c > +++ b/osm/opensm/osm_inform.c > @@ -57,6 +57,7 @@ > #include > #include > #include > +#include > > typedef struct _osm_infr_match_ctxt > { > @@ -442,7 +443,8 @@ __osm_send_report( > *p_report_ntc = *p_ntc; > > /* The TRUE is for: response is expected */ > - status = osm_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE ); > + status = osm_sa_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE, > + p_infr_rec->p_infr_rcv->p_subn ); > if ( status != IB_SUCCESS ) > { > osm_log( p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c > index d856fb0..f10ed60 100644 > --- a/osm/opensm/osm_lid_mgr.c > +++ b/osm/opensm/osm_lid_mgr.c > @@ -1163,15 +1163,19 @@ __osm_lid_mgr_set_physp_pi( > if ( (mtu != ib_port_info_get_neighbor_mtu(p_old_pi)) || > (op_vls != ib_port_info_get_op_vls(p_old_pi))) > { > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > +#if 0 > + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ERROR ) ) > { > - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > +#endif > + osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_lid_mgr_set_physp_pi: " > - "Sending Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", > + "Setting Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", > mtu, ib_port_info_get_neighbor_mtu(p_old_pi), > op_vls, ib_port_info_get_op_vls(p_old_pi) > ); > +#if 0 > } > +#endif Why those #if 0? Should it be here? Sasha > > /* > we need to make sure the internal DB will follow the fact the remote > diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c > index 6d68ed2..360ad70 100644 > --- a/osm/opensm/osm_sa.c > +++ b/osm/opensm/osm_sa.c > @@ -69,6 +69,7 @@ > #include > #include > #include > +#include > > #define OSM_SA_INITIAL_TID_VALUE 0xabc > > @@ -202,6 +203,7 @@ osm_sa_init( > > status = osm_sa_resp_init(&p_sa->resp, > p_sa->p_mad_pool, > + p_subn, > p_log); > if( status != IB_SUCCESS ) > goto Exit; > @@ -519,6 +521,22 @@ osm_sa_bind( > return( status ); > } > > +ib_api_status_t > +osm_sa_vendor_send( > + IN osm_bind_handle_t h_bind, > + IN osm_madw_t* const p_madw, > + IN boolean_t const resp_expected, > + IN osm_subn_t* const p_subn ) > +{ > + ib_api_status_t status; > + > + cl_atomic_inc( &p_subn->p_osm->stats.qp1_mads_sent ); > + status = osm_vendor_send( h_bind, p_madw, resp_expected ); > + if ( status != IB_SUCCESS ) > + cl_atomic_dec( &p_subn->p_osm->stats.qp1_mads_sent ); > + return status; > +} > + > /********************************************************************** > **********************************************************************/ > /* > diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c > index da107ee..9ee434a 100644 > --- a/osm/opensm/osm_sa_class_port_info.c > +++ b/osm/opensm/osm_sa_class_port_info.c > @@ -60,6 +60,7 @@ > #include > #include > #include > +#include > > #define MAX_MSECS_TO_RTV 24 > /* Precalculated table in msec (index is related to encoded value) */ > @@ -223,7 +224,8 @@ __osm_cpi_rcv_respond( > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_FRAMES ) ) > osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if( status != IB_SUCCESS ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_guidinfo_record.c b/osm/opensm/osm_sa_guidinfo_record.c > index 10fac3c..fe85eff 100644 > --- a/osm/opensm/osm_sa_guidinfo_record.c > +++ b/osm/opensm/osm_sa_guidinfo_record.c > @@ -33,7 +33,6 @@ > * > */ > > - > /* > * Abstract: > * Implementation of osm_gir_rcv_t. > @@ -61,6 +60,7 @@ > #include > #include > #include > +#include > > #define OSM_GIR_RCV_POOL_MIN_SIZE 32 > #define OSM_GIR_RCV_POOL_GROW_SIZE 32 > @@ -108,7 +108,7 @@ osm_gir_rcv_init( > IN osm_gir_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -595,7 +595,8 @@ osm_gir_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c > index 340a7f1..dc999b3 100644 > --- a/osm/opensm/osm_sa_informinfo.c > +++ b/osm/opensm/osm_sa_informinfo.c > @@ -339,7 +339,8 @@ __osm_infr_rcv_respond( > > p_resp_infr = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > > if ( status != IB_SUCCESS ) > { > @@ -647,7 +648,8 @@ osm_infr_rcv_process_get_method( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c > index b6333e7..ed989a0 100644 > --- a/osm/opensm/osm_sa_lft_record.c > +++ b/osm/opensm/osm_sa_lft_record.c > @@ -58,6 +58,7 @@ > #include > #include > #include > +#include > > #define OSM_LFTR_RCV_POOL_MIN_SIZE 32 > #define OSM_LFTR_RCV_POOL_GROW_SIZE 32 > @@ -502,7 +503,8 @@ osm_lftr_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c > index 169e75e..058b6b2 100644 > --- a/osm/opensm/osm_sa_link_record.c > +++ b/osm/opensm/osm_sa_link_record.c > @@ -60,6 +60,7 @@ > #include > #include > #include > +#include > > #define OSM_LR_RCV_POOL_MIN_SIZE 64 > #define OSM_LR_RCV_POOL_GROW_SIZE 64 > @@ -679,7 +680,8 @@ __osm_lr_rcv_respond( > } > } > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c > index d6518e4..579e8f1 100644 > --- a/osm/opensm/osm_sa_mad_ctrl.c > +++ b/osm/opensm/osm_sa_mad_ctrl.c > @@ -269,6 +269,7 @@ __osm_sa_mad_ctrl_process( > There is an unknown MAD attribute type for which there is > no recipient. Simply retire the MAD here. > */ > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); > osm_mad_pool_put( p_ctrl->p_mad_pool, p_madw ); > } > > @@ -330,6 +331,7 @@ __osm_sa_mad_ctrl_rcv_callback( > */ > if ( p_ctrl->p_subn->sm_state != IB_SMINFO_STATE_MASTER ) > { > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > "__osm_sa_mad_ctrl_rcv_callback: " > "Received SA MAD while SM not MASTER. MAD ignored\n"); > @@ -338,6 +340,7 @@ __osm_sa_mad_ctrl_rcv_callback( > } > if ( p_ctrl->p_subn->first_time_master_sweep == TRUE ) > { > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > "__osm_sa_mad_ctrl_rcv_callback: " > "Received SA MAD while SM in first sweep. MAD ignored\n"); > @@ -394,6 +397,7 @@ __osm_sa_mad_ctrl_rcv_callback( > break; > > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sa_mad_ctrl_rcv_callback: ERR 1A05: " > "Unsupported method = 0x%X\n", > diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c > index 50c4f22..260360f 100644 > --- a/osm/opensm/osm_sa_mcmember_record.c > +++ b/osm/opensm/osm_sa_mcmember_record.c > @@ -68,6 +68,7 @@ > #include > #include > #include > +#include > > #define OSM_MCMR_RCV_POOL_MIN_SIZE 32 > #define OSM_MCMR_RCV_POOL_GROW_SIZE 32 > @@ -571,7 +572,8 @@ __osm_mcmr_rcv_respond( > p_resp_mcmember_rec->pkt_life &= 0x3f; > p_resp_mcmember_rec->pkt_life |= 2<<6; /* exactly */ > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > > if(status != IB_SUCCESS) > { > @@ -2266,7 +2268,8 @@ __osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* const p_rcv, > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if(status != IB_SUCCESS) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c > index 005c9bd..d7c7544 100644 > --- a/osm/opensm/osm_sa_mft_record.c > +++ b/osm/opensm/osm_sa_mft_record.c > @@ -57,6 +57,7 @@ > #include > #include > #include > +#include > > #define OSM_MFTR_RCV_POOL_MIN_SIZE 32 > #define OSM_MFTR_RCV_POOL_GROW_SIZE 32 > @@ -534,7 +535,8 @@ osm_mftr_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c > index 0c5643e..2df3699 100644 > --- a/osm/opensm/osm_sa_multipath_record.c > +++ b/osm/opensm/osm_sa_multipath_record.c > @@ -64,6 +64,7 @@ > #include > #include > #include > +#include > > #define OSM_MPR_RCV_POOL_MIN_SIZE 64 > #define OSM_MPR_RCV_POOL_GROW_SIZE 64 > @@ -1536,7 +1537,8 @@ __osm_mpr_rcv_respond( > > osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > > if ( status != IB_SUCCESS ) > { > diff --git a/osm/opensm/osm_sa_node_record.c b/osm/opensm/osm_sa_node_record.c > index 892582e..0d08a4c 100644 > --- a/osm/opensm/osm_sa_node_record.c > +++ b/osm/opensm/osm_sa_node_record.c > @@ -58,6 +58,7 @@ > #include > #include > #include > +#include > > #define OSM_NR_RCV_POOL_MIN_SIZE 32 > #define OSM_NR_RCV_POOL_GROW_SIZE 32 > @@ -105,7 +106,7 @@ osm_nr_rcv_init( > IN osm_nr_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -587,7 +588,8 @@ osm_nr_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c > index 1b0f89f..b993fdd 100644 > --- a/osm/opensm/osm_sa_path_record.c > +++ b/osm/opensm/osm_sa_path_record.c > @@ -67,6 +67,7 @@ > #include > #include > #include > +#include > #ifdef ROUTER_EXP > #include > #include > @@ -1892,7 +1893,8 @@ __osm_pr_rcv_respond( > > CL_ASSERT( cl_is_qlist_empty( p_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > > if( status != IB_SUCCESS ) > { > diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c > index 5eb15df..2692d0c 100644 > --- a/osm/opensm/osm_sa_pkey_record.c > +++ b/osm/opensm/osm_sa_pkey_record.c > @@ -49,6 +49,7 @@ > #include > #include > #include > +#include > > #define OSM_PKEY_REC_RCV_POOL_MIN_SIZE 32 > #define OSM_PKEY_REC_RCV_POOL_GROW_SIZE 32 > @@ -94,10 +95,10 @@ osm_pkey_rec_rcv_destroy( > **********************************************************************/ > ib_api_status_t > osm_pkey_rec_rcv_init( > - IN osm_pkey_rec_rcv_t* const p_rcv, > + IN osm_pkey_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > - IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_mad_pool_t* const p_mad_pool, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -573,7 +574,8 @@ osm_pkey_rec_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c > index 5d9b1b2..4aa1723 100644 > --- a/osm/opensm/osm_sa_portinfo_record.c > +++ b/osm/opensm/osm_sa_portinfo_record.c > @@ -33,7 +33,6 @@ > * > */ > > - > /* > * Abstract: > * Implementation of osm_pir_rcv_t. > @@ -63,6 +62,7 @@ > #include > #include > #include > +#include > > #define OSM_PIR_RCV_POOL_MIN_SIZE 32 > #define OSM_PIR_RCV_POOL_GROW_SIZE 32 > @@ -865,7 +865,8 @@ osm_pir_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c > index 4f158e9..fac2159 100644 > --- a/osm/opensm/osm_sa_response.c > +++ b/osm/opensm/osm_sa_response.c > @@ -56,6 +56,7 @@ > #include > #include > #include > +#include > > /********************************************************************** > **********************************************************************/ > @@ -81,6 +82,7 @@ ib_api_status_t > osm_sa_resp_init( > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_pool, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log ) > { > ib_api_status_t status = IB_SUCCESS; > @@ -89,6 +91,7 @@ osm_sa_resp_init( > > osm_sa_resp_construct( p_resp ); > > + p_resp->p_subn = p_subn; > p_resp->p_log = p_log; > p_resp->p_pool = p_pool; > > @@ -158,8 +161,8 @@ osm_sa_send_error( > if( osm_log_is_active( p_resp->p_log, OSM_LOG_FRAMES ) ) > osm_dump_sa_mad( p_resp->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > - status = osm_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), > - p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), > + p_resp_madw, FALSE, p_resp->p_subn ); > > if( status != IB_SUCCESS ) > { > diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c > index b23a12d..4479f00 100644 > --- a/osm/opensm/osm_sa_service_record.c > +++ b/osm/opensm/osm_sa_service_record.c > @@ -465,7 +465,8 @@ __osm_sr_rcv_respond( > } > } > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > > if( status != IB_SUCCESS ) > { > diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c > index d831ffd..885bdc5 100644 > --- a/osm/opensm/osm_sa_slvl_record.c > +++ b/osm/opensm/osm_sa_slvl_record.c > @@ -61,6 +61,7 @@ > #include > #include > #include > +#include > > #define OSM_SLVL_REC_RCV_POOL_MIN_SIZE 32 > #define OSM_SLVL_REC_RCV_POOL_GROW_SIZE 32 > @@ -109,7 +110,7 @@ osm_slvl_rec_rcv_init( > IN osm_slvl_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -540,7 +541,8 @@ osm_slvl_rec_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if(status != IB_SUCCESS) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c > index 5e15f52..99e31c6 100644 > --- a/osm/opensm/osm_sa_sminfo_record.c > +++ b/osm/opensm/osm_sa_sminfo_record.c > @@ -68,6 +68,7 @@ > #include > #include > #include > +#include > > #define OSM_SMIR_RCV_POOL_MIN_SIZE 32 > #define OSM_SMIR_RCV_POOL_GROW_SIZE 32 > @@ -570,7 +571,8 @@ osm_smir_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if( status != IB_SUCCESS ) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c > index da65864..1c2b6c7 100644 > --- a/osm/opensm/osm_sa_sw_info_record.c > +++ b/osm/opensm/osm_sa_sw_info_record.c > @@ -57,6 +57,7 @@ > #include > #include > #include > +#include > > #define OSM_SIR_RCV_POOL_MIN_SIZE 32 > #define OSM_SIR_RCV_POOL_GROW_SIZE 32 > @@ -522,7 +523,8 @@ osm_sir_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if (status != IB_SUCCESS) > { > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c > index f0ff957..fdb3d99 100644 > --- a/osm/opensm/osm_sa_vlarb_record.c > +++ b/osm/opensm/osm_sa_vlarb_record.c > @@ -61,6 +61,7 @@ > #include > #include > #include > +#include > > #define OSM_VLARB_REC_RCV_POOL_MIN_SIZE 32 > #define OSM_VLARB_REC_RCV_POOL_GROW_SIZE 32 > @@ -109,7 +110,7 @@ osm_vlarb_rec_rcv_init( > IN osm_vlarb_rec_rcv_t* const p_rcv, > IN osm_sa_resp_t* const p_resp, > IN osm_mad_pool_t* const p_mad_pool, > - IN const osm_subn_t* const p_subn, > + IN osm_subn_t* const p_subn, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -560,7 +561,8 @@ osm_vlarb_rec_rcv_process( > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > + p_rcv->p_subn ); > if(status != IB_SUCCESS) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > diff --git a/osm/opensm/osm_sm_mad_ctrl.c b/osm/opensm/osm_sm_mad_ctrl.c > index acd68d7..85729af 100644 > --- a/osm/opensm/osm_sm_mad_ctrl.c > +++ b/osm/opensm/osm_sm_mad_ctrl.c > @@ -318,6 +318,7 @@ __osm_sm_mad_ctrl_process_get_resp( > case IB_MAD_ATTR_NOTICE: > case IB_MAD_ATTR_INFORM_INFO: > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sm_mad_ctrl_process_get_resp: ERR 3103: " > "Unsupported attribute = 0x%X\n", > @@ -395,6 +396,7 @@ __osm_sm_mad_ctrl_process_get( > break; > > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > "__osm_sm_mad_ctrl_process_get: " > "Ignoring SubnGet MAD - unsupported attribute = 0x%X\n", > @@ -487,6 +489,7 @@ __osm_sm_mad_ctrl_process_set( > break; > > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sm_mad_ctrl_process_set: ERR 3107: " > "Unsupported attribute = 0x%X\n", > @@ -591,6 +594,7 @@ __osm_sm_mad_ctrl_process_trap( > break; > > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sm_mad_ctrl_process_trap: ERR 3109: " > "Unsupported attribute = 0x%X\n", > @@ -763,6 +767,7 @@ __osm_sm_mad_ctrl_rcv_callback( > case IB_MAD_METHOD_REPORT_RESP: > case IB_MAD_METHOD_TRAP_REPRESS: > default: > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > "__osm_sm_mad_ctrl_rcv_callback: ERR 3112: " > "Unsupported method = 0x%X\n", p_smp->method ); > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Sep 17 18:05:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 03:05:41 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/main.c: Fix compile error In-Reply-To: <1190073598.12099.97.camel@hrosenstock-ws.xsigo.com> References: <1190073598.12099.97.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918010541.GK6891@sashak.voltaire.com> On 16:59 Mon 17 Sep , Hal Rosenstock wrote: > OpenSM/main.c: Fix compile error > > Signed-off-by: Hal Rosenstock Nice catch. Applied. Thanks. Sasha From hrosenstock at xsigo.com Mon Sep 17 21:14:59 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 17 Sep 2007 21:14:59 -0700 Subject: [ofa-general] [PATCH] OpenSM: Improve QP0 and QP1 counter accounting In-Reply-To: <20070918005820.GJ6891@sashak.voltaire.com> References: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> <20070918005820.GJ6891@sashak.voltaire.com> Message-ID: <1190088899.12099.103.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Tue, 2007-09-18 at 02:58 +0200, Sasha Khapyorsky wrote: > Hi Hal, > > On 11:04 Tue 11 Sep , Hal Rosenstock wrote: > > OpenSM: Improve QP0 and QP1 counter accounting > > > > Note: Patch is based on OFED 1.2 > > Well, you know :) > > The question is below. > > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h > > index ea60341..eced96b 100644 > > --- a/osm/include/opensm/osm_sa.h > > +++ b/osm/include/opensm/osm_sa.h > > @@ -209,6 +209,7 @@ typedef struct _osm_sa > > * FIELDS > > * state > > * State of this SA object > > +* > > * p_subn > > * Pointer to the Subnet object for this subnet. > > * > > @@ -448,6 +449,22 @@ osm_sa_bind( > > * SEE ALSO > > *********/ > > > > +/****f* OpenSM: SA/osm_sa_vendor_send > > +* NAME > > +* osm_sa_vendor_send > > +* > > +* DESCRIPTION > > +* Sends SA MAD via osm_vendor_call and maintains the QP1 sent statistic > > +* > > +* SYNOPSIS > > +*/ > > +ib_api_status_t > > +osm_sa_vendor_send( > > + IN osm_bind_handle_t h_bind, > > + IN osm_madw_t* const p_madw, > > + IN boolean_t const resp_expected, > > + IN osm_subn_t* const p_subn ); > > + > > struct _osm_opensm_t; > > /****f* OpenSM: SA/osm_sa_db_file_dump > > * NAME > > diff --git a/osm/include/opensm/osm_sa_guidinfo_record.h b/osm/include/opensm/osm_sa_guidinfo_record.h > > index 5c23cf9..d3cb23d 100644 > > --- a/osm/include/opensm/osm_sa_guidinfo_record.h > > +++ b/osm/include/opensm/osm_sa_guidinfo_record.h > > @@ -98,7 +98,7 @@ BEGIN_C_DECLS > > */ > > typedef struct _osm_gir_rcv > > { > > - const osm_subn_t *p_subn; > > + osm_subn_t *p_subn; > > osm_sa_resp_t *p_resp; > > osm_mad_pool_t *p_mad_pool; > > osm_log_t *p_log; > > @@ -209,7 +209,7 @@ osm_gir_rcv_init( > > IN osm_gir_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ); > > /* > > diff --git a/osm/include/opensm/osm_sa_node_record.h b/osm/include/opensm/osm_sa_node_record.h > > index c0e8988..0ee8ae1 100644 > > --- a/osm/include/opensm/osm_sa_node_record.h > > +++ b/osm/include/opensm/osm_sa_node_record.h > > @@ -99,7 +99,7 @@ BEGIN_C_DECLS > > */ > > typedef struct _osm_nr_recv > > { > > - const osm_subn_t *p_subn; > > + osm_subn_t *p_subn; > > osm_sa_resp_t *p_resp; > > osm_mad_pool_t *p_mad_pool; > > osm_log_t *p_log; > > @@ -206,7 +206,7 @@ ib_api_status_t osm_nr_rcv_init( > > IN osm_nr_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ); > > /* > > diff --git a/osm/include/opensm/osm_sa_pkey_record.h b/osm/include/opensm/osm_sa_pkey_record.h > > index aceab9a..08b7fee 100644 > > --- a/osm/include/opensm/osm_sa_pkey_record.h > > +++ b/osm/include/opensm/osm_sa_pkey_record.h > > @@ -87,7 +87,7 @@ BEGIN_C_DECLS > > */ > > typedef struct _osm_pkey_rec_rcv > > { > > - const osm_subn_t* p_subn; > > + osm_subn_t* p_subn; > > osm_sa_resp_t* p_resp; > > osm_mad_pool_t* p_mad_pool; > > osm_log_t* p_log; > > @@ -198,7 +198,7 @@ osm_pkey_rec_rcv_init( > > IN osm_pkey_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ); > > /* > > diff --git a/osm/include/opensm/osm_sa_response.h b/osm/include/opensm/osm_sa_response.h > > index b9e84d1..d883c3b 100644 > > --- a/osm/include/opensm/osm_sa_response.h > > +++ b/osm/include/opensm/osm_sa_response.h > > @@ -52,6 +52,7 @@ > > #include > > #include > > #include > > +#include > > > > #ifdef __cplusplus > > # define BEGIN_C_DECLS extern "C" { > > @@ -97,6 +98,7 @@ BEGIN_C_DECLS > > typedef struct _osm_sa_resp > > { > > osm_mad_pool_t *p_pool; > > + osm_subn_t *p_subn; > > osm_log_t *p_log; > > } osm_sa_resp_t; > > /* > > @@ -186,6 +188,7 @@ ib_api_status_t > > osm_sa_resp_init( > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_pool, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log ); > > /* > > * PARAMETERS > > @@ -195,8 +198,8 @@ osm_sa_resp_init( > > * p_mad_pool > > * [in] Pointer to the MAD pool. > > * > > -* p_vl15 > > -* [in] Pointer to the VL15 interface. > > +* p_subn > > +* [in] Pointer to Subnet object for this subnet. > > * > > * p_log > > * [in] Pointer to the log object. > > diff --git a/osm/include/opensm/osm_sa_slvl_record.h b/osm/include/opensm/osm_sa_slvl_record.h > > index a5ce9b4..fabd133 100644 > > --- a/osm/include/opensm/osm_sa_slvl_record.h > > +++ b/osm/include/opensm/osm_sa_slvl_record.h > > @@ -100,7 +100,7 @@ BEGIN_C_DECLS > > */ > > typedef struct _osm_slvl_rec_rcv > > { > > - const osm_subn_t *p_subn; > > + osm_subn_t *p_subn; > > osm_sa_resp_t *p_resp; > > osm_mad_pool_t *p_mad_pool; > > osm_log_t *p_log; > > @@ -211,7 +211,7 @@ osm_slvl_rec_rcv_init( > > IN osm_slvl_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ); > > /* > > diff --git a/osm/include/opensm/osm_sa_vlarb_record.h b/osm/include/opensm/osm_sa_vlarb_record.h > > index 4aad76f..9796483 100644 > > --- a/osm/include/opensm/osm_sa_vlarb_record.h > > +++ b/osm/include/opensm/osm_sa_vlarb_record.h > > @@ -100,7 +100,7 @@ BEGIN_C_DECLS > > */ > > typedef struct _osm_vlarb_rec_rcv > > { > > - const osm_subn_t *p_subn; > > + osm_subn_t *p_subn; > > osm_sa_resp_t *p_resp; > > osm_mad_pool_t *p_mad_pool; > > osm_log_t *p_log; > > @@ -211,7 +211,7 @@ osm_vlarb_rec_rcv_init( > > IN osm_vlarb_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ); > > /* > > diff --git a/osm/include/opensm/osm_stats.h b/osm/include/opensm/osm_stats.h > > index 5cffc00..15bc8e0 100644 > > --- a/osm/include/opensm/osm_stats.h > > +++ b/osm/include/opensm/osm_stats.h > > @@ -90,9 +90,12 @@ typedef struct _osm_stats > > atomic32_t qp0_mads_rcvd; > > atomic32_t qp0_mads_sent; > > atomic32_t qp0_unicasts_sent; > > + atomic32_t qp0_mads_rcvd_unknown; > > atomic32_t qp1_mads_outstanding; > > atomic32_t qp1_mads_rcvd; > > atomic32_t qp1_mads_sent; > > + atomic32_t qp1_mads_rcvd_unknown; > > + atomic32_t qp1_mads_ignored; > > > > } osm_stats_t; > > /* > > @@ -117,6 +120,27 @@ typedef struct _osm_stats > > * Total number of response-less MADs sent on the wire. This count > > * includes getresp(), send() and trap() methods. > > * > > +* qp0_mads_rcvd_unknown > > +* Total number of unknown QP0 MADs received. This includes > > +* unrecognized attribute IDs and methods. > > +* > > +* qp1_mads_outstanding > > +* Contains the number of MADs outstanding on QP1. > > +* > > +* qp1_mads_rcvd > > +* Total number of QP1 MADs received. > > +* > > +* qp1_mads_sent > > +* Total number of QP1 MADs sent. > > +* > > +* qp1_mads_rcvd_unknown > > +* Total number of unknown QP1 MADs received. This includes > > +* unrecognized attribute IDs and methods. > > +* > > +* qp1_mads_ignored > > +* Total number of QP1 MADs received because SM is not > > +* master or SM is in first time sweep. > > +* > > * SEE ALSO > > ***************/ > > > > diff --git a/osm/include/opensm/osm_version.h b/osm/include/opensm/osm_version.h > > index ef91e16..6d2c8ee 100644 > > --- a/osm/include/opensm/osm_version.h > > +++ b/osm/include/opensm/osm_version.h > > @@ -55,7 +55,7 @@ BEGIN_C_DECLS > > * > > * SYNOPSIS > > */ > > -#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo2" > > +#define OSM_VERSION "OpenSM Rev:openib-3.0.14-xsigo3" This shouldn't be part of this patch. Missed that before. > > /********/ > > > > END_C_DECLS > > diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c > > index 5575425..7acfdf1 100644 > > --- a/osm/opensm/osm_console.c > > +++ b/osm/opensm/osm_console.c > > @@ -336,23 +336,29 @@ static void print_status(osm_opensm_t *p_osm, FILE *out) > > p_osm->routing_engine.name ? p_osm->routing_engine.name : "null (min-hop)"); > > fprintf(out, "\n MAD stats\n" > > " ---------\n" > > - " QP0 MADS outstanding : %d\n" > > - " QP0 MADS outstanding (on wire) : %d\n" > > - " QP0 MADS rcvd : %d\n" > > - " QP0 MADS sent : %d\n" > > + " QP0 MADs outstanding : %d\n" > > + " QP0 MADs outstanding (on wire) : %d\n" > > + " QP0 MADs rcvd : %d\n" > > + " QP0 MADs sent : %d\n" > > " QP0 unicasts sent : %d\n" > > - " QP1 MADS outstanding : %d\n" > > - " QP1 MADS rcvd : %d\n" > > - " QP1 MADS sent : %d\n" > > + " QP0 unknown MADs rcvd : %d\n" > > + " QP1 MADs outstanding : %d\n" > > + " QP1 MADs rcvd : %d\n" > > + " QP1 MADs sent : %d\n" > > + " QP1 unknown MADs rcvd : %d\n" > > + " QP1 MADs ignored : %d\n" > > , > > p_osm->stats.qp0_mads_outstanding, > > p_osm->stats.qp0_mads_outstanding_on_wire, > > p_osm->stats.qp0_mads_rcvd, > > p_osm->stats.qp0_mads_sent, > > p_osm->stats.qp0_unicasts_sent, > > + p_osm->stats.qp0_mads_rcvd_unknown, > > p_osm->stats.qp1_mads_outstanding, > > p_osm->stats.qp1_mads_rcvd, > > - p_osm->stats.qp1_mads_sent > > + p_osm->stats.qp1_mads_sent, > > + p_osm->stats.qp1_mads_rcvd_unknown, > > + p_osm->stats.qp1_mads_ignored > > ); > > fprintf(out, "\n Subnet flags\n" > > " ------------\n" > > diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c > > index f91fa49..e1e1dec 100644 > > --- a/osm/opensm/osm_inform.c > > +++ b/osm/opensm/osm_inform.c > > @@ -57,6 +57,7 @@ > > #include > > #include > > #include > > +#include > > > > typedef struct _osm_infr_match_ctxt > > { > > @@ -442,7 +443,8 @@ __osm_send_report( > > *p_report_ntc = *p_ntc; > > > > /* The TRUE is for: response is expected */ > > - status = osm_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE ); > > + status = osm_sa_vendor_send( p_report_madw->h_bind, p_report_madw, TRUE, > > + p_infr_rec->p_infr_rcv->p_subn ); > > if ( status != IB_SUCCESS ) > > { > > osm_log( p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c > > index d856fb0..f10ed60 100644 > > --- a/osm/opensm/osm_lid_mgr.c > > +++ b/osm/opensm/osm_lid_mgr.c > > @@ -1163,15 +1163,19 @@ __osm_lid_mgr_set_physp_pi( > > if ( (mtu != ib_port_info_get_neighbor_mtu(p_old_pi)) || > > (op_vls != ib_port_info_get_op_vls(p_old_pi))) > > { > > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > > +#if 0 > > + if( osm_log_is_active( p_mgr->p_log, OSM_LOG_ERROR ) ) > > { > > - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > > +#endif > > + osm_log( p_mgr->p_log, OSM_LOG_ERROR, > > "__osm_lid_mgr_set_physp_pi: " > > - "Sending Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", > > + "Setting Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", > > mtu, ib_port_info_get_neighbor_mtu(p_old_pi), > > op_vls, ib_port_info_get_op_vls(p_old_pi) > > ); > > +#if 0 > > } > > +#endif > > Why those #if 0? Should it be here? No; this osm_lid_mgr.c change is not part of this patch. Sorry. -- Hal > Sasha > > > > > /* > > we need to make sure the internal DB will follow the fact the remote > > diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c > > index 6d68ed2..360ad70 100644 > > --- a/osm/opensm/osm_sa.c > > +++ b/osm/opensm/osm_sa.c > > @@ -69,6 +69,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_SA_INITIAL_TID_VALUE 0xabc > > > > @@ -202,6 +203,7 @@ osm_sa_init( > > > > status = osm_sa_resp_init(&p_sa->resp, > > p_sa->p_mad_pool, > > + p_subn, > > p_log); > > if( status != IB_SUCCESS ) > > goto Exit; > > @@ -519,6 +521,22 @@ osm_sa_bind( > > return( status ); > > } > > > > +ib_api_status_t > > +osm_sa_vendor_send( > > + IN osm_bind_handle_t h_bind, > > + IN osm_madw_t* const p_madw, > > + IN boolean_t const resp_expected, > > + IN osm_subn_t* const p_subn ) > > +{ > > + ib_api_status_t status; > > + > > + cl_atomic_inc( &p_subn->p_osm->stats.qp1_mads_sent ); > > + status = osm_vendor_send( h_bind, p_madw, resp_expected ); > > + if ( status != IB_SUCCESS ) > > + cl_atomic_dec( &p_subn->p_osm->stats.qp1_mads_sent ); > > + return status; > > +} > > + > > /********************************************************************** > > **********************************************************************/ > > /* > > diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c > > index da107ee..9ee434a 100644 > > --- a/osm/opensm/osm_sa_class_port_info.c > > +++ b/osm/opensm/osm_sa_class_port_info.c > > @@ -60,6 +60,7 @@ > > #include > > #include > > #include > > +#include > > > > #define MAX_MSECS_TO_RTV 24 > > /* Precalculated table in msec (index is related to encoded value) */ > > @@ -223,7 +224,8 @@ __osm_cpi_rcv_respond( > > if( osm_log_is_active( p_rcv->p_log, OSM_LOG_FRAMES ) ) > > osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if( status != IB_SUCCESS ) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_guidinfo_record.c b/osm/opensm/osm_sa_guidinfo_record.c > > index 10fac3c..fe85eff 100644 > > --- a/osm/opensm/osm_sa_guidinfo_record.c > > +++ b/osm/opensm/osm_sa_guidinfo_record.c > > @@ -33,7 +33,6 @@ > > * > > */ > > > > - > > /* > > * Abstract: > > * Implementation of osm_gir_rcv_t. > > @@ -61,6 +60,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_GIR_RCV_POOL_MIN_SIZE 32 > > #define OSM_GIR_RCV_POOL_GROW_SIZE 32 > > @@ -108,7 +108,7 @@ osm_gir_rcv_init( > > IN osm_gir_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ) > > { > > @@ -595,7 +595,8 @@ osm_gir_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c > > index 340a7f1..dc999b3 100644 > > --- a/osm/opensm/osm_sa_informinfo.c > > +++ b/osm/opensm/osm_sa_informinfo.c > > @@ -339,7 +339,8 @@ __osm_infr_rcv_respond( > > > > p_resp_infr = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > > > if ( status != IB_SUCCESS ) > > { > > @@ -647,7 +648,8 @@ osm_infr_rcv_process_get_method( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c > > index b6333e7..ed989a0 100644 > > --- a/osm/opensm/osm_sa_lft_record.c > > +++ b/osm/opensm/osm_sa_lft_record.c > > @@ -58,6 +58,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_LFTR_RCV_POOL_MIN_SIZE 32 > > #define OSM_LFTR_RCV_POOL_GROW_SIZE 32 > > @@ -502,7 +503,8 @@ osm_lftr_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c > > index 169e75e..058b6b2 100644 > > --- a/osm/opensm/osm_sa_link_record.c > > +++ b/osm/opensm/osm_sa_link_record.c > > @@ -60,6 +60,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_LR_RCV_POOL_MIN_SIZE 64 > > #define OSM_LR_RCV_POOL_GROW_SIZE 64 > > @@ -679,7 +680,8 @@ __osm_lr_rcv_respond( > > } > > } > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c > > index d6518e4..579e8f1 100644 > > --- a/osm/opensm/osm_sa_mad_ctrl.c > > +++ b/osm/opensm/osm_sa_mad_ctrl.c > > @@ -269,6 +269,7 @@ __osm_sa_mad_ctrl_process( > > There is an unknown MAD attribute type for which there is > > no recipient. Simply retire the MAD here. > > */ > > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); > > osm_mad_pool_put( p_ctrl->p_mad_pool, p_madw ); > > } > > > > @@ -330,6 +331,7 @@ __osm_sa_mad_ctrl_rcv_callback( > > */ > > if ( p_ctrl->p_subn->sm_state != IB_SMINFO_STATE_MASTER ) > > { > > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); > > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > > "__osm_sa_mad_ctrl_rcv_callback: " > > "Received SA MAD while SM not MASTER. MAD ignored\n"); > > @@ -338,6 +340,7 @@ __osm_sa_mad_ctrl_rcv_callback( > > } > > if ( p_ctrl->p_subn->first_time_master_sweep == TRUE ) > > { > > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_ignored ); > > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > > "__osm_sa_mad_ctrl_rcv_callback: " > > "Received SA MAD while SM in first sweep. MAD ignored\n"); > > @@ -394,6 +397,7 @@ __osm_sa_mad_ctrl_rcv_callback( > > break; > > > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp1_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > "__osm_sa_mad_ctrl_rcv_callback: ERR 1A05: " > > "Unsupported method = 0x%X\n", > > diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c > > index 50c4f22..260360f 100644 > > --- a/osm/opensm/osm_sa_mcmember_record.c > > +++ b/osm/opensm/osm_sa_mcmember_record.c > > @@ -68,6 +68,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_MCMR_RCV_POOL_MIN_SIZE 32 > > #define OSM_MCMR_RCV_POOL_GROW_SIZE 32 > > @@ -571,7 +572,8 @@ __osm_mcmr_rcv_respond( > > p_resp_mcmember_rec->pkt_life &= 0x3f; > > p_resp_mcmember_rec->pkt_life |= 2<<6; /* exactly */ > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > > > if(status != IB_SUCCESS) > > { > > @@ -2266,7 +2268,8 @@ __osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* const p_rcv, > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if(status != IB_SUCCESS) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c > > index 005c9bd..d7c7544 100644 > > --- a/osm/opensm/osm_sa_mft_record.c > > +++ b/osm/opensm/osm_sa_mft_record.c > > @@ -57,6 +57,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_MFTR_RCV_POOL_MIN_SIZE 32 > > #define OSM_MFTR_RCV_POOL_GROW_SIZE 32 > > @@ -534,7 +535,8 @@ osm_mftr_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c > > index 0c5643e..2df3699 100644 > > --- a/osm/opensm/osm_sa_multipath_record.c > > +++ b/osm/opensm/osm_sa_multipath_record.c > > @@ -64,6 +64,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_MPR_RCV_POOL_MIN_SIZE 64 > > #define OSM_MPR_RCV_POOL_GROW_SIZE 64 > > @@ -1536,7 +1537,8 @@ __osm_mpr_rcv_respond( > > > > osm_dump_sa_mad( p_rcv->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > > > if ( status != IB_SUCCESS ) > > { > > diff --git a/osm/opensm/osm_sa_node_record.c b/osm/opensm/osm_sa_node_record.c > > index 892582e..0d08a4c 100644 > > --- a/osm/opensm/osm_sa_node_record.c > > +++ b/osm/opensm/osm_sa_node_record.c > > @@ -58,6 +58,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_NR_RCV_POOL_MIN_SIZE 32 > > #define OSM_NR_RCV_POOL_GROW_SIZE 32 > > @@ -105,7 +106,7 @@ osm_nr_rcv_init( > > IN osm_nr_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ) > > { > > @@ -587,7 +588,8 @@ osm_nr_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c > > index 1b0f89f..b993fdd 100644 > > --- a/osm/opensm/osm_sa_path_record.c > > +++ b/osm/opensm/osm_sa_path_record.c > > @@ -67,6 +67,7 @@ > > #include > > #include > > #include > > +#include > > #ifdef ROUTER_EXP > > #include > > #include > > @@ -1892,7 +1893,8 @@ __osm_pr_rcv_respond( > > > > CL_ASSERT( cl_is_qlist_empty( p_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > > > if( status != IB_SUCCESS ) > > { > > diff --git a/osm/opensm/osm_sa_pkey_record.c b/osm/opensm/osm_sa_pkey_record.c > > index 5eb15df..2692d0c 100644 > > --- a/osm/opensm/osm_sa_pkey_record.c > > +++ b/osm/opensm/osm_sa_pkey_record.c > > @@ -49,6 +49,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_PKEY_REC_RCV_POOL_MIN_SIZE 32 > > #define OSM_PKEY_REC_RCV_POOL_GROW_SIZE 32 > > @@ -94,10 +95,10 @@ osm_pkey_rec_rcv_destroy( > > **********************************************************************/ > > ib_api_status_t > > osm_pkey_rec_rcv_init( > > - IN osm_pkey_rec_rcv_t* const p_rcv, > > + IN osm_pkey_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > - IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_mad_pool_t* const p_mad_pool, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ) > > { > > @@ -573,7 +574,8 @@ osm_pkey_rec_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_portinfo_record.c b/osm/opensm/osm_sa_portinfo_record.c > > index 5d9b1b2..4aa1723 100644 > > --- a/osm/opensm/osm_sa_portinfo_record.c > > +++ b/osm/opensm/osm_sa_portinfo_record.c > > @@ -33,7 +33,6 @@ > > * > > */ > > > > - > > /* > > * Abstract: > > * Implementation of osm_pir_rcv_t. > > @@ -63,6 +62,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_PIR_RCV_POOL_MIN_SIZE 32 > > #define OSM_PIR_RCV_POOL_GROW_SIZE 32 > > @@ -865,7 +865,8 @@ osm_pir_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c > > index 4f158e9..fac2159 100644 > > --- a/osm/opensm/osm_sa_response.c > > +++ b/osm/opensm/osm_sa_response.c > > @@ -56,6 +56,7 @@ > > #include > > #include > > #include > > +#include > > > > /********************************************************************** > > **********************************************************************/ > > @@ -81,6 +82,7 @@ ib_api_status_t > > osm_sa_resp_init( > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_pool, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log ) > > { > > ib_api_status_t status = IB_SUCCESS; > > @@ -89,6 +91,7 @@ osm_sa_resp_init( > > > > osm_sa_resp_construct( p_resp ); > > > > + p_resp->p_subn = p_subn; > > p_resp->p_log = p_log; > > p_resp->p_pool = p_pool; > > > > @@ -158,8 +161,8 @@ osm_sa_send_error( > > if( osm_log_is_active( p_resp->p_log, OSM_LOG_FRAMES ) ) > > osm_dump_sa_mad( p_resp->p_log, p_resp_sa_mad, OSM_LOG_FRAMES ); > > > > - status = osm_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), > > - p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( osm_madw_get_bind_handle( p_resp_madw ), > > + p_resp_madw, FALSE, p_resp->p_subn ); > > > > if( status != IB_SUCCESS ) > > { > > diff --git a/osm/opensm/osm_sa_service_record.c b/osm/opensm/osm_sa_service_record.c > > index b23a12d..4479f00 100644 > > --- a/osm/opensm/osm_sa_service_record.c > > +++ b/osm/opensm/osm_sa_service_record.c > > @@ -465,7 +465,8 @@ __osm_sr_rcv_respond( > > } > > } > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > > > if( status != IB_SUCCESS ) > > { > > diff --git a/osm/opensm/osm_sa_slvl_record.c b/osm/opensm/osm_sa_slvl_record.c > > index d831ffd..885bdc5 100644 > > --- a/osm/opensm/osm_sa_slvl_record.c > > +++ b/osm/opensm/osm_sa_slvl_record.c > > @@ -61,6 +61,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_SLVL_REC_RCV_POOL_MIN_SIZE 32 > > #define OSM_SLVL_REC_RCV_POOL_GROW_SIZE 32 > > @@ -109,7 +110,7 @@ osm_slvl_rec_rcv_init( > > IN osm_slvl_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ) > > { > > @@ -540,7 +541,8 @@ osm_slvl_rec_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if(status != IB_SUCCESS) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c > > index 5e15f52..99e31c6 100644 > > --- a/osm/opensm/osm_sa_sminfo_record.c > > +++ b/osm/opensm/osm_sa_sminfo_record.c > > @@ -68,6 +68,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_SMIR_RCV_POOL_MIN_SIZE 32 > > #define OSM_SMIR_RCV_POOL_GROW_SIZE 32 > > @@ -570,7 +571,8 @@ osm_smir_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if( status != IB_SUCCESS ) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c > > index da65864..1c2b6c7 100644 > > --- a/osm/opensm/osm_sa_sw_info_record.c > > +++ b/osm/opensm/osm_sa_sw_info_record.c > > @@ -57,6 +57,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_SIR_RCV_POOL_MIN_SIZE 32 > > #define OSM_SIR_RCV_POOL_GROW_SIZE 32 > > @@ -522,7 +523,8 @@ osm_sir_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if (status != IB_SUCCESS) > > { > > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sa_vlarb_record.c b/osm/opensm/osm_sa_vlarb_record.c > > index f0ff957..fdb3d99 100644 > > --- a/osm/opensm/osm_sa_vlarb_record.c > > +++ b/osm/opensm/osm_sa_vlarb_record.c > > @@ -61,6 +61,7 @@ > > #include > > #include > > #include > > +#include > > > > #define OSM_VLARB_REC_RCV_POOL_MIN_SIZE 32 > > #define OSM_VLARB_REC_RCV_POOL_GROW_SIZE 32 > > @@ -109,7 +110,7 @@ osm_vlarb_rec_rcv_init( > > IN osm_vlarb_rec_rcv_t* const p_rcv, > > IN osm_sa_resp_t* const p_resp, > > IN osm_mad_pool_t* const p_mad_pool, > > - IN const osm_subn_t* const p_subn, > > + IN osm_subn_t* const p_subn, > > IN osm_log_t* const p_log, > > IN cl_plock_t* const p_lock ) > > { > > @@ -560,7 +561,8 @@ osm_vlarb_rec_rcv_process( > > > > CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); > > > > - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > + status = osm_sa_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE, > > + p_rcv->p_subn ); > > if(status != IB_SUCCESS) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > diff --git a/osm/opensm/osm_sm_mad_ctrl.c b/osm/opensm/osm_sm_mad_ctrl.c > > index acd68d7..85729af 100644 > > --- a/osm/opensm/osm_sm_mad_ctrl.c > > +++ b/osm/opensm/osm_sm_mad_ctrl.c > > @@ -318,6 +318,7 @@ __osm_sm_mad_ctrl_process_get_resp( > > case IB_MAD_ATTR_NOTICE: > > case IB_MAD_ATTR_INFORM_INFO: > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > "__osm_sm_mad_ctrl_process_get_resp: ERR 3103: " > > "Unsupported attribute = 0x%X\n", > > @@ -395,6 +396,7 @@ __osm_sm_mad_ctrl_process_get( > > break; > > > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_VERBOSE, > > "__osm_sm_mad_ctrl_process_get: " > > "Ignoring SubnGet MAD - unsupported attribute = 0x%X\n", > > @@ -487,6 +489,7 @@ __osm_sm_mad_ctrl_process_set( > > break; > > > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > "__osm_sm_mad_ctrl_process_set: ERR 3107: " > > "Unsupported attribute = 0x%X\n", > > @@ -591,6 +594,7 @@ __osm_sm_mad_ctrl_process_trap( > > break; > > > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > "__osm_sm_mad_ctrl_process_trap: ERR 3109: " > > "Unsupported attribute = 0x%X\n", > > @@ -763,6 +767,7 @@ __osm_sm_mad_ctrl_rcv_callback( > > case IB_MAD_METHOD_REPORT_RESP: > > case IB_MAD_METHOD_TRAP_REPRESS: > > default: > > + cl_atomic_inc( &p_ctrl->p_stats->qp0_mads_rcvd_unknown ); > > osm_log( p_ctrl->p_log, OSM_LOG_ERROR, > > "__osm_sm_mad_ctrl_rcv_callback: ERR 3112: " > > "Unsupported method = 0x%X\n", p_smp->method ); > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bramesh at vt.edu Mon Sep 17 21:22:02 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 18 Sep 2007 00:22:02 -0400 Subject: [ofa-general] IBV_WC_LOC_PROT_ERROR in receive Message-ID: <20070918042202.GA8660@vt.edu> I am getting this error when I am trying to do a bunch of send/receives. I have registered the receive buffer. I printed the address of the buffers and their respective lkeys, they all match but I am still getting this error. The code snippet looks as follows: struct ibv_mr *mr; struct ibv_sge sge; struct ibv_recv_wr wr; struct ibv_recv_wr *wr_bad; // registering buffers mr = ibv_reg_mr (ib_pd, buf, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE); //Post the receive buffer sge.addr = (uintptr_t) buf; sge.length = size; sge.lkey = mr->lkey; wr.wr_id = WR_ID; wr.next = NULL; wr.sg_list = &sge; wr.num_sge = 1; if (ibv_post_recv (ib_qp, &wr, &wr_bad) != 0) { printf ("ERROR: Unable to post receiver buffer.\n"); return -1; } When I poll for the completion event I get this error. Any help on this is appreciated. I am not subscribed to this list, I would appreciate if you please cc me on the reply. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From kliteyn at mellanox.co.il Mon Sep 17 22:35:42 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 18 Sep 2007 07:35:42 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-18:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-17 OpenSM git rev = Sun_Sep_16_18:47:46_2007 [8224cc5e3f6e5ce03d783e674b4eaa6e1cf37acd] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From mst at dev.mellanox.co.il Mon Sep 17 22:50:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 07:50:05 +0200 Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: References: <20070911090313.GE15363@mellanox.co.il> <20070917062252.GA30842@mellanox.co.il> Message-ID: <20070918055005.GG24414@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: mlx4 violating radix tree API locking rules? > > By the way, in the past we've gotten push-back against using RCU in > dual GPL/BSD code. I don't know what official Mellanox position is on this. I think that in this case it's OK, simply because RCU here is just a requirement from radix tree code, not from mlx4 itself, so anyone who wants to use this code under BSD license, can just implement radix tree in a way that does not require RCU, and then rip out RCU calls. > I have no problem relicensing mlx4 to GPL-only and > then sticking in the rcu_read_lock() stuff to handle this I guess. I don't see how that would help. BTW I looked for BSD code in kernel today that uses RCU: $grep -i RCU `grep -l 'source and binary forms' . -rI after manually filtering out words such as circular, gave me: ./net/sunrpc/auth_gss/auth_gss.c ./fs/nfs/nfs4proc.c ./drivers/infiniband/hw/ipath/ipath_verbs_mcast.c -- MST From mst at dev.mellanox.co.il Mon Sep 17 23:09:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 08:09:52 +0200 Subject: [ofa-general] Re: ofed-1.3 daily build package's content In-Reply-To: <200709171711.09316.hnguyen@linux.vnet.ibm.com> References: <200709171711.09316.hnguyen@linux.vnet.ibm.com> Message-ID: <20070918060952.GI24414@mellanox.co.il> > Quoting Hoang-Nam Nguyen : > Subject: ofed-1.3 daily build package's content > > Hello Vlad and Michael! > Just downloaded daily build package OFED-1.3-20070917-0600 and saw > in SRPMS: > localhost:/home/nguyen/tmp/OFED-1.3-20070917-0600/SRPMS # ls -l ofa_kernel-1.3-ofed2007091* > -rw-r--r-- 1 1011 1011 1967453 2007-09-10 15:27 ofa_kernel-1.3-ofed20070910.src.rpm > -rw-r--r-- 1 1011 1011 1960701 2007-09-11 15:02 ofa_kernel-1.3-ofed20070911.src.rpm > -rw-r--r-- 1 1011 1011 1966672 2007-09-12 15:02 ofa_kernel-1.3-ofed20070912.src.rpm > -rw-r--r-- 1 1011 1011 1957624 2007-09-13 15:02 ofa_kernel-1.3-ofed20070913.src.rpm > -rw-r--r-- 1 1011 1011 1963469 2007-09-14 15:02 ofa_kernel-1.3-ofed20070914.src.rpm > -rw-r--r-- 1 1011 1011 1965865 2007-09-15 15:02 ofa_kernel-1.3-ofed20070915.src.rpm > -rw-r--r-- 1 1011 1011 1963044 2007-09-16 15:01 ofa_kernel-1.3-ofed20070916.src.rpm > -rw-r--r-- 1 1011 1011 1959261 2007-09-17 15:01 ofa_kernel-1.3-ofed20070917.src.rpm I see this too tar tvzf OFED-1.3-20070917-0600.tgz | grep kernel -rw-r--r-- vlad/vlad 1967453 2007-09-10 16:27:48 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070910.src.rpm -rw-r--r-- vlad/vlad 1960701 2007-09-11 16:02:55 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070911.src.rpm -rw-r--r-- vlad/vlad 1966672 2007-09-12 16:02:32 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070912.src.rpm -rw-r--r-- vlad/vlad 1957624 2007-09-13 16:02:46 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070913.src.rpm -rw-r--r-- vlad/vlad 1963469 2007-09-14 16:02:30 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070914.src.rpm -rw-r--r-- vlad/vlad 1965865 2007-09-15 16:02:32 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070915.src.rpm -rw-r--r-- vlad/vlad 1963044 2007-09-16 15:01:56 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070916.src.rpm -rw-r--r-- vlad/vlad 1959261 2007-09-17 15:01:58 OFED-1.3-20070917-0600/SRPMS/ofa_kernel-1.3-ofed20070917.src.rpm > Is there a reason to include earlier versions of ofa_kernel-1.3? Are they > needed by the build script? I don't think so. -- MST From dotanb at dev.mellanox.co.il Mon Sep 17 22:17:34 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 18 Sep 2007 08:17:34 +0300 Subject: [ofa-general] IBV_WC_LOC_PROT_ERROR in receive In-Reply-To: <20070918042202.GA8660@vt.edu> References: <20070918042202.GA8660@vt.edu> Message-ID: <46EF5F6E.3080708@dev.mellanox.co.il> Hi. Bharath Ramesh wrote: > I am getting this error when I am trying to do a bunch of send/receives. > I have registered the receive buffer. I printed the address of the > buffers and their respective lkeys, they all match but I am still > getting this error. > > The code snippet looks as follows: > > struct ibv_mr *mr; > struct ibv_sge sge; > struct ibv_recv_wr wr; > struct ibv_recv_wr *wr_bad; > > // registering buffers > mr = ibv_reg_mr (ib_pd, buf, size, IBV_ACCESS_LOCAL_WRITE | > IBV_ACCESS_REMOTE_READ | > IBV_ACCESS_REMOTE_WRITE); > > > //Post the receive buffer > sge.addr = (uintptr_t) buf; > sge.length = size; > sge.lkey = mr->lkey; > wr.wr_id = WR_ID; > wr.next = NULL; > wr.sg_list = &sge; > wr.num_sge = 1; > if (ibv_post_recv (ib_qp, &wr, &wr_bad) != 0) { > printf ("ERROR: Unable to post receiver buffer.\n"); > return -1; > } > > When I poll for the completion event I get this error. Any help on this > is appreciated. I am not subscribed to this list, I would appreciate if > you please cc me on the reply. > If the address that you given in the RR is valid (you didn't deregister this MR): You should check the following things: * If this is a UD QP, maybe the extra 40 bytes (for the GRH) is missing in the recv buffer. * Maybe the incoming message is larger than the receive buffer * maybe the PD of the QP and the MR are not the same If this didn't help you, the value of the vendor_err in the completion structure may help me.... Dotan From jackm at dev.mellanox.co.il Tue Sep 18 00:09:49 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 09:09:49 +0200 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <200709180909.50029.jackm@dev.mellanox.co.il> On Thursday 13 September 2007 20:57, Roland Dreier wrote: > HW specific: > >  - I already merged patches to enable MSI-X by default for mthca and >    mlx4.  I hope there aren't too many systems that get hosed if a >    MSI-X interrupt is generated. > >  - Jack and Michael's mlx4 FMR support.  Will merge I guess, although >    I do hope to have time to address the DMA API abuse that is being >    copied from mthca, so that mlx4 and mthca work in Xen domU. > >  - ehca patch queue.  Will merge, pending fixes for the few minor >    issues I commented on. > >  - Steve's mthca router mode support.  Would be nice to see a review >    from someone at Mellanox. > >  - Arthur's mthca doorbell alignment fixes.  I will experiment with a >    few different approaches and post what I like (and fix mlx4 as >    well).  I hope Arthur can review. > >  - Michael's mlx4 WQE shrinking patch.  Not sure yet; I'll reply to >    the latest patch directly. > Missing from this list (IMPORTANT patch!): [ofa-general] [PATCH 2 of 2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists (Posted by me to list on Sept 4) {patch header: This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745 (June 18). This addendum adds prefetch headroom marking processing for s/g segments. We write s/g segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing s/g segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. } This patch series (1 of 2 is for libmlx4, the same issue). ============================================================ Also, I'm now posting (in a separate post) the following patch to mlx4, which is important: display the following device information via sysfs: board_id, fw_ver, hw_rev, hca_type. The info is displayed under directory /sys/class/infiniband/mlx4_x, where x is the pci bus sequence number (starting from zero). This patch makes information available to ibstat and ibv_devinfo under the same directory as is used for tavor/arbel/sinai -- thus requiring no userspace modifications. - Jack From jackm at dev.mellanox.co.il Tue Sep 18 00:14:18 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 09:14:18 +0200 Subject: [ofa-general] [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo Message-ID: <200709180914.18560.jackm@dev.mellanox.co.il> display the following device information via sysfs: board_id, fw_ver, hw_rev, hca_type. The info is displayed under directory /sys/class/infiniband/mlx4_x, where x is the pci bus sequence number (starting from zero). This patch makes information available to ibstat and ibv_devinfo under the same directory as is used for tavor/arbel/sinai -- thus requiring no userspace modifications. Signed-off-by: Jack Morgenstein Index: connectx_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-08-02 13:58:37.000000000 +0300 +++ connectx_kernel/drivers/infiniband/hw/mlx4/main.c 2007-08-02 14:04:28.000000000 +0300 @@ -477,9 +477,61 @@ return err; } +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); + switch (dev->dev->pdev->device) { + case 0x6340: + return sprintf(buf, "MT25408\n"); + case 0x634a: + return sprintf(buf, "MT25418\n"); + case 0x6354: + return sprintf(buf, "MT25428\n"); + case 0x6732: + return sprintf(buf, "MT26418\n"); + case 0x673c: + return sprintf(buf, "MT26428\n"); + default: + return sprintf(buf, "unknown\n"); + } +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); + return sprintf(buf, "%d.%d.%d\n", (int) (dev->dev->caps.fw_ver >> 32), + (int) (dev->dev->caps.fw_ver >> 16) & 0xffff, + (int) dev->dev->caps.fw_ver & 0xffff); +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->dev->rev_id); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); + return sprintf(buf, "%.*s\n", MLX4_BOARD_ID_LEN, dev->dev->board_id); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *mlx4_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + static void *mlx4_ib_add(struct mlx4_dev *dev) { struct mlx4_ib_dev *ibdev; + int i; ibdev = (struct mlx4_ib_dev *) ib_alloc_device(sizeof *ibdev); if (!ibdev) { @@ -586,6 +642,12 @@ if (mlx4_ib_mad_init(ibdev)) goto err_reg; + for (i = 0; i < ARRAY_SIZE(mlx4_class_attributes); ++i) { + if (class_device_create_file(&ibdev->ib_dev.class_dev, + mlx4_class_attributes[i])) + goto err_reg; + } + return ibdev; err_reg: Index: connectx_kernel/include/linux/mlx4/device.h =================================================================== --- connectx_kernel.orig/include/linux/mlx4/device.h 2007-08-02 13:58:37.000000000 +0300 +++ connectx_kernel/include/linux/mlx4/device.h 2007-08-02 14:04:28.000000000 +0300 @@ -49,6 +49,10 @@ }; enum { + MLX4_BOARD_ID_LEN = 64 +}; + +enum { MLX4_DEV_CAP_FLAG_RC = 1 << 0, MLX4_DEV_CAP_FLAG_UC = 1 << 1, MLX4_DEV_CAP_FLAG_UD = 1 << 2, @@ -283,6 +287,8 @@ unsigned long flags; struct mlx4_caps caps; struct radix_tree_root qp_table_tree; + u32 rev_id; + char board_id[MLX4_BOARD_ID_LEN]; }; struct mlx4_init_port_param { Index: connectx_kernel/drivers/net/mlx4/main.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/main.c 2007-08-02 13:58:37.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/main.c 2007-08-02 14:04:28.000000000 +0300 @@ -536,8 +536,8 @@ } priv->eq_table.inta_pin = adapter.inta_pin; - priv->rev_id = adapter.revision_id; - memcpy(priv->board_id, adapter.board_id, sizeof priv->board_id); + priv->dev.rev_id = adapter.revision_id; + memcpy(priv->dev.board_id, adapter.board_id, sizeof priv->dev.board_id); return 0; Index: connectx_kernel/drivers/net/mlx4/mlx4.h =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/mlx4.h 2007-08-02 13:58:37.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/mlx4.h 2007-08-02 14:04:28.000000000 +0300 @@ -56,10 +56,6 @@ }; enum { - MLX4_BOARD_ID_LEN = 64 -}; - -enum { MLX4_MGM_ENTRY_SIZE = 0x100, MLX4_QP_PER_MGM = 4 * (MLX4_MGM_ENTRY_SIZE / 16 - 2), MLX4_MTT_ENTRY_PER_SEG = 8 @@ -279,9 +275,6 @@ struct mlx4_uar driver_uar; void __iomem *kar; - - u32 rev_id; - char board_id[MLX4_BOARD_ID_LEN]; }; static inline struct mlx4_priv *mlx4_priv(struct mlx4_dev *dev) From mst at dev.mellanox.co.il Tue Sep 18 00:16:58 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 09:16:58 +0200 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: <200709180914.18560.jackm@dev.mellanox.co.il> References: <200709180914.18560.jackm@dev.mellanox.co.il> Message-ID: <20070918071658.GA32109@mellanox.co.il> > +static ssize_t show_hca(struct class_device *cdev, char *buf) > +{ > + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); > + switch (dev->dev->pdev->device) { > + case 0x6340: > + return sprintf(buf, "MT25408\n"); > + case 0x634a: > + return sprintf(buf, "MT25418\n"); > + case 0x6354: > + return sprintf(buf, "MT25428\n"); > + case 0x6732: > + return sprintf(buf, "MT26418\n"); > + case 0x673c: > + return sprintf(buf, "MT26428\n"); > + default: > + return sprintf(buf, "unknown\n"); > + } > +} How about just static ssize_t show_hca(struct class_device *cdev, char *buf) { struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); return sprintf(buf, "MT%d\n", dev->dev->pdev->device); } -- MST From jackm at dev.mellanox.co.il Tue Sep 18 01:43:07 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 10:43:07 +0200 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: <20070918071658.GA32109@mellanox.co.il> References: <200709180914.18560.jackm@dev.mellanox.co.il> <20070918071658.GA32109@mellanox.co.il> Message-ID: <200709181043.08150.jackm@dev.mellanox.co.il> On Tuesday 18 September 2007 09:16, Michael S. Tsirkin wrote: > > +static ssize_t show_hca(struct class_device *cdev, char *buf) > > +{ > > + struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); > > + switch (dev->dev->pdev->device) { > > + case 0x6340: > > + return sprintf(buf, "MT25408\n"); > > + case 0x634a: > > + return sprintf(buf, "MT25418\n"); > > + case 0x6354: > > + return sprintf(buf, "MT25428\n"); > > + case 0x6732: > > + return sprintf(buf, "MT26418\n"); > > + case 0x673c: > > + return sprintf(buf, "MT26428\n"); > > + default: > > + return sprintf(buf, "unknown\n"); > > + } > > +} > > How about just > > static ssize_t show_hca(struct class_device *cdev, char *buf) > { > struct mlx4_ib_dev *dev = container_of(cdev, struct mlx4_ib_dev, ib_dev.class_dev); > return sprintf(buf, "MT%d\n", dev->dev->pdev->device); > } > Looks OK. Don't need the "default" case, since the kernel will only invoke the mlx4 driver for the device-id's it registers for. (see static struct pci_device_id mlx4_pci_table[] in file drivers/net/mlx4/main.c) - Jack From tziporet at dev.mellanox.co.il Tue Sep 18 02:25:07 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 18 Sep 2007 11:25:07 +0200 Subject: [ofa-general] Re: [ewg] Re: ofed-1.3 daily build package's content In-Reply-To: <20070918060952.GI24414@mellanox.co.il> References: <200709171711.09316.hnguyen@linux.vnet.ibm.com> <20070918060952.GI24414@mellanox.co.il> Message-ID: <46EF9973.6020703@mellanox.co.il> Michael S. Tsirkin wrote: >> Quoting Hoang-Nam Nguyen : >> Subject: ofed-1.3 daily build package's content >> >> Hello Vlad and Michael! >> Just downloaded daily build package OFED-1.3-20070917-0600 and saw >> in SRPMS: >> localhost:/home/nguyen/tmp/OFED-1.3-20070917-0600/SRPMS # ls -l ofa_kernel-1.3-ofed2007091* >> -rw-r--r-- 1 1011 1011 1967453 2007-09-10 15:27 ofa_kernel-1.3-ofed20070910.src.rpm >> -rw-r--r-- 1 1011 1011 1960701 2007-09-11 15:02 ofa_kernel-1.3-ofed20070911.src.rpm >> -rw-r--r-- 1 1011 1011 1966672 2007-09-12 15:02 ofa_kernel-1.3-ofed20070912.src.rpm >> -rw-r--r-- 1 1011 1011 1957624 2007-09-13 15:02 ofa_kernel-1.3-ofed20070913.src.rpm >> -rw-r--r-- 1 1011 1011 1963469 2007-09-14 15:02 ofa_kernel-1.3-ofed20070914.src.rpm >> -rw-r--r-- 1 1011 1011 1965865 2007-09-15 15:02 ofa_kernel-1.3-ofed20070915.src.rpm >> -rw-r--r-- 1 1011 1011 1963044 2007-09-16 15:01 ofa_kernel-1.3-ofed20070916.src.rpm >> -rw-r--r-- 1 1011 1011 1959261 2007-09-17 15:01 ofa_kernel-1.3-ofed20070917.src.rpm >> > > It's a bug in the build script I will try to fix this before Vlad is back from vacation. I can use help from someone - need to look at the script: ~vlad/scripts/ofed_1_3/build_ofed_daily.sh Tziporet From tziporet at dev.mellanox.co.il Tue Sep 18 02:48:38 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 18 Sep 2007 11:48:38 +0200 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <1190034015.6272.83.camel@hrosenstock-ws.xsigo.com> References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> <1190034015.6272.83.camel@hrosenstock-ws.xsigo.com> Message-ID: <46EF9EF6.7090805@mellanox.co.il> Hal Rosenstock wrote: > > Has anyone tested these with QoS actually be used ? I suppose this > requires Connect-X. > You can test it with a switch without ConnectX. If you want that the HCA will react to the QoS setting too then you should have ConnectX Tziporet From tziporet at dev.mellanox.co.il Tue Sep 18 02:55:10 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 18 Sep 2007 11:55:10 +0200 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46ECE389.4020308@voltaire.com> References: <46ECE389.4020308@voltaire.com> Message-ID: <46EFA07E.9090102@mellanox.co.il> Or Gerlitz wrote: > Shirley Ma wrote: >> Since ehca can support 4K MTU, we would like to see a patch >> in IPoIB to allow link MTU to be up to 4K instead of current 2K for >> 2.6.24 kernel. The idea is IPoIB link MTU will pick up a return value >> from SM's default broadcast MTU. This patch should be a small patch, >> I hope you are OK with this. > > The only IB switching chip I know does not support 4K IB MTU so you > would be able to use it only in p2p connections, correct? > > Or. > > > ConnectX can support 4K MTU too (need a configured ini file) Anafa II switch can support 4K MTU with a special configuration. Tziporet From vlad at lists.openfabrics.org Tue Sep 18 02:55:34 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 18 Sep 2007 02:55:34 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070918-0200 daily build status Message-ID: <20070918095535.1FCFDE608AB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070918-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Tue Sep 18 03:09:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 18 Sep 2007 12:09:37 +0200 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <46EFA07E.9090102@mellanox.co.il> References: <46ECE389.4020308@voltaire.com> <46EFA07E.9090102@mellanox.co.il> Message-ID: <46EFA3E1.4080005@voltaire.com> Tziporet Koren wrote: > Or Gerlitz wrote: >> The only IB switching chip I know does not support 4K IB MTU so you >> would be able to use it only in p2p connections, correct? > ConnectX can support 4K MTU too (need a configured ini file) > Anafa II switch can support 4K MTU with a special configuration. Hi Tziporet, Thanks for the clarification. Does the Anafa II configuration/firmware that supports 4K MTU is officially supported by Mellanox? If yes, I think it makes much sense to work on the direction suggested by Shirley, namely to implement mtu > page-size support for ipoib datagram mode. Are you going to look on that? Or. From ogerlitz at voltaire.com Tue Sep 18 03:44:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 18 Sep 2007 12:44:07 +0200 (IST) Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management Message-ID: Hi Sean, We see a problem related to the core multicast management which seems as a bug: It is possible for the multicast consumer to call ib_sa_free_multicast() where this leave request is queued to be later processed by the workqueue thread, and then call ib_sa_join_multicast() which calls acquire_group() --before-- the leave request was excecuted by the thread. So the lookup done by acquire_group() succeeds, the code goes to the found: label and the group reference count climbs to (eg) 2. Following that the leave work-element causes the thread to just dec the reference count to 1 in release_group() and do nothing else, and the join work-element causes the thread to return the cached address-handle attributes to the consumer. So no sa query is being sent to the SA. We saw the bug on a uni processor system running the ipath driver, where the consumer is ipoib and the group being the IPv4 broadcast. When we take down the link of the switch port connected to the device across the cable, ipoib rushes to leave the group and then join it. On this system the join "crosses the leave" and the SA does not take into account the node when computing the multicast routing of the group --> the node does not get the broadcast traffic. For now we have applied a work around which causes the multicast code to call release_group() from ib_sa_free_multicast(). The workaround is implemented by using the patch below which causes mcast_groups_lost() to be called also when the port actually goes up, and set the group state to MCAST_ERROR such that the call to release_group() is not deferred (ipoib does leave/join for every event, namely both on link down and up). Please let me know what is your thinking on this issue, thanks! Or. From: Matty Kadosh Index: linux-2.6.23-rc5/drivers/infiniband/core/multicast.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/core/multicast.c 2007-09-18 12:32:08.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/core/multicast.c 2007-09-18 13:31:35.000000000 +0300 @@ -735,6 +735,7 @@ static void mcast_event_handler(struct i dev = container_of(handler, struct mcast_device, event_handler); switch (event->event) { + case IB_EVENT_PORT_ACTIVE: case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: case IB_EVENT_SM_CHANGE: From krkumar2 at in.ibm.com Tue Sep 18 04:18:03 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Tue, 18 Sep 2007 16:48:03 +0530 Subject: [ofa-general] [PATCH 1/2] IPoIB: Fix unregister_netdev hang Message-ID: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with the message: "waiting for ib2 to become free. Usage count = -515276", etc. The problem is that the poll handler does netif_rx_complete (which does a dev_put) followed by netif_rx_reschedule() to schedule for more receives (which again does a dev_put). This reduces refcount to < 0 (depending on how many times netif_rx_complete followed by netif_rx_reschedule was called). The following patch fixes the bug, but I don't know if there is some specific IB issue that prevents this approach. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 11 ++++------- 1 files changed, 4 insertions(+), 7 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 15:50:09.000000000 +0530 +++ new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:14:20.000000000 +0530 @@ -291,7 +291,6 @@ int ipoib_poll(struct napi_struct *napi, done = 0; -poll_more: while (done < budget) { int max = (budget - done); @@ -316,12 +315,10 @@ poll_more: } if (done < budget) { - netif_rx_complete(dev, napi); - if (unlikely(ib_req_notify_cq(priv->cq, - IB_CQ_NEXT_COMP | - IB_CQ_REPORT_MISSED_EVENTS)) && - netif_rx_reschedule(napi)) - goto poll_more; + if (likely(!ib_req_notify_cq(priv->cq, + IB_CQ_NEXT_COMP | + IB_CQ_REPORT_MISSED_EVENTS))) + netif_rx_complete(dev, napi); } return done; From krkumar2 at in.ibm.com Tue Sep 18 04:18:17 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Tue, 18 Sep 2007 16:48:17 +0530 Subject: [ofa-general] [PATCH 2/2] IPoIB: Code cleanup In-Reply-To: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> Message-ID: <20070918111817.1769.1042.sendpatchset@localhost.localdomain> Follow-up cleanup and "while loop" optimization in the poll handler. net_rx_action guarantees that 'budget' is atleast 1. Note: This could also be done for poll handlers of other drivers. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 22 ++++++++-------------- 1 files changed, 8 insertions(+), 14 deletions(-) diff -ruNp new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:14:20.000000000 +0530 +++ new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:31:42.000000000 +0530 @@ -285,19 +285,16 @@ int ipoib_poll(struct napi_struct *napi, { struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi); struct net_device *dev = priv->dev; - int done; - int t; - int n, i; + int num_wc, max_wc; + int done = 0; - done = 0; - - while (done < budget) { - int max = (budget - done); + do { + int i; - t = min(IPOIB_NUM_WC, max); - n = ib_poll_cq(priv->cq, t, priv->ibwc); + max_wc = min(IPOIB_NUM_WC, budget - done); + num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc); - for (i = 0; i < n; i++) { + for (i = 0; i < num_wc; i++) { struct ib_wc *wc = priv->ibwc + i; if (wc->wr_id & IPOIB_CM_OP_SRQ) { @@ -309,10 +306,7 @@ int ipoib_poll(struct napi_struct *napi, } else ipoib_ib_handle_tx_wc(dev, wc); } - - if (n != t) - break; - } + } while (num_wc == max_wc && done < budget); if (done < budget) { if (likely(!ib_req_notify_cq(priv->cq, From hrosenstock at xsigo.com Tue Sep 18 04:31:56 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 18 Sep 2007 04:31:56 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] infiniband-diags/ibnetdiscover: Bump build version Message-ID: <1190115116.12099.119.camel@hrosenstock-ws.xsigo.com> infiniband-diags/ibnetdiscover: Bump build version Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index da15523..6574f2b 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -47,7 +47,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.4 +#define __BUILD_VERSION_TAG__ 1.2.5 #include #include #include @@ -704,7 +704,6 @@ dump_topology(int listtype, int group) chguid = out_chassis(ch->chassisnum); chname = NULL; if (is_xsigo_guid(chguid)) { - /* !!! */ for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { if (!node->chrecord || !node->chrecord->chassisnum) From krkumar2 at in.ibm.com Tue Sep 18 04:39:16 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Tue, 18 Sep 2007 17:09:16 +0530 Subject: [ofa-general] [PATCH] IPoIB: Optimizations in poll handler. Message-ID: <20070918113916.2065.14065.sendpatchset@localhost.localdomain> Final follow-up optimizations: If the poll loop executes more than once (and it happens on my system with two flood pings): - no need to calculate "budget - done" on every iteration (but will require to do this once, when returning from fn) - check for one variable being non-zero instead of comparing two vars for every iteration. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 15 ++++++++------- 1 files changed, 8 insertions(+), 7 deletions(-) diff -ruNp new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c new3/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:31:42.000000000 +0530 +++ new3/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 17:01:44.000000000 +0530 @@ -286,36 +286,37 @@ int ipoib_poll(struct napi_struct *napi, struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi); struct net_device *dev = priv->dev; int num_wc, max_wc; - int done = 0; + int remaining = budget; do { int i; - max_wc = min(IPOIB_NUM_WC, budget - done); + max_wc = min(IPOIB_NUM_WC, remaining); num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc); for (i = 0; i < num_wc; i++) { struct ib_wc *wc = priv->ibwc + i; if (wc->wr_id & IPOIB_CM_OP_SRQ) { - ++done; + --remaining; ipoib_cm_handle_rx_wc(dev, wc); } else if (wc->wr_id & IPOIB_OP_RECV) { - ++done; + --remaining; ipoib_ib_handle_rx_wc(dev, wc); } else ipoib_ib_handle_tx_wc(dev, wc); } - } while (num_wc == max_wc && done < budget); + } while (num_wc == max_wc && remaining); - if (done < budget) { + if (remaining) { if (likely(!ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS))) netif_rx_complete(dev, napi); } - return done; + /* return number of receives processed */ + return budget - remaining; } void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) From AmberwoodpeckHeard at outsideadventures.org Tue Sep 18 08:58:16 2007 From: AmberwoodpeckHeard at outsideadventures.org (Ethel Ritter) Date: Tue, 18 Sep 2007 14:58:16 -0100 Subject: [ofa-general] Re: Thank you, we are accepting your company loan request Message-ID: <20d2901c7f9f7$53f12760$c000a8c0@desktop> Your credit score doesn't matter to us! If you have your own business and want IMMEDIATE cash to spend ANY way you like or need Extra money to give the company a boost or want A low interest loan - NO STRINGS ATTACHED, here is our best deal we can offer you NOW (hurry, this offer will expire NOW): $31,000+ loan Hurry, when best deal is gone, it is gone. Simply Call Us... Do not worry about approval, your credit will not disqualify you! Call Us Free on 877-482-4954 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnguyen at linux.vnet.ibm.com Tue Sep 18 06:33:14 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Tue, 18 Sep 2007 15:33:14 +0200 Subject: [ofa-general] Re: [ewg] Re: ofed-1.3 daily build package's content In-Reply-To: <46EF9973.6020703@mellanox.co.il> References: <200709171711.09316.hnguyen@linux.vnet.ibm.com> <20070918060952.GI24414@mellanox.co.il> <46EF9973.6020703@mellanox.co.il> Message-ID: <200709181533.14764.hnguyen@linux.vnet.ibm.com> Hi Tziporet! On Tuesday 18 September 2007 11:25, Tziporet Koren wrote: > I will try to fix this before Vlad is back from vacation. > I can use help from someone - need to look at the script: > ~vlad/scripts/ofed_1_3/build_ofed_daily.sh Hope the patch below helps. Nam diff -Nurp scripts_orig/ofed_1_3/build_ofed.sh scripts/ofed_1_3/build_ofed.sh --- scripts_orig/ofed_1_3/build_ofed.sh 2007-09-18 05:23:45.000000000 -0700 +++ scripts/ofed_1_3/build_ofed.sh 2007-09-18 05:31:55.000000000 -0700 @@ -272,7 +272,7 @@ build_ofa_kernel() ex rpmbuild -bs --define \'_topdir ${CWD}/topdir\' ${CWD}/topdir/SPECS/${kernel_spec} - ex cp -a ${CWD}/topdir/SRPMS/${kernel_proj}*src.rpm ${CWD}/${PACKAGE}-${PACKAGE_VERSION}/SRPMS + ex cp -a ${CWD}/topdir/SRPMS/${kernel_proj}-${kernel_proj_ver}-${kernel_proj_rel}.src.rpm ${CWD}/${PACKAGE}-${PACKAGE_VERSION}/SRPMS # Update BUILD_ID file tar xzf ${CWD}/builds/${kernel_proj}-${kernel_proj_ver}/${kernel_proj}-${kernel_proj_ver}.tgz ${kernel_proj}-${kernel_proj_ver}/BUILD_ID From jlentini at netapp.com Tue Sep 18 07:16:30 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 18 Sep 2007 10:16:30 -0400 (EDT) Subject: [ofa-general] Re: mlx4 violating radix tree API locking rules? In-Reply-To: <20070918055005.GG24414@mellanox.co.il> References: <20070911090313.GE15363@mellanox.co.il> <20070917062252.GA30842@mellanox.co.il> <20070918055005.GG24414@mellanox.co.il> Message-ID: On Tue, 18 Sep 2007, Michael S. Tsirkin wrote: > > Quoting Roland Dreier : > > Subject: Re: mlx4 violating radix tree API locking rules? > > > > By the way, in the past we've gotten push-back against using RCU in > > dual GPL/BSD code. > > I don't know what official Mellanox position is on this. > I think that in this case it's OK, simply because > RCU here is just a requirement from radix tree code, not > from mlx4 itself, so anyone who wants to use this code under BSD license, > can just implement radix tree in a way > that does not require RCU, and then rip out RCU calls. I agree. I don't see an issue with the currect licensing terms. When the code is compiled as part of Linux, the code is licensed under the GPL. > > I have no problem relicensing mlx4 to GPL-only and > > then sticking in the rcu_read_lock() stuff to handle this I guess. > > I don't see how that would help. > > BTW I looked for BSD code in kernel today that uses RCU: > $grep -i RCU `grep -l 'source and binary forms' . -rI > after manually filtering out words such as circular, gave me: > ./net/sunrpc/auth_gss/auth_gss.c > ./fs/nfs/nfs4proc.c > ./drivers/infiniband/hw/ipath/ipath_verbs_mcast.c > > > -- > MST > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Tue Sep 18 07:16:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 16:16:48 +0200 Subject: [ofa-general] ANNOUNCE orenk taking over mstflint/imgen Message-ID: <20070918141648.GJ2050@mellanox.co.il> Oren Kladnitsky is taking over maintaining mstflint and imgen tools from me. His trees: git://git.openfabrics.org/~orenk/mstflint.git git://git.openfabrics.org/~orenk/imgen.git are, starting now, the authoritative source for these tools. Oren is the internal maintainer of Mellanox FW tools (MFT) and now he is assuming ownership on the OFED tools too. Thanks, -- MST From sashak at voltaire.com Tue Sep 18 07:31:12 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 16:31:12 +0200 Subject: [ofa-general] [PATCH] OpenSM: Improve QP0 and QP1 counter accounting In-Reply-To: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> References: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918143112.GA31938@sashak.voltaire.com> On 11:04 Tue 11 Sep , Hal Rosenstock wrote: > OpenSM: Improve QP0 and QP1 counter accounting > > Note: Patch is based on OFED 1.2 > > Signed-off-by: Hal Rosenstock Applied with discussed removals. Thanks. Also I rebased this patch to master. So please look it still be fine. Sasha From rdreier at cisco.com Tue Sep 18 07:27:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 07:27:24 -0700 Subject: [ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang In-Reply-To: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> (Krishna Kumar's message of "Tue, 18 Sep 2007 16:48:03 +0530") References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> Message-ID: Thanks for testing on ehca... > While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with I don't think you're actually using rc6 bits, since in your patch you have: > -poll_more: and I think that is only in Dave's net-2.6.24 tree now, right? > The problem is that the poll handler does netif_rx_complete (which > does a dev_put) followed by netif_rx_reschedule() to schedule for > more receives (which again does a dev_put). This reduces refcount to > < 0 (depending on how many times netif_rx_complete followed by > netif_rx_reschedule was called). Dave, the real problem seems to be that netif_rx_recschedule() calls __napi_schedule() rather than __netif_rx_schedule(), so it misses the call to dev_hold() that is needed to balance the dev_put() in netif_rx_complete(). The current netif_rx_reschedule() looks like it really should be napi_reschedule(), and we need a new function that takes a netdev too. Or am I misunderstanding the refcounting? I'll send a patch once I've had some breakfast and had a chance to at least compile it... Krishna, unfortunately your proposed fix has a race: > - netif_rx_complete(dev, napi); > - if (unlikely(ib_req_notify_cq(priv->cq, > - IB_CQ_NEXT_COMP | > - IB_CQ_REPORT_MISSED_EVENTS)) && > - netif_rx_reschedule(napi)) > - goto poll_more; > + if (likely(!ib_req_notify_cq(priv->cq, > + IB_CQ_NEXT_COMP | > + IB_CQ_REPORT_MISSED_EVENTS))) It is possible for an interrupt to happen immediately right here, before the netif_rx_complete(), so that netif_rx_schedule() gets called while we are still on the poll list. > + netif_rx_complete(dev, napi); - R. From sashak at voltaire.com Tue Sep 18 08:14:37 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 17:14:37 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] infiniband-diags/ibnetdiscover: Bump build version In-Reply-To: <1190115116.12099.119.camel@hrosenstock-ws.xsigo.com> References: <1190115116.12099.119.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918151437.GB31938@sashak.voltaire.com> On 04:31 Tue 18 Sep , Hal Rosenstock wrote: > infiniband-diags/ibnetdiscover: Bump build version > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Sep 18 08:19:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 17:19:15 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/man: Update email contact info In-Reply-To: <1190115281.12099.121.camel@hrosenstock-ws.xsigo.com> References: <1190115281.12099.121.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918151915.GC31938@sashak.voltaire.com> On 04:34 Tue 18 Sep , Hal Rosenstock wrote: > OpenSM/man: Update email contact info > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hrosenstock at xsigo.com Tue Sep 18 08:32:21 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 18 Sep 2007 08:32:21 -0700 Subject: [ofa-general] [PATCH] OpenSM: Improve QP0 and QP1 counter accounting In-Reply-To: <20070918143112.GA31938@sashak.voltaire.com> References: <1189533856.11745.10.camel@hrosenstock-ws.xsigo.com> <20070918143112.GA31938@sashak.voltaire.com> Message-ID: <1190129541.12099.144.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-09-18 at 16:31 +0200, Sasha Khapyorsky wrote: > On 11:04 Tue 11 Sep , Hal Rosenstock wrote: > > OpenSM: Improve QP0 and QP1 counter accounting > > > > Note: Patch is based on OFED 1.2 > > > > Signed-off-by: Hal Rosenstock > > Applied with discussed removals. Thanks. > > Also I rebased this patch to master. Thanks! > So please look it still be fine. Looks fine; a few cosmetic changes to follow shortly. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Tue Sep 18 08:32:25 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 18 Sep 2007 08:32:25 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: Trivial comment changes and other cleanup Message-ID: <1190129545.12099.146.camel@hrosenstock-ws.xsigo.com> OpenSM: Trivial comment changes and other cleanup Signed-off-by: Hal Rosenstock diff --git a/opensm/include/opensm/osm_sa.h b/opensm/include/opensm/osm_sa.h index eb009eb..6c7b9c6 100644 --- a/opensm/include/opensm/osm_sa.h +++ b/opensm/include/opensm/osm_sa.h @@ -442,7 +442,7 @@ osm_sa_bind(IN osm_sa_t * const p_sa, IN const ib_net64_t port_guid); * osm_sa_vendor_send * * DESCRIPTION -* Sends SA MAD via osm_vendor_call and maintains the QP1 sent statistic +* Sends SA MAD via osm_vendor_send and maintains the QP1 sent statistic * * SYNOPSIS */ diff --git a/opensm/include/opensm/osm_stats.h b/opensm/include/opensm/osm_stats.h index 787d511..51424d1 100644 --- a/opensm/include/opensm/osm_stats.h +++ b/opensm/include/opensm/osm_stats.h @@ -128,20 +128,20 @@ typedef struct _osm_stats { * unrecognized attribute IDs and methods. * * sa_mads_outstanding -* Contains the number of MADs outstanding on QP1. +* Contains the number of SA MADs outstanding on QP1. * * sa_mads_rcvd -* Total number of QP1 MADs received. +* Total number of SA MADs received. * * sa_mads_sent -* Total number of QP1 MADs sent. +* Total number of SA MADs sent. * * sa_mads_rcvd_unknown -* Total number of unknown QP1 MADs received. This includes +* Total number of unknown SA MADs received. This includes * unrecognized attribute IDs and methods. * * sa_mads_ignored -* Total number of QP1 MADs received because SM is not +* Total number of SA MADs received because SM is not * master or SM is in first time sweep. * * SEE ALSO diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index ad5662b..34b69a0 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -390,18 +390,17 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) #endif fprintf(out, "\n MAD stats\n" " ---------\n" - " QP0 MADS outstanding : %d\n" - " QP0 MADS outstanding (on wire) : %d\n" - " QP0 MADS rcvd : %d\n" - " QP0 MADS sent : %d\n" + " QP0 MADs outstanding : %d\n" + " QP0 MADs outstanding (on wire) : %d\n" + " QP0 MADs rcvd : %d\n" + " QP0 MADs sent : %d\n" " QP0 unicasts sent : %d\n" " QP0 unknown MADs rcvd : %d\n" - " SA MADS outstanding : %d\n" - " SA MADS rcvd : %d\n" - " SA MADS sent : %d\n" - " QP1 MADs sent : %d\n" - " QP1 unknown MADs rcvd : %d\n" - " QP1 MADs ignored : %d\n", + " SA MADs outstanding : %d\n" + " SA MADs rcvd : %d\n" + " SA MADs sent : %d\n" + " SA unknown MADs rcvd : %d\n" + " SA MADs ignored : %d\n", p_osm->stats.qp0_mads_outstanding, p_osm->stats.qp0_mads_outstanding_on_wire, p_osm->stats.qp0_mads_rcvd, @@ -409,7 +408,7 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) p_osm->stats.qp0_unicasts_sent, p_osm->stats.qp0_mads_rcvd_unknown, p_osm->stats.sa_mads_outstanding, - p_osm->stats.sa_mads_rcvd, p_osm->stats.sa_mads_sent, + p_osm->stats.sa_mads_rcvd, p_osm->stats.sa_mads_sent, p_osm->stats.sa_mads_rcvd_unknown, p_osm->stats.sa_mads_ignored); From sashak at voltaire.com Tue Sep 18 09:19:12 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 18 Sep 2007 18:19:12 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: Trivial comment changes and other cleanup In-Reply-To: <1190129545.12099.146.camel@hrosenstock-ws.xsigo.com> References: <1190129545.12099.146.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070918161912.GD31938@sashak.voltaire.com> On 08:32 Tue 18 Sep , Hal Rosenstock wrote: > OpenSM: Trivial comment changes and other cleanup > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From mst at dev.mellanox.co.il Tue Sep 18 09:34:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 18:34:33 +0200 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <20070918163433.GL2050@mellanox.co.il> > Quoting Roland Dreier : > Subject: InfiniBand/RDMA merge plans for 2.6.24 > > With 2.6.24 probably opening in the not-too-distant future, it's > probably a good time to review what my plans are for when the merge > window opens. Roland, could you merge the common TX CQ patch please? It actually fixes a real problem. -- MST From ardavis at ichips.intel.com Tue Sep 18 10:08:30 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 18 Sep 2007 10:08:30 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> Message-ID: <46F0060E.1080505@ichips.intel.com> Jeff Becker wrote: > Hi all. I have a first cut. > > If you view "http://www.openfabrics.org/listdir.php" in your browser, > all the download directories are given as links, and I list the > contents of WEB_README if it exists. Please let me know what you > think. Thanks. > Jeff, When can you move this to the downloads page? I would like to wrap this up this week. Maintainers, Please move your packages and update your WEB_README. Currently we only have rdmacm, dapl, cxgb3, and WinOF updated for this process. Thanks, -arlin From mshefty at ichips.intel.com Tue Sep 18 10:12:21 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Sep 2007 10:12:21 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: References: Message-ID: <46F006F5.5090801@ichips.intel.com> > It is possible for the multicast consumer to call ib_sa_free_multicast() where > this leave request is queued to be later processed by the workqueue thread, and > then call ib_sa_join_multicast() which calls acquire_group() --before-- the leave > request was excecuted by the thread. So the lookup done by acquire_group() succeeds, > the code goes to the found: label and the group reference count climbs to (eg) 2. Yes - this is possible. Note that although the group reference count is 2, joins are tracked in different lists: active_list or pending_list. The second join doesn't move to the active_list until it's processed by the callback thread, to synchronize against errors and leaves. > Following that the leave work-element causes the thread to just dec the > reference count to 1 in release_group() and do nothing else, and the join > work-element causes the thread to return the cached address-handle attributes > to the consumer. So no sa query is being sent to the SA. This sounds like the correct behavior. > We saw the bug on a uni processor system running the ipath driver, where the > consumer is ipoib and the group being the IPv4 broadcast. When we take down > the link of the switch port connected to the device across the cable, ipoib > rushes to leave the group and then join it. On this system the join "crosses > the leave" and the SA does not take into account the node when computing the > multicast routing of the group --> the node does not get the broadcast traffic. Does the SA remove the node from the multicast group? If the HCA port goes down, the multicast code will transition all existing multicast groups to the error state. An error will be reported on active joins. Pending joins will be processed normally after error handling has completed. > For now we have applied a work around which causes the multicast code to > call release_group() from ib_sa_free_multicast(). The workaround is > implemented by using the patch below which causes mcast_groups_lost() > to be called also when the port actually goes up, and set the group state > to MCAST_ERROR such that the call to release_group() is not deferred (ipoib > does leave/join for every event, namely both on link down and up). I'm wondering if the problem isn't in ipoib. When an error occurs on a multicast group, the group transitions into the error state, and the user is called back to let them know that they need to rejoin the group. Since ipoib responds directly to port events and not multicast callback errors, is there a chance ipoib missed the error notification? In short, I'm still not sure where the problem lies. - Sean From rdreier at cisco.com Tue Sep 18 10:18:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 10:18:03 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <20070918163433.GL2050@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 18 Sep 2007 18:34:33 +0200") References: <20070918163433.GL2050@mellanox.co.il> Message-ID: > Roland, could you merge the common TX CQ patch please? > It actually fixes a real problem. Yes, I will, but it collides with the net-2.6.24 NAPI rework I think, so it may not go in until a few days after the merge window. Have you verified that the patch cures the interrupt overload issues? From jackm at dev.mellanox.co.il Tue Sep 18 10:24:57 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:24:57 +0200 Subject: [ofa-general] [PATCH 0 of 5] XRC implementation patches (libibverbs, libmlx4, core, mlx4) Message-ID: <200709181924.57665.jackm@dev.mellanox.co.il> The implementation is according to Michael Tsirkin's post of August 9: [ofa-general] [PATCHv4 RFC] Scalable Reliable Connection: API and documentation The Kernel patches (core and mlx4) all assume that Eli Cohen's 17-patch list, posted on September 11, has been applied (at least those patches which apply to the core and mlx4 modules). The core patch has been divided in two, for easier understanding. The first patch implements XRC only for the case that ibv_open_xrc_domain() is called with fd = -1 (i.e., the mechanism for sharing xrc domains between processes on the same host is disabled). The second core patch adds support for fd indicating a created/opened file. - Jack From jackm at dev.mellanox.co.il Tue Sep 18 10:25:01 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:25:01 +0200 Subject: [ofa-general] [PATCH 1 of 5] libibverbs: XRC implementation Message-ID: <200709181925.02387.jackm@dev.mellanox.co.il> Implement eXtended Reliable Connections. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack Morgenstein diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 67a3bf8..30ba79f 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -99,6 +99,11 @@ int ibv_cmd_create_srq(struct ibv_pd *pd, struct ibv_srq *srq, struct ibv_srq_init_attr *attr, struct ibv_create_srq *cmd, size_t cmd_size, struct ibv_create_srq_resp *resp, size_t resp_size); +int ibv_cmd_create_xrc_srq(struct ibv_pd *pd, + struct ibv_srq *srq, struct ibv_srq_init_attr *attr, + uint32_t xrc_domain, uint32_t xrc_cq, + struct ibv_create_xrc_srq *cmd, size_t cmd_size, + struct ibv_create_srq_resp *resp, size_t resp_size); int ibv_cmd_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask, @@ -134,6 +139,12 @@ int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int ibv_dontfork_range(void *base, size_t size); int ibv_dofork_range(void *base, size_t size); +int ibv_cmd_open_xrc_domain(struct ibv_context *context, int fd, int oflag, + struct ibv_xrc_domain *d, + struct ibv_open_xrc_domain_resp *resp, + size_t resp_size); +int ibv_cmd_close_xrc_domain(struct ibv_xrc_domain *d); + /* * sysfs helper functions diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index 0db083a..3845a4c 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -85,7 +85,10 @@ enum { IB_USER_VERBS_CMD_MODIFY_SRQ, IB_USER_VERBS_CMD_QUERY_SRQ, IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + IB_USER_VERBS_CMD_POST_SRQ_RECV, + IB_USER_VERBS_CMD_CREATE_XRC_SRQ, + IB_USER_VERBS_CMD_OPEN_XRC_DOMAIN, + IB_USER_VERBS_CMD_CLOSE_XRC_DOMAIN }; /* @@ -706,6 +709,21 @@ struct ibv_create_srq { __u64 driver_data[0]; }; +struct ibv_create_xrc_srq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 xrcd_handle; + __u32 xrc_cq; + __u64 driver_data[0]; +}; + struct ibv_create_srq_resp { __u32 srq_handle; __u32 max_wr; @@ -754,6 +772,29 @@ struct ibv_destroy_srq_resp { __u32 events_reported; }; +struct ibv_open_xrc_domain { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 fd; + __u32 oflags; + __u64 driver_data[0]; +}; + +struct ibv_open_xrc_domain_resp { + __u32 xrcd_handle; +}; + +struct ibv_close_xrc_domain { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 xrcd_handle; + __u64 driver_data[0]; +}; + /* * Compatibility with older ABI versions */ @@ -803,6 +844,9 @@ enum { * trick opcodes in IBV_INIT_CMD() doesn't break. */ IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL_V2 = -1, + IB_USER_VERBS_CMD_CREATE_XRC_SRQ_V2 = -1, + IB_USER_VERBS_CMD_OPEN_XRC_DOMAIN_V2 = -1, + IB_USER_VERBS_CMD_CLOSE_XRC_DOMAIN_V2 = -1, }; struct ibv_destroy_cq_v1 { diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index acc1b82..4c63208 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -92,7 +92,8 @@ enum ibv_device_cap_flags { IBV_DEVICE_SYS_IMAGE_GUID = 1 << 11, IBV_DEVICE_RC_RNR_NAK_GEN = 1 << 12, IBV_DEVICE_SRQ_RESIZE = 1 << 13, - IBV_DEVICE_N_NOTIFY_CQ = 1 << 14 + IBV_DEVICE_N_NOTIFY_CQ = 1 << 14, + IBV_DEVICE_XRC = 1 << 18 }; enum ibv_atomic_cap { @@ -370,6 +371,11 @@ struct ibv_ah_attr { uint8_t port_num; }; +struct ibv_xrc_domain { + struct ibv_context *context; + uint32_t handle; +}; + enum ibv_srq_attr_mask { IBV_SRQ_MAX_WR = 1 << 0, IBV_SRQ_LIMIT = 1 << 1 @@ -389,7 +395,8 @@ struct ibv_srq_init_attr { enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, - IBV_QPT_UD + IBV_QPT_UD, + IBV_QPT_XRC }; struct ibv_qp_cap { @@ -408,6 +415,7 @@ struct ibv_qp_init_attr { struct ibv_qp_cap cap; enum ibv_qp_type qp_type; int sq_sig_all; + struct ibv_xrc_domain *xrc_domain; }; enum ibv_qp_attr_mask { @@ -526,6 +534,7 @@ struct ibv_send_wr { uint32_t remote_qkey; } ud; } wr; + uint32_t xrc_remote_srq_num; }; struct ibv_recv_wr { @@ -553,6 +562,10 @@ struct ibv_srq { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + uint32_t xrc_srq_num; + struct ibv_xrc_domain *xrc_domain; + struct ibv_cq *xrc_cq; }; struct ibv_qp { @@ -570,6 +583,8 @@ struct ibv_qp { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + struct ibv_xrc_domain *xrc_domain; }; struct ibv_comp_channel { @@ -624,6 +639,7 @@ struct ibv_device { char ibdev_path[IBV_SYSFS_PATH_MAX]; }; +#define HAVE_IBV_CREATE_XRC_SRQ struct ibv_context_ops { int (*query_device)(struct ibv_context *context, struct ibv_device_attr *device_attr); @@ -680,6 +696,13 @@ struct ibv_context_ops { int (*detach_mcast)(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); void (*async_event)(struct ibv_async_event *event); + struct ibv_srq * (*create_xrc_srq)(struct ibv_pd *pd, + struct ibv_xrc_domain *xrc_domain, + struct ibv_cq *xrc_cq, + struct ibv_srq_init_attr *srq_init_attr); + struct ibv_xrc_domain * (*open_xrc_domain)(struct ibv_context *context, + int fd, int oflag); + int (*close_xrc_domain)(struct ibv_xrc_domain *d); }; struct ibv_context { @@ -912,6 +935,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); /** + * ibv_create_xrc_srq - Creates a SRQ associated with the specified protection + * domain and xrc domain. + * @pd: The protection domain associated with the SRQ. + * @xrc_domain: The XRC domain associated with the SRQ. + * @xrc_cq: CQ to report completions for XRC packets on. + * + * @srq_init_attr: A list of initial attributes required to create the SRQ. + * + * srq_attr->max_wr and srq_attr->max_sge are read the determine the + * requested size of the SRQ, and set to the actual values allocated + * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * will always be at least as large as the requested values. + */ +struct ibv_srq *ibv_create_xrc_srq(struct ibv_pd *pd, + struct ibv_xrc_domain *xrc_domain, + struct ibv_cq *xrc_cq, + struct ibv_srq_init_attr *srq_init_attr); + +/** * ibv_modify_srq - Modifies the attributes for the specified SRQ. * @srq: The SRQ to modify. * @srq_attr: On input, specifies the SRQ attributes to modify. On output, @@ -1074,6 +1116,42 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); */ int ibv_fork_init(void); +/** + * ibv_open_xrc_domain - open an XRC domain + * Returns a reference to an XRC domain. + * + * @context: Device context + * @fd: descriptor for inode associated with the domain + * If fd == -1, no inode is associated with the domain; in this case, + * the only legal value for oflag is O_CREAT + * + * @oflag: oflag values are constructed by OR-ing flags from the following list + * + * O_CREAT + * If a domain belonging to device named by context is already associated + * with the inode, this flag has no effect, except as noted under O_EXCL + * below. Otherwise, a new XRC domain is created and is associated with + * inode specified by fd. + * + * O_EXCL + * If O_EXCL and O_CREAT are set, open will fail if a domain associated with + * the inode exists. The check for the existence of the domain and creation + * of the domain if it does not exist is atomic with respect to other + * processes executing open with fd naming the same inode. + */ +struct ibv_xrc_domain *ibv_open_xrc_domain(struct ibv_context *context, + int fd, int oflag); + +/** + * ibv_close_xrc_domain - close an XRC domain + * If this is the last reference, destroys the domain. + * + * @d: reference to XRC domain to close + * + * close is implicitly performed at process exit. + */ +int ibv_close_xrc_domain(struct ibv_xrc_domain *d); + END_C_DECLS # undef __attribute_const diff --git a/src/cmd.c b/src/cmd.c index 6d4331f..d6b2a4b 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -482,6 +482,34 @@ int ibv_cmd_create_srq(struct ibv_pd *pd, return 0; } +int ibv_cmd_create_xrc_srq(struct ibv_pd *pd, + struct ibv_srq *srq, struct ibv_srq_init_attr *attr, + uint32_t xrcd_handle, uint32_t xrc_cq, + struct ibv_create_xrc_srq *cmd, size_t cmd_size, + struct ibv_create_srq_resp *resp, size_t resp_size) +{ + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_XRC_SRQ, resp, resp_size); + cmd->user_handle = (uintptr_t) srq; + cmd->pd_handle = pd->handle; + cmd->max_wr = attr->attr.max_wr; + cmd->max_sge = attr->attr.max_sge; + cmd->srq_limit = attr->attr.srq_limit; + cmd->xrcd_handle = xrcd_handle; + cmd->xrc_cq = xrc_cq; + + if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + VALGRIND_MAKE_MEM_DEFINED(resp, resp_size); + + srq->handle = resp->srq_handle; + srq->context = pd->context; + attr->attr.max_wr = resp->max_wr; + attr->attr.max_sge = resp->max_sge; + + return 0; +} + static int ibv_cmd_modify_srq_v3(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask, @@ -596,7 +624,6 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, cmd->pd_handle = pd->handle; cmd->send_cq_handle = attr->send_cq->handle; cmd->recv_cq_handle = attr->recv_cq->handle; - cmd->srq_handle = attr->srq ? attr->srq->handle : 0; cmd->max_send_wr = attr->cap.max_send_wr; cmd->max_recv_wr = attr->cap.max_recv_wr; cmd->max_send_sge = attr->cap.max_send_sge; @@ -604,7 +631,11 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, cmd->max_inline_data = attr->cap.max_inline_data; cmd->sq_sig_all = attr->sq_sig_all; cmd->qp_type = attr->qp_type; - cmd->is_srq = !!attr->srq; + cmd->is_srq = attr->qp_type == IBV_QPT_XRC ? + !!attr->xrc_domain : !!attr->srq; + cmd->srq_handle = attr->qp_type == IBV_QPT_XRC ? + (attr->xrc_domain ? attr->xrc_domain->handle : 0) : + (attr->srq ? attr->srq->handle : 0); cmd->reserved = 0; if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) @@ -1107,3 +1138,41 @@ int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) return 0; } + +int ibv_cmd_open_xrc_domain(struct ibv_context *context, int fd, int oflag, + struct ibv_xrc_domain *d, + struct ibv_open_xrc_domain_resp *resp, + size_t resp_size) +{ + struct ibv_open_xrc_domain cmd; + + if (abi_ver < 6) + return ENOSYS; + + IBV_INIT_CMD_RESP(&cmd, sizeof cmd, OPEN_XRC_DOMAIN, resp, resp_size); + cmd.fd = fd; + cmd.oflags = oflag; + + if (write(context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + d->handle = resp->xrcd_handle; + + return 0; +} + +int ibv_cmd_close_xrc_domain(struct ibv_xrc_domain *d) +{ + struct ibv_close_xrc_domain cmd; + + if (abi_ver < 6) + return ENOSYS; + + IBV_INIT_CMD(&cmd, sizeof cmd, CLOSE_XRC_DOMAIN); + cmd.xrcd_handle = d->handle; + + if (write(d->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + return 0; +} + diff --git a/src/libibverbs.map b/src/libibverbs.map index 3a346ed..fea3ff7 100644 --- a/src/libibverbs.map +++ b/src/libibverbs.map @@ -91,4 +91,10 @@ IBVERBS_1.1 { ibv_dontfork_range; ibv_dofork_range; ibv_register_driver; + ibv_create_xrc_srq; + ibv_cmd_create_xrc_srq; + ibv_open_xrc_domain; + ibv_cmd_open_xrc_domain; + ibv_close_xrc_domain; + ibv_cmd_close_xrc_domain; } IBVERBS_1.0; diff --git a/src/verbs.c b/src/verbs.c index f5cf4d3..4083fcf 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -364,6 +364,9 @@ struct ibv_srq *__ibv_create_srq(struct ibv_pd *pd, srq->context = pd->context; srq->srq_context = srq_init_attr->srq_context; srq->pd = pd; + srq->xrc_domain = NULL; + srq->xrc_cq = NULL; + srq->xrc_srq_num = 0; srq->events_completed = 0; pthread_mutex_init(&srq->mutex, NULL); pthread_cond_init(&srq->cond, NULL); @@ -373,6 +376,32 @@ struct ibv_srq *__ibv_create_srq(struct ibv_pd *pd, } default_symver(__ibv_create_srq, ibv_create_srq); +struct ibv_srq *__ibv_create_xrc_srq(struct ibv_pd *pd, + struct ibv_xrc_domain *xrc_domain, + struct ibv_cq *xrc_cq, + struct ibv_srq_init_attr *srq_init_attr) +{ + struct ibv_srq *srq; + + if (!pd->context->ops.create_xrc_srq) + return NULL; + + srq = pd->context->ops.create_xrc_srq(pd, xrc_domain, xrc_cq, srq_init_attr); + if (srq) { + srq->context = pd->context; + srq->srq_context = srq_init_attr->srq_context; + srq->pd = pd; + srq->xrc_domain = xrc_domain; + srq->xrc_cq = xrc_cq; + srq->events_completed = 0; + pthread_mutex_init(&srq->mutex, NULL); + pthread_cond_init(&srq->cond, NULL); + } + + return srq; +} +default_symver(__ibv_create_xrc_srq, ibv_create_xrc_srq); + int __ibv_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask) @@ -396,8 +425,9 @@ default_symver(__ibv_destroy_srq, ibv_destroy_srq); struct ibv_qp *__ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr) { - struct ibv_qp *qp = pd->context->ops.create_qp(pd, qp_init_attr); + struct ibv_qp *qp; + qp = pd->context->ops.create_qp(pd, qp_init_attr); if (qp) { qp->context = pd->context; qp->qp_context = qp_init_attr->qp_context; @@ -408,6 +438,8 @@ struct ibv_qp *__ibv_create_qp(struct ibv_pd *pd, qp->qp_type = qp_init_attr->qp_type; qp->state = IBV_QPS_RESET; qp->events_completed = 0; + qp->xrc_domain = qp_init_attr->qp_type == IBV_QPT_XRC ? + qp_init_attr->xrc_domain : NULL; pthread_mutex_init(&qp->mutex, NULL); pthread_cond_init(&qp->cond, NULL); } @@ -541,3 +573,28 @@ int __ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) return qp->context->ops.detach_mcast(qp, gid, lid); } default_symver(__ibv_detach_mcast, ibv_detach_mcast); + +struct ibv_xrc_domain *__ibv_open_xrc_domain(struct ibv_context *context, + int fd, int oflag) +{ + struct ibv_xrc_domain *d; + + if (!context->ops.open_xrc_domain) + return NULL; + + d = context->ops.open_xrc_domain(context, fd, oflag); + if (d) + d->context = context; + + return d; +} +default_symver(__ibv_open_xrc_domain, ibv_open_xrc_domain); + +int __ibv_close_xrc_domain(struct ibv_xrc_domain *d) +{ + if (!d->context->ops.close_xrc_domain) + return 0; + + return d->context->ops.close_xrc_domain(d); +} +default_symver(__ibv_close_xrc_domain, ibv_close_xrc_domain); From jackm at dev.mellanox.co.il Tue Sep 18 10:25:05 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:25:05 +0200 Subject: [ofa-general] [PATCH 2 of 5] libmlx4: XRC implementation Message-ID: <200709181925.06466.jackm@dev.mellanox.co.il> libmlx4: Implement XRC (eXtended RC) support. Signed-off-by: Jack Morgenstein diff --git a/src/cq.c b/src/cq.c index 06fd2ca..c0d7a8b 100644 --- a/src/cq.c +++ b/src/cq.c @@ -196,9 +196,11 @@ static int mlx4_poll_one(struct mlx4_cq *cq, struct mlx4_cqe *cqe; struct mlx4_srq *srq; uint32_t qpn; + uint32_t srqn; uint16_t wqe_index; int is_error; int is_send; + int is_src_recv = 0; cqe = next_cqe_sw(cq); if (!cqe) @@ -220,20 +222,30 @@ static int mlx4_poll_one(struct mlx4_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; - if (!*cur_qp || - (ntohl(cqe->my_qpn) & 0xffffff) != (*cur_qp)->ibv_qp.qp_num) { - /* - * We do not have to take the QP table lock here, - * because CQs will be locked while QPs are removed - * from the table. - */ - *cur_qp = mlx4_find_qp(to_mctx(cq->ibv_cq.context), - ntohl(cqe->my_qpn) & 0xffffff); - if (!*cur_qp) + if (qpn & MLX4_XRC_QPN_BIT && !is_send) { + srqn = ntohl(cqe->g_mlpath_rqpn) & 0xffffff; + /* + * We do not have to take the XRC SRQ table lock here, + * because CQs will be locked while XRC SRQs are removed + * from the table. + */ + srq = mlx4_find_xrc_srq(to_mctx(cq->ibv_cq.context), srqn); + if (!srq) return CQ_POLL_ERR; - } - - wc->qp_num = (*cur_qp)->ibv_qp.qp_num; + is_src_recv = 1; + } else if (!*cur_qp || (qpn & 0xffffff) != (*cur_qp)->ibv_qp.qp_num) { + /* + * We do not have to take the QP table lock here, + * because CQs will be locked while QPs are removed + * from the table. + */ + *cur_qp = mlx4_find_qp(to_mctx(cq->ibv_cq.context), + qpn & 0xffffff); + if (!*cur_qp) + return CQ_POLL_ERR; + } + + wc->qp_num = qpn & 0xffffff; if (is_send) { wq = &(*cur_qp)->sq; @@ -241,6 +253,10 @@ static int mlx4_poll_one(struct mlx4_cq *cq, wq->tail += (uint16_t) (wqe_index - (uint16_t) wq->tail); wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; + } else if (is_src_recv) { + wqe_index = htons(cqe->wqe_index); + wc->wr_id = srq->wrid[wqe_index]; + mlx4_free_srq_wqe(srq, wqe_index); } else if ((*cur_qp)->ibv_qp.srq) { srq = to_msrq((*cur_qp)->ibv_qp.srq); wqe_index = htons(cqe->wqe_index); @@ -386,6 +402,10 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) uint32_t prod_index; uint8_t owner_bit; int nfreed = 0; + int is_xrc_srq = 0; + + if (srq && srq->ibv_srq.xrc_cq) + is_xrc_srq = 1; pthread_spin_lock(&cq->lock); @@ -406,7 +426,12 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) */ while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); - if ((ntohl(cqe->my_qpn) & 0xffffff) == qpn) { + if (is_xrc_srq && + (ntohl(cqe->g_mlpath_rqpn & 0xffffff) == srq->srqn) && + !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) { + mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index)); + ++nfreed; + } else if ((ntohl(cqe->my_qpn) & 0xffffff) == qpn) { if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index)); ++nfreed; diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index 20a40c9..5c04145 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -68,6 +68,14 @@ struct mlx4_resize_cq { __u64 buf_addr; }; +#ifdef HAVE_IBV_CREATE_XRC_SRQ +struct mlx4_create_xrc_srq { + struct ibv_create_xrc_srq ibv_cmd; + __u64 buf_addr; + __u64 db_addr; +}; +#endif + struct mlx4_create_srq { struct ibv_create_srq ibv_cmd; __u64 buf_addr; @@ -90,4 +98,12 @@ struct mlx4_create_qp { __u8 reserved[5]; }; +#ifdef HAVE_IBV_CREATE_XRC_SRQ +struct mlx4_open_xrc_domain_resp { + struct ibv_open_xrc_domain_resp ibv_resp; + __u32 xrcdn; + __u32 reserved; +}; +#endif + #endif /* MLX4_ABI_H */ diff --git a/src/mlx4.c b/src/mlx4.c index b2e2ba9..95902cd 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -94,6 +94,11 @@ static struct ibv_context_ops mlx4_ctx_ops = { .post_recv = mlx4_post_recv, .create_ah = mlx4_create_ah, .destroy_ah = mlx4_destroy_ah, +#ifdef HAVE_IBV_CREATE_XRC_SRQ + .create_xrc_srq = mlx4_create_xrc_srq, + .open_xrc_domain = mlx4_open_xrc_domain, + .close_xrc_domain = mlx4_close_xrc_domain, +#endif .attach_mcast = mlx4_attach_mcast, .detach_mcast = mlx4_detach_mcast }; @@ -123,6 +128,15 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ for (i = 0; i < MLX4_QP_TABLE_SIZE; ++i) context->qp_table[i].refcnt = 0; + context->num_xrc_srqs = resp.qp_tab_size; + context->xrc_srq_table_shift = ffs(context->num_xrc_srqs) - 1 + - MLX4_XRC_SRQ_TABLE_BITS; + context->qp_table_mask = (1 << context->xrc_srq_table_shift) - 1; + + pthread_mutex_init(&context->xrc_srq_table_mutex, NULL); + for (i = 0; i < MLX4_XRC_SRQ_TABLE_SIZE; ++i) + context->xrc_srq_table[i].refcnt = 0; + for (i = 0; i < MLX4_NUM_DB_TYPE; ++i) context->db_list[i] = NULL; diff --git a/src/mlx4.h b/src/mlx4.h index 3710a17..deb0f55 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -97,6 +97,16 @@ enum { MLX4_QP_TABLE_MASK = MLX4_QP_TABLE_SIZE - 1 }; +enum { + MLX4_XRC_SRQ_TABLE_BITS = 8, + MLX4_XRC_SRQ_TABLE_SIZE = 1 << MLX4_XRC_SRQ_TABLE_BITS, + MLX4_XRC_SRQ_TABLE_MASK = MLX4_XRC_SRQ_TABLE_SIZE - 1 +}; + +enum { + MLX4_XRC_QPN_BIT = (1 << 23) +}; + enum mlx4_db_type { MLX4_DB_TYPE_CQ, MLX4_DB_TYPE_RQ, @@ -157,6 +167,15 @@ struct mlx4_context { int qp_table_shift; int qp_table_mask; + struct { + struct mlx4_srq **table; + int refcnt; + } xrc_srq_table[MLX4_XRC_SRQ_TABLE_SIZE]; + pthread_mutex_t xrc_srq_table_mutex; + int num_xrc_srqs; + int xrc_srq_table_shift; + int xrc_srq_table_mask; + struct mlx4_db_page *db_list[MLX4_NUM_DB_TYPE]; pthread_mutex_t db_list_mutex; }; @@ -242,6 +261,11 @@ struct mlx4_ah { struct mlx4_av av; }; +struct mlx4_xrc_domain { + struct ibv_xrc_domain ibv_xrcd; + uint32_t xrcdn; +}; + static inline unsigned long align(unsigned long val, unsigned long align) { return (val + align - 1) & ~(align - 1); @@ -286,6 +310,13 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah) return to_mxxx(ah, ah); } +#ifdef HAVE_IBV_CREATE_XRC_SRQ +static inline struct mlx4_xrc_domain *to_mxrcd(struct ibv_xrc_domain *ibxrcd) +{ + return to_mxxx(xrcd, xrc_domain); +} +#endif + int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size); void mlx4_free_buf(struct mlx4_buf *buf); @@ -317,7 +348,7 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, void mlx4_cq_resize_copy_cqes(struct mlx4_cq *cq, void *buf, int new_cqe); struct ibv_srq *mlx4_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *attr); + struct ibv_srq_init_attr *attr); int mlx4_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask mask); @@ -330,6 +361,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *srq, int ind); int mlx4_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); +struct mlx4_srq *mlx4_find_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn); +int mlx4_store_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn, + struct mlx4_srq *srq); +void mlx4_clear_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn); struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int mlx4_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, @@ -360,5 +395,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, struct ibv_ah_attr *attr, void mlx4_free_av(struct mlx4_ah *ah); int mlx4_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); +#ifdef HAVE_IBV_CREATE_XRC_SRQ +struct ibv_srq *mlx4_create_xrc_srq(struct ibv_pd *pd, + struct ibv_xrc_domain *xrc_domain, + struct ibv_cq *xrc_cq, + struct ibv_srq_init_attr *attr); +struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context, + int fd, int oflag); + +int mlx4_close_xrc_domain(struct ibv_xrc_domain *d); +#endif + #endif /* MLX4_H */ diff --git a/src/qp.c b/src/qp.c index ae0ae82..defe346 100644 --- a/src/qp.c +++ b/src/qp.c @@ -157,7 +157,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; - ctrl->srcrb_flags = + ctrl->xrcrb_flags = (wr->send_flags & IBV_SEND_SIGNALED ? htonl(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr->send_flags & IBV_SEND_SOLICITED ? @@ -174,6 +174,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, size = sizeof *ctrl / 16; switch (ibqp->qp_type) { + case IBV_QPT_XRC: + ctrl->xrcrb_flags |= htonl(wr->xrc_remote_srq_num << 8); + /* fall thru */ case IBV_QPT_RC: case IBV_QPT_UC: switch (wr->opcode) { diff --git a/src/srq.c b/src/srq.c index ba2ceb9..3cd1a95 100644 --- a/src/srq.c +++ b/src/srq.c @@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, return 0; } + +struct mlx4_srq *mlx4_find_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn) +{ + int tind = (xrc_srqn & (ctx->num_xrc_srqs - 1)) >> ctx->xrc_srq_table_shift; + + if (ctx->xrc_srq_table[tind].refcnt) + return ctx->xrc_srq_table[tind].table[xrc_srqn & ctx->xrc_srq_table_mask]; + else + return NULL; +} + +int mlx4_store_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn, + struct mlx4_srq *srq) +{ + int tind = (xrc_srqn & (ctx->num_xrc_srqs - 1)) >> ctx->xrc_srq_table_shift; + int ret = 0; + + pthread_mutex_lock(&ctx->xrc_srq_table_mutex); + + if (!ctx->xrc_srq_table[tind].refcnt) { + ctx->xrc_srq_table[tind].table = calloc(ctx->xrc_srq_table_mask + 1, + sizeof (struct mlx4_srq *)); + if (!ctx->xrc_srq_table[tind].table) { + ret = -1; + goto out; + } + } + + ++ctx->xrc_srq_table[tind].refcnt; + ctx->xrc_srq_table[tind].table[xrc_srqn & ctx->xrc_srq_table_mask] = srq; + +out: + pthread_mutex_unlock(&ctx->xrc_srq_table_mutex); + return ret; +} + +void mlx4_clear_xrc_srq(struct mlx4_context *ctx, uint32_t xrc_srqn) +{ + int tind = (xrc_srqn & (ctx->num_xrc_srqs - 1)) >> ctx->xrc_srq_table_shift; + + pthread_mutex_lock(&ctx->xrc_srq_table_mutex); + + if (!--ctx->xrc_srq_table[tind].refcnt) + free(ctx->xrc_srq_table[tind].table); + else + ctx->qp_table[tind].table[xrc_srqn & ctx->xrc_srq_table_mask] = NULL; + + pthread_mutex_unlock(&ctx->xrc_srq_table_mutex); +} + diff --git a/src/verbs.c b/src/verbs.c index b0273a1..728da64 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -246,7 +246,7 @@ int mlx4_destroy_cq(struct ibv_cq *cq) } struct ibv_srq *mlx4_create_srq(struct ibv_pd *pd, - struct ibv_srq_init_attr *attr) + struct ibv_srq_init_attr *attr) { struct mlx4_create_srq cmd; struct mlx4_create_srq_resp resp; @@ -287,7 +287,6 @@ struct ibv_srq *mlx4_create_srq(struct ibv_pd *pd, goto err_db; srq->srqn = resp.srqn; - return &srq->ibv_srq; err_db: @@ -320,18 +319,36 @@ int mlx4_query_srq(struct ibv_srq *srq, return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); } -int mlx4_destroy_srq(struct ibv_srq *srq) +int mlx4_destroy_srq(struct ibv_srq *ibsrq) { + struct mlx4_srq *srq = to_msrq(ibsrq); + struct mlx4_cq *mcq = NULL; int ret; - ret = ibv_cmd_destroy_srq(srq); - if (ret) + if (ibsrq->xrc_cq) { + /* is an xrc_srq */ + mcq = to_mcq(ibsrq->xrc_cq); + mlx4_cq_clean(mcq, 0, srq); + pthread_spin_lock(&mcq->lock); + mlx4_clear_xrc_srq(to_mctx(ibsrq->context), srq->srqn); + pthread_spin_unlock(&mcq->lock); + } + + ret = ibv_cmd_destroy_srq(ibsrq); + if (ret) { + if (ibsrq->xrc_cq) { + pthread_spin_lock(&mcq->lock); + mlx4_store_xrc_srq(to_mctx(ibsrq->context), + srq->srqn, srq); + pthread_spin_unlock(&mcq->lock); + } return ret; + } - mlx4_free_db(to_mctx(srq->context), MLX4_DB_TYPE_RQ, to_msrq(srq)->db); - mlx4_free_buf(&to_msrq(srq)->buf); - free(to_msrq(srq)->wrid); - free(to_msrq(srq)); + mlx4_free_db(to_mctx(ibsrq->context), MLX4_DB_TYPE_RQ, srq->db); + mlx4_free_buf(&srq->buf); + free(srq->wrid); + free(srq); return 0; } @@ -606,3 +623,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) { return ibv_cmd_detach_mcast(qp, gid, lid); } + +#ifdef HAVE_IBV_CREATE_XRC_SRQ +struct ibv_srq *mlx4_create_xrc_srq(struct ibv_pd *pd, + struct ibv_xrc_domain *xrc_domain, + struct ibv_cq *xrc_cq, + struct ibv_srq_init_attr *attr) +{ + struct mlx4_create_xrc_srq cmd; + struct mlx4_create_srq_resp resp; + struct mlx4_srq *srq; + int ret; + + /* Sanity check SRQ size before proceeding */ + if (attr->attr.max_wr > 1 << 16 || attr->attr.max_sge > 64) + return NULL; + + srq = malloc(sizeof *srq); + if (!srq) + return NULL; + + if (pthread_spin_init(&srq->lock, PTHREAD_PROCESS_PRIVATE)) + goto err; + + srq->max = align_queue_size(attr->attr.max_wr + 1); + srq->max_gs = attr->attr.max_sge; + srq->counter = 0; + + if (mlx4_alloc_srq_buf(pd, &attr->attr, srq)) + goto err; + + srq->db = mlx4_alloc_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ); + if (!srq->db) + goto err_free; + + *srq->db = 0; + + cmd.buf_addr = (uintptr_t) srq->buf.buf; + cmd.db_addr = (uintptr_t) srq->db; + + ret = ibv_cmd_create_xrc_srq(pd, &srq->ibv_srq, attr, + xrc_domain->handle, + xrc_cq->handle, + &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) + goto err_db; + + srq->ibv_srq.xrc_srq_num = srq->srqn = resp.srqn; + + ret = mlx4_store_xrc_srq(to_mctx(pd->context), srq->ibv_srq.xrc_srq_num, srq); + if (ret) + goto err_destroy; + + return &srq->ibv_srq; + +err_destroy: + ibv_cmd_destroy_srq(&srq->ibv_srq); + +err_db: + mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, srq->db); + +err_free: + free(srq->wrid); + mlx4_free_buf(&srq->buf); + +err: + free(srq); + + return NULL; +} + +struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context, + int fd, int oflag) +{ + int ret; + struct mlx4_open_xrc_domain_resp resp; + struct mlx4_xrc_domain *xrcd; + + xrcd = malloc(sizeof *xrcd); + if (!xrcd) + return NULL; + + ret = ibv_cmd_open_xrc_domain(context, fd, oflag, &xrcd->ibv_xrcd, + &resp.ibv_resp, sizeof resp); + if (ret) { + free(xrcd); + return NULL; + } + + xrcd->xrcdn = resp.xrcdn; + return &xrcd->ibv_xrcd; +} + +int mlx4_close_xrc_domain(struct ibv_xrc_domain *d) +{ + ibv_cmd_close_xrc_domain(d); + free(d); + return 0; +} +#endif diff --git a/src/wqe.h b/src/wqe.h index 6f7f309..fa2f8ac 100644 --- a/src/wqe.h +++ b/src/wqe.h @@ -65,7 +65,7 @@ struct mlx4_wqe_ctrl_seg { * [1] SE (solicited event) * [0] FL (force loopback) */ - uint32_t srcrb_flags; + uint32_t xrcrb_flags; /* * imm is immediate data for send/RDMA write w/ immediate; * also invalidation key for send with invalidate; input From jackm at dev.mellanox.co.il Tue Sep 18 10:25:08 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:25:08 +0200 Subject: [ofa-general] [PATCH 3 of 5] core: XRC implementation for fd = -1 when opening an xrc domain Message-ID: <200709181925.09218.jackm@dev.mellanox.co.il> IB/core: Implement XRC support at verbs layer (for case in which fd is not used when opening an xrc_domain). Signed-off-by: Jack Morgenstein Index: ofed_kernel/drivers/infiniband/core/uverbs_main.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs_main.c 2007-09-18 12:22:24.264017000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs_main.c 2007-09-18 18:47:53.280999000 +0200 @@ -74,6 +74,7 @@ DEFINE_IDR(ib_uverbs_ah_idr); DEFINE_IDR(ib_uverbs_cq_idr); DEFINE_IDR(ib_uverbs_qp_idr); DEFINE_IDR(ib_uverbs_srq_idr); +DEFINE_IDR(ib_uverbs_xrc_domain_idr); static spinlock_t map_lock; static struct ib_uverbs_device *dev_table[IB_UVERBS_MAX_DEVICES]; @@ -110,6 +111,9 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, [IB_USER_VERBS_CMD_QUERY_SRQ] = ib_uverbs_query_srq, [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, + [IB_USER_VERBS_CMD_CREATE_XRC_SRQ] = ib_uverbs_create_xrc_srq, + [IB_USER_VERBS_CMD_OPEN_XRC_DOMAIN] = ib_uverbs_open_xrc_domain, + [IB_USER_VERBS_CMD_CLOSE_XRC_DOMAIN] = ib_uverbs_close_xrc_domain, }; static struct vfsmount *uverbs_event_mnt; @@ -205,17 +209,6 @@ static int ib_uverbs_cleanup_ucontext(st kfree(uqp); } - list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { - struct ib_cq *cq = uobj->object; - struct ib_uverbs_event_file *ev_file = cq->cq_context; - struct ib_ucq_object *ucq = - container_of(uobj, struct ib_ucq_object, uobject); - - idr_remove_uobj(&ib_uverbs_cq_idr, uobj); - ib_destroy_cq(cq); - ib_uverbs_release_ucq(file, ev_file, ucq); - kfree(ucq); - } list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { struct ib_srq *srq = uobj->object; @@ -228,6 +221,18 @@ static int ib_uverbs_cleanup_ucontext(st kfree(uevent); } + list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { + struct ib_cq *cq = uobj->object; + struct ib_uverbs_event_file *ev_file = cq->cq_context; + struct ib_ucq_object *ucq = + container_of(uobj, struct ib_ucq_object, uobject); + + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); + ib_destroy_cq(cq); + ib_uverbs_release_ucq(file, ev_file, ucq); + kfree(ucq); + } + /* XXX Free MWs */ list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { @@ -238,6 +243,14 @@ static int ib_uverbs_cleanup_ucontext(st kfree(uobj); } + list_for_each_entry_safe(uobj, tmp, &context->xrc_domain_list, list) { + struct ib_xrcd *xrcd = uobj->object; + + idr_remove_uobj(&ib_uverbs_xrc_domain_idr, uobj); + ib_dealloc_xrcd(xrcd); + kfree(uobj); + } + list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { struct ib_pd *pd = uobj->object; Index: ofed_kernel/include/rdma/ib_user_verbs.h =================================================================== --- ofed_kernel.orig/include/rdma/ib_user_verbs.h 2007-09-18 12:22:24.277021000 +0200 +++ ofed_kernel/include/rdma/ib_user_verbs.h 2007-09-18 18:37:35.652206000 +0200 @@ -83,7 +83,10 @@ enum { IB_USER_VERBS_CMD_MODIFY_SRQ, IB_USER_VERBS_CMD_QUERY_SRQ, IB_USER_VERBS_CMD_DESTROY_SRQ, - IB_USER_VERBS_CMD_POST_SRQ_RECV + IB_USER_VERBS_CMD_POST_SRQ_RECV, + IB_USER_VERBS_CMD_CREATE_XRC_SRQ, + IB_USER_VERBS_CMD_OPEN_XRC_DOMAIN, + IB_USER_VERBS_CMD_CLOSE_XRC_DOMAIN }; /* @@ -643,6 +646,18 @@ struct ib_uverbs_create_srq { __u64 driver_data[0]; }; +struct ib_uverbs_create_xrc_srq { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 xrcd_handle; + __u32 xrc_cq; + __u64 driver_data[0]; +}; + struct ib_uverbs_create_srq_resp { __u32 srq_handle; __u32 max_wr; @@ -682,4 +697,23 @@ struct ib_uverbs_destroy_srq_resp { __u32 events_reported; }; +struct ib_uverbs_open_xrc_domain { + __u64 response; + __u32 fd; + __u32 oflags; + __u64 driver_data[0]; +}; + +struct ib_uverbs_open_xrc_domain_resp { + __u32 xrcd_handle; +}; + +struct ib_uverbs_close_xrc_domain { + __u64 response; + __u32 xrcd_handle; + __u64 driver_data[0]; +}; + + + #endif /* IB_USER_VERBS_H */ Index: ofed_kernel/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs_cmd.c 2007-09-18 12:22:24.267016000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs_cmd.c 2007-09-18 18:47:53.275995000 +0200 @@ -252,6 +252,16 @@ static void put_srq_read(struct ib_srq * put_uobj_read(srq->uobject); } +static struct ib_xrcd *idr_read_xrcd(int xrcd_handle, struct ib_ucontext *context) +{ + return idr_read_obj(&ib_uverbs_xrc_domain_idr, xrcd_handle, context, 0); +} + +static void put_xrcd_read(struct ib_xrcd *xrcd) +{ + put_uobj_read(xrcd->uobject); +} + ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -295,6 +305,7 @@ ssize_t ib_uverbs_get_context(struct ib_ INIT_LIST_HEAD(&ucontext->qp_list); INIT_LIST_HEAD(&ucontext->srq_list); INIT_LIST_HEAD(&ucontext->ah_list); + INIT_LIST_HEAD(&ucontext->xrc_domain_list); ucontext->closing = 0; resp.num_comp_vectors = file->device->num_comp_vectors; @@ -1024,6 +1035,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_srq *srq; struct ib_qp *qp; struct ib_qp_init_attr attr; + struct ib_xrcd *xrcd; int ret; if (out_len < sizeof resp) @@ -1043,13 +1055,16 @@ ssize_t ib_uverbs_create_qp(struct ib_uv init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext, &qp_lock_key); down_write(&obj->uevent.uobject.mutex); - srq = cmd.is_srq ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; + srq = (cmd.is_srq && cmd.qp_type != IB_QPT_XRC) ? + idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; + xrcd = (cmd.is_srq && cmd.qp_type == IB_QPT_XRC) ? + idr_read_xrcd(cmd.srq_handle, file->ucontext) : NULL; pd = idr_read_pd(cmd.pd_handle, file->ucontext); scq = idr_read_cq(cmd.send_cq_handle, file->ucontext, 0); rcq = cmd.recv_cq_handle == cmd.send_cq_handle ? scq : idr_read_cq(cmd.recv_cq_handle, file->ucontext, 1); - if (!pd || !scq || !rcq || (cmd.is_srq && !srq)) { + if (!pd || !scq || !rcq || (cmd.is_srq && !srq && !xrcd)) { ret = -EINVAL; goto err_put; } @@ -1061,6 +1076,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.srq = srq; attr.sq_sig_type = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; attr.qp_type = cmd.qp_type; + attr.xrc_domain = xrcd; attr.cap.max_send_wr = cmd.max_send_wr; attr.cap.max_recv_wr = cmd.max_recv_wr; @@ -1087,11 +1103,14 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; + qp->xrcd = attr.xrc_domain; atomic_inc(&pd->usecnt); atomic_inc(&attr.send_cq->usecnt); atomic_inc(&attr.recv_cq->usecnt); if (attr.srq) atomic_inc(&attr.srq->usecnt); + else if (attr.xrc_domain) + atomic_inc(&attr.xrc_domain->usecnt); obj->uevent.uobject.object = qp; ret = idr_add_uobj(&ib_uverbs_qp_idr, &obj->uevent.uobject); @@ -1119,6 +1138,8 @@ ssize_t ib_uverbs_create_qp(struct ib_uv put_cq_read(rcq); if (srq) put_srq_read(srq); + if (xrcd) + put_xrcd_read(xrcd); mutex_lock(&file->mutex); list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); @@ -1145,6 +1166,8 @@ err_put: put_cq_read(rcq); if (srq) put_srq_read(srq); + if (xrcd) + put_xrcd_read(xrcd); put_uobj_write(&obj->uevent.uobject); return ret; @@ -1988,6 +2011,8 @@ ssize_t ib_uverbs_create_srq(struct ib_u srq->uobject = &obj->uobject; srq->event_handler = attr.event_handler; srq->srq_context = attr.srq_context; + srq->xrc_cq = NULL; + srq->xrcd = NULL; atomic_inc(&pd->usecnt); atomic_set(&srq->usecnt, 0); @@ -2033,6 +2058,135 @@ err: return ret; } +ssize_t ib_uverbs_create_xrc_srq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_xrc_srq cmd; + struct ib_uverbs_create_srq_resp resp; + struct ib_udata udata; + struct ib_uevent_object *obj; + struct ib_pd *pd; + struct ib_srq *srq; + struct ib_cq *xrc_cq; + struct ib_xrcd *xrcd; + struct ib_srq_init_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + obj = kmalloc(sizeof *obj, GFP_KERNEL); + if (!obj) + return -ENOMEM; + + init_uobj(&obj->uobject, cmd.user_handle, file->ucontext, &srq_lock_key); + down_write(&obj->uobject.mutex); + + pd = idr_read_pd(cmd.pd_handle, file->ucontext); + if (!pd) { + ret = -EINVAL; + goto err; + } + + xrc_cq = idr_read_cq(cmd.xrc_cq, file->ucontext, 0); + if (!xrc_cq) { + ret = -EINVAL; + goto err_put_pd; + } + + xrcd = idr_read_xrcd(cmd.xrcd_handle, file->ucontext); + if (!xrcd) { + ret = -EINVAL; + goto err_put_cq; + } + + + attr.event_handler = ib_uverbs_srq_event_handler; + attr.srq_context = file; + attr.attr.max_wr = cmd.max_wr; + attr.attr.max_sge = cmd.max_sge; + attr.attr.srq_limit = cmd.srq_limit; + + obj->events_reported = 0; + INIT_LIST_HEAD(&obj->event_list); + + srq = pd->device->create_xrc_srq(pd, xrc_cq, xrcd, &attr, &udata); + if (IS_ERR(srq)) { + ret = PTR_ERR(srq); + goto err_put; + } + + srq->device = pd->device; + srq->pd = pd; + srq->uobject = &obj->uobject; + srq->event_handler = attr.event_handler; + srq->srq_context = attr.srq_context; + srq->xrc_cq = xrc_cq; + srq->xrcd = xrcd; + atomic_inc(&pd->usecnt); + atomic_inc(&xrc_cq->usecnt); + atomic_inc(&xrcd->usecnt); + + atomic_set(&srq->usecnt, 0); + + obj->uobject.object = srq; + ret = idr_add_uobj(&ib_uverbs_srq_idr, &obj->uobject); + if (ret) + goto err_destroy; + + memset(&resp, 0, sizeof resp); + resp.srq_handle = obj->uobject.id; + resp.max_wr = attr.attr.max_wr; + resp.max_sge = attr.attr.max_sge; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_copy; + } + + put_xrcd_read(xrcd); + put_cq_read(xrc_cq); + put_pd_read(pd); + + mutex_lock(&file->mutex); + list_add_tail(&obj->uobject.list, &file->ucontext->srq_list); + mutex_unlock(&file->mutex); + + obj->uobject.live = 1; + + up_write(&obj->uobject.mutex); + + return in_len; + +err_copy: + idr_remove_uobj(&ib_uverbs_srq_idr, &obj->uobject); + +err_destroy: + ib_destroy_srq(srq); + +err_put: + put_xrcd_read(xrcd); + +err_put_cq: + put_cq_read(xrc_cq); + +err_put_pd: + put_pd_read(pd); + +err: + put_uobj_write(&obj->uobject); + return ret; +} + ssize_t ib_uverbs_modify_srq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -2151,3 +2305,120 @@ ssize_t ib_uverbs_destroy_srq(struct ib_ return ret ? ret : in_len; } + +ssize_t ib_uverbs_open_xrc_domain(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_open_xrc_domain cmd; + struct ib_uverbs_open_xrc_domain_resp resp; + struct ib_udata udata; + struct ib_uobject *uobj; + struct ib_xrcd *xrcd; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + /* file descriptors/inodes not yet implemented */ + if (cmd.fd != (u32) (-1)) + return -ENOSYS; + + INIT_UDATA(&udata, buf + sizeof cmd, + (unsigned long) cmd.response + sizeof resp, + in_len - sizeof cmd, out_len - sizeof resp); + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + init_uobj(uobj, 0, file->ucontext, &pd_lock_key); + down_write(&uobj->mutex); + + + xrcd = file->device->ib_dev->alloc_xrcd(file->device->ib_dev, + file->ucontext, &udata); + if (IS_ERR(xrcd)) { + ret = PTR_ERR(xrcd); + goto err; + } + + xrcd->fd = cmd.fd; + xrcd->flags = cmd.oflags; + xrcd->uobject = uobj; + xrcd->device = file->device->ib_dev; + atomic_set(&xrcd->usecnt, 0); + + uobj->object = xrcd; + ret = idr_add_uobj(&ib_uverbs_xrc_domain_idr, uobj); + if (ret) + goto err_idr; + + memset(&resp, 0, sizeof resp); + resp.xrcd_handle = uobj->id; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_copy; + } + + mutex_lock(&file->mutex); + list_add_tail(&uobj->list, &file->ucontext->xrc_domain_list); + mutex_unlock(&file->mutex); + + uobj->live = 1; + + up_write(&uobj->mutex); + + return in_len; + +err_copy: + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); + +err_idr: + ib_dealloc_xrcd(xrcd); + +err: + put_uobj_write(uobj); + return ret; +} + +ssize_t ib_uverbs_close_xrc_domain(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_close_xrc_domain cmd; + struct ib_uobject *uobj; + int ret; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = idr_write_uobj(&ib_uverbs_xrc_domain_idr, cmd.xrcd_handle, file->ucontext); + if (!uobj) + return -EINVAL; + + ret = ib_dealloc_xrcd(uobj->object); + if (!ret) + uobj->live = 0; + + put_uobj_write(uobj); + + if (ret) + return ret; + + idr_remove_uobj(&ib_uverbs_xrc_domain_idr, uobj); + + mutex_lock(&file->mutex); + list_del(&uobj->list); + mutex_unlock(&file->mutex); + + put_uobj(uobj); + + return in_len; +} + Index: ofed_kernel/drivers/infiniband/core/verbs.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/verbs.c 2007-09-18 18:37:34.772206000 +0200 +++ ofed_kernel/drivers/infiniband/core/verbs.c 2007-09-18 18:49:01.540510000 +0200 @@ -236,6 +236,8 @@ struct ib_srq *ib_create_srq(struct ib_p srq->uobject = NULL; srq->event_handler = srq_init_attr->event_handler; srq->srq_context = srq_init_attr->srq_context; + srq->xrc_cq = NULL; + srq->xrcd = NULL; atomic_inc(&pd->usecnt); atomic_set(&srq->usecnt, 0); } @@ -263,16 +265,25 @@ EXPORT_SYMBOL(ib_query_srq); int ib_destroy_srq(struct ib_srq *srq) { struct ib_pd *pd; + struct ib_cq *xrc_cq; + struct ib_xrcd *xrcd; int ret; if (atomic_read(&srq->usecnt)) return -EBUSY; pd = srq->pd; + xrc_cq = srq->xrc_cq; + xrcd = srq->xrcd; ret = srq->device->destroy_srq(srq); - if (!ret) + if (!ret) { atomic_dec(&pd->usecnt); + if (xrc_cq) + atomic_dec(&xrc_cq->usecnt); + if (xrcd) + atomic_dec(&xrcd->usecnt); + } return ret; } @@ -297,6 +308,7 @@ struct ib_qp *ib_create_qp(struct ib_pd qp->event_handler = qp_init_attr->event_handler; qp->qp_context = qp_init_attr->qp_context; qp->qp_type = qp_init_attr->qp_type; + qp->xrcd = NULL; atomic_inc(&pd->usecnt); atomic_inc(&qp_init_attr->send_cq->usecnt); atomic_inc(&qp_init_attr->recv_cq->usecnt); @@ -328,6 +340,9 @@ static const struct { [IB_QPT_RC] = (IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_ACCESS_FLAGS), + [IB_QPT_XRC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | @@ -350,6 +365,9 @@ static const struct { [IB_QPT_RC] = (IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_ACCESS_FLAGS), + [IB_QPT_XRC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | @@ -369,6 +387,12 @@ static const struct { IB_QP_RQ_PSN | IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_MIN_RNR_TIMER), + [IB_QPT_XRC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), }, .opt_param = { [IB_QPT_UD] = (IB_QP_PKEY_INDEX | @@ -379,6 +403,9 @@ static const struct { [IB_QPT_RC] = (IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX), + [IB_QPT_XRC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | @@ -399,6 +426,11 @@ static const struct { IB_QP_RNR_RETRY | IB_QP_SQ_PSN | IB_QP_MAX_QP_RD_ATOMIC), + [IB_QPT_XRC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), [IB_QPT_SMI] = IB_QP_SQ_PSN, [IB_QPT_GSI] = IB_QP_SQ_PSN, }, @@ -414,6 +446,11 @@ static const struct { IB_QP_ACCESS_FLAGS | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), + [IB_QPT_XRC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_CUR_STATE | @@ -438,6 +475,11 @@ static const struct { IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE | IB_QP_MIN_RNR_TIMER), + [IB_QPT_XRC] = (IB_QP_CUR_STATE | + IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_CUR_STATE | @@ -450,6 +492,7 @@ static const struct { [IB_QPT_UD] = IB_QP_EN_SQD_ASYNC_NOTIFY, [IB_QPT_UC] = IB_QP_EN_SQD_ASYNC_NOTIFY, [IB_QPT_RC] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [IB_QPT_XRC] = IB_QP_EN_SQD_ASYNC_NOTIFY, [IB_QPT_SMI] = IB_QP_EN_SQD_ASYNC_NOTIFY, [IB_QPT_GSI] = IB_QP_EN_SQD_ASYNC_NOTIFY } @@ -472,6 +515,11 @@ static const struct { IB_QP_ACCESS_FLAGS | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), + [IB_QPT_XRC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_CUR_STATE | @@ -500,6 +548,18 @@ static const struct { IB_QP_PKEY_INDEX | IB_QP_MIN_RNR_TIMER | IB_QP_PATH_MIG_STATE), + [IB_QPT_XRC] = (IB_QP_PORT | + IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | @@ -584,12 +644,14 @@ int ib_destroy_qp(struct ib_qp *qp) struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; + struct ib_xrcd *xrcd; int ret; pd = qp->pd; scq = qp->send_cq; rcq = qp->recv_cq; srq = qp->srq; + xrcd = qp->xrcd; ret = qp->device->destroy_qp(qp); if (!ret) { @@ -598,6 +660,8 @@ int ib_destroy_qp(struct ib_qp *qp) atomic_dec(&rcq->usecnt); if (srq) atomic_dec(&srq->usecnt); + if (xrcd) + atomic_dec(&xrcd->usecnt); } return ret; @@ -856,3 +920,14 @@ int ib_detach_mcast(struct ib_qp *qp, un return qp->device->detach_mcast(qp, gid, lid); } EXPORT_SYMBOL(ib_detach_mcast); + +int ib_dealloc_xrcd(struct ib_xrcd *xrcd) +{ + if (atomic_read(&xrcd->usecnt)) + return -EBUSY; + + return xrcd->device->dealloc_xrcd(xrcd); +} +EXPORT_SYMBOL(ib_dealloc_xrcd); + + Index: ofed_kernel/include/rdma/ib_verbs.h =================================================================== --- ofed_kernel.orig/include/rdma/ib_verbs.h 2007-09-18 18:37:34.708242000 +0200 +++ ofed_kernel/include/rdma/ib_verbs.h 2007-09-18 18:47:53.289995000 +0200 @@ -97,7 +97,8 @@ enum ib_device_cap_flags { IB_DEVICE_SEND_W_INV = (1<<16), IB_DEVICE_MEM_WINDOW = (1<<17), IB_DEVICE_IP_CSUM = (1<<18), - IB_DEVICE_TCP_GSO = (1<<19) + IB_DEVICE_TCP_GSO = (1<<19), + IB_DEVICE_XRC = (1<<20) }; enum ib_atomic_cap { @@ -487,6 +488,7 @@ enum ib_qp_type { IB_QPT_RC, IB_QPT_UC, IB_QPT_UD, + IB_QPT_XRC, IB_QPT_RAW_IPV6, IB_QPT_RAW_ETY }; @@ -500,6 +502,7 @@ struct ib_qp_init_attr { struct ib_qp_cap cap; enum ib_sig_type sq_sig_type; enum ib_qp_type qp_type; + struct ib_xrcd *xrc_domain; /* XRC qp's only */ u8 port_num; /* special QP types only */ }; @@ -724,6 +727,7 @@ struct ib_ucontext { struct list_head qp_list; struct list_head srq_list; struct list_head ah_list; + struct list_head xrc_domain_list; int closing; }; @@ -751,6 +755,18 @@ struct ib_pd { atomic_t usecnt; /* count all resources */ }; +struct ib_xrcd { + struct ib_device *device; + struct ib_uobject *uobject; + struct rb_node node; + u32 xrc_domain_num; + struct inode *inode; + int fd; + u32 flags; + atomic_t usecnt; /* count all resources */ +}; + + struct ib_ah { struct ib_device *device; struct ib_pd *pd; @@ -772,6 +788,8 @@ struct ib_cq { struct ib_srq { struct ib_device *device; struct ib_pd *pd; + struct ib_cq *xrc_cq; + struct ib_xrcd *xrcd; struct ib_uobject *uobject; void (*event_handler)(struct ib_event *, void *); void *srq_context; @@ -789,6 +807,7 @@ struct ib_qp { void *qp_context; u32 qp_num; enum ib_qp_type qp_type; + struct ib_xrcd *xrcd; /* XRC QPs only */ }; struct ib_mr { @@ -1035,6 +1054,15 @@ struct ib_device { struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); + struct ib_srq * (*create_xrc_srq)(struct ib_pd *pd, + struct ib_cq *xrc_cq, + struct ib_xrcd *xrcd, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata); + struct ib_xrcd * (*alloc_xrcd)(struct ib_device *device, + struct ib_ucontext *context, + struct ib_udata *udata); + int (*dealloc_xrcd)(struct ib_xrcd *xrcd); struct ib_dma_mapping_ops *dma_ops; @@ -1855,4 +1883,11 @@ int ib_attach_mcast(struct ib_qp *qp, un */ int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_dealloc_xrcd - Deallocates an extended reliably connected domain. + * @pd: The xrc domain to deallocate. + */ +int ib_dealloc_xrcd(struct ib_xrcd *xrcd); + #endif /* IB_VERBS_H */ Index: ofed_kernel/drivers/infiniband/core/uverbs.h =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs.h 2007-09-18 12:22:24.274015000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs.h 2007-09-18 18:47:53.284997000 +0200 @@ -143,6 +143,7 @@ extern struct idr ib_uverbs_ah_idr; extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; extern struct idr ib_uverbs_srq_idr; +extern struct idr ib_uverbs_xrc_domain_idr; void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); @@ -197,5 +198,9 @@ IB_UVERBS_DECLARE_CMD(create_srq); IB_UVERBS_DECLARE_CMD(modify_srq); IB_UVERBS_DECLARE_CMD(query_srq); IB_UVERBS_DECLARE_CMD(destroy_srq); +IB_UVERBS_DECLARE_CMD(create_xrc_srq); +IB_UVERBS_DECLARE_CMD(open_xrc_domain); +IB_UVERBS_DECLARE_CMD(close_xrc_domain); + #endif /* UVERBS_H */ From jackm at dev.mellanox.co.il Tue Sep 18 10:25:27 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:25:27 +0200 Subject: [ofa-general] [PATCH 4 of 5] core: XRC implementation -- add support for working with file descriptors Message-ID: <200709181925.27455.jackm@dev.mellanox.co.il> Add XRC support for working with file descriptors, to allow sharing XRC domains between processes. Signed-off-by: Jack Morgenstein Index: ofed_kernel/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs_cmd.c 2007-09-16 16:32:22.844587000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs_cmd.c 2007-09-18 11:09:20.590991000 +0200 @@ -39,6 +39,7 @@ #include #include +#include #include "uverbs.h" @@ -252,14 +253,18 @@ static void put_srq_read(struct ib_srq * put_uobj_read(srq->uobject); } -static struct ib_xrcd *idr_read_xrcd(int xrcd_handle, struct ib_ucontext *context) +static struct ib_xrcd *idr_read_xrcd(int xrcd_handle, + struct ib_ucontext *context, + struct ib_uobject **uobj) { - return idr_read_obj(&ib_uverbs_xrc_domain_idr, xrcd_handle, context, 0); + *uobj = idr_read_uobj(&ib_uverbs_xrc_domain_idr, xrcd_handle, + context, 0); + return *uobj ? (*uobj)->object : NULL; } -static void put_xrcd_read(struct ib_xrcd *xrcd) +static void put_xrcd_read(struct ib_uobject *uobj) { - put_uobj_read(xrcd->uobject); + put_uobj_read(uobj); } ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, @@ -1036,6 +1041,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_qp *qp; struct ib_qp_init_attr attr; struct ib_xrcd *xrcd; + struct ib_uobject *xrcd_uobj; int ret; if (out_len < sizeof resp) @@ -1058,7 +1064,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv srq = (cmd.is_srq && cmd.qp_type != IB_QPT_XRC) ? idr_read_srq(cmd.srq_handle, file->ucontext) : NULL; xrcd = (cmd.is_srq && cmd.qp_type == IB_QPT_XRC) ? - idr_read_xrcd(cmd.srq_handle, file->ucontext) : NULL; + idr_read_xrcd(cmd.srq_handle, file->ucontext, &xrcd_uobj) : NULL; pd = idr_read_pd(cmd.pd_handle, file->ucontext); scq = idr_read_cq(cmd.send_cq_handle, file->ucontext, 0); rcq = cmd.recv_cq_handle == cmd.send_cq_handle ? @@ -1139,7 +1145,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv if (srq) put_srq_read(srq); if (xrcd) - put_xrcd_read(xrcd); + put_xrcd_read(xrcd_uobj); mutex_lock(&file->mutex); list_add_tail(&obj->uevent.uobject.list, &file->ucontext->qp_list); @@ -1167,7 +1173,7 @@ err_put: if (srq) put_srq_read(srq); if (xrcd) - put_xrcd_read(xrcd); + put_xrcd_read(xrcd_uobj); put_uobj_write(&obj->uevent.uobject); return ret; @@ -2071,6 +2077,7 @@ ssize_t ib_uverbs_create_xrc_srq(struct struct ib_cq *xrc_cq; struct ib_xrcd *xrcd; struct ib_srq_init_attr attr; + struct ib_uobject *xrcd_uobj; int ret; if (out_len < sizeof resp) @@ -2102,7 +2109,7 @@ ssize_t ib_uverbs_create_xrc_srq(struct goto err_put_pd; } - xrcd = idr_read_xrcd(cmd.xrcd_handle, file->ucontext); + xrcd = idr_read_xrcd(cmd.xrcd_handle, file->ucontext, &xrcd_uobj); if (!xrcd) { ret = -EINVAL; goto err_put_cq; @@ -2153,7 +2160,7 @@ ssize_t ib_uverbs_create_xrc_srq(struct goto err_copy; } - put_xrcd_read(xrcd); + put_xrcd_read(xrcd_uobj); put_cq_read(xrc_cq); put_pd_read(pd); @@ -2174,7 +2181,7 @@ err_destroy: ib_destroy_srq(srq); err_put: - put_xrcd_read(xrcd); + put_xrcd_read(xrcd_uobj); err_put_cq: put_cq_read(xrc_cq); @@ -2306,6 +2313,117 @@ ssize_t ib_uverbs_destroy_srq(struct ib_ return ret ? ret : in_len; } +static struct inode * xrc_fd2inode(unsigned int fd) +{ + struct file * f = fget(fd); + + if (!f) + return NULL; + + return f->f_dentry->d_inode; +} + +struct xrcd_table_entry { + struct rb_node node; + struct inode * inode; + struct ib_xrcd *xrcd; +}; + +static int xrcd_table_insert(struct ib_device *dev, + struct inode *i_n, + struct ib_xrcd *xrcd) +{ + struct xrcd_table_entry *entry, *scan; + struct rb_node **p = &dev->ib_uverbs_xrcd_table.rb_node; + struct rb_node *parent = NULL; + + entry = kmalloc(sizeof(struct xrcd_table_entry), GFP_KERNEL); + if (!entry) + return -ENOMEM; + + entry->inode = i_n; + entry->xrcd = xrcd; + + while (*p) + { + parent = *p; + scan = rb_entry(parent, struct xrcd_table_entry, node); + + if (i_n < scan->inode) + p = &(*p)->rb_left; + else if (i_n > scan->inode) + p = &(*p)->rb_right; + else { + kfree(entry); + return -EEXIST; + } + } + + rb_link_node(&entry->node, parent, p); + rb_insert_color(&entry->node, &dev->ib_uverbs_xrcd_table); + return 0; +} + +static int insert_xrcd(struct ib_device *dev, struct inode *i_n, + struct ib_xrcd *xrcd) +{ + int ret; + + ret = xrcd_table_insert(dev, i_n, xrcd); + if (!ret) + igrab(i_n); + + return ret; +} + +static struct xrcd_table_entry * xrcd_table_search(struct ib_device *dev, + struct inode *i_n) +{ + struct xrcd_table_entry *scan; + struct rb_node **p = &dev->ib_uverbs_xrcd_table.rb_node; + struct rb_node *parent = NULL; + + while (*p) + { + parent = *p; + scan = rb_entry(parent, struct xrcd_table_entry, node); + + if (i_n < scan->inode) + p = &(*p)->rb_left; + else if (i_n > scan->inode) + p = &(*p)->rb_right; + else + return scan; + } + return NULL; +} + +static int find_xrcd(struct ib_device *dev, struct inode *i_n, + struct ib_xrcd **xrcd) +{ + struct xrcd_table_entry *entry; + + entry = xrcd_table_search(dev, i_n); + if (!entry) + return -EINVAL; + + *xrcd = entry->xrcd; + return 0; +} + + +static void xrcd_table_delete(struct ib_device *dev, + struct inode *i_n) +{ + struct xrcd_table_entry *entry = xrcd_table_search(dev, i_n); + + if (entry) { + iput(i_n); + rb_erase(&entry->node, &dev->ib_uverbs_xrcd_table); + kfree(entry); + } +} + ssize_t ib_uverbs_open_xrc_domain(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -2314,8 +2432,10 @@ ssize_t ib_uverbs_open_xrc_domain(struct struct ib_uverbs_open_xrc_domain_resp resp; struct ib_udata udata; struct ib_uobject *uobj; - struct ib_xrcd *xrcd; - int ret; + struct ib_xrcd *xrcd = NULL; + struct inode *inode = NULL; + int ret = 0; + int new_xrcd = 0; if (out_len < sizeof resp) return -ENOSPC; @@ -2323,35 +2443,55 @@ ssize_t ib_uverbs_open_xrc_domain(struct if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; - /* file descriptors/inodes not yet implemented */ - if (cmd.fd != (u32) (-1)) - return -ENOSYS; - INIT_UDATA(&udata, buf + sizeof cmd, (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); + mutex_lock(&file->device->ib_dev->xrcd_table_mutex); + if (cmd.fd != (u32) (-1)) { + /* search for file descriptor */ + inode = xrc_fd2inode(cmd.fd); + if (!inode) { + ret = -EBADF; + goto err_table_mutex_unlock; + } + + ret = find_xrcd(file->device->ib_dev, inode, &xrcd); + if (ret && !(cmd.oflags & O_CREAT)) { + /* no file descriptor. Need CREATE flag */ + ret = -EAGAIN; + goto err_table_mutex_unlock; + } + + if (xrcd && cmd.oflags & O_EXCL){ + ret = -EINVAL; + goto err_table_mutex_unlock; + } + } + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); - if (!uobj) - return -ENOMEM; + if (!uobj) { + ret = -ENOMEM; + goto err_table_mutex_unlock; + } init_uobj(uobj, 0, file->ucontext, &pd_lock_key); down_write(&uobj->mutex); - - xrcd = file->device->ib_dev->alloc_xrcd(file->device->ib_dev, - file->ucontext, &udata); - if (IS_ERR(xrcd)) { - ret = PTR_ERR(xrcd); - goto err; + if (!xrcd) { + xrcd = file->device->ib_dev->alloc_xrcd(file->device->ib_dev, + file->ucontext, &udata); + if (IS_ERR(xrcd)) { + ret = PTR_ERR(xrcd); + goto err; + } + xrcd->uobject = (cmd.fd == -1) ? uobj : NULL; + xrcd->inode = inode; + xrcd->device = file->device->ib_dev; + atomic_set(&xrcd->usecnt, 0); + new_xrcd = 1; } - xrcd->fd = cmd.fd; - xrcd->flags = cmd.oflags; - xrcd->uobject = uobj; - xrcd->device = file->device->ib_dev; - atomic_set(&xrcd->usecnt, 0); - uobj->object = xrcd; ret = idr_add_uobj(&ib_uverbs_xrc_domain_idr, uobj); if (ret) @@ -2360,6 +2500,16 @@ ssize_t ib_uverbs_open_xrc_domain(struct memset(&resp, 0, sizeof resp); resp.xrcd_handle = uobj->id; + if (inode) { + if (new_xrcd) { + /* create new inode/xrcd table entry */ + ret = insert_xrcd(file->device->ib_dev, inode, xrcd); + if (ret) + goto err_insert_xrcd; + } + atomic_inc(&xrcd->usecnt); + } + if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { ret = -EFAULT; @@ -2374,16 +2524,29 @@ ssize_t ib_uverbs_open_xrc_domain(struct up_write(&uobj->mutex); + mutex_unlock(&file->device->ib_dev->xrcd_table_mutex); return in_len; err_copy: - idr_remove_uobj(&ib_uverbs_pd_idr, uobj); + + if (inode) { + if (new_xrcd) + xrcd_table_delete(file->device->ib_dev, inode); + atomic_dec(&xrcd->usecnt); + } + +err_insert_xrcd: + idr_remove_uobj(&ib_uverbs_xrc_domain_idr, uobj); err_idr: ib_dealloc_xrcd(xrcd); err: put_uobj_write(uobj); + +err_table_mutex_unlock: + + mutex_unlock(&file->device->ib_dev->xrcd_table_mutex); return ret; } @@ -2393,14 +2556,25 @@ ssize_t ib_uverbs_close_xrc_domain(struc { struct ib_uverbs_close_xrc_domain cmd; struct ib_uobject *uobj; - int ret; + struct ib_xrcd *xrcd = NULL; + struct inode *inode = NULL; + int ret = 0; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + mutex_lock(&file->device->ib_dev->xrcd_table_mutex); uobj = idr_write_uobj(&ib_uverbs_xrc_domain_idr, cmd.xrcd_handle, file->ucontext); - if (!uobj) - return -EINVAL; + if (!uobj) { + ret = -EINVAL; + goto err_unlock_mutex; + } + + xrcd = (struct ib_xrcd *) (uobj->object); + inode = xrcd->inode; + + if (inode) + atomic_dec(&xrcd->usecnt); ret = ib_dealloc_xrcd(uobj->object); if (!ret) @@ -2408,8 +2582,11 @@ ssize_t ib_uverbs_close_xrc_domain(struc put_uobj_write(uobj); - if (ret) - return ret; + if (ret && !inode) + goto err_unlock_mutex; + + if (!ret && inode) + xrcd_table_delete(file->device->ib_dev, inode); idr_remove_uobj(&ib_uverbs_xrc_domain_idr, uobj); @@ -2419,6 +2596,27 @@ ssize_t ib_uverbs_close_xrc_domain(struc put_uobj(uobj); + mutex_unlock(&file->device->ib_dev->xrcd_table_mutex); return in_len; + +err_unlock_mutex: + mutex_unlock(&file->device->ib_dev->xrcd_table_mutex); + return ret; } +void ib_uverbs_dealloc_xrcd(struct ib_device *ib_dev, + struct ib_xrcd *xrcd) +{ + struct inode *inode = NULL; + int ret = 0; + + inode = xrcd->inode; + if (inode) + atomic_dec(&xrcd->usecnt); + + ret = ib_dealloc_xrcd(xrcd); + if (!ret && inode) + xrcd_table_delete(ib_dev, inode); +} + + Index: ofed_kernel/include/rdma/ib_verbs.h =================================================================== --- ofed_kernel.orig/include/rdma/ib_verbs.h 2007-09-16 16:32:43.674747000 +0200 +++ ofed_kernel/include/rdma/ib_verbs.h 2007-09-17 12:31:55.239267000 +0200 @@ -52,6 +52,8 @@ #include #include #include +#include +#include union ib_gid { u8 raw[16]; @@ -758,11 +760,8 @@ struct ib_pd { struct ib_xrcd { struct ib_device *device; struct ib_uobject *uobject; - struct rb_node node; - u32 xrc_domain_num; struct inode *inode; - int fd; - u32 flags; + struct rb_node node; atomic_t usecnt; /* count all resources */ }; @@ -1084,6 +1083,8 @@ struct ib_device { __be64 node_guid; u8 node_type; u8 phys_port_cnt; + struct rb_root ib_uverbs_xrcd_table; + struct mutex xrcd_table_mutex; }; struct ib_client { Index: ofed_kernel/drivers/infiniband/core/device.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/device.c 2007-09-10 09:07:21.951463000 +0300 +++ ofed_kernel/drivers/infiniband/core/device.c 2007-09-17 15:06:09.213698000 +0200 @@ -290,6 +290,8 @@ int ib_register_device(struct ib_device INIT_LIST_HEAD(&device->client_data_list); spin_lock_init(&device->event_handler_lock); spin_lock_init(&device->client_data_lock); + device->ib_uverbs_xrcd_table = RB_ROOT; + mutex_init(&device->xrcd_table_mutex); ret = read_port_table_lengths(device); if (ret) { Index: ofed_kernel/drivers/infiniband/core/uverbs_main.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs_main.c 2007-09-16 16:32:22.000000000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs_main.c 2007-09-17 15:17:28.552933000 +0200 @@ -243,13 +243,15 @@ static int ib_uverbs_cleanup_ucontext(st kfree(uobj); } + mutex_lock(&file->device->ib_dev->xrcd_table_mutex); list_for_each_entry_safe(uobj, tmp, &context->xrc_domain_list, list) { struct ib_xrcd *xrcd = uobj->object; idr_remove_uobj(&ib_uverbs_xrc_domain_idr, uobj); - ib_dealloc_xrcd(xrcd); + ib_uverbs_dealloc_xrcd(file->device->ib_dev, xrcd); kfree(uobj); } + mutex_unlock(&file->device->ib_dev->xrcd_table_mutex); list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { struct ib_pd *pd = uobj->object; Index: ofed_kernel/drivers/infiniband/core/uverbs.h =================================================================== --- ofed_kernel.orig/drivers/infiniband/core/uverbs.h 2007-09-17 15:18:32.000000000 +0200 +++ ofed_kernel/drivers/infiniband/core/uverbs.h 2007-09-17 15:19:16.885160000 +0200 @@ -164,6 +164,8 @@ void ib_uverbs_qp_event_handler(struct i void ib_uverbs_srq_event_handler(struct ib_event *event, void *context_ptr); void ib_uverbs_event_handler(struct ib_event_handler *handler, struct ib_event *event); +void ib_uverbs_dealloc_xrcd(struct ib_device *ib_dev, + struct ib_xrcd *xrcd); #define IB_UVERBS_DECLARE_CMD(name) \ ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ From mst at dev.mellanox.co.il Tue Sep 18 10:22:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 18 Sep 2007 19:22:39 +0200 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <20070918163433.GL2050@mellanox.co.il> Message-ID: <20070918172239.GA23320@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: InfiniBand/RDMA merge plans for 2.6.24 > > > Roland, could you merge the common TX CQ patch please? > > It actually fixes a real problem. > > Yes, I will, but it collides with the net-2.6.24 NAPI rework I think, > so it may not go in until a few days after the merge window. > > Have you verified that the patch cures the interrupt overload issues? Yes. -- MST From jackm at dev.mellanox.co.il Tue Sep 18 10:27:36 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 18 Sep 2007 19:27:36 +0200 Subject: [ofa-general] [PATCH 5 of 5] mlx4: XRC implementation Message-ID: <200709181927.36968.jackm@dev.mellanox.co.il> mlx4: Implements XRC support. Signed-off-by: Jack Morgenstein Index: ofed_kernel/include/linux/mlx4/device.h =================================================================== --- ofed_kernel.orig/include/linux/mlx4/device.h 2007-09-18 12:14:40.223721000 +0200 +++ ofed_kernel/include/linux/mlx4/device.h 2007-09-18 12:15:27.919989000 +0200 @@ -56,6 +56,7 @@ enum { MLX4_DEV_CAP_FLAG_RC = 1 << 0, MLX4_DEV_CAP_FLAG_UC = 1 << 1, MLX4_DEV_CAP_FLAG_UD = 1 << 2, + MLX4_DEV_CAP_FLAG_XRC = 1 << 3, MLX4_DEV_CAP_FLAG_SRQ = 1 << 6, MLX4_DEV_CAP_FLAG_IPOIB_CSUM = 1 << 7, MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR = 1 << 8, @@ -176,6 +177,8 @@ struct mlx4_caps { int num_pds; int reserved_pds; int mtt_entry_sz; + int reserved_xrcds; + int max_xrcds; u32 max_msg_sz; u32 page_size_cap; u32 flags; @@ -313,6 +316,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int mlx4_pd_alloc(struct mlx4_dev *dev, u32 *pdn); void mlx4_pd_free(struct mlx4_dev *dev, u32 pdn); +int mlx4_xrcd_alloc(struct mlx4_dev *dev, u32 *xrcdn); +void mlx4_xrcd_free(struct mlx4_dev *dev, u32 xrcdn); + int mlx4_uar_alloc(struct mlx4_dev *dev, struct mlx4_uar *uar); void mlx4_uar_free(struct mlx4_dev *dev, struct mlx4_uar *uar); @@ -337,8 +343,8 @@ void mlx4_cq_free(struct mlx4_dev *dev, int mlx4_qp_alloc(struct mlx4_dev *dev, int sqpn, struct mlx4_qp *qp); void mlx4_qp_free(struct mlx4_dev *dev, int is_sqp, struct mlx4_qp *qp); -int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, struct mlx4_mtt *mtt, - u64 db_rec, struct mlx4_srq *srq); +int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcd, + struct mlx4_mtt *mtt, u64 db_rec, struct mlx4_srq *srq); void mlx4_srq_free(struct mlx4_dev *dev, struct mlx4_srq *srq); int mlx4_srq_arm(struct mlx4_dev *dev, struct mlx4_srq *srq, int limit_watermark); int mlx4_srq_query(struct mlx4_dev *dev, struct mlx4_srq *srq, int *limit_watermark); Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-18 12:14:40.231717000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-18 12:15:27.927990000 +0200 @@ -104,6 +104,8 @@ static int mlx4_ib_query_device(struct i props->device_cap_flags |= IB_DEVICE_IP_CSUM; if (dev->dev->caps.max_gso_sz) props->device_cap_flags |= IB_DEVICE_TCP_GSO; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC) + props->device_cap_flags |= IB_DEVICE_XRC; props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & 0xffffff; @@ -447,6 +449,46 @@ static int mlx4_ib_mcg_detach(struct ib_ &to_mqp(ibqp)->mqp, gid->raw); } +static struct ib_xrcd *mlx4_ib_alloc_xrcd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct mlx4_ib_xrcd *xrcd; + struct mlx4_ib_dev *mdev = to_mdev(ibdev); + int err; + + if (!(mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC)) + return ERR_PTR(-ENOSYS); + + xrcd = kmalloc(sizeof *xrcd, GFP_KERNEL); + if (!xrcd) + return ERR_PTR(-ENOMEM); + + err = mlx4_xrcd_alloc(mdev->dev, &xrcd->xrcdn); + if (err) { + kfree(xrcd); + return ERR_PTR(err); + } + + if (context) + if (ib_copy_to_udata(udata, &xrcd->xrcdn, sizeof (__u32))) { + mlx4_xrcd_free(mdev->dev, xrcd->xrcdn); + kfree(xrcd); + return ERR_PTR(-EFAULT); + } + + return &xrcd->ibxrcd; +} + +static int mlx4_ib_dealloc_xrcd(struct ib_xrcd *xrcd) +{ + mlx4_xrcd_free(to_mdev(xrcd->device)->dev, to_mxrcd(xrcd)->xrcdn); + kfree(xrcd); + + return 0; +} + + static int init_node_data(struct mlx4_ib_dev *dev) { struct ib_smp *in_mad = NULL; @@ -630,6 +672,16 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev->ib_dev.map_phys_fmr = mlx4_ib_map_phys_fmr; ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC) { + ibdev->ib_dev.create_xrc_srq = mlx4_ib_create_xrc_srq; + ibdev->ib_dev.alloc_xrcd = mlx4_ib_alloc_xrcd; + ibdev->ib_dev.dealloc_xrcd = mlx4_ib_dealloc_xrcd; + ibdev->ib_dev.uverbs_cmd_mask |= + (1ull << IB_USER_VERBS_CMD_CREATE_XRC_SRQ) | + (1ull << IB_USER_VERBS_CMD_OPEN_XRC_DOMAIN) | + (1ull << IB_USER_VERBS_CMD_CLOSE_XRC_DOMAIN); + } + if (ibdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) ibdev->ib_dev.flags |= IB_DEVICE_IP_CSUM; Index: ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-18 12:14:40.233722000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-18 12:15:27.931990000 +0200 @@ -73,6 +73,11 @@ struct mlx4_ib_pd { u32 pdn; }; +struct mlx4_ib_xrcd { + struct ib_xrcd ibxrcd; + u32 xrcdn; +}; + struct mlx4_ib_cq_buf { struct mlx4_buf buf; struct mlx4_mtt mtt; @@ -127,6 +132,7 @@ struct mlx4_ib_qp { struct mlx4_mtt mtt; int buf_size; struct mutex mutex; + u16 xrcdn; u8 port; u8 alt_port; u8 atomic_rd_en; @@ -189,6 +195,11 @@ static inline struct mlx4_ib_pd *to_mpd( return container_of(ibpd, struct mlx4_ib_pd, ibpd); } +static inline struct mlx4_ib_xrcd *to_mxrcd(struct ib_xrcd *ibxrcd) +{ + return container_of(ibxrcd, struct mlx4_ib_xrcd, ibxrcd); +} + static inline struct mlx4_ib_cq *to_mcq(struct ib_cq *ibcq) { return container_of(ibcq, struct mlx4_ib_cq, ibcq); @@ -264,6 +275,11 @@ int mlx4_ib_destroy_ah(struct ib_ah *ah) struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *init_attr, struct ib_udata *udata); +struct ib_srq *mlx4_ib_create_xrc_srq(struct ib_pd *pd, + struct ib_cq *xrc_cq, + struct ib_xrcd *xrcd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata); int mlx4_ib_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); int mlx4_ib_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); Index: ofed_kernel/drivers/net/mlx4/xrcd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ ofed_kernel/drivers/net/mlx4/xrcd.c 2007-09-18 12:15:27.936991000 +0200 @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006, 2007 Cisco Systems, Inc. All rights reserved. + * Copyright (c) 2007 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "mlx4.h" + +int mlx4_xrcd_alloc(struct mlx4_dev *dev, u32 *xrcdn) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + *xrcdn = mlx4_bitmap_alloc(&priv->xrcd_bitmap); + if (*xrcdn == -1) + return -ENOMEM; + + return 0; +} +EXPORT_SYMBOL_GPL(mlx4_xrcd_alloc); + +void mlx4_xrcd_free(struct mlx4_dev *dev, u32 xrcdn) +{ + mlx4_bitmap_free(&mlx4_priv(dev)->xrcd_bitmap, xrcdn); +} +EXPORT_SYMBOL_GPL(mlx4_xrcd_free); + +int __devinit mlx4_init_xrcd_table(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + + return mlx4_bitmap_init(&priv->xrcd_bitmap, (1 << 16), + (1 << 16) - 1, dev->caps.reserved_xrcds + 1); +} + +void mlx4_cleanup_xrcd_table(struct mlx4_dev *dev) +{ + mlx4_bitmap_cleanup(&mlx4_priv(dev)->xrcd_bitmap); +} + + Index: ofed_kernel/drivers/net/mlx4/mlx4.h =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/mlx4.h 2007-09-18 12:14:40.244721000 +0200 +++ ofed_kernel/drivers/net/mlx4/mlx4.h 2007-09-18 12:15:27.939990000 +0200 @@ -260,6 +260,7 @@ struct mlx4_priv { struct mlx4_cmd cmd; struct mlx4_bitmap pd_bitmap; + struct mlx4_bitmap xrcd_bitmap; struct mlx4_uar_table uar_table; struct mlx4_mr_table mr_table; struct mlx4_cq_table cq_table; @@ -289,6 +290,7 @@ void mlx4_bitmap_cleanup(struct mlx4_bit int mlx4_reset(struct mlx4_dev *dev); int mlx4_init_pd_table(struct mlx4_dev *dev); +int mlx4_init_xrcd_table(struct mlx4_dev *dev); int mlx4_init_uar_table(struct mlx4_dev *dev); int mlx4_init_mr_table(struct mlx4_dev *dev); int mlx4_init_eq_table(struct mlx4_dev *dev); @@ -305,6 +307,7 @@ void mlx4_cleanup_cq_table(struct mlx4_d void mlx4_cleanup_qp_table(struct mlx4_dev *dev); void mlx4_cleanup_srq_table(struct mlx4_dev *dev); void mlx4_cleanup_mcg_table(struct mlx4_dev *dev); +void mlx4_cleanup_xrcd_table(struct mlx4_dev *dev); void mlx4_start_catas_poll(struct mlx4_dev *dev); void mlx4_stop_catas_poll(struct mlx4_dev *dev); Index: ofed_kernel/drivers/net/mlx4/main.c =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/main.c 2007-09-18 12:14:40.247718000 +0200 +++ ofed_kernel/drivers/net/mlx4/main.c 2007-09-18 12:15:27.945990000 +0200 @@ -160,6 +160,10 @@ static int __devinit mlx4_dev_cap(struct dev->caps.flags = dev_cap->flags; dev->caps.stat_rate_support = dev_cap->stat_rate_support; dev->caps.max_gso_sz = dev_cap->max_gso_sz; + dev->caps.reserved_xrcds = (dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC) ? + dev_cap->reserved_xrcds : 0; + dev->caps.max_xrcds = (dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC) ? + dev_cap->max_xrcds : 0; return 0; } @@ -589,11 +593,18 @@ static int __devinit mlx4_setup_hca(stru goto err_kar_unmap; } + err = mlx4_init_xrcd_table(dev); + if (err) { + mlx4_err(dev, "Failed to initialize " + "extended reliably connected domain table, aborting.\n"); + goto err_pd_table_free; + } + err = mlx4_init_mr_table(dev); if (err) { mlx4_err(dev, "Failed to initialize " "memory region table, aborting.\n"); - goto err_pd_table_free; + goto err_xrcd_table_free; } err = mlx4_init_eq_table(dev); @@ -670,6 +681,9 @@ err_eq_table_free: err_mr_table_free: mlx4_cleanup_mr_table(dev); +err_xrcd_table_free: + mlx4_cleanup_xrcd_table(dev); + err_pd_table_free: mlx4_cleanup_pd_table(dev); @@ -851,6 +865,7 @@ err_cleanup: mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); mlx4_cleanup_mr_table(dev); + mlx4_cleanup_xrcd_table(dev); mlx4_cleanup_pd_table(dev); mlx4_cleanup_uar_table(dev); @@ -897,6 +912,7 @@ static void __devexit mlx4_remove_one(st mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); mlx4_cleanup_mr_table(dev); + mlx4_cleanup_xrcd_table(dev); mlx4_cleanup_pd_table(dev); iounmap(priv->kar); Index: ofed_kernel/drivers/net/mlx4/srq.c =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/srq.c 2007-09-18 12:14:40.249721000 +0200 +++ ofed_kernel/drivers/net/mlx4/srq.c 2007-09-18 12:15:27.949990000 +0200 @@ -40,20 +40,20 @@ struct mlx4_srq_context { __be32 state_logsize_srqn; u8 logstride; - u8 reserved1[3]; - u8 pg_offset; - u8 reserved2[3]; - u32 reserved3; + u8 reserved1; + __be16 xrc_domain; + __be32 pg_offset_cqn; + u32 reserved2; u8 log_page_size; - u8 reserved4[2]; + u8 reserved3[2]; u8 mtt_base_addr_h; __be32 mtt_base_addr_l; __be32 pd; __be16 limit_watermark; __be16 wqe_cnt; - u16 reserved5; + u16 reserved4; __be16 wqe_counter; - u32 reserved6; + u32 reserved5; __be64 db_rec_addr; }; @@ -109,8 +109,8 @@ static int mlx4_QUERY_SRQ(struct mlx4_de MLX4_CMD_TIME_CLASS_A); } -int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, struct mlx4_mtt *mtt, - u64 db_rec, struct mlx4_srq *srq) +int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcd, + struct mlx4_mtt *mtt, u64 db_rec, struct mlx4_srq *srq) { struct mlx4_srq_table *srq_table = &mlx4_priv(dev)->srq_table; struct mlx4_cmd_mailbox *mailbox; @@ -148,6 +148,8 @@ int mlx4_srq_alloc(struct mlx4_dev *dev, srq_context->state_logsize_srqn = cpu_to_be32((ilog2(srq->max) << 24) | srq->srqn); srq_context->logstride = srq->wqe_shift - 4; + srq_context->xrc_domain = cpu_to_be16(xrcd); + srq_context->pg_offset_cqn = cpu_to_be32(cqn & 0xffffff); srq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; mtt_addr = mlx4_mtt_addr(dev, mtt); Index: ofed_kernel/drivers/net/mlx4/fw.c =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/fw.c 2007-09-18 12:14:40.252717000 +0200 +++ ofed_kernel/drivers/net/mlx4/fw.c 2007-09-18 12:15:27.954992000 +0200 @@ -160,6 +160,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev * #define QUERY_DEV_CAP_MAX_MCG_OFFSET 0x63 #define QUERY_DEV_CAP_RSVD_PD_OFFSET 0x64 #define QUERY_DEV_CAP_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_CAP_RSVD_XRC_OFFSET 0x66 +#define QUERY_DEV_CAP_MAX_XRC_OFFSET 0x67 #define QUERY_DEV_CAP_RDMARC_ENTRY_SZ_OFFSET 0x80 #define QUERY_DEV_CAP_QPC_ENTRY_SZ_OFFSET 0x82 #define QUERY_DEV_CAP_AUX_ENTRY_SZ_OFFSET 0x84 @@ -270,6 +272,11 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev * MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_PD_OFFSET); dev_cap->max_pds = 1 << (field & 0x3f); + MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_XRC_OFFSET); + dev_cap->reserved_xrcds = field >> 4; + MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_XRC_OFFSET); + dev_cap->max_xrcds = 1 << (field & 0x1f); + MLX4_GET(size, outbox, QUERY_DEV_CAP_RDMARC_ENTRY_SZ_OFFSET); dev_cap->rdmarc_entry_sz = size; MLX4_GET(size, outbox, QUERY_DEV_CAP_QPC_ENTRY_SZ_OFFSET); Index: ofed_kernel/drivers/net/mlx4/fw.h =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/fw.h 2007-09-18 12:14:40.254721000 +0200 +++ ofed_kernel/drivers/net/mlx4/fw.h 2007-09-18 12:15:27.958990000 +0200 @@ -82,6 +82,8 @@ struct mlx4_dev_cap { int max_mcgs; int reserved_pds; int max_pds; + int reserved_xrcds; + int max_xrcds; int qpc_entry_sz; int rdmarc_entry_sz; int altc_entry_sz; Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-18 12:14:40.236718000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-18 12:18:47.788826000 +0200 @@ -410,6 +410,9 @@ static int create_qp_common(struct mlx4_ if (err) goto err_wrid; + if (init_attr->qp_type == IB_QPT_XRC) + qp->mqp.qpn |= (1 << 23); + /* * Hardware wants QPN written in big-endian order (after * shifting) for send doorbell. Precompute this value to save @@ -547,6 +550,9 @@ struct ib_qp *mlx4_ib_create_qp(struct i int err; switch (init_attr->qp_type) { + case IB_QPT_XRC: + if (!(dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC)) + return ERR_PTR(-ENOSYS); case IB_QPT_RC: case IB_QPT_UC: case IB_QPT_UD: @@ -561,6 +567,11 @@ struct ib_qp *mlx4_ib_create_qp(struct i return ERR_PTR(err); } + if (init_attr->qp_type == IB_QPT_XRC) + qp->xrcdn = to_mxrcd(init_attr->xrc_domain)->xrcdn; + else + qp->xrcdn = 0; + qp->ibqp.qp_num = qp->mqp.qpn; break; @@ -625,6 +636,7 @@ static int to_mlx4_st(enum ib_qp_type ty case IB_QPT_RC: return MLX4_QP_ST_RC; case IB_QPT_UC: return MLX4_QP_ST_UC; case IB_QPT_UD: return MLX4_QP_ST_UD; + case IB_QPT_XRC: return MLX4_QP_ST_XRC; case IB_QPT_SMI: case IB_QPT_GSI: return MLX4_QP_ST_MLX; default: return -1; @@ -772,8 +784,11 @@ static int __mlx4_ib_modify_qp(struct ib context->sq_size_stride = ilog2(qp->sq.wqe_cnt) << 3; context->sq_size_stride |= qp->sq.wqe_shift - 4; - if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) { context->sq_size_stride |= !!qp->sq_no_prefetch << 7; + if (ibqp->qp_type == IB_QPT_XRC) + context->xrcd = cpu_to_be32((u32) qp->xrcdn); + } if (qp->ibqp.uobject) context->usr_page = cpu_to_be32(to_mucontext(ibqp->uobject->context)->uar.index); Index: ofed_kernel/drivers/infiniband/hw/mlx4/srq.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/srq.c 2007-09-18 12:14:40.240718000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/srq.c 2007-09-18 12:15:27.970990000 +0200 @@ -72,13 +72,17 @@ static void mlx4_ib_srq_event(struct mlx } } -struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, - struct ib_srq_init_attr *init_attr, - struct ib_udata *udata) +struct ib_srq *mlx4_ib_create_xrc_srq(struct ib_pd *pd, + struct ib_cq *xrc_cq, + struct ib_xrcd *xrcd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata) { struct mlx4_ib_dev *dev = to_mdev(pd->device); struct mlx4_ib_srq *srq; struct mlx4_wqe_srq_next_seg *next; + u32 cqn; + u16 xrcdn; int desc_size; int buf_size; int err; @@ -172,7 +176,11 @@ struct ib_srq *mlx4_ib_create_srq(struct } } - err = mlx4_srq_alloc(dev->dev, to_mpd(pd)->pdn, &srq->mtt, + cqn = xrc_cq ? (u32) (to_mcq(xrc_cq)->mcq.cqn) : 0; + xrcdn = xrcd ? (u16) (to_mxrcd(xrcd)->xrcdn) : + (u16) dev->dev->caps.reserved_xrcds; + + err = mlx4_srq_alloc(dev->dev, to_mpd(pd)->pdn, cqn, xrcdn, &srq->mtt, srq->db.dma, &srq->msrq); if (err) goto err_wrid; @@ -240,6 +248,13 @@ int mlx4_ib_modify_srq(struct ib_srq *ib return 0; } +struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata) +{ + return mlx4_ib_create_xrc_srq(pd, NULL, NULL, init_attr, udata); +} + int mlx4_ib_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr) { struct mlx4_ib_dev *dev = to_mdev(ibsrq->device); Index: ofed_kernel/include/linux/mlx4/qp.h =================================================================== --- ofed_kernel.orig/include/linux/mlx4/qp.h 2007-09-18 12:14:40.227721000 +0200 +++ ofed_kernel/include/linux/mlx4/qp.h 2007-09-18 12:15:27.973990000 +0200 @@ -74,6 +74,7 @@ enum { MLX4_QP_ST_UC = 0x1, MLX4_QP_ST_RD = 0x2, MLX4_QP_ST_UD = 0x3, + MLX4_QP_ST_XRC = 0x6, MLX4_QP_ST_MLX = 0x7 }; @@ -136,7 +137,7 @@ struct mlx4_qp_context { __be32 ssn; __be32 params2; __be32 rnr_nextrecvpsn; - __be32 srcd; + __be32 xrcd; __be32 cqn_recv; __be64 db_rec_addr; __be32 qkey; Index: ofed_kernel/drivers/net/mlx4/Makefile =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/Makefile 2007-09-18 12:14:40.257717000 +0200 +++ ofed_kernel/drivers/net/mlx4/Makefile 2007-09-18 12:15:27.976991000 +0200 @@ -1,4 +1,4 @@ obj-$(CONFIG_MLX4_CORE) += mlx4_core.o mlx4_core-y := alloc.o catas.o cmd.o cq.o eq.o fw.o icm.o intf.o main.o mcg.o \ - mr.o pd.o profile.o qp.o reset.o srq.o + mr.o pd.o profile.o qp.o reset.o srq.o xrcd.o Index: ofed_kernel/drivers/net/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/net/mlx4/qp.c 2007-09-18 12:14:40.000000000 +0200 +++ ofed_kernel/drivers/net/mlx4/qp.c 2007-09-18 12:20:43.390650000 +0200 @@ -263,10 +263,12 @@ int __devinit mlx4_init_qp_table(struct * We reserve 2 extra QPs per port for the special QPs. The * block of special QPs must be aligned to a multiple of 8, so * round up. + * We also reserve the MSB of the 24-bit QP number to indicate + * an XRC qp. */ dev->caps.sqp_start = ALIGN(dev->caps.reserved_qps, 8); err = mlx4_bitmap_init(&qp_table->bitmap, dev->caps.num_qps, - (1 << 24) - 1, dev->caps.sqp_start + 8); + (1 << 23) - 1, dev->caps.sqp_start + 8); if (err) return err; ------------------------------------------------------- From swise at opengridcomputing.com Tue Sep 18 10:23:31 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 18 Sep 2007 12:23:31 -0500 Subject: [ofa-general] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> Message-ID: <46F00993.9080706@opengridcomputing.com> Once this is applied upstream, I can pull it back in to ofed-1.2.5 and ofed-1.3. Steve. Sean Hefty wrote: >> RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. >> >> Calling arp_send() to initiate neighbour discovery (ND) doesn't do the >> full ND protocol. Namely, it doesn't handle retransmitting the arp >> request if it is dropped. The function neigh_event_send() does all this. >> Without doing full ND, rdma address resolution fails in the presence of >> dropped arp bcast packets. >> >> Signed-off-by: Steve Wise > > Acked-by: Sean Hefty > > Roland - can you please queue this up for 2.6.24? From rdreier at cisco.com Tue Sep 18 10:42:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 10:42:57 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: (Roland Dreier's message of "Mon, 17 Sep 2007 16:33:39 -0700") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <18593.1190071438@death> Message-ID: > Maybe this new notification function should be in net/core/dev.c > instead of exporting call_netdevice_notifiers()? Or actually, does it work to add the call to the notifiers directly in unregister_netdev() so that device drivers don't have to worry about it? (And is the existing patch missing a call to notifiers in ipoib_vlan_delete()?) - R. From rdreier at cisco.com Tue Sep 18 10:58:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 10:58:37 -0700 Subject: [ofa-general] [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> (Krishna Kumar's message of "Tue, 18 Sep 2007 16:48:03 +0530") References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> Message-ID: netif_rx_complete() takes a netdev parameter and does dev_put() on that netdev, so netif_rx_reschedule() needs to also take a netdev parameter and do dev_hold() on it to avoid reference counts from getting becoming negative because of unbalanced dev_put()s. This should fix the problem reported by Krishna Kumar with IPoIB waiting forever for netdev refcounts to become 0 during module unload. Signed-off-by: Roland Dreier --- Dave, feel free to roll this up into earlier NAPI conversion patches (assuming I'm understanding things correctly and this patch actually makes sense!). BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have built in the current net-2.6.24 tree, since its call to netif_rx_reschedule() was left with the netdev parameter. So that file does not need to be touched in this patch. drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 +- drivers/net/arm/ep93xx_eth.c | 2 +- drivers/net/ehea/ehea_main.c | 2 +- drivers/net/ibmveth.c | 2 +- include/linux/netdevice.h | 21 +++++++++++---------- 5 files changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 6a2bff4..481e4b6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -320,7 +320,7 @@ poll_more: if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && - netif_rx_reschedule(napi)) + netif_rx_reschedule(dev, napi)) goto poll_more; } diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c index f3858d1..7f016f3 100644 --- a/drivers/net/arm/ep93xx_eth.c +++ b/drivers/net/arm/ep93xx_eth.c @@ -309,7 +309,7 @@ poll_some_more: } spin_unlock_irq(&ep->rx_lock); - if (more && netif_rx_reschedule(napi)) + if (more && netif_rx_reschedule(dev, napi)) goto poll_some_more; } diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 4a5ab4a..9a499f4 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -636,7 +636,7 @@ static int ehea_poll(struct napi_struct *napi, int budget) if (!cqe && !cqe_skb) return 0; - if (!netif_rx_reschedule(napi)) + if (!netif_rx_reschedule(dev, napi)) return 0; } diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c index b8d7cec..b94f266 100644 --- a/drivers/net/ibmveth.c +++ b/drivers/net/ibmveth.c @@ -973,7 +973,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) netif_rx_complete(netdev, napi); if (ibmveth_rxq_pending_buffer(adapter) && - netif_rx_reschedule(napi)) { + netif_rx_reschedule(netdev, napi)) { lpar_rc = h_vio_signal(adapter->vdev->unit_address, VIO_IRQ_DISABLE); goto restart_poll; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be5fe05..0dbf185 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1198,16 +1198,6 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits) return (1 << debug_value) - 1; } -/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete(). */ -static inline int netif_rx_reschedule(struct napi_struct *n) -{ - if (napi_schedule_prep(n)) { - __napi_schedule(n); - return 1; - } - return 0; -} - /* Test if receive needs to be scheduled but only if up */ static inline int netif_rx_schedule_prep(struct net_device *dev, struct napi_struct *napi) @@ -1234,6 +1224,17 @@ static inline void netif_rx_schedule(struct net_device *dev, __netif_rx_schedule(dev, napi); } +/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete(). */ +static inline int netif_rx_reschedule(struct net_device *dev, + struct napi_struct *napi) +{ + if (napi_schedule_prep(napi)) { + __netif_rx_schedule(dev, napi); + return 1; + } + return 0; +} + /* same as netif_rx_complete, except that local_irq_save(flags) * has already been issued */ From rdreier at cisco.com Tue Sep 18 11:04:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 11:04:22 -0700 Subject: [ofa-general] [PATCH net-2.6.24] Fix documentation for dev_put()/dev_hold() In-Reply-To: (Roland Dreier's message of "Tue, 18 Sep 2007 10:58:37 -0700") References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> Message-ID: It looks like the comments for dev_put() and dev_hold() got reversed somehow. Signed-off-by: Roland Dreier --- include/linux/netdevice.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be5fe05..239ae68 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1030,7 +1030,7 @@ extern int netdev_budget; extern void netdev_run_todo(void); /** - * dev_put - get reference to device + * dev_put - release reference to device * @dev: network device * * Hold reference to device to keep it from being freed. @@ -1041,7 +1041,7 @@ static inline void dev_put(struct net_device *dev) } /** - * dev_hold - release reference to device + * dev_hold - get reference to device * @dev: network device * * Release reference to device to allow it to be freed. From bramesh at vt.edu Tue Sep 18 11:02:53 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 18 Sep 2007 14:02:53 -0400 Subject: [ofa-general] IBV_WC_LOC_PROT_ERROR in receive In-Reply-To: <46EF5F6E.3080708@dev.mellanox.co.il> References: <20070918042202.GA8660@vt.edu> <46EF5F6E.3080708@dev.mellanox.co.il> Message-ID: <20070918180253.GA18113@vt.edu> I checked for the following: 1) I havent deregistered the MR. 2) I am using a RC QP 3) The messages size are the same 40 bytes. 4) I only have one PD for the entire application, i.e both QP and MR belong to the same PD 5) The vendor error that I get in the WC is error code 52. 6) I forgot to mention this in the earlier mail the snippet for my send is as follows: struct ibv_sge sge; struct ibv_wc wc; struct ibv_send_wr wr; struct ibv_send_wr *wr_bad; sge.addr = (uintptr_t) buf; sge.length = size; wr.wr_id = WR_ID; wr.next = NULL; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_INLINE; wr.num_sge = 1; wr.sg_list = &sge; if (ibv_post_send (ib_qp, &wr, &wr_bad) != 0) { printf ("ERROR: Unable to post send WR to queue.\n"); return -1; } Thanks, Bharath * Dotan Barak (dotanb at dev.mellanox.co.il) wrote: > Hi. > > Bharath Ramesh wrote: >> I am getting this error when I am trying to do a bunch of send/receives. >> I have registered the receive buffer. I printed the address of the >> buffers and their respective lkeys, they all match but I am still >> getting this error. >> >> The code snippet looks as follows: >> >> struct ibv_mr *mr; >> struct ibv_sge sge; >> struct ibv_recv_wr wr; >> struct ibv_recv_wr *wr_bad; >> >> // registering buffers >> mr = ibv_reg_mr (ib_pd, buf, size, IBV_ACCESS_LOCAL_WRITE | >> IBV_ACCESS_REMOTE_READ | >> IBV_ACCESS_REMOTE_WRITE); >> >> >> //Post the receive buffer >> sge.addr = (uintptr_t) buf; >> sge.length = size; >> sge.lkey = mr->lkey; >> wr.wr_id = WR_ID; >> wr.next = NULL; >> wr.sg_list = &sge; >> wr.num_sge = 1; >> if (ibv_post_recv (ib_qp, &wr, &wr_bad) != 0) { >> printf ("ERROR: Unable to post receiver buffer.\n"); >> return -1; >> } >> >> When I poll for the completion event I get this error. Any help on this >> is appreciated. I am not subscribed to this list, I would appreciate if >> you please cc me on the reply. >> > > > If the address that you given in the RR is valid (you didn't deregister > this MR): > You should check the following things: > * If this is a UD QP, maybe the extra 40 bytes (for the GRH) is missing in > the recv buffer. > * Maybe the incoming message is larger than the receive buffer > * maybe the PD of the QP and the MR are not the same > > > If this didn't help you, the value of the vendor_err in the completion > structure may help me.... > > Dotan > --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From mshefty at ichips.intel.com Tue Sep 18 12:34:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Sep 2007 12:34:29 -0700 Subject: [ofa-general] Re: IPoIB CM (NOSRQ) [PATCH 1] review In-Reply-To: References: Message-ID: <46F02845.9000103@ichips.intel.com> > A linear search accommodates intermittent connection and disconnection of > QPs, if ever there is a need. This search does not happen upon the packet > receive path and hence will not pose performance issues. I'm not sure how frequently connections are made, or if they're long lived. Avoiding the linear search costs an extra integer per QP, but I do think this optimization can come later. >> Would multiple values be better here? Something like: max_conn_qp, >> qp_type, and use_srq. > > > One of the goals was to keep the number of module parameters to a minimum. > Currently, UD is the default qp_type and when one switches to connected > mode > RC qps are used. At init we determine if the HCA support srq or not. So, > this > too is not required as a module param. Personally, I don't believe that automatic use of SRQ is desirable. And the code differences needed to support RC versus UC (without SRQ) seem trivial. So, rather than having one parameter mean: the maximum number of RC QPs with SRQ if it exists, but without SRQ if it doesn't exist, let's separate these values. Even if UC support isn't added right away, it would be better to expose the right values initially then have to change them later. > We compute the mask NOSRQ_INDEX_MASK based on max_rc_qp. This is used to > compute the wr_id through a bitwise AND. Hence we need that to be a power > of 2. I'm saying that we don't need to restrict the number of QPs to a power of 2. We only need to restrict it to less than 2^(number of bits that we want to dedicate from the wr_id to find the QP). E.g. it's okay to have 4-bit or 30-bit masks, but only support 12 QPs. > It is illustrative to see how people view the same thing differently :) > That had never occurred to me. Any suggestions? Yes - multiple parameters :) qp_type use_srq max_qp qp_size (covered by send_queue_size & recv_queue_size?) message_size (covered by mtu?) Yes - it's a lot of parameters, but I think they're needed to support connected mode. If an admin can control the QP sizes and MTU, then they only need to limit the number of QPs. It'd be nice to get some other's input. >>> + >>> + /* In the SRQ case there is a common rx buffer called the > srq_ring. >>> + * However, for the NOSRQ case we create an rx_ring for every >>> + * struct ipoib_cm_rx. >>> + */ >>> + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, >>> GFP_KERNEL); >>> + if (!p->rx_ring) { >>> + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", >>> + qp_num); >>> + return -ENOMEM; >>> + } >>> + >>> + spin_lock_irq(&priv->lock); >>> + list_add(&p->list, &priv->cm.passive_ids); >>> + spin_unlock_irq(&priv->lock); >>> + >>> + init_context_and_add_list(cm_id, p, priv); > >> stale_task thread could be executing on 'p' at this point. Is that >> acceptable? (I'm pretty sure I pointed this out before, but I don't >> remember what the response was.) > > > In the previous review of version v6, you had caught bug (which I > concurred) > That has been fixed now. > > >> We just added 'p' to the passive_ids list here, but >> init_context_and_add_list() also adds it to the list, but only in the >> srq case. It would be cleaner to always just add it to the list in >> init_context_and_add_list() or always do it outside of the list. > > > I am not sure I understand this. init_context_and_add_list() adds to the > list > conditionally. The end result of the code is that p->list is always added to priv->cm.passive_ids list: no SRQ case - allocate_and_post_rbuf_nosrq(): spin_lock_irq(&priv->lock); list_add(&p->list, &priv->cm.passive_ids); spin_unlock_irq(&priv->lock); init_context_and_add_list(cm_id, p, priv); SRQ case - ipoib_cm_req_handler(): if (priv->cm.srq) { p->state = IPOIB_CM_RX_LIVE; init_context_and_add_list(cm_id, p, priv); } init_context_and_add_list(...) { ... if (priv->cm.srq) { if (p->state == IPOIB_CM_RX_LIVE) list_move(&p->list, &priv->cm.passive_ids); Why can't this always just be done as: static void init_context_and_add_list(...) { ... list_add(&p->list, &priv->cm.passive_ids); ... } > Would it better to not drop the lock at all, but hold it till all 3 are > done? > This is not in the packet receive path, and hence not critical. It's not the performance that's bothering me so much as the code being structured in such a way that the same lock is acquired/released 3 times back to back to back. > As mentioned previously the state applies to srq only. I did not combine > the routines since they are quite small and this does get used in the > packet receive path. The state applying to srq only was a design decision. It looks like it could be used in the no srq case, but wasn't. >>> if (unlikely(wr_id >= ipoib_recvq_size)) { > >> Why would this ever occur? > > > If you see the previous IPoIB code (even UD) this has always been there- > probably to detect WQE corruption? IMO, I would toss these checks then. It only protects some of the wr_id bits against a fairly limited type of corruption. >>> + ipoib_warn(priv, "cm recv completion event with wrid %lld (> >>> %d)\n", >>> + (unsigned long long)wr_id, ipoib_recvq_size); >>> + return; >>> + } >>> + >>> + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; >>> + >>> + /* This is the only place where rx_ptr could be a NULL - could >>> + * have just received a packet from a connection that has become >>> + * stale and so is going away. We will simply drop the packet and >>> + * let the hardware (it s IB_QPT_RC) handle the dropped packet. > >> I don't understand this comment. How can the hardware handle a packet >> dropped by software? > > > Under the conditions described we drop the packet and since it is an RC > connection, the remote side will detect a timeout and the hardware will > detect it and automatically initiate a retransmission -till a > RETRY_EXCEEDED > error occurs. This still doesn't make sense to me. An ACK was already generated by the local hardware. Tossing the receive doesn't cause the remote hardware to resend the packet. >> If the completion can be for a connection that has gone away, what's to >> prevent a new connection from grabbing the same slot in the >> rx_index_table. If this occurs, then the completion will reference the >> wrong connection. > > > It does not matter if after a connection has gone away if a new connection > grabs > the same slot (that is likely to happen with the linear search). If the > old > connection comes back it will get a new slot in the rx_index_tabe. Yes - but a receive for the old connection will reference the rx_table index for the new connection. See below: >>> + * In the timer_check() function below, p->jiffies is updated and >>> + * hence the connection will not be stale after that. >>> + */ >>> + rx_ptr = priv->cm.rx_index_table[index]; >>> + if (unlikely(!rx_ptr)) { >>> + ipoib_warn(priv, "Received packet from a connection " >>> + "that is going away. Hardware will handle it.\n"); >>> + return; >>> + } If this check can ever succeed, then it's also possible for rx_ptr to reference the wrong connection. rx_table[index] should not be freed until all receives associated with that QP have been processed. > There have been a lot of discussions about this very issue. It was > strongly > suggested that I keep the if(srq) checks to a bare minimum, especially > since > this is in the packet receive path. I agree that we should limit checks in the receive path, but duplicating a fair amount of code to avoid 1 extra check doesn't seem worth it. We'll end up taking the same branch every time anyway. I'd rather make the code easy to maintain first, with the burden placed on showing the performance gained by removing the branch. > That may be possible. However, in the no srq case we need to do that upon > receipt > of every REQ, whereas for srq we need to do that only once. That is why it > is > convenient to do it here. It ends up using a lot of memory. For the SRQ case, having something adaptable may be better. Rather than posting the maximum up front, post only recv_queue_size receives with each connection (up to some total maximum). As connections go away, let the number of posted receives drop back down as buffers are consumed. This change is outside this patch though. - Sean From phillips.ken at gmail.com Tue Sep 18 12:47:42 2007 From: phillips.ken at gmail.com (Ken Phillips) Date: Tue, 18 Sep 2007 15:47:42 -0400 Subject: [ofa-general] SDP memory allocation policy problem? Message-ID: Greetings, Teammates here report the following: Problem The method SDP uses to allocate socket buffers may cause the node to hang under memory pressure. Details Each kernel level socket has an allocation flag to specify the memory allocation policy for socket buffers, the default is GFP_ATOMIC (or GFP_KERNEL for SDP). If the caller creates a socket with the policy set to GFP_NOFS or GFP_NOIO this should be the allocation policy used by the SDP layer. The problem we are seeing is that if a node is under load, and a memory allocation fails (say in sock_sendmsg()), the kernel will use the allocation policy to decide how to proceed with the allocation. If GFP_KERNEL is specified, then the kernel may attempt to free pages through the iSCSI block device that is making the socket call, which would result in a deadlock. Use of GFP_NOIO should prevent the kernel from using the IO backend to free memory resources. here is a sample stack trace from Alt-Sysrq during one of these lockups, > tx_worker D ffffff0014d14000 0 10195 1 10196 10194 > (L-TLB) > 00000100707e98d8 0000000000000046 0000000000000004 0000000000000212 > 0000000000000212 ffffffffa018ccae 0000000000000246 0000000000000246 > 000001007873c7f0 0000000000000320 > Call Trace:{:ib_mthca:mthca_poll_cq+2258} > {schedule_timeout+224} > {lock_sock+152} > {autoremove_wake_function+0} > {:ib_sdp:sdp_poll_cq+58} > {autoremove_wake_function+0} > {release_sock+16} > {:ib_sdp:sdp_sendmsg+33} > {sock_sendmsg+271} > {:ib_sdp:sdp_post_sends+619} > {release_sock+16} > {:ib_sdp:sdp_sendmsg+2222} > {autoremove_wake_function+0} > {:rs_iscsi:iscsi_sock_msg+1265} > {:rs_iscsi:iscsi_sock_msg+1261} > {recalc_task_prio+337} > {:rs_iscsi:scsi_command_i+5283} > {thread_return+0} > {thread_return+88} > {del_timer+107} > {del_singleshot_timer_sync+9} > {schedule_timeout+375} > {:rs_iscsi:tx_worker_proc_i+6819} > {child_rip+8} > {:rs_iscsi:tx_worker_proc_i+0} > {child_rip+0} > > We still don't know the scope of changes to be made, but we think, at minimum that some of the memory allocation in SDP should be changed, for example. diff -Naur old/drivers/infiniband/ulp/sdp/sdp_bcopy.c new/drivers/infiniband/ulp/sdp/sdp_bcopy.c --- old/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-06-21 10:38:47.000000000 -0400 +++ new/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-08-31 12:25:58.000000000 -0400 @@ -224,13 +224,27 @@ /* Now, allocate and repost recv */ /* TODO: allocate from cache */ + +#if (PROPOSED_SDP_FIX == 1) + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : + ssk->isk.sk.sk_allocation); +#else skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, GFP_KERNEL); +#endif /* FIXME */ BUG_ON(!skb); h = (struct sdp_bsdh *)skb->head; for (i = 0; i < ssk->recv_frags; ++i) { +#if (PROPOSED_SDP_FIX == 1) + page = alloc_pages((ssk->isk.sk.sk_allocation == 0) + ? (GFP_HIGHUSER) : + (ssk->isk.sk.sk_allocation | (__GFP_HIGHMEM)), + 0); +#else page = alloc_pages(GFP_HIGHUSER, 0); +#endif BUG_ON(!page); frag = &skb_shinfo(skb)->frags[i]; frag->page = page; @@ -406,10 +420,18 @@ ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { struct sdp_chrecvbuf *resp_size; ssk->recv_request = 0; +#if (PROPOSED_SDP_FIX == 1) + skb = sk_stream_alloc_skb(&ssk->isk.sk, + sizeof(struct sdp_bsdh) + + sizeof(*resp_size), + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : + ssk->isk.sk.sk_allocation); +#else skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh) + sizeof(*resp_size), GFP_KERNEL); +#endif /* FIXME */ BUG_ON(!skb); resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); @@ -431,10 +453,18 @@ ssk->tx_head > ssk->sent_request_head + SDP_RESIZE_WAIT && ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { struct sdp_chrecvbuf *req_size; +#if (PROPOSED_SDP_FIX == 1) + skb = sk_stream_alloc_skb(&ssk->isk.sk, + sizeof(struct sdp_bsdh) + + sizeof(*req_size), + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : + ssk->isk.sk.sk_allocation); +#else skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh) + sizeof(*req_size), GFP_KERNEL); +#endif /* FIXME */ BUG_ON(!skb); ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; @@ -463,9 +493,16 @@ (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && !ssk->isk.sk.sk_send_head && ssk->bufs) { +#if (PROPOSED_SDP_FIX == 1) + skb = sk_stream_alloc_skb(&ssk->isk.sk, + sizeof(struct sdp_bsdh), + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : + ssk->isk.sk.sk_allocation); +#else skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh), GFP_KERNEL); +#endif /* FIXME */ BUG_ON(!skb); sdp_post_send(ssk, skb, SDP_MID_DISCONN); diff -Naur old/drivers/infiniband/ulp/sdp/sdp.h new/drivers/infiniband/ulp/sdp/sdp.h --- old/drivers/infiniband/ulp/sdp/sdp.h 2007-06-21 10:38:47.000000000 -0400 +++ new/drivers/infiniband/ulp/sdp/sdp.h 2007-08-31 12:25:45.000000000 -0400 @@ -7,6 +7,8 @@ #include /* For urgent data flags */ #include +#define PROPOSED_SDP_FIX 1 + #define sdp_printk(level, sk, format, arg...) \ printk(level "sdp_sock(%d:%d): " format, \ (sk) ? inet_sk(sk)->num : -1, \ --------------------- Best Regards K Phillips From pradeeps at linux.vnet.ibm.com Tue Sep 18 12:54:29 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 18 Sep 2007 12:54:29 -0700 Subject: [ofa-general] Re: IPoIB CM (NOSRQ) [PATCH 1] review In-Reply-To: <46EF606D.7030206@linux.vnet.ibm.com> References: <46EED168.3050102@ichips.intel.com> <46EF606D.7030206@linux.vnet.ibm.com> Message-ID: <46F02CF5.8060409@linux.vnet.ibm.com> Sean, In my earlier reply I missed two issues that you had pointed out. They are addressed here. Pradeep >>> +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, >>> + struct ipoib_cm_rx *p, unsigned psn) >> Function name is a little long. Maybe there should be multiple >> functions here. (Use of 'and' in the function name points to multiple >> functions that are grouped together. Maybe we should add a function >> naming rule: if the function name contains 'and', create separate >> functions...) > > Agreed, this can be dealt with the rest of the restructure. >>> +{ >>> + struct net_device *dev = cm_id->context; >>> + struct ipoib_dev_priv *priv = netdev_priv(dev); >>> + int ret; >>> + u32 qp_num, index; >>> + u64 i, recv_mem_used; >>> + >>> + qp_num = p->qp->qp_num; >> qp_num is only used in one place in this function, and only for a debug >> print. > > OK > >>> + >>> + /* In the SRQ case there is a common rx buffer called the srq_ring. >>> + * However, for the NOSRQ case we create an rx_ring for every >>> + * struct ipoib_cm_rx. >>> + */ >>> + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, >>> GFP_KERNEL); >>> + if (!p->rx_ring) { >>> + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", >>> + qp_num); >>> + return -ENOMEM; >>> + } >>> + >>> + spin_lock_irq(&priv->lock); >>> + list_add(&p->list, &priv->cm.passive_ids); >>> + spin_unlock_irq(&priv->lock); >>> + >>> + init_context_and_add_list(cm_id, p, priv); >> stale_task thread could be executing on 'p' at this point. Is that >> acceptable? (I'm pretty sure I pointed this out before, but I don't >> remember what the response was.) > > In the previous review of version v6, you had caught bug (which I concurred) > That has been fixed now. > >> We just added 'p' to the passive_ids list here, but >> init_context_and_add_list() also adds it to the list, but only in the >> srq case. It would be cleaner to always just add it to the list in >> init_context_and_add_list() or always do it outside of the list. > > I am not sure I understand this. init_context_and_add_list() adds to the list > conditionally. > >>> + spin_lock_irq(&priv->lock); >> Including the call above, we end up acquiring this lock 3 times in a >> row, setting 2 variables between the first and second time, and doing >> nothing between the second and third time. > > Would it better to not drop the lock at all, but hold it till all 3 are done? > This is not in the packet receive path, and hence not critical. > >>> + >>> + for (index = 0; index < max_rc_qp; index++) >>> + if (priv->cm.rx_index_table[index] == NULL) >>> + break; >> See previous comment about avoiding a linear search. >> >>> + >>> + recv_mem_used = (u64)ipoib_recvq_size * >>> + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; >>> + if ((index == max_rc_qp) || >>> + (recv_mem_used >= max_recv_buf * (1ul << 20))) { >> I would prefer a single check against max_rc_qp. (Fold memory >> constraints into limiting the value of max_rc_qp.) Otherwise, we can >> end up allocating a larger array of rx_index_table than is actually >> usable. >> >>> + spin_unlock_irq(&priv->lock); >>> + ipoib_warn(priv, "NOSRQ has reached the configurable limit " >>> + "of either %d RC QPs or, max recv buf size of " >>> + "0x%x MB\n", max_rc_qp, max_recv_buf); >>> + >>> + /* We send a REJ to the remote side indicating that we >>> + * have no more free RC QPs and leave it to the remote side >>> + * to take appropriate action. This should leave the >>> + * current set of QPs unaffected and any subsequent REQs >>> + * will be able to use RC QPs if they are available. >>> + */ >>> + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); >>> + ret = -EINVAL; >>> + goto err_alloc_and_post; >>> + } >>> + >>> + priv->cm.rx_index_table[index] = p; >>> + spin_unlock_irq(&priv->lock); >>> + >>> + /* We will subsequently use this stored pointer while freeing >>> + * resources in stale task >>> + */ >>> + p->index = index; >> Is it dangerous to have this not set before releasing the lock? (It >> doesn't look like it, but wanted to check.) Could anything be iterating >> over the table expecting p->index to be set. Only the stale task will iterate over the table. This initialization happens when REQ is received. So, if this thread gets scheduled out before p->index is set there may be a possibility of a race. Good catch! >> >>> + >>> + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >>> + if (ret) { >>> + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); >>> + ipoib_cm_dev_cleanup(dev); >>> + goto err_alloc_and_post; >>> + } >>> + >>> + for (i = 0; i < ipoib_recvq_size; ++i) { >>> + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, >>> + IPOIB_CM_RX_SG - 1, >>> + p->rx_ring[i].mapping)) { >>> + ipoib_warn(priv, "failed to allocate receive " >>> + "buffer %d\n", (int)i); >>> + ipoib_cm_dev_cleanup(dev); >>> + ret = -ENOMEM; >>> + goto err_alloc_and_post; >>> + } >>> + >>> + if (post_receive_nosrq(dev, i << 32 | index)) { >>> + ipoib_warn(priv, "post_receive_nosrq " >>> + "failed for buf %lld\n", (unsigned long long)i); >>> + ipoib_cm_dev_cleanup(dev); >>> + ret = -EIO; >> >> Why not just do: >> >> ret = post_receive_nosrq()? >> if (ret) ... >> >>> + goto err_alloc_and_post; >>> + } >>> + } >>> + >>> + return 0; >>> + >>> +err_alloc_and_post: >>> + atomic_dec(¤t_rc_qp); >>> + kfree(p->rx_ring); >>> + list_del_init(&p->list); >> We need a lock here. > > Agreed. You are correct. > >> Is priv->cm.rx_index_table[index] cleaned up anywhere? > > Yes, in dev_stop_nosrq(). > >>> + return ret; >>> +} >>> + >>> static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct >>> ib_cm_event *event) >>> { >>> struct net_device *dev = cm_id->context; >>> @@ -301,9 +477,6 @@ static int ipoib_cm_req_handler(struct i >>> return -ENOMEM; >>> p->dev = dev; >>> p->id = cm_id; >>> - cm_id->context = p; >>> - p->state = IPOIB_CM_RX_LIVE; >>> - p->jiffies = jiffies; >>> INIT_LIST_HEAD(&p->list); >>> >>> p->qp = ipoib_cm_create_rx_qp(dev, p); >>> @@ -313,19 +486,21 @@ static int ipoib_cm_req_handler(struct i >>> } >>> >>> psn = random32() & 0xffffff; >>> - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >>> - if (ret) >>> - goto err_modify; >>> + if (!priv->cm.srq) { >>> + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); >>> + if (ret) >>> + goto err_modify; >>> + } else { >>> + p->rx_ring = NULL; >>> + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >>> + if (ret) >>> + goto err_modify; >>> + } >>> >>> - spin_lock_irq(&priv->lock); >>> - queue_delayed_work(ipoib_workqueue, >>> - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); >>> - /* Add this entry to passive ids list head, but do not re-add it >>> - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ >>> - p->jiffies = jiffies; >>> - if (p->state == IPOIB_CM_RX_LIVE) >>> - list_move(&p->list, &priv->cm.passive_ids); >>> - spin_unlock_irq(&priv->lock); >>> + if (priv->cm.srq) { >>> + p->state = IPOIB_CM_RX_LIVE; >> This if can be merged with the previous if statement above, which >> performs a similar check. >> >> Does it matter that the state is set outside of any locks? No it does not matter, since we have not yet added p into the list as yet. >> >>> + init_context_and_add_list(cm_id, p, priv); >>> + } >>> >>> ret = ipoib_cm_send_rep(dev, cm_id, p->qp, >>> &event->param.req_rcvd, psn); >>> if (ret) { >>> @@ -398,29 +573,60 @@ static void skb_put_frags(struct sk_buff >>> } >>> } >>> > From davem at davemloft.net Tue Sep 18 13:15:05 2007 From: davem at davemloft.net (David Miller) Date: Tue, 18 Sep 2007 13:15:05 -0700 (PDT) Subject: [ofa-general] Re: [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule() In-Reply-To: References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> Message-ID: <20070918.131505.91210517.davem@davemloft.net> From: Roland Dreier Date: Tue, 18 Sep 2007 10:58:37 -0700 > netif_rx_complete() takes a netdev parameter and does dev_put() on > that netdev, so netif_rx_reschedule() needs to also take a netdev > parameter and do dev_hold() on it to avoid reference counts from > getting becoming negative because of unbalanced dev_put()s. > > This should fix the problem reported by Krishna Kumar > with IPoIB waiting forever for netdev refcounts > to become 0 during module unload. > > Signed-off-by: Roland Dreier Applied to net-2.6.24, thanks Roland. > BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have > built in the current net-2.6.24 tree, since its call to > netif_rx_reschedule() was left with the netdev parameter. So that > file does not need to be touched in this patch. Yes, I know, this is the one NAPI driver that hasn't been converted. It's a complicated conversion because of how the driver and the data structures have been arranged (in short, a mess) which makes it insanely difficult to get from a queue instance back up to a network device or similar. Further complicating things is that you need to setup a ppc32 cross-build environment to even build test a conversion, and I'm not comfortable doing the surgery until I can test build the thing. And this may be hard to believe, but other things have been more pressing than setting up a ppc32 cross-build environment :-) This is a hint of anyone looking for something to do that it'd be much appreciated for someone to tackle the ibm_emac conversion. Thanks. From pradeeps at linux.vnet.ibm.com Tue Sep 18 14:01:21 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 18 Sep 2007 14:01:21 -0700 Subject: [ofa-general] Re: IPoIB CM (NOSRQ) [PATCH 1] review In-Reply-To: <46F02845.9000103@ichips.intel.com> References: <46F02845.9000103@ichips.intel.com> Message-ID: <46F03CA1.7040701@linux.vnet.ibm.com> > >> We compute the mask NOSRQ_INDEX_MASK based on max_rc_qp. This is used >> to compute the wr_id through a bitwise AND. Hence we need that to be a >> power of 2. > > I'm saying that we don't need to restrict the number of QPs to a power > of 2. We only need to restrict it to less than 2^(number of bits that > we want to dedicate from the wr_id to find the QP). E.g. it's okay to > have 4-bit or 30-bit masks, but only support 12 QPs. OK. Got it, I will unlink the two. .... > >>>> + ipoib_warn(priv, "cm recv completion event with wrid %lld (> >>>> %d)\n", >>>> + (unsigned long long)wr_id, ipoib_recvq_size); >>>> + return; >>>> + } >>>> + >>>> + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; >>>> + >>>> + /* This is the only place where rx_ptr could be a NULL - could >>>> + * have just received a packet from a connection that has become >>>> + * stale and so is going away. We will simply drop the packet and >>>> + * let the hardware (it s IB_QPT_RC) handle the dropped packet. >> >>> I don't understand this comment. How can the hardware handle a packet >>> dropped by software? >> >> >> Under the conditions described we drop the packet and since it is an RC >> connection, the remote side will detect a timeout and the hardware >> will detect it and automatically initiate a retransmission -till a >> RETRY_EXCEEDED >> error occurs. > > This still doesn't make sense to me. An ACK was already generated by > the local hardware. Tossing the receive doesn't cause the remote > hardware to resend the packet. So, at the hardware level an ack goes through, but we drop it at the software level. Is there any way we can force the remote end to resend? TCP should be OK. What about UDP? Do we depend upon the application at the remote end? Would it be more appropriate that I rephrase it something along the lines ... "We will simply drop the packet and let the remote end handle the dropped packet" > >>> If the completion can be for a connection that has gone away, what's to >>> prevent a new connection from grabbing the same slot in the >>> rx_index_table. If this occurs, then the completion will reference the >>> wrong connection. >> >> >> It does not matter if after a connection has gone away if a new >> connection grabs >> the same slot (that is likely to happen with the linear search). If >> the old >> connection comes back it will get a new slot in the rx_index_tabe. > > Yes - but a receive for the old connection will reference the rx_table > index for the new connection. See below: > >>>> + * In the timer_check() function below, p->jiffies is updated and >>>> + * hence the connection will not be stale after that. >>>> + */ >>>> + rx_ptr = priv->cm.rx_index_table[index]; >>>> + if (unlikely(!rx_ptr)) { >>>> + ipoib_warn(priv, "Received packet from a connection " >>>> + "that is going away. Hardware will handle it.\n"); >>>> + return; >>>> + } > > If this check can ever succeed, then it's also possible for rx_ptr to > reference the wrong connection. rx_table[index] should not be freed > until all receives associated with that QP have been processed. rx_index_table[index] is freed only in the stale task. So, that means all receives have been processed by this time. From mshefty at ichips.intel.com Tue Sep 18 14:06:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Sep 2007 14:06:49 -0700 Subject: [ofa-general] Re: IPoIB CM (NOSRQ) [PATCH 1] review In-Reply-To: <46F03CA1.7040701@linux.vnet.ibm.com> References: <46F02845.9000103@ichips.intel.com> <46F03CA1.7040701@linux.vnet.ibm.com> Message-ID: <46F03DE9.1010500@ichips.intel.com> > Would it be more appropriate that I rephrase it something along the lines ... > "We will simply drop the packet and let the remote end handle the dropped packet" Ok - this makes more sense to me. >>>>> + rx_ptr = priv->cm.rx_index_table[index]; >>>>> + if (unlikely(!rx_ptr)) { >>>>> + ipoib_warn(priv, "Received packet from a connection " >>>>> + "that is going away. Hardware will handle it.\n"); >>>>> + return; >>>>> + } >> If this check can ever succeed, then it's also possible for rx_ptr to >> reference the wrong connection. rx_table[index] should not be freed >> until all receives associated with that QP have been processed. > > rx_index_table[index] is freed only in the stale task. So, that means > all receives have been processed by this time. Then it sounds like we can remove this check. - Sean From becker at nas.nasa.gov Tue Sep 18 14:32:30 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 18 Sep 2007 14:32:30 -0700 Subject: [ofa-general] Fwd: Announcing the release of MVAPICH2 1.0 (fwd) In-Reply-To: <795c49870709181425u34e1b224j7437c264b25c013e@mail.gmail.com> References: <200709181420.l8IEKPFi002663@xi.cse.ohio-state.edu> <795c49870709181204o41c991bexcd34f0fe52357fa3@mail.gmail.com> <795c49870709181330w3255fa22o3d2e0144410dfb7a@mail.gmail.com> <795c49870709181404t726c6984kfd288ca07ca6147a@mail.gmail.com> <795c49870709181425u34e1b224j7437c264b25c013e@mail.gmail.com> Message-ID: <795c49870709181432i773f85d4n95e9b50167d416b4@mail.gmail.com> > > The MVAPICH team is pleased to announce the availability of > MVAPICH2-1.0 with the following NEW features: > > - Message coalescing support to enable reduction of per Queue-pair > send queues for reduction in memory requirement on large scale > clusters. This design also increases the small message messaging > rate significantly. Available for Open Fabrics Gen2-IB. > > - Hot-Spot Avoidance Mechanism (HSAM) for alleviating > network congestion in large scale clusters. Available for > Open Fabrics Gen2-IB. > > - RDMA CM based on-demand connection management for large scale > clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. > > - uDAPL on-demand connection management for large scale clusters. > Available for uDAPL interface (including Solaris IB implementation). > > - RDMA Read support for increased overlap of computation and > communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. > > - Application-initiated system-level (synchronous) check-pointing in > addition to the user-transparent check-pointing. User application > can now request a whole program checkpoint synchronously with BLCR > by calling special functions within the application. Available for > OpenFabrics Gen2-IB. > > - Network-Level fault tolerance with Automatic Path Migration (APM) > for tolerating intermittent network failures over InfiniBand. > Available for OpenFabrics Gen2-IB. > > - Integrated multi-rail communication support for OpenFabrics > Gen2-iWARP and RDMA CM (with Gen2-IB). > > - RDMA based Direct One-sided communication support for OpenFabrics > Gen2-iWARP and RDMA CM (with Gen2-IB). > > - Blocking mode of communication progress. Available for OpenFabrics > Gen2-IB. > > - Based on MPICH2 1.0.5p4. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://mvapich.cse.ohio-state.edu/overview/mvapich2/features.shtml > > MVAPICH2 1.0 is tested with OFED 1.1, OFED 1.2 and OFED 1.2.5 (for > ConnectX). It continues to deliver excellent performance. Sample > performance numbers include: > > OpenFabrics/Gen2 on EM64T quad-core with PCIe and ConnectX-DDR: > Two-sided operations: > - 1.66 microsec one-way latency (4 bytes) > - 1405 MB/sec unidirectional bandwidth > - 2716 MB/sec bidirectional bandwidth > > One-sided operations: > - 3.19 microsec Put latency > - 1405 MB/sec unidirectional Put bandwidth > - 2716 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > For downloading MVAPICH2 1.0 package and accessing the anonymous SVN, > please visit the following URL: > > http://mvapich.cse.ohio-state.edu/ > > All feedbacks, including bug reports, hints for performance tuning, > patches and enhancements are welcome. Please post it to > mvapich-discuss mailing list. > > Thanks, > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, QLogic, Sun Microsystems and Linux > Networx; and with equipment support from Advanced Clustering, AMD, > Appro, Chelsio, Dell, Fujitsu, Fulcrum, IBM, Intel, Mellanox, > Microway, NetEffect, QLogic and Sun Microsystems. Other technology > partner includes Etnus. > ====================================================================== > > > From pradeeps at linux.vnet.ibm.com Tue Sep 18 15:43:02 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 18 Sep 2007 15:43:02 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch Message-ID: <46F05476.4090809@linux.vnet.ibm.com> This version incorporates some of Sean's comments, especially relating to locking. Sean's comments regarding module parameters, code restructure, ipoib_cm_rx state and the like will require more discussion and subsequent testing. They will be addressed with additional set of patches later on. This patch has been tested with linux-2.6.23-rc5 derived from Roland's for-2.6.24 git tree on ppc64 machines using IBM HCA. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-31 12:14:30.000000000 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 14:31:07.000000000 -0500 @@ -95,11 +95,13 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #define IPOIB_OP_RECV (1ul << 31) + #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -166,11 +168,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -no srq only */ enum ipoib_cm_state state; }; @@ -215,6 +220,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -438,6 +445,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) +extern int max_rc_qp; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.000000000 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-18 17:04:06.000000000 -0500 @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +int max_rc_qp = 128; +static int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of no srq RC QPs supported; must be a power of 2"); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ + +#define NOSRQ_INDEX_MASK (max_rc_qp -1) #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", + (unsigned long long)id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +117,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +189,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -203,11 +258,14 @@ static struct ib_qp *ipoib_cm_create_rx_ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + if (!priv->cm.srq) + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; return ib_create_qp(priv->pd, &attr); } @@ -281,12 +339,131 @@ static int ipoib_cm_send_rep(struct net_ rep.private_data_len = sizeof data; rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } + spin_unlock_irq(&priv->lock); +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 index; + u64 i, recv_mem_used; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the no srq case we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + p->qp->qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + + init_context_and_add_list(cm_id, p, priv); + spin_lock_irq(&priv->lock); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; + if ((index == max_rc_qp) || + (recv_mem_used >= max_recv_buf * (1ul << 20))) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "no srq has reached the configurable limit " + "of either %d RC QPs or, max recv buf size of " + "0x%x MB\n", max_rc_qp, max_recv_buf); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + + priv->cm.rx_index_table[index] = p; + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + spin_unlock_irq(&priv->lock); + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", (int)i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + ret = post_receive_nosrq(dev, i << 32 | index); + if (ret) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %lld\n", (unsigned long long)i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + atomic_dec(¤t_rc_qp); + kfree(p->rx_ring); + spin_lock_irq(&priv->lock); + list_del_init(&p->list); + spin_unlock_irq(&priv->lock); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -301,9 +478,6 @@ static int ipoib_cm_req_handler(struct i return -ENOMEM; p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -313,19 +487,21 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (!priv->cm.srq) { + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); + if (ret) + goto err_modify; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + p->state = IPOIB_CM_RX_LIVE; + init_context_and_add_list(cm_id, p, priv); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -398,29 +574,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", - wr_id, wc->status); + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -428,23 +635,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -456,13 +655,109 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", + (unsigned long long)wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; + + /* This is the only place where rx_ptr could be a NULL - could + * have just received a packet from a connection that has become + * stale and so is going away. We will simply drop the packet and + * let the remote end handle the dropped packet. + * In the timer_check() function below, p->jiffies is updated and + * hence the connection will not be stale after that. + */ + rx_ptr = priv->cm.rx_index_table[index]; + if (unlikely(!rx_ptr)) { + ipoib_warn(priv, "Received packet from a connection " + "that is going away. Remote end will handle it.\n"); + return; + } + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) + timer_check_nosrq(priv, rx_ptr); + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -482,10 +777,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n", + (unsigned long long)wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -677,6 +984,43 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); + kfree(priv->cm.rx_index_table); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -691,6 +1035,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -814,7 +1163,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 0; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -854,7 +1205,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1198,6 +1549,8 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) + atomic_dec(¤t_rc_qp); kfree(p); } } @@ -1216,12 +1569,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1275,16 +1635,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1301,20 +1685,32 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + ret = ib_query_device(priv->ca, &attr); + if (ret) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + ret = create_srq(dev, priv); + if (ret) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kcalloc(max_rc_qp, + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate rx_index_table\n"); + return -ENOMEM; + } } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1327,17 +1723,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for no srq we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 14:08:33.000000000 -0500 @@ -300,7 +300,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -566,7 +566,7 @@ void ipoib_drain_cq(struct net_device *d if (priv->ibwc[i].status == IB_WC_SUCCESS) priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-18 14:02:03.000000000 -0500 @@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if (!priv->cm.srq) + size += (max_rc_qp - 1) * ipoib_recvq_size; +#endif + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 14:02:03.000000000 -0500 @@ -1227,6 +1227,7 @@ static int __init ipoib_init_module(void ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + max_rc_qp = roundup_pow_of_two(max_rc_qp); ret = ipoib_register_debugfs(); if (ret) From rdreier at cisco.com Tue Sep 18 15:46:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 15:46:36 -0700 Subject: [ofa-general] Re: [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070918.131505.91210517.davem@davemloft.net> (David Miller's message of "Tue, 18 Sep 2007 13:15:05 -0700 (PDT)") References: <20070918111803.1769.60619.sendpatchset@localhost.localdomain> <20070918.131505.91210517.davem@davemloft.net> Message-ID: > Further complicating things is that you need to setup a ppc32 > cross-build environment to even build test a conversion, and I'm not > comfortable doing the surgery until I can test build the thing. OK, I actually have a system with a ppc 440 SoC that uses this driver, so I'll try to get things to the stage where I can boot net-2.6.24 on it and see if I can get the driver working... From davem at davemloft.net Tue Sep 18 15:50:51 2007 From: davem at davemloft.net (David Miller) Date: Tue, 18 Sep 2007 15:50:51 -0700 (PDT) Subject: [ofa-general] Re: [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule() In-Reply-To: References: <20070918.131505.91210517.davem@davemloft.net> Message-ID: <20070918.155051.122059976.davem@davemloft.net> From: Roland Dreier Date: Tue, 18 Sep 2007 15:46:36 -0700 > OK, I actually have a system with a ppc 440 SoC that uses this driver, > so I'll try to get things to the stage where I can boot net-2.6.24 on > it and see if I can get the driver working... Thanks a lot Roland. From sashak at voltaire.com Tue Sep 18 16:57:55 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 01:57:55 +0200 Subject: [ofa-general] [PATCH] opensm/Makefile.am: more 'make dist' fixes Message-ID: <20070918235755.GJ31938@sashak.voltaire.com> Add all scripts to EXTRA_DIST list - it is used in spec file when rpm is generated. Signed-off-by: Sasha Khapyorsky --- opensm/Makefile.am | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 9cbce3a..b7b6e6a 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -23,7 +23,9 @@ endif man_MANS = man/opensm.8 man/osmtest.8 -EXTRA_DIST = opensm.spec scripts/opensm.init scripts/opensm.sysconfig $(man_MANS) +various_scripts = $(wildcard scripts/*) + +EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) dist-hook: $(EXTRA_DIST) mkdir -p $(distdir)/scripts -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Tue Sep 18 17:25:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 02:25:52 +0200 Subject: [ofa-general] [PATCH] management/*/*.spec.in: don't run autogen.sh Message-ID: <20070919002552.GK31938@sashak.voltaire.com> Don't run autogen.sh script for rpm generation - valid tarballs should have already generated ./configure. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/infiniband-diags.spec.in | 3 +-- libibcommon/libibcommon.spec.in | 3 +-- libibmad/libibmad.spec.in | 3 +-- libibumad/libibumad.spec.in | 3 +-- opensm/opensm.spec.in | 3 +-- 5 files changed, 5 insertions(+), 10 deletions(-) diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in index c43eb58..4abf761 100644 --- a/infiniband-diags/infiniband-diags.spec.in +++ b/infiniband-diags/infiniband-diags.spec.in @@ -11,7 +11,7 @@ Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) Source: git://git.openfabrics.org/~sashak/management/@TARBALL@ Url: http://openfabrics.org/ -BuildRequires: libibmad-devel, opensm-devel, autoconf, automake +BuildRequires: libibmad-devel, opensm-devel Provides: perl(IBswcountlimits) %description @@ -26,7 +26,6 @@ diagnose an IB subnet. %endif %build -./autogen.sh %configure %{?_enable_switch_map} make diff --git a/libibcommon/libibcommon.spec.in b/libibcommon/libibcommon.spec.in index e7c5d76..bc7995f 100644 --- a/libibcommon/libibcommon.spec.in +++ b/libibcommon/libibcommon.spec.in @@ -13,7 +13,7 @@ Source: git://git.openfabrics.org/~sashak/management/@TARBALL@ Url: http://openfabrics.org/ Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig -BuildRequires: autoconf, libtool, automake +BuildRequires: libtool %description libibcommon provides common utility functions for the OFA diagnostic and @@ -41,7 +41,6 @@ Static library files for the libibcommon library. %setup -q %build -./autogen.sh %configure make %{?_smp_mflags} diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in index 4263eac..93f8b20 100644 --- a/libibmad/libibmad.spec.in +++ b/libibmad/libibmad.spec.in @@ -11,7 +11,7 @@ Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) Source: git://git.openfabrics.org/~sashak/management/@TARBALL@ Url: http://openfabrics.org/ -BuildRequires: libibumad-devel, autoconf, libtool, automake +BuildRequires: libibumad-devel, libtool Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig @@ -42,7 +42,6 @@ Static version of the libibmad library %setup -q %build -./autogen.sh %configure make %{?_smp_mflags} diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in index 87bd071..64a7293 100644 --- a/libibumad/libibumad.spec.in +++ b/libibumad/libibumad.spec.in @@ -13,7 +13,7 @@ Source: git://git.openfabrics.org/~sashak/management/@TARBALL@ Url: http://openfabrics.org Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig -BuildRequires: libibcommon-devel, autoconf, libtool, automake +BuildRequires: libibcommon-devel, libtool %description libibumad provides the user MAD library functions which sit on top of @@ -42,7 +42,6 @@ Static version of the libibumad library. %setup -q %build -./autogen.sh %configure make %{?_smp_mflags} diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index 4179b27..9b9409d 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -30,7 +30,7 @@ Group: System Environment/Daemons URL: http://openfabrics.org/ Source: git://git.openfabrics.org/~sashak/management/@TARBALL@ BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -BuildRequires: libibumad-devel, autoconf, libtool, automake +BuildRequires: libibumad-devel, libtool Requires: %{name}-libs = %{version}-%{release}, logrotate Requires(post): /sbin/service, /sbin/chkconfig Requires(preun): /sbin/chkconfig, /sbin/service @@ -72,7 +72,6 @@ Static version of the opensm libraries %setup -q %build -./autogen.sh %configure \ %{?_enable_console_socket} \ %{?_disable_console_socket} \ -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Tue Sep 18 17:31:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 02:31:15 +0200 Subject: [ofa-general] [PATCH] management/*/Makefile.am: add autogen.sh script to 'make dist' Message-ID: <20070919003115.GL31938@sashak.voltaire.com> Somebody could want to regenerate auto* stuff with distributed tarball. So add autogen.sh to archive. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/Makefile.am | 3 +-- libibcommon/Makefile.am | 2 +- libibmad/Makefile.am | 2 +- libibumad/Makefile.am | 2 +- 4 files changed, 4 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index f6b6292..aad0020 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -93,7 +93,7 @@ man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibdatacounts.8 man/ibdatacounters.8 \ man/ibrouters.8 man/ibprintrt.8 man/ibidsverify.8 -EXTRA_DIST = scripts include infiniband-diags.spec.in $(man_MANS) +EXTRA_DIST = scripts include infiniband-diags.spec.in $(man_MANS) autogen.sh dist-hook: infiniband-diags.spec cp infiniband-diags.spec $(distdir) @@ -101,4 +101,3 @@ dist-hook: infiniband-diags.spec # install this to a default location. install-data-hook: $(top_srcdir)/config/install-sh -m 444 scripts/IBswcountlimits.pm $(DESTDIR)/$(PERL_INSTALLDIR)/IBswcountlimits.pm - diff --git a/libibcommon/Makefile.am b/libibcommon/Makefile.am index 7bd264c..c2cd087 100644 --- a/libibcommon/Makefile.am +++ b/libibcommon/Makefile.am @@ -23,7 +23,7 @@ libibcommonincludedir = $(includedir)/infiniband libibcommoninclude_HEADERS = $(srcdir)/include/infiniband/common.h EXTRA_DIST = $(srcdir)/include/infiniband/common.h libibcommon.spec.in \ - $(srcdir)/src/libibcommon.map libibcommon.ver + $(srcdir)/src/libibcommon.map libibcommon.ver autogen.sh dist-hook: libibcommon.spec cp libibcommon.spec $(distdir) diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index f12e1f9..86d233d 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -30,7 +30,7 @@ libibmadincludedir = $(includedir)/infiniband libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h EXTRA_DIST = $(srcdir)/include/infiniband/mad.h libibmad.spec.in \ - $(srcdir)/src/libibmad.map libibmad.ver + $(srcdir)/src/libibmad.map libibmad.ver autogen.sh dist-hook: libibmad.spec cp libibmad.spec $(distdir) diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index 25495c3..6b0a59b 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -38,7 +38,7 @@ libibumadinclude_HEADERS = $(srcdir)/include/infiniband/umad.h EXTRA_DIST = $(srcdir)/include/infiniband/umad.h libibumad.spec.in \ $(srcdir)/src/libibumad.map libibumad.ver \ - $(man_MANS) + $(man_MANS) autogen.sh dist-hook: libibumad.spec cp libibumad.spec $(distdir) -- 1.5.3.1.91.gd3392 From ggrundstrom at neteffect.com Tue Sep 18 17:22:37 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Tue, 18 Sep 2007 19:22:37 -0500 Subject: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement. Message-ID: <200709190022.l8J0MbWt024754@neteffect.com> RDMA/CMA: Implement rdma_resolve_ip retry enhancement. If an application is calling rdma_resolve_ip() and a status of -ENODATA is returned from addr_resolve_local/remote(), the timeout mechanism waits until the application's timeout occurs before rechecking the address resolution status; the application will wait until it's full timeout occurs. This case is seen when the work thread call to process_req() is made before the arp packet is processed. This patch is in addition to Steve Wise's neigh_event_send patch to initiate neighbour discovery sent on 9/12/2007. Signed-off-by: Glenn Grundstrom --- drivers/infiniband/core/addr.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index c5c33d3..a953780 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -55,6 +55,7 @@ struct addr_req { int status; }; +#define MIN_ADDR_TIMEOUT_MS 500 static void process_req(struct work_struct *work); static DEFINE_MUTEX(lock); @@ -136,6 +137,7 @@ static void set_timeout(unsigned long ti static void queue_req(struct addr_req *req) { struct addr_req *temp_req; + unsigned long req_timeout = msecs_to_jiffies(MIN_ADDR_TIMEOUT_MS) + jiffies; mutex_lock(&lock); list_for_each_entry_reverse(temp_req, &req_list, list) { @@ -145,8 +147,10 @@ static void queue_req(struct addr_req *r list_add(&req->list, &temp_req->list); - if (req_list.next == &req->list) + if (req_list.next == &req->list) { + req_timeout = min(req_timeout, req->timeout); set_timeout(req->timeout); + } mutex_unlock(&lock); } @@ -220,6 +224,7 @@ static void process_req(struct work_stru struct addr_req *req, *temp_req; struct sockaddr_in *src_in, *dst_in; struct list_head done_list; + unsigned long req_timeout; INIT_LIST_HEAD(&done_list); @@ -238,9 +243,11 @@ static void process_req(struct work_stru list_move_tail(&req->list, &done_list); } + req_timeout = msecs_to_jiffies(MIN_ADDR_TIMEOUT_MS) + jiffies; if (!list_empty(&req_list)) { req = list_entry(req_list.next, struct addr_req, list); - set_timeout(req->timeout); + req_timeout = min(req_timeout, req->timeout); + set_timeout(req_timeout); } mutex_unlock(&lock); From sashak at voltaire.com Tue Sep 18 18:29:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 03:29:28 +0200 Subject: [ofa-general] [PATCH] management/gen_chlog.sh: fixes and improvements Message-ID: <20070919012928.GM31938@sashak.voltaire.com> Parse tags properly and make ChangeLog output similar to existing one. Signed-off-by: Sasha Khapyorsky --- gen_chlog.sh | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/gen_chlog.sh b/gen_chlog.sh index 9d60081..c54028a 100755 --- a/gen_chlog.sh +++ b/gen_chlog.sh @@ -33,7 +33,7 @@ mkchlog() prev_tag="" - for tag in `git-tag -l $target` ; do + for tag in `git-tag -l ${target}-'*'` ; do obj=`git-cat-file tag $tag | awk '/^object /{print $2}'` base=`git-merge-base $obj HEAD` if [ -z "$base" -o "$base" != $obj ] ; then @@ -51,7 +51,8 @@ mkchlog() for ver in $all_vers ; do ver_name=`echo $ver | sed -e 's/^.*\.\.//'` - echo "* Version: $ver_name" + echo "" + echo "** Version: $ver_name" echo "" git-log --no-merges "${format}" $ver -- $target prev_t=$tag.. @@ -60,7 +61,8 @@ mkchlog() if [ -z "$spec_format" ] ; then - mkchlog $TARGET --pretty=format:"commit %H%n%ad %an%n%n %s%n" + mkchlog $TARGET --pretty=format:"%ad %an%n%H%n%n* %s%n" \ + | sed -e 's/^\* /\t* /' else echo "%changelog" mkchlog $TARGET --pretty=format:"- %ad %an: %s" -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Tue Sep 18 18:30:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 03:30:20 +0200 Subject: [ofa-general] [PATCH] opensm/Makefile.am: generate ChangeLog on 'make dist' Message-ID: <20070919013020.GN31938@sashak.voltaire.com> This generates up-to-date ChangeLog file from git repo and places it in distribution directory. Activated via dist-hook:. Signed-off-by: Sasha Khapyorsky --- opensm/Makefile.am | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/Makefile.am b/opensm/Makefile.am index b7b6e6a..f0ef88a 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -30,3 +30,5 @@ EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) dist-hook: $(EXTRA_DIST) mkdir -p $(distdir)/scripts cp -r --parents $< $(distdir) + test -x ../$(top_srcdir)/gen_chlog.sh \ + && ../$(top_srcdir)/gen_chlog.sh $(PACKAGE) > $(distdir)/ChangeLog -- 1.5.3.1.91.gd3392 From sashak at voltaire.com Tue Sep 18 19:16:03 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 04:16:03 +0200 Subject: [ofa-general] [PATCH] management/*/configure: RELEASE and TARBALL for spec files Message-ID: <20070919021603.GO31938@sashak.voltaire.com> RELEASE and TARBALL variables are used as substitution pattern in *.spec.in files, but never defines. As result ./configure script produces invalid *.spec files for management sub-projects. This patch fixes this. Values are defined as "unknown" and "$PACKAGE-$VERSION.tar.gz", but could be overwritten with environment variables, like this: RELEASE=ofed_7.8 TARBALL=opensm-7.8.9-customized.tar.gz ./configure Signed-off-by: Sasha Khapyorsky --- infiniband-diags/configure.in | 3 +++ libibcommon/configure.in | 3 +++ libibmad/configure.in | 3 +++ libibumad/configure.in | 3 +++ opensm/configure.in | 3 +++ 5 files changed, 15 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 171dec7..2da23de 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -6,6 +6,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE(infiniband-diags, 1.3.1) +AC_SUBST(RELEASE, ${RELEASE:-unknown}) +AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) + AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], [ if test x$enableval = xno ; then disable_libcheck=yes diff --git a/libibcommon/configure.in b/libibcommon/configure.in index 78f615d..5d08725 100644 --- a/libibcommon/configure.in +++ b/libibcommon/configure.in @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE +AC_SUBST(RELEASE, ${RELEASE:-unknown}) +AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) + dnl the library version info is available in the file: libibcommon.ver ibcommon_api_version=`grep LIBVERSION $srcdir/libibcommon.ver | sed 's/LIBVERSION=//'` if test -z $ibcommon_api_version; then diff --git a/libibmad/configure.in b/libibmad/configure.in index 83d4bfc..3232472 100644 --- a/libibmad/configure.in +++ b/libibmad/configure.in @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE +AC_SUBST(RELEASE, ${RELEASE:-unknown}) +AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) + dnl the library version info is available in the file: libibmad.ver ibmad_api_version=`grep LIBVERSION $srcdir/libibmad.ver | sed 's/LIBVERSION=//'` if test -z $ibmad_api_version; then diff --git a/libibumad/configure.in b/libibumad/configure.in index d5ebe5b..c42a2b3 100644 --- a/libibumad/configure.in +++ b/libibumad/configure.in @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) AM_INIT_AUTOMAKE +AC_SUBST(RELEASE, ${RELEASE:-unknown}) +AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) + dnl the library version info is available in the file: libibumad.ver ibumad_api_version=`grep LIBVERSION $srcdir/libibumad.ver | sed 's/LIBVERSION=//'` if test -z $ibumad_api_version; then diff --git a/opensm/configure.in b/opensm/configure.in index 6c4db9f..cb27ffd 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -7,6 +7,9 @@ AC_CONFIG_AUX_DIR(config) AC_CONFIG_HEADERS(include/config.h) AM_INIT_AUTOMAKE(opensm, 3.1.1) +AC_SUBST(RELEASE, ${RELEASE:-unknown}) +AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) + dnl Defines the Language AC_LANG_C -- 1.5.3.1.91.gd3392 From krkumar2 at in.ibm.com Tue Sep 18 20:23:55 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 19 Sep 2007 08:53:55 +0530 Subject: [ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang In-Reply-To: Message-ID: Hi Roland, Roland Dreier wrote on 09/18/2007 07:57:24 PM: > > While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with > > I don't think you're actually using rc6 bits, since in your patch you have: > > > -poll_more: > > and I think that is only in Dave's net-2.6.24 tree now, right? Nope, that was what I downloaded yesterday: VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 23 EXTRAVERSION =-rc6 NAME = Pink Farting Weasel > > + if (likely(!ib_req_notify_cq(priv->cq, > > + IB_CQ_NEXT_COMP | > > + IB_CQ_REPORT_MISSED_EVENTS))) > > It is possible for an interrupt to happen immediately right here, > before the netif_rx_complete(), so that netif_rx_schedule() gets > called while we are still on the poll list. > > > + netif_rx_complete(dev, napi); To be clear, netif_rx_schedule while we are still in the poll list will not do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared by netif_rx_complete which has not yet run). Effectively we lost/delayed processing an interrupt, if I understood the code right. I agree with you on the new patch. thanks, - KK From rdreier at cisco.com Tue Sep 18 20:30:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 18 Sep 2007 20:30:36 -0700 Subject: [ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang In-Reply-To: (Krishna Kumar2's message of "Wed, 19 Sep 2007 08:53:55 +0530") References: Message-ID: > > and I think that is only in Dave's net-2.6.24 tree now, right? > > Nope, that was what I downloaded yesterday: > > VERSION = 2 > PATCHLEVEL = 6 > SUBLEVEL = 23 > EXTRAVERSION =-rc6 > NAME = Pink Farting Weasel Please double check your tree. I just very carefully looked at my trees, and the poll_more: label is added in commit 6b460a71 ("[NET]: Make NAPI polling independent of struct net_device objects.") which is only in the net-2.6.24 tree. Of course Dave did not change the version information in the Makefile since he wouldn't want Linus to pick up any extra strange changes when he pulls, so a net-2.6.24 tree will look like 2.6.23-rc6 as you quoted. And the refcounting bug I fixed is only in net-2.6.24. > To be clear, netif_rx_schedule while we are still in the poll list will not > do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared > by netif_rx_complete which has not yet run). Effectively we lost/delayed > processing an interrupt, if I understood the code right. Right, we lose an interrupt, and since the CQ events are one-shot, we never get another one, and the interface is effectively dead. - R. From krkumar2 at in.ibm.com Tue Sep 18 21:24:18 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 19 Sep 2007 09:54:18 +0530 Subject: [ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang In-Reply-To: Message-ID: Hi Roland, > Please double check your tree. I just very carefully looked at my > trees, and the poll_more: label is added in commit 6b460a71 ("[NET]: > Make NAPI polling independent of struct net_device objects.") which is > only in the net-2.6.24 tree. Of course Dave did not change the > version information in the Makefile since he wouldn't want Linus to > pick up any extra strange changes when he pulls, so a net-2.6.24 tree > will look like 2.6.23-rc6 as you quoted. > > And the refcounting bug I fixed is only in net-2.6.24. You are absolutely right. My wording was incorrect, I should have said net-2.6.24 (which is *at* rev rc6). > > To be clear, netif_rx_schedule while we are still in the poll list will not > > do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared > > by netif_rx_complete which has not yet run). Effectively we lost/delayed > > processing an interrupt, if I understood the code right. > > Right, we lose an interrupt, and since the CQ events are one-shot, we > never get another one, and the interface is effectively dead. Thanks, - KK From sashak at voltaire.com Tue Sep 18 21:37:50 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 06:37:50 +0200 Subject: [ofa-general] [PATCH] libibmad,infiniband-diags: fix include paths Message-ID: <20070919043750.GP31938@sashak.voltaire.com> Use include (rather than just <*.h>) for external management header files from libibcommon and libibumad (also libibmad for diags). Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/grouping.c | 4 ++-- infiniband-diags/src/ibaddr.c | 6 +++--- infiniband-diags/src/ibnetdiscover.c | 6 +++--- infiniband-diags/src/ibping.c | 6 +++--- infiniband-diags/src/ibportstate.c | 6 +++--- infiniband-diags/src/ibroute.c | 6 +++--- infiniband-diags/src/ibstat.c | 6 +++--- infiniband-diags/src/ibsysstat.c | 6 +++--- infiniband-diags/src/ibtracert.c | 6 +++--- infiniband-diags/src/perfquery.c | 6 +++--- infiniband-diags/src/sminfo.c | 6 +++--- infiniband-diags/src/smpdump.c | 6 +++--- infiniband-diags/src/smpquery.c | 6 +++--- infiniband-diags/src/vendstat.c | 6 +++--- libibmad/src/gs.c | 2 +- libibmad/src/mad.c | 4 ++-- libibmad/src/register.c | 2 +- libibmad/src/resolve.c | 4 ++-- libibmad/src/rpc.c | 2 +- libibmad/src/serv.c | 4 ++-- 20 files changed, 50 insertions(+), 50 deletions(-) diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c index 89b7ea0..621d49e 100644 --- a/infiniband-diags/src/grouping.c +++ b/infiniband-diags/src/grouping.c @@ -44,8 +44,8 @@ #include #include -#include -#include +#include +#include #include "ibnetdiscover.h" #include "grouping.h" diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index 04aa2fa..c61b6b7 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -42,9 +42,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 6574f2b..e627e84 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -48,9 +48,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.5 -#include -#include -#include +#include +#include +#include #include "ibnetdiscover.h" #include "grouping.h" diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 76f4258..ea46002 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -45,9 +45,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index 45af970..9ea7529 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -44,9 +44,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 77beb36..44d2fc8 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -47,9 +47,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 9f5e15d..4653390 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -58,9 +58,9 @@ #include #define __BUILD_VERSION_TAG__ 1.1 -#include -#include -#include +#include +#include +#include #define DEBUG if (debug) IBWARN diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index 832c0e7..2435c87 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -44,9 +44,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index f085fd6..e553f4f 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -47,9 +47,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.1 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 315a900..2ae3281 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -42,9 +42,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.1 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index 9a0e5a7..0cd63f9 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -43,9 +43,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.1 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index 7ed621a..5eceea7 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -57,9 +57,9 @@ #include #define __BUILD_VERSION_TAG__ 1.1 -#include -#include -#include +#include +#include +#include #define DEBUG if (debug) IBWARN diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 9e25255..73e880b 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -48,9 +48,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.2 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 2c8ef3b..fa0206c 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -43,9 +43,9 @@ #include #define __BUILD_VERSION_TAG__ 1.2.1 -#include -#include -#include +#include +#include +#include #include "ibdiag_common.h" diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index 3489885..7e9f4f4 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -42,7 +42,7 @@ #include #include -#include +#include #include "mad.h" #undef DEBUG diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index bbd39d8..1137fb5 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -42,9 +42,9 @@ #include #include -#include -#include #include +#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 08e781a..3d1285a 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -43,7 +43,7 @@ #include #include -#include +#include #include "mad.h" #undef DEBUG diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 4a782ec..05b443d 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -42,9 +42,9 @@ #include #include -#include -#include #include +#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index b9daa47..17f43b8 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -43,7 +43,7 @@ #include #include -#include +#include #include "mad.h" #define MAX_CLASS 256 diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index e59cb63..9b20cb6 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -43,9 +43,9 @@ #include #include -#include -#include #include +#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN -- 1.5.3.rc2.29.gc4640f From kliteyn at mellanox.co.il Tue Sep 18 22:17:53 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 19 Sep 2007 07:17:53 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-19:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-18 OpenSM git rev = Tue_Sep_18_08:32:25_2007 [6bc4b03807b6fe7e12121cbbca1a200b66f72d0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From dotanb at dev.mellanox.co.il Tue Sep 18 23:14:32 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 19 Sep 2007 08:14:32 +0200 Subject: [ofa-general] IBV_WC_LOC_PROT_ERROR in receive In-Reply-To: <20070918180253.GA18113@vt.edu> References: <20070918042202.GA8660@vt.edu> <46EF5F6E.3080708@dev.mellanox.co.il> <20070918180253.GA18113@vt.edu> Message-ID: <46F0BE48.50005@dev.mellanox.co.il> Bharath Ramesh wrote: > I checked for the following: > 1) I havent deregistered the MR. > 2) I am using a RC QP > 3) The messages size are the same 40 bytes. > 4) I only have one PD for the entire application, i.e both QP and MR > belong to the same PD > 5) The vendor error that I get in the WC is error code 52. > 6) I forgot to mention this in the earlier mail the snippet for my send > is as follows: > I believe that the problem is related to the incoming message size and the attributes that were given in the scatter entry (size of the buffer which was specified smaller than the message size) or the size of the MR is smaller than the size of the message. I suggest to check that all of the values in the scatter entry in the RR again If you wish to send me you source for me to review, you are welcome. Dotan From mst at dev.mellanox.co.il Tue Sep 18 23:19:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 19 Sep 2007 08:19:08 +0200 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <200709180909.50029.jackm@dev.mellanox.co.il> References: <200709180909.50029.jackm@dev.mellanox.co.il> Message-ID: <20070919061908.GA31331@mellanox.co.il> > Missing from this list (IMPORTANT patch!): > [ofa-general] [PATCH 2 of 2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists > (Posted by me to list on Sept 4) > {patch header: > This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745 > (June 18). This addendum adds prefetch headroom marking processing for s/g segments. > > We write s/g segments in reverse order into the WQE, in order to guarantee > that the first dword of all cachelines containing s/g segments is written last > (overwriting the headroom invalidation pattern). The entire cacheline will thus > contain valid data when the invalidation pattern is overwritten. This actually looks like a bugfix that might even have been appropriate for 2.6.23. Roland, do you have this patch? Can you comment on it please? -- MST From mst at dev.mellanox.co.il Tue Sep 18 23:25:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 19 Sep 2007 08:25:08 +0200 Subject: [ofa-general] [PATCH repost] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists Message-ID: <20070919062508.GB31331@mellanox.co.il> From: Jack Morgenstein This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745 (June 18). This addendum adds prefetch headroom marking processing for s/g segments. We write s/g segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing s/g segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. Signed-off-by: Jack Morgenstein --- Roland, there were some ideas on optimizing this patch, which you later decided to hold off on. Meanwhile, it seems the whole patch has been lost, which is a problem since it fixes data corruption observed under stress. What do you say? Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 16:35:01.000000000 +0300 +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 17:05:47.000000000 +0300 @@ -1215,9 +1215,18 @@ static void set_datagram_seg(struct mlx4 static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) { - dseg->byte_count = cpu_to_be32(sg->length); dseg->lkey = cpu_to_be32(sg->lkey); dseg->addr = cpu_to_be64(sg->addr); + + /* Need a barrier before writing the byte_count field + * to make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment + * begins a new cacheline, the HCA prefetcher could + * grab the 64-byte chunk and get a valid (!= * 0xffffffff) + * byte count but stale data, and end up sending the wrong + * data. */ + wmb(); + dseg->byte_count = cpu_to_be32(sg->length); } int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, @@ -1226,6 +1235,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp struct mlx4_ib_qp *qp = to_mqp(ibqp); void *wqe; struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_data_seg *seg; unsigned long flags; int nreq; int err = 0; @@ -1325,19 +1335,22 @@ int mlx4_ib_post_send(struct ib_qp *ibqp break; } - for (i = 0; i < wr->num_sge; ++i) { - set_data_seg(wqe, wr->sg_list + i); - - wqe += sizeof (struct mlx4_wqe_data_seg); + seg = (struct mlx4_wqe_data_seg *) wqe; + /* Add one more inline data segment for ICRC for MLX sends. + * Write this inline and all s/g segments in reverse order, + * so as to overwrite cacheline stamp last within each + * cacheline. */ + if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { + void *t = wqe + (wr->num_sge) * sizeof(struct mlx4_wqe_data_seg); + ((u32 *) t)[1] = 0; + wmb(); + ((struct mlx4_wqe_inline_seg *) t)->byte_count = + cpu_to_be32((1 << 31) | 4); size += sizeof (struct mlx4_wqe_data_seg) / 16; } - /* Add one more inline data segment for ICRC for MLX sends */ - if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { - ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = - cpu_to_be32((1 << 31) | 4); - ((u32 *) wqe)[1] = 0; - wqe += sizeof (struct mlx4_wqe_data_seg); + for (i = wr->num_sge - 1; i >= 0; --i) { + set_data_seg(seg + i, wr->sg_list + i); size += sizeof (struct mlx4_wqe_data_seg) / 16; } _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ----- End forwarded message ----- -- MST From mst at dev.mellanox.co.il Tue Sep 18 23:34:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 19 Sep 2007 08:34:21 +0200 Subject: [ofa-general] [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists In-Reply-To: <200709041047.32062.jackm@dev.mellanox.co.il> References: <200709041047.32062.jackm@dev.mellanox.co.il> Message-ID: <20070919063421.GA6185@mellanox.co.il> From: Jack Morgenstein : Subject: IB/mlx4: fix data corruption triggered by wrong headroom marking order This is an addendum to Roland's commit 0e6e74162164d908edf7889ac66dca09e7505745 (June 18). This addendum adds prefetch headroom marking processing for s/g segments. We write s/g segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing s/g segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. Signed-off-by: Jack Morgenstein --- The previous patch version turned out to contain a space followed by a tab. Here's a fixed one. Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 16:35:01.000000000 +0300 +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-07-30 17:05:47.000000000 +0300 @@ -1215,9 +1215,18 @@ static void set_datagram_seg(struct mlx4 static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) { - dseg->byte_count = cpu_to_be32(sg->length); dseg->lkey = cpu_to_be32(sg->lkey); dseg->addr = cpu_to_be64(sg->addr); + + /* Need a barrier before writing the byte_count field + * to make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment + * begins a new cacheline, the HCA prefetcher could + * grab the 64-byte chunk and get a valid (!= * 0xffffffff) + * byte count but stale data, and end up sending the wrong + * data. */ + wmb(); + dseg->byte_count = cpu_to_be32(sg->length); } int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, @@ -1226,6 +1235,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp struct mlx4_ib_qp *qp = to_mqp(ibqp); void *wqe; struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_data_seg *seg; unsigned long flags; int nreq; int err = 0; @@ -1325,19 +1335,22 @@ int mlx4_ib_post_send(struct ib_qp *ibqp break; } - for (i = 0; i < wr->num_sge; ++i) { - set_data_seg(wqe, wr->sg_list + i); - - wqe += sizeof (struct mlx4_wqe_data_seg); + seg = (struct mlx4_wqe_data_seg *) wqe; + /* Add one more inline data segment for ICRC for MLX sends. + * Write this inline and all s/g segments in reverse order, + * so as to overwrite cacheline stamp last within each + * cacheline. */ + if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { + void *t = wqe + (wr->num_sge) * sizeof(struct mlx4_wqe_data_seg); + ((u32 *) t)[1] = 0; + wmb(); + ((struct mlx4_wqe_inline_seg *) t)->byte_count = + cpu_to_be32((1 << 31) | 4); size += sizeof (struct mlx4_wqe_data_seg) / 16; } - /* Add one more inline data segment for ICRC for MLX sends */ - if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { - ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = - cpu_to_be32((1 << 31) | 4); - ((u32 *) wqe)[1] = 0; - wqe += sizeof (struct mlx4_wqe_data_seg); + for (i = wr->num_sge - 1; i >= 0; --i) { + set_data_seg(seg + i, wr->sg_list + i); size += sizeof (struct mlx4_wqe_data_seg) / 16; } -- MST From unclaimprize at lottery.org.uk Wed Sep 19 02:16:51 2007 From: unclaimprize at lottery.org.uk (BRITISH WEB LOTTERY) Date: Wed, 19 Sep 2007 21:16:51 +1200 Subject: [ofa-general] AWARD WINNER 2007 WEB LOTTERY Message-ID: <200709190916.l8J9GpnW002620@server41.ewsclustercore.net> GOVERNMENT ACCREDITED LICENSED!! BRITISH WEB LOTTERY IS REGISTERED UNDER THE DATA PROTECTION ACT OF; (Registration Z720633X) The Marina Offices, St Peters Yacht Basin, Newcastle upon Tyne, NE6 1HXEngland (Customer Services) Ref: UK/9420X2/68 Batch: 074/05/ZY369 Ref: WINNING NOTIFICATION We happily announce to you the draw of the UK NATIONAL LOTTERY, online Sweepstakes International program held on 19 May 2007 Draw 1190. It is yet to be unclaimed and you are getting the final NOTIFICATION as regards this.Your E-mail address attached to the lucky numbers: 17, 19, 20,32, 36, 48 And (Bonus ball(18),which subsequently won you the lottery in the 2nd category i.e JACKPOT. You have therefore been approved to claim a total sum of ÂŁ7,923,399 (Seven million, Nine Hundred And Twenty Three Thousand, Three Hundred and Ninety Nine Pounds Sterlings) in cash credited to file number KTU/902311832012/07 and Draw Number: 1183. This is from a total cash prize for winners in this category i.e JACKPOT Bonus. All participants for the online version were selected randomly from World Wide Web sites through computer draw system and extracted from over 100,000 unions,associations,and corporate bodies that are listed online. This is part of the Country's Program to fund for the Olympic Games in 2012 The ÂŁ1.5bn Olympic lottery puzzle: http://news.bbc.co.uk/1/hi/uk/4719851.stm) The Olympic fund-raising games will include a TV draw The National Lottery may have seemed a relatively simple way of helping pay for the Olympics. The Lottery must raise ÂŁ1.5bn over the next seven years to pay its share of the public money going into the Olympics. A further ÂŁ650m will be raised from council tax in London and another ÂŁ250m from the London Development Agency,while similar sums will be raised from ticket sales, marketing, sponsorship and the sale of television rights. Please note that your lucky winning number falls within our European booklet representative office in Europe as indicated in your play coupon. In view of this, your ÂŁ7,923,399 will be released to you our payment office in UK. Our European agent will immediately commence the process to facilitate the release of your funds as soon as you contact him. For security reasons, you are advised to keep your winning information confidential till your claim is processed and your money remitted to you in whatever manner you deem fit to claim your prize. This is part of our precautionary measure to avoid double claiming and unwarranted abuse of this program. Please be warned. To file for your claim,please contact our Administration Assistant: Name: Mr. Moore Baker Official Email: unclaimprizes at National-Champs.com Mobile:+44 701 113 2675 Endeavor to email him your full names, File and draw numbers, email address,telephone and fax numbers immediately. You can go to our online result site to confirm the value of your winnings and also get a prize breakdown:-http://www.national-lottery.co.uk/player/p/results/unclaimedPrizes.do Congratulations from me and members of staff of THE NATIONAL LOTTERY. Yours faithfully, Darryn Clarke(Mrs) Online coordinator for UK NATIONAL LOTTERY Sweepstakes International Program. ------------------------------------------------------------------------------------------------------------------------------------------- ® Do not reply this E-mail, contact our Administration Assistant with the above email®. From vlad at lists.openfabrics.org Wed Sep 19 02:51:54 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 19 Sep 2007 02:51:54 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070919-0200 daily build status Message-ID: <20070919095154.C8DD2E608C5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070919-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Wed Sep 19 03:47:55 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 12:47:55 +0200 Subject: [ofa-general] [PATCH] management/*/Makefile.am: fix include paths Message-ID: <20070919104755.GL29384@sashak.voltaire.com> Fix include paths for management libraries and diags builds - since it will always linked against installed libs, we will use only installed external header files for build. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/Makefile.am | 7 +------ libibmad/Makefile.am | 5 +---- libibumad/Makefile.am | 3 +-- 3 files changed, 3 insertions(+), 12 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index aad0020..ca18178 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,10 +1,5 @@ -INCLUDES = -I include \ - -I$(srcdir)/../libibcommon/include/infiniband \ - -I$(srcdir)/../libibumad/include/infiniband \ - -I$(srcdir)/../libibmad/include/infiniband \ - -I$(srcdir)/../opensm/include \ - -I$(includedir)/infiniband +INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index 6f4fb95..974616e 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -1,10 +1,7 @@ SUBDIRS = . -INCLUDES = -I$(srcdir)/include/infiniband \ - -I$(srcdir)/../libibcommon/include/infiniband \ - -I$(srcdir)/../libibumad/include/infiniband \ - -I$(includedir)/infiniband +INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) lib_LTLIBRARIES = libibmad.la diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index be65673..dad6168 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -1,8 +1,7 @@ SUBDIRS = . -INCLUDES = -I$(srcdir)/include/infiniband \ - -I$(srcdir)/../libibcommon/include/infiniband +INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ man/umad_get_ca_portguids.3 man/umad_get_cas_names.3 \ -- 1.5.3.rc2.29.gc4640f From johnpol at 2ka.mipt.ru Wed Sep 19 03:56:13 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Wed, 19 Sep 2007 14:56:13 +0400 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46EE9C50.7070406@opengridcomputing.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> <20070916142241.GA26848@2ka.mipt.ru> <46EE9C50.7070406@opengridcomputing.com> Message-ID: <20070919105612.GA31158@2ka.mipt.ru> Hi Steve. On Mon, Sep 17, 2007 at 10:25:04AM -0500, Steve Wise (swise at opengridcomputing.com) wrote: > >Does creating the whole new netdevice is a too big overhead, or is it > >considered bad idea? > > I think its too big overhead, and pretty invasive on the low level cxgb3 > driver. I think having a device in the 'ifconfig -a' after iw_cxgb3 is > loaded and devices discovered would be a good thing for the admin. This > is the angle Roland suggested. I'm just not sure how to implement it. > > But if someone could explain how I might create this full netdevice as a > pseudo device on top of the real one, maybe I could implement it. > > Note that non TCP traffic still needs to utilize this interface for ND > to work properly with the RDMA core. Just a though - what about allowing secondary addresses with the same address as main one? I.e. change bit of the core code to allow creating aliases with the same address as main device, so that you would be able to create ':iw' alias during rdma device initialization? If this solution is not acceptible, then I belive your alias change is the way to go. -- Evgeniy Polyakov From krkumar2 at in.ibm.com Wed Sep 19 04:54:03 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Wed, 19 Sep 2007 17:24:03 +0530 Subject: [ofa-general] [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() Message-ID: <20070919115403.19455.65941.sendpatchset@K50wks273871wss.in.ibm.com> Hi Dave, After applying Roland's NAPI patch, system panics when I run multiple thread iperf (no stack trace at this time, it shows that the panic is in net_tx_action). I think the problem is: In the "done < budget" case, ipoib_poll calls netif_rx_complete() netif_rx_complete() __netif_rx_complete() __napi_complete() list_del() __list_del() entry->next = LIST_POISON1; entry->prev = LIST_POISON2; Due to race with another completion (explained at end of the patch), net_rx_action finds quota==0 (even though done < budget earlier): net_rx_action() if (unlikely(!n->quota)) { n->quota = n->weight; list_move_tail() __list_del(POISON, POISON) } while IPoIB calling netif_rx_reschedule() works fine due to: netif_rx_reschedule __netif_rx_schedule __napi_schedule list_add_tail (this is OK) Patch that fixes this: diff -ruNp a/include/linux/netdevice.h b/include/linux/netdevice.h --- a/include/linux/netdevice.h 2007-09-19 16:50:35.000000000 +0530 +++ b/include/linux/netdevice.h 2007-09-19 16:51:28.000000000 +0530 @@ -346,7 +346,7 @@ static inline void napi_schedule(struct static inline void __napi_complete(struct napi_struct *n) { BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); - list_del(&n->poll_list); + __list_del(&n->poll_list); smp_mb__before_clear_bit(); clear_bit(NAPI_STATE_SCHED, &n->state); } When I apply this patch, things work fine but I get napi_check_quota_bug() warning. This race seems to happen as follows: CPU#1: ipoib_poll(budget=100) { A. process 100 skbs B. netif_rx_complete() F. ib_req_notify_cq() (no missed completions, do nothing) G. return 100 H. return to net_rx_action, quota=99, subtract 100, quota=-1, BUG. } CPU#2: ipoib_ib_completion() : (starts and finishes entire line of execution *after* step B and *before* H executes). { C. New skb comes, call netif_rx_schedule; set quota=100 D. do ipoib_poll(), process one skb, return work=1 to net_rx_action E. net_rx_action: set quota=99 } The reason why both cpu's can execute poll simultaneously is because netpoll_poll_lock() returns NULL (dev->npinfo == NULL). This results in negative napi refcount and the warning. I verified this is the reason by saving the original quota before calling poll (in net_tx_action) and comparing with final after poll (before it gets updated), and it gets changed very often in multiple thread testing (atleast 4 threads, haven't seen with 2). In most cases, the quota becomes -1, and I have seen upto -9 but those are rarer. Note: during steps F-H and C-E, priv/napi is read/modified by both cpu's which is another bug relating to the same race. I guess the above patch is not required if this bug (in IPoIB) is fixed? Roland, why cannot we get rid of "poll_more" ? We will get called again after netif_rx_reschedule, and it is cleaner to let the new execution handle fresh completions. Is there a reason why this goto is required? Thanks, - KK From dotanb at dev.mellanox.co.il Wed Sep 19 05:50:23 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 19 Sep 2007 14:50:23 +0200 Subject: [ofa-general] [PATCH] libibcm: add valgrind support to the libibcm Message-ID: <200709191450.23784.dotanb@dev.mellanox.co.il> Added valgrind support to the libibcm. Signed-off-by: Dotan Barak Signed-off-by: Sean Hefty --- Index: ofa_1_3_dev_user/src/userspace/libibcm/configure.in =================================================================== --- ofa_1_3_dev_user.orig/src/userspace/libibcm/configure.in 2007-09-19 08:31:54.000000000 +0200 +++ ofa_1_3_dev_user/src/userspace/libibcm/configure.in 2007-09-19 12:10:50.000000000 +0200 @@ -9,6 +9,18 @@ AM_INIT_AUTOMAKE(libibcm, 1.0-1) AM_PROG_LIBTOOL +AC_ARG_WITH([valgrind], + AC_HELP_STRING([--with-valgrind], + [Enable valgrind annotations - default NO])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then + AC_DEFINE([INCLUDE_VALGRIND], 1, + [Define to 1 to enable valgrind annotations]) + if test -d $with_valgrind; then + CPPFLAGS="$CPPLFAGS -I$with_valgrind/include" + fi +fi + AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], [ if test "$enableval" = "no"; then disable_libcheck=yes @@ -38,6 +50,12 @@ AC_CHECK_HEADER(infiniband/verbs.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) AC_CHECK_HEADER(infiniband/marshall.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) + +if test "$with_valgrind" != "" && test "$with_valgrind" != "no"; then +AC_CHECK_HEADER(valgrind/memcheck.h, [], + AC_MSG_ERROR([valgrind requested but not found.])) +fi + fi AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, Index: ofa_1_3_dev_user/src/userspace/libibcm/src/cm.c =================================================================== --- ofa_1_3_dev_user.orig/src/userspace/libibcm/src/cm.c 2007-09-19 08:31:54.000000000 +0200 +++ ofa_1_3_dev_user/src/userspace/libibcm/src/cm.c 2007-09-19 12:15:51.000000000 +0200 @@ -51,6 +51,17 @@ #include #include +#ifdef INCLUDE_VALGRIND +# include +# ifndef VALGRIND_MAKE_MEM_DEFINED +# warning "Valgrind requested, but VALGRIND_MAKE_MEM_DEFINED undefined" +# endif +#endif + +#ifndef VALGRIND_MAKE_MEM_DEFINED +# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +#endif + #define PFX "libibcm: " static int abi_ver; @@ -226,6 +237,8 @@ int ib_cm_create_id(struct ib_cm_device if (result != size) goto err; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + cm_id_priv->id.handle = resp->id; *cm_id = &cm_id_priv->id; return 0; @@ -250,6 +263,8 @@ int ib_cm_destroy_id(struct ib_cm_id *cm if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + cm_id_priv = container_of(cm_id, struct cm_id_private, id); pthread_mutex_lock(&cm_id_priv->mut); @@ -279,6 +294,8 @@ int ib_cm_attr_id(struct ib_cm_id *cm_id if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + param->service_id = resp->service_id; param->service_mask = resp->service_mask; param->local_id = resp->local_id; @@ -307,6 +324,8 @@ int ib_cm_init_qp_attr(struct ib_cm_id * if (result != size) return (result > 0) ? -ENODATA : result; + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + *qp_attr_mask = resp->qp_attr_mask; ibv_copy_qp_attr_from_kern(qp_attr, resp); @@ -782,7 +801,7 @@ int ib_cm_get_event(struct ib_cm_device msg = alloca(size); if (!msg) return -ENOMEM; - + hdr = msg; cmd = msg + sizeof(*hdr); @@ -790,6 +809,8 @@ int ib_cm_get_event(struct ib_cm_device hdr->in = sizeof(*cmd); hdr->out = sizeof(*resp); + memset(cmd, 0, sizeof(*cmd)); + resp = alloca(sizeof(*resp)); if (!resp) return -ENOMEM; @@ -818,6 +839,9 @@ int ib_cm_get_event(struct ib_cm_device result = (result > 0) ? -ENODATA : result; goto done; } + + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + /* * decode event. */ From hrosenstock at xsigo.com Wed Sep 19 06:00:46 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 19 Sep 2007 06:00:46 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_mcast_mgr.c: Possible NULL ptr seg fault Message-ID: <1190206846.7075.34.camel@hrosenstock-ws.xsigo.com> In osm_mcast_mgr.c:__osm_mcast_mgr_branch, protect against NULL return from osm_node_get_remote_node Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index 0fad02b..50b95fd 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -765,6 +765,8 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_mcast_mgr_t * const p_mgr, p_node = p_sw->p_node; p_remote_node = osm_node_get_remote_node(p_node, i, NULL); + if (!p_remote_node) + continue; if (osm_node_get_type(p_remote_node) == IB_NODE_TYPE_SWITCH) { /* From hrosenstock at xsigo.com Wed Sep 19 06:00:55 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 19 Sep 2007 06:00:55 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_ucast_ftree.c: Possible NULL ptr seg fault Message-ID: <1190206855.7075.35.camel@hrosenstock-ws.xsigo.com> In osm_ucast_ftree.c:__osm_ftree_rank_leaf_switches, protect against NULL return from osm_node_get_remote_node Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 5d48c6e..d8ba368 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -2885,6 +2885,8 @@ __osm_ftree_rank_leaf_switches(IN ftree_fabric_t * p_ftree, p_remote_osm_node = osm_node_get_remote_node(p_osm_node, i, NULL); + if (!p_remote_osm_node) + continue; switch (osm_node_get_type(p_remote_osm_node)) { case IB_NODE_TYPE_CA: From ossthema at de.ibm.com Wed Sep 19 06:23:48 2007 From: ossthema at de.ibm.com (Jan-Bernd Themann) Date: Wed, 19 Sep 2007 15:23:48 +0200 Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070919115403.19455.65941.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070919115403.19455.65941.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <200709191523.48874.ossthema@de.ibm.com> Hi, On Wednesday 19 September 2007 13:54, Krishna Kumar wrote: > CPU#1: ipoib_poll(budget=100) > { > A. process 100 skbs > B. netif_rx_complete() > CPU#2> > F. ib_req_notify_cq() (no missed completions, do nothing) > G. return 100 > H. return to net_rx_action, quota=99, subtract 100, quota=-1, BUG. > } > > CPU#2: ipoib_ib_completion() : (starts and finishes entire line of execution > *after* step B and *before* H executes). > { > C. New skb comes, call netif_rx_schedule; set quota=100 > D. do ipoib_poll(), process one skb, return work=1 to net_rx_action > E. net_rx_action: set quota=99 > } If I understood it right the problem you describe (quota update in __napi_schdule) can cause further problems when you choose the following numbers: CPU1: A. process 99 pkts CPU1: B. netif_rx_complete() CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action triggerd: CPU2: C. set quota = 100 (__napi_schedule) CPU2: D. call poll(), process 1 pkt CPU2: D.2 call netif_rx_complete() (quota not exeeded) CPU2: E. net_rx_action: set quota=99 CPU1: F. net_rx_action: set qutoa=99 - 99 = 0 CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been called Step G would fail as the device is not in the list due to netif_rx_complete. This case can occur for all devices running on an SMP system where interrupts are not pinned. Regards, Jan-Bernd From sashak at voltaire.com Wed Sep 19 06:43:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 15:43:42 +0200 Subject: [ofa-general] Re: [PATCH] OpenSM/osm_mcast_mgr.c: Possible NULL ptr seg fault In-Reply-To: <1190206846.7075.34.camel@hrosenstock-ws.xsigo.com> References: <1190206846.7075.34.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070919134342.GM29384@sashak.voltaire.com> On 06:00 Wed 19 Sep , Hal Rosenstock wrote: > In osm_mcast_mgr.c:__osm_mcast_mgr_branch, protect against NULL return > from osm_node_get_remote_node > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Sep 19 06:44:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 15:44:01 +0200 Subject: [ofa-general] Re: [PATCH] OpenSM/osm_ucast_ftree.c: Possible NULL ptr seg fault In-Reply-To: <1190206855.7075.35.camel@hrosenstock-ws.xsigo.com> References: <1190206855.7075.35.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070919134401.GN29384@sashak.voltaire.com> On 06:00 Wed 19 Sep , Hal Rosenstock wrote: > In osm_ucast_ftree.c:__osm_ftree_rank_leaf_switches, protect against > NULL return from osm_node_get_remote_node > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Sep 19 07:43:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 16:43:09 +0200 Subject: [ofa-general] [PATCH] management/make.dist: use 'make dist' for tarballs generation Message-ID: <20070919144309.GO29384@sashak.voltaire.com> Use 'make dist' in components directories for tarballs generation - this creates "clean" archives. make.dist script functionality and usage still be same as before. Signed-off-by: Sasha Khapyorsky --- make.dist | 25 +++++++++++-------------- 1 files changed, 11 insertions(+), 14 deletions(-) diff --git a/make.dist b/make.dist index 42e38ce..4f9c2a9 100755 --- a/make.dist +++ b/make.dist @@ -1,6 +1,6 @@ #!/bin/bash -TMPDIR=dist +TMPDIR=`pwd`/dist if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi usage() { @@ -8,9 +8,9 @@ echo "$0 daily | release [ signed | ]" echo echo " You must specify either release or daily in order for this script" echo "to make tarballs. If this is a daily release, the tarballs will" -echo "be named -git.tgz and will overwrite existing tarballs." +echo "be named -git.tar.gz and will overwrite existing tarballs." echo "If this is a release build, then the tarball will be named" -echo "-.tgz and must be a new file. In addition," +echo "-.tar.gz and must be a new file. In addition," echo "the script will add a new set of symbolic tags to the git repo" echo "that correspond to the - of each tarball." echo @@ -77,8 +77,8 @@ for target in $TARGETS; do exit 0 fi # Check versions to make sure that we can proceed - if [ -f $TMPDIR/$target-$VERSION.tgz ]; then - echo "Target $target-$VERSION.tgz already exists, please update the version on" + if [ -f $TMPDIR/$target-$VERSION.tar.gz ]; then + echo "Target $target-$VERSION.tar.gz already exists, please update the version on" echo "component $target" exit 2 fi @@ -92,7 +92,7 @@ for target in $TARGETS; do # incrementally higher than the last officially released tarball. RELEASE=1 echo $RELEASE > $TMPDIR/$target.release - TARBALL=$target-$VERSION.tgz + TARBALL=$target-$VERSION.tar.gz elif [ "$1" = "daily" ]; then DATE=`date +%Y%m%d` if [ -f $TMPDIR/$target.release ]; then @@ -103,17 +103,14 @@ for target in $TARGETS; do fi echo $RELEASE > $TMPDIR/$target.release RELEASE=0.${RELEASE}.${DATE}git - TARBALL=$target-git.tgz + TARBALL=$target-$VERSION-$RELEASE.tar.gz fi - cp -a $target $target-$VERSION - sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $target/$target.spec.in > $target-$VERSION/$target.spec - cd $target-$VERSION - ./autogen.sh - cd .. echo "Creating $TMPDIR/$TARBALL" - tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION - rm -rf $target-$VERSION + ( cd $target && ./autogen.sh && + RELEASE=$RELEASE TARBALL=$TARBALL ./configure && + make dist && mv $target-$VERSION.tar.gz $TMPDIR/$TARBALL ) || + exit $? if [ $1 = release ]; then if [ ! -z "$2" ]; then -- 1.5.3.rc2.29.gc4640f From mst at dev.mellanox.co.il Wed Sep 19 08:31:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 19 Sep 2007 17:31:43 +0200 Subject: [ofa-general] [PATCH v7] IB/mlx4: shrinking WQE Message-ID: <20070919153143.GF31061@mellanox.co.il> ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use WR with NOP opcode to avoid wrap-around in the middle of WR. We set NoErrorCompletion bit to avoid getting completions with error for NOP WRs. Since NEC is only supported starting with firmware 2.2.232, we use constant-sized WRs for older firmware. And, since MLX QPs only support SEND, we use constant-sized WRs in this case. Signed-off-by: Michael S. Tsirkin --- Changes since v4: - avoid mis-detecting recv write with immediate completion as NOP - increase min. wqe_shift for RC QPs to 64 bytes, so that stamping (which is done each 64 bytes) invalidates all WQEs - disable WQE shrinking if FW version is < 2.2.232, otherwise we could get CQE with error for NOP, which might overflow the CQ diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..0981f3c 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,11 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +358,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { @@ -403,6 +410,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, case MLX4_OPCODE_BIND_MW: wc->opcode = IB_WC_BIND_MW; break; + default: + printk("Unrecognized send opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } } else { wc->byte_len = be32_to_cpu(cqe->byte_cnt); @@ -422,6 +433,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + default: + printk("Unrecognized recv opcode 0x%x!\n", + cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK); + return -EINVAL; } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 158507d..95f8c48 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0xffffffff) : + cpu_to_be32(0x7fffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -234,9 +307,35 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } +static int nop_wqe_shift(enum ib_qp_type type) +{ + /* + * WQE size is at least 0x20. + * UD WQEs must have a datagram segment. + * RC and UC WQEs must have control segment. + * MLX WQEs do not support NOP. + */ + switch (type) { + case IB_QPT_UD: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_datagram_seg), + (size_t)0x20))); + case IB_QPT_SMI: + case IB_QPT_GSI: + return -EINVAL; + case IB_QPT_UC: + case IB_QPT_RC: + default: + return ilog2(roundup_pow_of_two(max(sizeof (struct mlx4_wqe_ctrl_seg), + (size_t)0x20))); + } +} + static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +351,60 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * Since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + qp->sq.wqe_shift = nop_wqe_shift(type); + if (!qp->sq_signal_bits || BITS_PER_LONG != 64 || qp->sq.wqe_shift < 0) + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +416,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +455,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +551,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1045,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1238,13 +1379,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1260,7 +1402,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1371,16 +1513,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1402,8 +1551,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..bd3ed64 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -185,7 +185,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..09a2230 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -154,7 +154,11 @@ struct mlx4_qp_context { u32 reserved5[10]; }; +/* Which firmware version adds support for NEC (NoErrorCompletion) bit */ +#define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) + enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, -- MST From rdreier at cisco.com Wed Sep 19 08:43:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 08:43:50 -0700 Subject: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement. In-Reply-To: <200709190022.l8J0MbWt024754@neteffect.com> (ggrundstrom@neteffect.com's message of "Tue, 18 Sep 2007 19:22:37 -0500") References: <200709190022.l8J0MbWt024754@neteffect.com> Message-ID: Thanks for the patch... > If an application is calling rdma_resolve_ip() and a status of -ENODATA is returned from addr_resolve_local/remote(), the timeout mechanism waits until the application's timeout occurs before rechecking the address resolution status; the application will wait until it's full timeout occurs. This case is seen when the work thread call to process_req() is made before the arp packet is processed. I'm having a hard time understanding this changelog. Could you please resend with a description that lets me understand: - What the current behavior is and what is wrong with that; - What the behavior should be; - And how your patch changes the behavior to be correct. > This patch is in addition to Steve Wise's neigh_event_send patch to initiate neighbour discovery sent on 9/12/2007. Does this mean it depends on Steve's patch being applied first? Also please try to keep lines in the changelog to 72 characters or so... > @@ -136,6 +137,7 @@ static void set_timeout(unsigned long ti > static void queue_req(struct addr_req *req) > { > struct addr_req *temp_req; > + unsigned long req_timeout = msecs_to_jiffies(MIN_ADDR_TIMEOUT_MS) + jiffies; > > mutex_lock(&lock); > list_for_each_entry_reverse(temp_req, &req_list, list) { > @@ -145,8 +147,10 @@ static void queue_req(struct addr_req *r > > list_add(&req->list, &temp_req->list); > > - if (req_list.next == &req->list) > + if (req_list.next == &req->list) { > + req_timeout = min(req_timeout, req->timeout); > set_timeout(req->timeout); > + } > mutex_unlock(&lock); > } I don't understand this code. It seems you keep track of the minimum timeout, and then ignore the value you computed. What am I missing? Thanks, Roland From davem at davemloft.net Wed Sep 19 09:05:57 2007 From: davem at davemloft.net (David Miller) Date: Wed, 19 Sep 2007 09:05:57 -0700 (PDT) Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070919115403.19455.65941.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070919115403.19455.65941.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070919.090557.24612742.davem@davemloft.net> From: Krishna Kumar Date: Wed, 19 Sep 2007 17:24:03 +0530 > Note: during steps F-H and C-E, priv/napi is read/modified by both cpu's > which is another bug relating to the same race. > > I guess the above patch is not required if this bug (in IPoIB) is fixed? The NAPI_STATE_SCHED flag bit should provide all of the necessary synchornization. Only the setter of that bit should add the NAPI instance to the polling list. The polling loop runs atomically on the cpu where the NAPI instance got added to the per-cpu polling list. And therefore decisions to complete NAPI are serialized too. That serialized completion decision is also when the list deletion occurs. I'm starting to suspect the whole problem comes from the resched facility, and now I really don't blame Stephen for trying to delete it. Semantically it really makes things very difficult, especially wrt. to the atomicity of the list handling. From jim at mellanox.com Wed Sep 19 09:14:44 2007 From: jim at mellanox.com (Jim Mott) Date: Wed, 19 Sep 2007 09:14:44 -0700 Subject: [ofa-general] FW: Updated SDP AIO test Message-ID: In preparation for SDP updates I have reworked the ttcp.aio.c program to include some extra options for PREADV/PWRITEV. While kernel support for these functions, especially for sockets is in transition from the kernel in the current distributions to the OFED 1.3 target, there is some ugly code at the beginning that will be removed once it is in the distributed libaio.h. Or maybe I should remove it now and have people only run this with up to date libaio? Other change is to report performance info in a comma separated format easy to import into a spreadsheet. An option exists to render human readable stuff like ttcp. Missing is the code that actually checks the data for correctness. It is probably something useful to add before the release (-: Signed-off-by: Jim Mott ======================================================================== ==== README.txt sdp_aio.c This is a modification of ttcp.aio.c (above) that includes some new options, mostly around vector IO, and a reformatted reporting format that allows direct importing of results into a spreadsheet. They two applications are interoperable. -Build instructions: gcc -g -o sdp_aio sdp_aio.c -laio Usage: ./sdp_aio -t [options] host (sending side. Send to 'host') ./sdp_aio -r [optoins] (receive side - default) Common options: -v: Generate (-t) or check (-v) data (default no) -d: Set SO_DEBUG socket option (default no) -S: Use SDP protcol sockets explicitly (default no) -p n: Use port 'n' (default 5001) -l n: Send and receive chunks of size 'n' (default 8K) -a n: Number of IOs/request (default 1) -O n: Buffer offset (default 0) -A n: Buffer alignment (default 16K) -b n: SO_SNDBUF and SO_RECVBUF socket buffer size -I n: Use n element iovec[] and PREADV / PWRITEV -x n: Number of buffers to allocate (default -a value) -w n: Set warning level (default 0 - no warnings) Sending side (-t) options: -D: Set TCP_NODELAY socket option -n n: Number of buffers to send (use -n or -N; default 2K) -N n: Number of seconds to run test for (use -n or -N) Receive side (-r) options: -R: Set SO_REUSEADDR socket option (default no) -L n: Set RCVLOWAT (Receive low water mark) to 'n' (default no) Output: Human readable with -w1 Comma seperated line: 1 - TX/RX Role of this instance 2 - buf_len -l n 3 - buf_off -O n 4 - buf_align -A n 5 - num_conc -a n 6 - num_sec -N n 7 - num_cnt -n n 8 - opt_iovec -I n 9 - opt_cnt -x n 10 - "Options and info" 11 - bytes Number of bytes transfered 12 - calls Number of system (io_submit, io_getevents) 13 - buffs Number of buffers transfered 14 - us_wall Wall clock time in uS 15 - us_user User space CPU time 16 - us_sys System CPU time ======================================================================== ==== /* * sdp_aio - Test Linux libaio on SDP (and non-SDP) sockets. * * Based on ttcp.c; T.C. Slattery, USNA * */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * Additional commands * While libaio.h does not include the vector IO commands in the * io_iocb_cmd{} enumeration, the kernel includes support for them. * * Start of libaio.h extensions */ enum { IO_CMD_PREADV = 7, /* IOCB_CMD_PREADV */ IO_CMD_PWRITEV = 8 /* IOCB_CMD_PWRITEV */ }; static inline void io_prep_preadv(struct iocb *iocb, int fd, struct iovec *iov, int nr_segs, long long offset) { memset(iocb, 0, sizeof(*iocb)); iocb->aio_fildes = fd; iocb->aio_lio_opcode = IO_CMD_PREADV; iocb->u.v.vec = iov; iocb->u.v.nr = nr_segs; iocb->u.v.offset = offset; } static inline void io_prep_pwritev(struct iocb *iocb, int fd, struct iovec *iov, int nr_segs, long long offset) { memset(iocb, 0, sizeof(*iocb)); iocb->aio_fildes = fd; iocb->aio_lio_opcode = IO_CMD_PWRITEV; iocb->u.v.vec = iov; iocb->u.v.nr = nr_segs; iocb->u.v.offset = offset; } /* End of libaio.h extensions */ struct app_hdr { uint32_t buf_num; /* Send size buffer number */ uint32_t buf_len; /* Total size of sent data */ uint32_t opt_cnt; /* Send side opt_cnt */ uint32_t filler; /* Not used - keep size nice */ uint64_t ip; /* Used by verify only */ }; struct buff_ptr { char *base; char *offset; int len; struct iovec *iov; int icnt; }; /* Needs to be in system include file - someday */ #ifndef AF_INET_SDP #define AF_INET_SDP 27 #endif static char base_pattern[] = {0x11, 055, 0xCC, 0x5C, 0xC5, 0x22, 0x81}; /* * PARAMETERS * These are the parameters that can be overridden by command line * line options. The default values are established here. */ int mode_rx = 1; /* -t 0 / -r 1: Set app mode send/receive */ int num_sec = 0; /* -N n: seconds to run */ int num_cnt = 2 * 1024; /* -n n: Number of buffers to send */ int num_conc = 1; /* -a n: Number of cuncurrent IOs */ int buf_len = 8 * 1024; /* -l n: Length of buffers to send */ int buf_off = 0; /* -O n: Buffer offset */ int buf_align = 16 * 1024; /* -A n: Alignment */ short port = 5001; /* -p n: Port number to use */ int domain = AF_INET; /* -S: Use SDP protocol explicitly */ int opt_ver = 0; /* -v: Verify data transfered */ int opt_nodel = 0; /* -D: Set TCP_NODELAY */ int opt_reuse = 0; /* -R: SO_REUSEADDR */ int opt_dbg = 0; /* -d: SO_DEBUG */ int opt_bytes = 0; /* -B: Format Bytes/sec else bytes/sec */ int opt_lingr = 0; /* -G: SOLINGER so close waits for data */ int opt_sbuf = -1; /* -b n: SO_SNDBUF */ int opt_rbuf = -1; /* -b n: SO_RCVBUF */ int opt_rlow = -1; /* -L n: SO_RCVLOWAT */ int opt_iovec = 0; /* -I n: n per-iovec, Use PREADV & PWRITEV */ int opt_cnt = 0; /* -x n: Number of unique buffers */ /* * Holds output string */ static int warn = 0; /* How chatty on errors */ static char *name = NULL; /* Name of the application */ static char label[1000]; static char *lba; /* Global variables */ static int fd; static struct sockaddr_in s_in; static int cur_data = 0; /* Pointer to next data[] buffer */ static struct buff_ptr **data; static io_context_t io_ctx = NULL; static struct io_event *events; static struct iocb **iocbs; static struct timeval in_time, out_time; static struct rusage in_usage, out_usage; static sig_atomic_t done = 0; /* Set by timer or error to end test */ static char usage_txt[] = "\ Usage: %s -t [options] host (sending side. Send to 'host')\n\ %s -r [optoins] (receive side - default)\n\n\ Common options:\n\ -v: Generate (-t) or check (-v) data (default no)\n\ -d: Set SO_DEBUG socket option (default no)\n\ -S: Use SDP protcol sockets explicitly (default no)\n\ -p n: Use port 'n' (default 5001)\n\ -l n: Send and receive chunks of size 'n' (default 8K)\n\ -a n: Number of IOs/request (default 1)\n\ -O n: Buffer offset (default 0)\n\ -A n: Buffer alignment (default 16K)\n\ -b n: SO_SNDBUF and SO_RECVBUF socket buffer size\n\n\ -I n: Use n element iovec[] and PREADV / PWRITEV\n\ -x n: Number of buffers to allocate (default -a value)\n\ -w n: Set warning level (default 0 - no warnings)\n\ Sending side (-t) options:\n\ -D: Set TCP_NODELAY socket option\n\ -n n: Number of buffers to send (use -n or -N; default 2K)\n\ -N n: Number of seconds to run test for (use -n or -N)\n\n\ Receive side (-r) options:\n\ -R: Set SO_REUSEADDR socket option (default no)\n\ -L n: Set RCVLOWAT (Receive low water mark) to 'n' (default no)\n\ \n\ Output:\n\ Human readable with -w1\n\ \n\ Command seperated line:\n\ 1 - TX/RX Role of this instance\n\ 2 - buf_len -l n\n\ 3 - buf_off -O n\n\ 4 - buf_align -A n\n\ 5 - num_conc -a n\n\ 6 - num_sec -N n\n\ 7 - num_cnt -n n\n\ 8 - opt_iovec -I n\n\ 9 - opt_cnt -x n\n\ 10 - \"Options and info\"\n\ 11 - bytes Number of bytes transfered\n\ 12 - calls Number of system (io_submit, io_getevents)\n\ 13 - buffs Number of buffers transfered\n\ 14 - us_wall Wall clock time in uS\n\ 15 - us_user User space CPU time\n\ 16 - us_sys System CPU time\n\ "; static void die_usage(void) { fprintf(stderr, usage_txt, name, name); exit(-1); } static void die_error(char *msg) { if (errno) perror(msg); else fprintf(stderr, "%s\n", msg); exit(-1); } static void do_log(int level, char *msg) { if (level < warn) fprintf(stderr, " --> log: %s\n", msg); } static void sig_pipe(int value) { done = 1; } static void sig_time(int value) { done = 1; } /* * new_b * This function creates and initializes a single buffer. * * num The 'number' of this buffer * size The size of the data transfer from this buffer * offset The offset into the buffer of the first real byte * align The alignmnet (power of 2) for this buffer * icnt The number of iovec elements used to map the buffer */ static struct buff_ptr *new_b(int num, int size, int offset, int align, int icnt) { struct buff_ptr *bp; struct app_hdr *ah; struct iovec *ip; char *cp, msg[200]; int rc, i, j, alloc_size; unsigned char mask; alloc_size = size + offset; rc = posix_memalign((void *)&cp, align, alloc_size); if (rc) die_error("unable to allocate data buffer"); memset(cp, 0, alloc_size); bp = (struct buff_ptr *)malloc(sizeof(struct buff_ptr)); if (!bp) die_error("unable to allocate buffer descriptor"); memset(bp, 0, sizeof(struct buff_ptr)); if (icnt) { ip = (struct iovec *)calloc(icnt, sizeof(struct iovec)); if (!ip) die_error("unable to allocate iovec"); } else ip = NULL; /* buff_ptr describes a single send/receive buffer in user space */ bp->base = cp; bp->offset = offset + cp; bp->len = size; bp->iov = ip; bp->icnt = icnt; cp = bp->offset; /* Start of data buffer */ ah = (struct app_hdr *)cp; /* * Most of the buffer holds a pattern; but there is some unique stuff * at the beginning. */ ah->buf_num = htonl(num); ah->buf_len = htonl(size); ah->opt_cnt = htonl(opt_cnt); cp += sizeof(struct app_hdr); size -= sizeof(struct app_hdr); mask = (unsigned char)(num % 255); for (i=j=0; i= sizeof(base_pattern)) j = 0; } /* If we are not doing iovec[] IO, then buffer is done */ if (!ip) { sprintf(msg, "Buff %d at 0x%lX offset=0x%lX, len=%d", num, (unsigned long)bp->base, (unsigned long)bp->offset, bp->len); do_log(4, msg); return(bp); } sprintf(msg, "Buff %d at 0x%lX offset=0x%lX, len=%d, iov=0x%lX, cnt=%d", num, (unsigned long)bp->base, (unsigned long)bp->offset, bp->len, (unsigned long)bp->iov, bp->icnt); do_log(4, msg); /* * In an ideal world, we would build these buffers differently if * we were doing send than receive. The send side would scatter * the data around and use the iov to gather it, and the receive * size would create it linearly. Maybe next time. */ size = bp->len / icnt; for (i=0; iiov[i]; ip->iov_base = bp->offset + (i * size); ip->iov_len = size; } /* Fiddle the last one to make sure we cover all the data */ ip->iov_len += (bp->len - (size * bp->icnt)); for (i=0; iiov[i]; sprintf(msg, " iov[%d] 0x%lX %d", i, (unsigned long)ip->iov_base, (int)ip->iov_len); do_log(4, msg); } return(bp); } static uint64_t tvsub(struct timeval *after, struct timeval *before) { uint64_t sec, usec; sec = (uint64_t)(after->tv_sec - before->tv_sec); if (after->tv_sec < before->tv_sec) { sec--; usec = (uint64_t)(1000000 + after->tv_usec - before->tv_usec); } else usec = (uint64_t)(after->tv_usec - before->tv_usec); usec += 1000000 * sec; return(usec); } static char *outfmt(double b) { static char obuf[50]; char prefix; if (!opt_bytes) b *= 8; prefix = ' '; if (b < 1024.0) goto out; prefix = 'K'; b = b / 1024.0; if (b < 1024.0) goto out; prefix = 'M'; b = b / 1024.0; if (b < 1024.0) goto out; prefix = 'G'; b = b / 1024.0; out: if (opt_bytes) sprintf(obuf, "%.2f %cB", b, prefix); else sprintf(obuf, "%.2f %cbit", b, prefix); return(obuf); } static void summary(uint64_t calls, uint64_t buffs, uint64_t bytes) { uint64_t us_user, us_wall, us_sys; double realt; us_wall = tvsub(&out_time, &in_time); us_user = tvsub(&out_usage.ru_utime, &in_usage.ru_utime); us_sys = tvsub(&out_usage.ru_stime, &in_usage.ru_stime); realt = ((double)us_wall)/1000000; if (realt == 0.0) realt = 0.001; /* No division by zero here */ if (warn) { printf("%lu bytes in %.2f seconds = %s/sec\n", bytes, realt, outfmt((double)(bytes / realt))); printf("%lu I/O calls, usec/call = %.2f, calls/sec = %.2f\n", calls, 1000000.0 * realt/((double)calls), ((double)calls/realt)); printf("user: %lu sys: %lu total: %lu real: %lu\n", us_user, us_sys, (us_user + us_sys), us_wall); } /* Add an eye catching value */ lba += sprintf(lba, "%s/sec", outfmt(((double)bytes)/realt)); lba += sprintf(lba, "\", %lu, %lu, %lu, %lu, %lu, %lu", bytes, calls, buffs, us_wall, us_user, us_sys); printf("%s\n", label); } static void setup_time(void) { int rc; struct itimerval alrm_timer; if (num_sec > 0) { lba += sprintf(lba, "+N "); signal(SIGALRM, sig_time); memset(&alrm_timer, 0, sizeof(alrm_timer)); alrm_timer.it_interval.tv_sec = num_sec; alrm_timer.it_interval.tv_usec = 0; alrm_timer.it_value.tv_sec = num_sec; alrm_timer.it_value.tv_usec = 0; rc = setitimer(ITIMER_REAL, &alrm_timer, NULL); if (rc) die_error("unable to set timer"); } else lba += sprintf(lba, "+n "); /* Catch when the other side goes away */ signal(SIGPIPE, sig_pipe); } static void setup_io(void) { int i, rc, log_opt_cnt; char msg[200]; if (!opt_cnt) { log_opt_cnt = 0; opt_cnt = num_conc; } else log_opt_cnt = 1; /* Calculate the buffer size needed based on -l, -O, -A */ if ((buf_align / 2) * 2 != buf_align) die_error("-A (buffer alignment) must be a positive power of 2"); sprintf(msg, "buffer count %d, size %d, offset %d, send length %d", num_conc, buf_align, buf_off, buf_len); do_log(3, msg); lba += sprintf(lba, "%d, %d, %d, %d, %d, %d, %d, %d, \"", buf_len, buf_off, buf_align, num_conc, num_sec, num_cnt, opt_iovec, opt_cnt); lba += log_opt_cnt ? sprintf(lba, "+I ") : sprintf(lba, "-I "); lba += opt_ver ? sprintf(lba, "+v ") : sprintf(lba, "-v "); data = (struct buff_ptr **)calloc(opt_cnt, sizeof(struct buff_ptr *)); if (!data) die_error("Unable to allocate buffer pointer memory"); for (i=0; i= 0) { rc = setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &opt_sbuf, sizeof(opt_sbuf)); if (rc < 0) die_error("unable to set SO_SNDBUF"); } lba += (opt_sbuf >= 0) ? sprintf(lba, "+b ") : sprintf(lba, "-b "); if (opt_rbuf >= 0) { rc = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &opt_rbuf, sizeof(opt_rbuf)); if (rc < 0) die_error("unable to set SO_RCVBUF"); } if (opt_rlow >= 0) { rc = setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT, &opt_rlow, sizeof(opt_rlow)); if (rc < 0) die_error("unable to set SO_RCVLOWAT"); } if (opt_dbg) { optval = 1; rc = setsockopt(fd, SOL_SOCKET, SO_DEBUG, &optval, sizeof(optval)); if (rc < 0) die_error("unable to set SO_DEBUG"); } lba += opt_dbg ? sprintf(lba, "+d ") : sprintf(lba, "-d "); if (opt_lingr) { struct linger linger; memset(&linger, 0, sizeof(linger)); linger.l_onoff = 1; /* Wait for all data to go */ linger.l_linger = 5; /* Wait for it all */ rc = setsockopt(fd, SOL_SOCKET, SO_LINGER, (char *)&linger, sizeof(linger)); if (rc < 0) die_error("unable to set SO_LINGER"); } lba += opt_lingr ? sprintf(lba, "+G ") : sprintf(lba, "-G "); if (opt_nodel) { optval = 1; rc = setsockopt(fd, SOL_TCP, TCP_NODELAY, &optval, sizeof(optval)); if (rc < 0) die_error("unable to set TCP_NODELAY"); } lba += opt_nodel ? sprintf(lba, "+D ") : sprintf(lba, "-D "); if (opt_iovec) { if (opt_iovec > UIO_MAXIOV) die_error("more than UIO_MAXIOV requested"); } memset(&s_in, 0, sizeof(s_in)); s_in.sin_port = htons(port); if (mode_rx) { rc = bind(fd, (struct sockaddr *)&s_in, sizeof(s_in)); if (rc < 0) die_error("unable to bind"); rc = listen(fd, 1); if (rc < 0) die_error("unable to listen"); i = sizeof(s_in); fd = accept(fd, (struct sockaddr *)&s_in, (socklen_t *)&i); if (fd < 0) die_error("unable to accept"); sprintf(msg, "Accepted connection from %s", inet_ntoa(s_in.sin_addr)); } else { if (atoi(target) > 0) s_in.sin_addr.s_addr = inet_addr(target); else { addr = gethostbyname(target); if (!addr) die_error("unable to resolve target host"); memcpy((char *)&s_in.sin_addr.s_addr, (char *)addr->h_addr, sizeof(s_in.sin_addr.s_addr)); } s_in.sin_family = AF_INET; rc = connect(fd, (struct sockaddr *)&s_in, sizeof(s_in)); if (rc < 0) die_error("unable to connect to target"); sprintf(msg, "Connected to %s", target); } do_log(1, msg); } static inline int do_norm(void) { int i, rc; struct iocb *ip; struct buff_ptr *bp; struct app_hdr *hp; for (i=0; i= opt_cnt) cur_data = 0; if (mode_rx) io_prep_pread(ip, fd, bp->offset, bp->len, 0); else io_prep_pwrite(ip, fd, bp->offset, bp->len, 0); ip->data = bp; if (opt_ver) { hp = (struct app_hdr *)bp; hp->ip = (uint64_t)ip; } } rc = io_submit(io_ctx, num_conc, iocbs); if (rc != num_conc) { if (rc > 0) { printf("Submitted %d, accepted %d\n", num_conc, rc); perror("io_submit"); die_error("not all normal io_submit elements accepted"); } else if (rc == 0) die_error("no normal io_submit elements accepted"); else die_error("error on normal io_submit"); } return(num_conc); } static inline int do_iov(void) { int i, rc; struct iocb *ip; struct buff_ptr *bp; struct app_hdr *hp; for (i=0; i= opt_cnt) cur_data = 0; if (mode_rx) io_prep_preadv(ip, fd, bp->iov, bp->icnt, 0); else io_prep_pwritev(ip, fd, bp->iov, bp->icnt, 0); ip->data = bp; if (opt_ver) { hp = (struct app_hdr *)bp; hp->ip = (uint64_t)ip; } } rc = io_submit(io_ctx, num_conc, iocbs); if (rc != num_conc) { if (rc > 0) die_error("not normal all io_submit elements accepted"); else if (rc == 0) die_error("no normal io_submit elements accepted"); else { errno = -rc; die_error("error on normal io_submit"); } } return(num_conc); } static inline void verify_rx(int i) { /* TODO: Check the data */ return; } int main(int argc, char *argv[]) { int i, rc; char *target; uint64_t cnt_calls, cnt_buffs, cnt_bytes, cnt_bytes_op; name = argv[0]; while (1) { i = getopt(argc, argv, "drtvRDSGb:l:N:n:p:A:O:a:x:L:I:w:"); if (i < 0) break; switch (i) { case 't': mode_rx=0; break; case 'r': mode_rx=1; break; case 'd': opt_dbg = 1; break; case 'D': opt_nodel = 1; break; case 'R': opt_reuse = 1; break; case 'v': opt_ver = 1; break; case 'B': opt_bytes = 1; break; case 'G': opt_lingr = 1; break; case 'S': domain = AF_INET_SDP; break; case 'p': port = atoi(optarg); break; case 'n': num_cnt = atoi(optarg); num_sec = 0; break; case 'N': num_sec = atoi(optarg); num_cnt = 0; break; case 'O': buf_off = atoi(optarg); break; case 'A': buf_align = atoi(optarg); break; case 'b': opt_sbuf = opt_rbuf = atoi(optarg); break; case 'L': opt_rlow = atoi(optarg); break; case 'a': num_conc = atoi(optarg); break; case 'I': opt_iovec = atoi(optarg); break; case 'x': opt_cnt = atoi(optarg); break; case 'w': warn = atoi(optarg); break; case 'l': buf_len = atoi(optarg); if (buf_len < sizeof(struct app_hdr)) buf_len = sizeof(struct app_hdr); break; default: die_usage(); } } memset(label, 0, sizeof(label)); if (mode_rx) { lba = label + sprintf(label, "RX, "); target = NULL; } else { lba = label + sprintf(label, "TX, "); if (optind == argc) die_usage(); target = argv[optind]; } setup_io(); setup_socket(target); /* Move some data */ cnt_buffs = 0; cnt_calls = 0; cnt_bytes = 0; setup_time(); gettimeofday(&in_time, NULL); getrusage(RUSAGE_SELF, &in_usage); while (!done) { cnt_buffs += (opt_iovec) ? do_iov() : do_norm(); rc = io_getevents(io_ctx, 1, num_conc, events, NULL); if (rc != num_conc) { if (rc > 0) do_log(5, "did not get them all"); else if (rc == 0) die_error("No completions"); else die_error("Error reading completions"); } cnt_calls++; cnt_buffs += num_conc; cnt_bytes_op = 0; for (i=0; i < rc && 0 < (long)events[i].res; i++) { cnt_bytes_op += events[i].res; if (!opt_ver) continue; if (mode_rx) verify_rx(i); } if (cnt_bytes_op) cnt_bytes += cnt_bytes_op; else break; if (opt_ver) if (cnt_bytes_op != num_conc * buf_len) die_error("Data size mismatch"); if (num_cnt) if (cnt_calls >= num_cnt) break; } gettimeofday(&out_time, NULL); getrusage(RUSAGE_SELF, &out_usage); close(fd); /* There are really 2 system calls for every request */ summary(2*cnt_calls, cnt_buffs, cnt_bytes); return 0; } From monis at voltaire.com Wed Sep 19 09:41:57 2007 From: monis at voltaire.com (Moni Shoua) Date: Wed, 19 Sep 2007 19:41:57 +0300 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <18593.1190071438@death> Message-ID: <46F15155.4070708@voltaire.com> Roland, Jay, Thanks a lot for the comments. I'd like to summarize the points raised so far 1. Reduce the indentation in patch #4 (Roland) I will resend 2. Remove the "if (n->dev->flags & IFF_MASTER)" from patch #3 (Roland) I will resend 3. Consider making ipoib_slave_detach() net/core/dev.c (Roland, Jay) I think that this is a good idea. I can make the patch (and necessary changes to the other patches) assuming this is agreed by all. 4. Change header for patch #1 (Roland) I will resend 5. Use NETDEV_GOING_DOWN and not NETDEV_CHANGE + IFF_SLAVE_DETACH (Jay) The NETDEV_GOING_DOWN event is sent in the contex of unregister_netdevice() Since the action in bonding to the event should be unregister the bonding master it is not possible to do so. bonding needs to know about the slave detach earlier. 6. call notifiers from unregister_netdev() See answer to 5. 7. missing call to notifiers in ipoib_vlan_delete() (Roland) It seems like you're right. I will fix and resend. I think that if there are no other comments, I will submit the entire 11 patches again (with changes) to make it easier to merge into the kernel. Since the most of the content in the patch series is in bonding I thought it would be right that Jay will push all the patches to the networking git. Is it OK with you Roland? From rdreier at cisco.com Wed Sep 19 09:44:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 09:44:02 -0700 Subject: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <46F15155.4070708@voltaire.com> (Moni Shoua's message of "Wed, 19 Sep 2007 19:41:57 +0300") References: <11898132301664-git-send-email-fubar@us.ibm.com> <11898132322950-git-send-email-fubar@us.ibm.com> <1189813234208-git-send-email-fubar@us.ibm.com> <18593.1190071438@death> <46F15155.4070708@voltaire.com> Message-ID: > I think that if there are no other comments, I will submit the entire 11 patches > again (with changes) to make it easier to merge into the kernel. Since the most of the > content in the patch series is in bonding I thought it would be right that Jay will > push all the patches to the networking git. Is it OK with you Roland? Yes, that's fine. - R. From mshefty at ichips.intel.com Wed Sep 19 09:52:20 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Sep 2007 09:52:20 -0700 Subject: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement. In-Reply-To: <200709190022.l8J0MbWt024754@neteffect.com> References: <200709190022.l8J0MbWt024754@neteffect.com> Message-ID: <46F153C4.2070008@ichips.intel.com> > If an application is calling rdma_resolve_ip() and a status of -ENODATA is returned from addr_resolve_local/remote(), the timeout mechanism waits until the application's timeout occurs before rechecking the address resolution status; the application will wait until it's full timeout occurs. This case is seen when the work thread call to process_req() is made before the arp packet is processed. I don't understand the issue. process_req() is invoked whenever a network event occurs, which rechecks all pending requests. > This patch is in addition to Steve Wise's neigh_event_send patch to initiate neighbour discovery sent on 9/12/2007. This patch looks unrelated to Steve's patch. Can you clarify the relationship? - Sean From rdreier at cisco.com Wed Sep 19 09:53:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 09:53:32 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists In-Reply-To: <20070919063421.GA6185@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 19 Sep 2007 08:34:21 +0200") References: <200709041047.32062.jackm@dev.mellanox.co.il> <20070919063421.GA6185@mellanox.co.il> Message-ID: OK, I added the patch below to my tree. I cleaned up Jack's patch a little and it seems to work for me; I hope I didn't break anything. commit 4a36e85ada9307b9f5d16df3856cdcfce1e9c5f0 Author: Jack Morgenstein Date: Wed Sep 19 09:52:25 2007 -0700 IB/mlx4: Fix data corruption triggered by wrong headroom marking order This is an addendum to commit 0e6e7416 ("IB/mlx4: Handle new FW requirement for send request prefetching"). We also need to handle prefetch marking properly for S/G segments, or else the HCA may end up processing S/G segments that are not fully written and end up sending the wrong data. We write S/G segments in reverse order into the WQE, in order to guarantee that the first dword of all cachelines containing S/G segments is written last (overwriting the headroom invalidation pattern). The entire cacheline will thus contain valid data when the invalidation pattern is overwritten. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mlx4/qp.c | 69 +++++++++++++++++++++++++++++++------- 1 files changed, 56 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 6c0ced2..f51c1fc 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1211,20 +1211,58 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); } -static __always_inline void set_data_seg(struct mlx4_wqe_data_seg *dseg, - struct ib_sge *sg) +static void set_mlx_icrc_seg(void *dseg) +{ + u32 *t = dseg; + struct mlx4_wqe_inline_seg *iseg = dseg; + + t[1] = 0; + + /* + * Need a barrier here before writing the byte_count field to + * make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment begins + * a new cacheline, the HCA prefetcher could grab the 64-byte + * chunk and get a valid (!= * 0xffffffff) byte count but + * stale data, and end up sending the wrong data. + */ + wmb(); + + iseg->byte_count = cpu_to_be32((1 << 31) | 4); +} + +static void __set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) { dseg->byte_count = cpu_to_be32(sg->length); dseg->lkey = cpu_to_be32(sg->lkey); dseg->addr = cpu_to_be64(sg->addr); } +static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) +{ + dseg->lkey = cpu_to_be32(sg->lkey); + dseg->addr = cpu_to_be64(sg->addr); + + /* + * Need a barrier here before writing the byte_count field to + * make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment begins + * a new cacheline, the HCA prefetcher could grab the 64-byte + * chunk and get a valid (!= * 0xffffffff) byte count but + * stale data, and end up sending the wrong data. + */ + wmb(); + + dseg->byte_count = cpu_to_be32(sg->length); +} + int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { struct mlx4_ib_qp *qp = to_mqp(ibqp); void *wqe; struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_data_seg *dseg; unsigned long flags; int nreq; int err = 0; @@ -1324,22 +1362,27 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, break; } - for (i = 0; i < wr->num_sge; ++i) { - set_data_seg(wqe, wr->sg_list + i); + /* + * Write data segments in reverse order, so as to + * overwrite cacheline stamp last within each + * cacheline. This avoids issues with WQE + * prefetching. + */ - wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; - } + dseg = wqe; + dseg += wr->num_sge - 1; + size += wr->num_sge * (sizeof (struct mlx4_wqe_data_seg) / 16); /* Add one more inline data segment for ICRC for MLX sends */ - if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { - ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = - cpu_to_be32((1 << 31) | 4); - ((u32 *) wqe)[1] = 0; - wqe += sizeof (struct mlx4_wqe_data_seg); + if (unlikely(qp->ibqp.qp_type == IB_QPT_SMI || + qp->ibqp.qp_type == IB_QPT_GSI)) { + set_mlx_icrc_seg(dseg + 1); size += sizeof (struct mlx4_wqe_data_seg) / 16; } + for (i = wr->num_sge - 1; i >= 0; --i, --dseg) + set_data_seg(dseg, wr->sg_list + i); + ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? MLX4_WQE_CTRL_FENCE : 0) | size; @@ -1429,7 +1472,7 @@ int mlx4_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, scat = get_recv_wqe(qp, ind); for (i = 0; i < wr->num_sge; ++i) - set_data_seg(scat + i, wr->sg_list + i); + __set_data_seg(scat + i, wr->sg_list + i); if (i < qp->rq.max_gs) { scat[i].byte_count = 0; From rdreier at cisco.com Wed Sep 19 09:55:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 09:55:43 -0700 Subject: [ofa-general] Re: [PATCH v7] IB/mlx4: shrinking WQE In-Reply-To: <20070919153143.GF31061@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 19 Sep 2007 17:31:43 +0200") References: <20070919153143.GF31061@mellanox.co.il> Message-ID: > ConnectX supports shrinking wqe, such that a single WR can include > multiple units of wqe_shift. This way, WRs can differ in size, and > do not have to be a power of 2 in size, saving memory and speeding up > send WR posting. Given this added complexity: 6 files changed, 226 insertions(+), 39 deletions(-) and the unpleasantness of having if (BITS_PER_LONG == 64) various places, can you quantify the improvement this gives? Would it make more sense to do this for userspace first? - R. From mst at dev.mellanox.co.il Wed Sep 19 10:14:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 19 Sep 2007 19:14:43 +0200 Subject: [ofa-general] Re: [PATCH v7] IB/mlx4: shrinking WQE In-Reply-To: References: <20070919153143.GF31061@mellanox.co.il> Message-ID: <20070919171443.GH31061@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH v7] IB/mlx4: shrinking WQE > > > ConnectX supports shrinking wqe, such that a single WR can include > > multiple units of wqe_shift. This way, WRs can differ in size, and > > do not have to be a power of 2 in size, saving memory and speeding up > > send WR posting. > > Given this added complexity: > > 6 files changed, 226 insertions(+), 39 deletions(-) > > and the unpleasantness of having if (BITS_PER_LONG == 64) various > places, I don't there's a way around that. BTW, the vmap trick is an improvement in itself, we can extend it to CQs, EQs etc easily. > can you quantify the improvement this gives? This gets me from 960 to 1020 MByte/sec on ipoib/cm with netperf. SDP shows similiar gains. > Would it make more sense to do this for userspace first? Given that we want it, what does a delay buy us? -- MST From rdreier at cisco.com Wed Sep 19 10:16:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 10:16:07 -0700 Subject: [ofa-general] Re: [PATCH 2/3] IB/umem: Add hugetlb flag to struct ib_umem In-Reply-To: <200709131815.29040.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 13 Sep 2007 18:15:28 +0200") References: <200709131814.13937.fenkes@de.ibm.com> <200709131815.29040.fenkes@de.ibm.com> Message-ID: This looks realy nice to me... a very clean patch. I'll add this to 2.6.24 unless someone objects soon... From bramesh at vt.edu Wed Sep 19 10:47:21 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Wed, 19 Sep 2007 13:47:21 -0400 Subject: [ofa-general] IBV_WC_LOC_PROT_ERROR in receive In-Reply-To: <46F0BE48.50005@dev.mellanox.co.il> References: <20070918042202.GA8660@vt.edu> <46EF5F6E.3080708@dev.mellanox.co.il> <20070918180253.GA18113@vt.edu> <46F0BE48.50005@dev.mellanox.co.il> Message-ID: <20070919174721.GA23866@vt.edu> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote: > Bharath Ramesh wrote: >> I checked for the following: >> 1) I havent deregistered the MR. >> 2) I am using a RC QP >> 3) The messages size are the same 40 bytes. >> 4) I only have one PD for the entire application, i.e both QP and MR >> belong to the same PD >> 5) The vendor error that I get in the WC is error code 52. >> 6) I forgot to mention this in the earlier mail the snippet for my send >> is as follows: >> > I believe that the problem is related to the incoming message size and the > attributes that were given > in the scatter entry (size of the buffer which was specified smaller than > the message size) or the > size of the MR is smaller than the size of the message. > > I suggest to check that all of the values in the scatter entry in the RR > again > > If you wish to send me you source for me to review, you are welcome. > > > Dotan > I finally found the problem and fixed it. The MR was of a smaller size and I didnt notice the difference. Thanks for taking time out to help me with this error. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From rdreier at cisco.com Wed Sep 19 11:00:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:00:10 -0700 Subject: [ofa-general] Re: [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <46ECF1B6.3020802@voltaire.com> (Or Gerlitz's message of "Sun, 16 Sep 2007 12:04:54 +0300") References: <000101c7f009$6472de50$3c98070a@amr.corp.intel.com> <46ECF1B6.3020802@voltaire.com> Message-ID: Thanks for the review Or... applied to for-2.6.24 From rdreier at cisco.com Wed Sep 19 11:11:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:11:29 -0700 Subject: [ofa-general] [RFC] [PATCH 2/5 v2] ib/sa: add new QoS fields to path record In-Reply-To: <46ECF2D8.9000803@voltaire.com> (Or Gerlitz's message of "Sun, 16 Sep 2007 12:09:44 +0300") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000701c7ef3b$d16562e0$3c98070a@amr.corp.intel.com> <46ECF2D8.9000803@voltaire.com> Message-ID: thanks guys, applied From rdreier at cisco.com Wed Sep 19 11:15:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:15:05 -0700 Subject: [ofa-general] [RFC] [PATCH 0/5 v2] rdma/cm: add ability to specifytype of service In-Reply-To: <46ECF217.900@voltaire.com> (Or Gerlitz's message of "Sun, 16 Sep 2007 12:06:31 +0300") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000801c7ef3b$ee7dcfc0$3c98070a@amr.corp.intel.com> <46ECF217.900@voltaire.com> Message-ID: thanks guys, applied From rdreier at cisco.com Wed Sep 19 11:22:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:22:36 -0700 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> (Sean Hefty's message of "Tue, 4 Sep 2007 14:36:45 -0700") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> Message-ID: thanks, applied. From rdreier at cisco.com Wed Sep 19 11:22:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:22:40 -0700 Subject: [ofa-general] [RFC] [PATCH 4/5 v2] rdma/ucm: export setting service type to user space In-Reply-To: <000901c7ef3c$1cef27a0$3c98070a@amr.corp.intel.com> (Sean Hefty's message of "Tue, 4 Sep 2007 14:39:46 -0700") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000901c7ef3c$1cef27a0$3c98070a@amr.corp.intel.com> Message-ID: thanks, applied. From ardavis at ichips.intel.com Wed Sep 19 11:22:56 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 19 Sep 2007 11:22:56 -0700 Subject: [ofa-general] DAPL Package Build Error on PPC64 Arch In-Reply-To: <13995234.1189513707210.JavaMail.root@wombat.diezmil.com> References: <13995234.1189513707210.JavaMail.root@wombat.diezmil.com> Message-ID: <46F16900.40500@ichips.intel.com> snagai at jp.ibm.com wrote: > I am trying to build OFED with enabling DAPL package, but build proceess did not complete due to some errors. > > gcc -DHAVE_CONFIG_H -I. -I. -I. -I../libibverbs/include/infiniband -I../librdmacm/include -I../libibverbs/include -I../../dat/include -Wall -g -D_GNU_SOURCE -DOS_RELEASE=131078 -DOPENIB -DCQ_WAIT_OBJECT -I./dat/include/ -I./dapl/include/ -I./dapl/common -I./dapl/udapl/linux -I./dapl/openib_cma -m32 -g -O2 -L/usr/lib -MT dapl_udapl_libdaplcma_la-dapl_init.lo -MD -MP -MF .deps/dapl_udapl_libdaplcma_la-dapl_init.Tpo -c dapl/udapl/dapl_init.c -fPIC -DPIC -o .libs/dapl_udapl_libdaplcma_la-dapl_init.o > In file included from ./dapl/include/dapl.h:50, > from dapl/udapl/dapl_init.c:39: > ./dapl/udapl/linux/dapl_osd.h:53:2: error: #error UNDEFINED ARCH see bug: https://bugs.openfabrics.org/show_bug.cgi?id=48 I believe James was waiting for someone to contribute PPC64 patches. -arlin From rdreier at cisco.com Wed Sep 19 11:23:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:23:58 -0700 Subject: [ofa-general] [RFC] [PATCH 5/5 v2] ib/srp: add QoS support through service ID In-Reply-To: <000a01c7ef3c$34a9d4d0$3c98070a@amr.corp.intel.com> (Sean Hefty's message of "Tue, 4 Sep 2007 14:40:26 -0700") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000a01c7ef3c$34a9d4d0$3c98070a@amr.corp.intel.com> Message-ID: looks good, applied From rdreier at cisco.com Wed Sep 19 11:30:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:30:59 -0700 Subject: [ofa-general] [PATCH] Message-ID: Do you see anything wrong with this patch? includes so the extra include of is not needed, and I don't see anything that seems like it would want (and my test builds on x86-64, ia64, powerpc, i386 and alpha all pass without the include). diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 9ea5b9a..a6f2303 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -34,8 +34,6 @@ #include #include #include -#include -#include #include #include #include diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index 36cdf77..e05690e 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -36,8 +36,6 @@ #include #include #include -#include -#include #include #include "iscsi_iser.h" diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index d42ec01..654a4dc 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -32,7 +32,6 @@ * * $Id: iser_verbs.c 7051 2006-05-10 12:29:11Z ogerlitz $ */ -#include #include #include #include From rdreier at cisco.com Wed Sep 19 11:32:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 19 Sep 2007 11:32:45 -0700 Subject: [ofa-general] [PATCH] In-Reply-To: (Roland Dreier's message of "Wed, 19 Sep 2007 11:30:59 -0700") References: Message-ID: err, subject should have been "[PATCH] IB/iser: Remove unnecessary includes" From mshefty at ichips.intel.com Wed Sep 19 11:37:51 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Sep 2007 11:37:51 -0700 Subject: [ofa-general] [PATCH] libibcm: add valgrind support to the libibcm In-Reply-To: <200709191450.23784.dotanb@dev.mellanox.co.il> References: <200709191450.23784.dotanb@dev.mellanox.co.il> Message-ID: <46F16C7F.5060600@ichips.intel.com> Thanks - applied. I also created a release 1.0.1 of libibcm and added it to my public html file and downloads directory. - Sean From sashak at voltaire.com Wed Sep 19 11:48:12 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 20:48:12 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: potentially uninitialized vars usage fix Message-ID: <20070919184812.GS29384@sashak.voltaire.com> Fix usage of potentially uninitialized variables. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_ftree.c | 10 ++++------ 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index d8ba368..948129c 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1526,7 +1526,7 @@ static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree) static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) { ftree_sw_t *p_remote_sw; - ftree_sw_t *p_sw; + ftree_sw_t *p_sw = NULL; ftree_sw_t *p_next_sw; ftree_tuple_t new_tuple; uint32_t i; @@ -2082,13 +2082,11 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* foreach down-going port group (in indexing order) starting with the least loaded group */ + i = p_sw->down_port_groups_idx; for (k = 0; k < p_sw->down_port_groups_num; k++) { - if (k == 0) - i = p_sw->down_port_groups_idx; - else - i = (i + 1) % p_sw->down_port_groups_num; p_group = p_sw->down_port_groups[i]; + i = (i + 1) % p_sw->down_port_groups_num; /* Skip this port group unless it points to a switch */ if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) @@ -3413,7 +3411,7 @@ static void __osm_ftree_fabric_set_leaf_rank(IN ftree_fabric_t * p_ftree) { unsigned i; ftree_sw_t *p_sw; - ftree_hca_t *p_hca; + ftree_hca_t *p_hca = NULL; ftree_hca_t *p_next_hca; OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_set_leaf_rank); -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Wed Sep 19 12:00:17 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 21:00:17 +0200 Subject: [ofa-general] [PATCH] opensm/osm_vl15intf.c: uninitialized var usage fix Message-ID: <20070919190017.GT29384@sashak.voltaire.com> Fix uninitialized variable usage. Also potentially fix the error flow case. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_vl15intf.c | 16 +++++++++------- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/opensm/opensm/osm_vl15intf.c b/opensm/opensm/osm_vl15intf.c index fcfad4f..74e749f 100644 --- a/opensm/opensm/osm_vl15intf.c +++ b/opensm/opensm/osm_vl15intf.c @@ -170,15 +170,17 @@ static void __osm_vl15_poller(IN void *p_ptr) while ((p_vl->p_stats->qp0_mads_outstanding_on_wire >= (int32_t) p_vl->max_wire_smps) && - (p_vl->thread_state == OSM_THREAD_STATE_RUN)) + (p_vl->thread_state == OSM_THREAD_STATE_RUN)) { status = cl_event_wait_on(&p_vl->signal, EVENT_NO_TIMEOUT, TRUE); - - if (status != CL_SUCCESS) - osm_log(p_vl->p_log, OSM_LOG_ERROR, - "__osm_vl15_poller: ERR 3E02: " - "Event wait failed (%s)\n", - CL_STATUS_MSG(status)); + if (status != CL_SUCCESS) { + osm_log(p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E02: " + "Event wait failed (%s)\n", + CL_STATUS_MSG(status)); + break; + } + } } /* -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Wed Sep 19 12:02:19 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 21:02:19 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa_mcmember_record.c: fix uninitilized proxy_join usage Message-ID: <20070919190219.GU29384@sashak.voltaire.com> This fixes uninitilized usage of proxy_join var. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sa_mcmember_record.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index a890525..4d4adfb 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1773,7 +1773,7 @@ __osm_sa_mcm_by_comp_mask_cb(IN cl_map_item_t * const p_map_item, uint8_t scope_state_mask = 0; cl_map_item_t *p_item; ib_gid_t port_gid; - boolean_t proxy_join; + boolean_t proxy_join = FALSE; OSM_LOG_ENTER(p_rcv->p_log, __osm_sa_mcm_by_comp_mask_cb); -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Wed Sep 19 12:06:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 21:06:23 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa_mcmember_record.c: fix uninitilized proxy_join usage In-Reply-To: <20070919190219.GU29384@sashak.voltaire.com> References: <20070919190219.GU29384@sashak.voltaire.com> Message-ID: <20070919190623.GV29384@sashak.voltaire.com> Hi Hal, On 21:02 Wed 19 Sep , Sasha Khapyorsky wrote: > > This fixes uninitilized usage of proxy_join var. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_sa_mcmember_record.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c > index a890525..4d4adfb 100644 > --- a/opensm/opensm/osm_sa_mcmember_record.c > +++ b/opensm/opensm/osm_sa_mcmember_record.c > @@ -1773,7 +1773,7 @@ __osm_sa_mcm_by_comp_mask_cb(IN cl_map_item_t * const p_map_item, > uint8_t scope_state_mask = 0; > cl_map_item_t *p_item; > ib_gid_t port_gid; > - boolean_t proxy_join; > + boolean_t proxy_join = FALSE; Does this fix look right for you? Thanks. Sasha > > OSM_LOG_ENTER(p_rcv->p_log, __osm_sa_mcm_by_comp_mask_cb); > > -- > 1.5.3.rc2.29.gc4640f > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Wed Sep 19 12:31:00 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 19 Sep 2007 21:31:00 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa_(multi)path_record: various fixes Message-ID: <20070919193100.GW29384@sashak.voltaire.com> Couple of similar fixes for osm_sa_path_record.c and osm_sa_multipath_record.c - mostly related to using yet not initialized variables. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sa_multipath_record.c | 27 ++++++++++++--------------- opensm/opensm/osm_sa_path_record.c | 25 ++++++++++--------------- 2 files changed, 22 insertions(+), 30 deletions(-) diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index a94a943..efc6a07 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -226,7 +226,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, const osm_physp_t *p_physp; const osm_physp_t *p_src_physp; const osm_physp_t *p_dest_physp; - const osm_prtn_t *p_prtn; + const osm_prtn_t *p_prtn = NULL; const ib_port_info_t *p_pi; ib_slvl_table_t *p_slvl_tbl; ib_api_status_t status = IB_SUCCESS; @@ -494,10 +494,6 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, && (rate > p_qos_level->rate_limit)) rate = p_qos_level->rate_limit; - if (p_qos_level->pkt_life_set - && (pkt_life > p_qos_level->pkt_life)) - pkt_life = p_qos_level->pkt_life; - if (p_qos_level->sl_set) { required_sl = p_qos_level->sl; if (!(valid_sl_mask & (1 << required_sl))) { @@ -505,14 +501,6 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, goto Exit; } } - - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mpr_rcv_get_path_parms: " - "MultiPath params with QoS constaraints: " - "min MTU = %u, min rate = %u, " - "packet lifetime = %u, sl = %u\n", - mtu, rate, pkt_life, required_sl); } /* @@ -608,7 +596,9 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, for loopback paths, packetLifeTime shall be zero. */ if (p_src_port == p_dest_port) pkt_life = 0; /* loopback */ - else if (!(p_qos_level && p_qos_level->pkt_life_set)) + else if (p_qos_level && p_qos_level->pkt_life_set) + pkt_life = p_qos_level->pkt_life; + else pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; /* we silently ignore cases where only the PktLife selector is defined */ @@ -783,13 +773,13 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, required_pkey & cl_ntoh16((uint16_t) ~ 0x8000)); if (!p_prtn) { + required_sl = OSM_DEFAULT_SL; /* this may be possible when pkey tables are created somehow in previous runs or things are going wrong here */ osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_mpr_rcv_get_path_parms: ERR 451A: " "No partition found for PKey 0x%04x - using default SL %d\n", cl_ntoh16(required_pkey), required_sl); - required_sl = OSM_DEFAULT_SL; } else required_sl = p_prtn->sl; @@ -825,6 +815,13 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, p_parms->sl = required_sl; p_parms->hops = hops; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mpr_rcv_get_path_parms: MultiPath params:" + " mtu = %u, rate = %u, packet lifetime = %u," + " pkey = %u, sl = %u, hops = %u\n", mtu, rate, + pkt_life, cl_ntoh16(required_pkey), required_sl, hops); + Exit: OSM_LOG_EXIT(p_rcv->p_log); return (status); diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 5e06f75..3b183d9 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -487,7 +487,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, p_pr, p_src_physp, p_dest_physp, comp_mask))) { - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_get_path_parms: " @@ -504,10 +503,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, && (rate > p_qos_level->rate_limit)) rate = p_qos_level->rate_limit; - if (p_qos_level->pkt_life_set - && (pkt_life > p_qos_level->pkt_life)) - pkt_life = p_qos_level->pkt_life; - if (p_qos_level->sl_set) { sl = p_qos_level->sl; if (!(valid_sl_mask & (1 << sl))) { @@ -515,14 +510,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, goto Exit; } } - - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_pr_rcv_get_path_parms: " - "Path params with QoS constaraints: " - "min MTU = %u, min rate = %u, " - "packet lifetime = %u, sl = %u\n", - mtu, rate, pkt_life, sl); } /* @@ -533,7 +520,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, */ if (p_src_port == p_dest_port) pkt_life = 0; - else if (!(p_qos_level && p_qos_level->pkt_life_set)) + else if (p_qos_level && p_qos_level->pkt_life_set) + pkt_life = p_qos_level->pkt_life; + else pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; /* @@ -803,13 +792,13 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, * No specific SL in request or in QoS level - use partition SL */ if (!p_prtn) { + sl = OSM_DEFAULT_SL; /* this may be possible when pkey tables are created somehow in previous runs or things are going wrong here */ osm_log(p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_get_path_parms: ERR 1F1C: " "No partition found for PKey 0x%04x - using default SL %d\n", cl_ntoh16(pkey), sl); - sl = OSM_DEFAULT_SL; } else sl = p_prtn->sl; } else if (p_rcv->p_subn->opt.qos) { @@ -843,6 +832,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, p_parms->pkey = pkey; p_parms->sl = sl; + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_pr_rcv_get_path_parms: Path params:" + " mtu = %u, rate = %u, packet lifetime = %u," + " pkey = %u, sl = %u\n", + mtu, rate, pkt_life, cl_ntoh16(pkey), sl); Exit: OSM_LOG_EXIT(p_rcv->p_log); return (status); -- 1.5.3.rc2.29.gc4640f From hrosenstock at xsigo.com Wed Sep 19 15:30:39 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 19 Sep 2007 15:30:39 -0700 Subject: [ofa-general] [PATCH] opensm/osm_sa_mcmember_record.c: fix uninitilized proxy_join usage In-Reply-To: <20070919190623.GV29384@sashak.voltaire.com> References: <20070919190219.GU29384@sashak.voltaire.com> <20070919190623.GV29384@sashak.voltaire.com> Message-ID: <1190241039.7075.71.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Wed, 2007-09-19 at 21:06 +0200, Sasha Khapyorsky wrote: > Hi Hal, > > On 21:02 Wed 19 Sep , Sasha Khapyorsky wrote: > > > > This fixes uninitilized usage of proxy_join var. > > > > Signed-off-by: Sasha Khapyorsky > > --- > > opensm/opensm/osm_sa_mcmember_record.c | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c > > index a890525..4d4adfb 100644 > > --- a/opensm/opensm/osm_sa_mcmember_record.c > > +++ b/opensm/opensm/osm_sa_mcmember_record.c > > @@ -1773,7 +1773,7 @@ __osm_sa_mcm_by_comp_mask_cb(IN cl_map_item_t * const p_map_item, > > uint8_t scope_state_mask = 0; > > cl_map_item_t *p_item; > > ib_gid_t port_gid; > > - boolean_t proxy_join; > > + boolean_t proxy_join = FALSE; > > Does this fix look right for you? ProxyJoin is a computed component and meaningless in the request. It looks like proxy_join variable is used in the case of lack of trust or a specific port specified in the request (the else clause (max one record returned)). In the latter case (specific port specified), proxy_join gets initialized but in the former case (non trusted request without a specific port) it might not be initialized so there might be an issue there with that combination. Setting this to non proxy seems safer to me and is certainly better than uninitialized but I need to think more about this. -- Hal > Thanks. > > Sasha > > > > > OSM_LOG_ENTER(p_rcv->p_log, __osm_sa_mcm_by_comp_mask_cb); > > > > -- > > 1.5.3.rc2.29.gc4640f > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jeff.c.becker at gmail.com Wed Sep 19 16:10:43 2007 From: jeff.c.becker at gmail.com (Jeff Becker) Date: Wed, 19 Sep 2007 16:10:43 -0700 Subject: [ofa-general] libibmad question forward Message-ID: <795c49870709191610j4330cb96i8ff8fef359bdcb6b@mail.gmail.com> I am trying to use libibmad library for initiating queries of Device Management and other class types. While initializing, the madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included as a part of mgmt_classes parameter. This is because mgmt_class_vers() (which is called by mad_register_port_client()/ mad_register_client()) fails to return class version for Device Management Class. I am able to make DM queries if mgmt_class_vers() is fixed i.e. just add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. static int mgmt_class_vers(int mgmt_class) { if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) return 1; switch(mgmt_class) { case IB_SMI_CLASS: case IB_SMI_DIRECT_CLASS: return 1; case IB_SA_CLASS: return 2; case IB_PERFORMANCE_CLASS: return 1; // Change START case IB_DEVICE_MGMT_CLASS: return 1; // Change END } return 0; I am wondering if this minor anomaly can be submitted as a bug to broaden the usage of libibmad its usage for DM queries. Thanks for any help in advance. Akshay Mathur QLogic Corporation 780 Fifth Avenue, Suite 140 King of Prussia, PA 19406 Office: 610.233.4836 Fax: 610.233.4777 From hrosenstock at xsigo.com Wed Sep 19 16:27:54 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 19 Sep 2007 16:27:54 -0700 Subject: [ofa-general] libibmad question forward In-Reply-To: <795c49870709191610j4330cb96i8ff8fef359bdcb6b@mail.gmail.com> References: <795c49870709191610j4330cb96i8ff8fef359bdcb6b@mail.gmail.com> Message-ID: <1190244474.7075.74.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-09-19 at 16:10 -0700, Jeff Becker wrote: > I am trying to use libibmad library for initiating queries of Device > Management and other class types. While initializing, the > madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included as > a part of mgmt_classes parameter. This is because mgmt_class_vers() > (which is called by mad_register_port_client()/ mad_register_client()) > fails to return class version for Device Management Class. > > I am able to make DM queries if mgmt_class_vers() is fixed i.e. just > add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. > > static int > mgmt_class_vers(int mgmt_class) > > { > > if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && > mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || > (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && > mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) > return 1; > > switch(mgmt_class) { > case IB_SMI_CLASS: > case IB_SMI_DIRECT_CLASS: > return 1; > case IB_SA_CLASS: > return 2; > case IB_PERFORMANCE_CLASS: > return 1; > // Change START > case IB_DEVICE_MGMT_CLASS: > return 1; > // Change END > } > > return 0; > > I am wondering if this minor anomaly can be submitted as a bug to > broaden the usage of libibmad its usage for DM queries. Yes, DM class (and perhaps some other missing GS classes) should be added there. -- Hal > > Thanks for any help in advance. > > Akshay Mathur > QLogic Corporation > 780 Fifth Avenue, Suite 140 > King of Prussia, PA 19406 > Office: 610.233.4836 > Fax: 610.233.4777 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ggrundstrom at NetEffect.com Wed Sep 19 16:45:28 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Wed, 19 Sep 2007 18:45:28 -0500 Subject: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement. In-Reply-To: <46F153C4.2070008@ichips.intel.com> References: <200709190022.l8J0MbWt024754@neteffect.com> <46F153C4.2070008@ichips.intel.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC076E400D@venom2> > > If an application is calling rdma_resolve_ip() and a status > of -ENODATA is returned from addr_resolve_local/remote(), the > timeout mechanism waits until the application's timeout > occurs before rechecking the address resolution status; the > application will wait until it's full timeout occurs. This > case is seen when the work thread call to process_req() is > made before the arp packet is processed. > > I don't understand the issue. process_req() is invoked whenever a > network event occurs, which rechecks all pending requests. Yes, I see the netevent_callback(). I now agree that this patch is not necessary. Roland, please disregard. Glenn. > > > This patch is in addition to Steve Wise's neigh_event_send > patch to initiate neighbour discovery sent on 9/12/2007. > > This patch looks unrelated to Steve's patch. Can you clarify the > relationship? > > - Sean > From krkumar2 at in.ibm.com Wed Sep 19 22:10:33 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Thu, 20 Sep 2007 10:40:33 +0530 Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <200709191523.48874.ossthema@de.ibm.com> Message-ID: Hi Jan-Bernd, Jan-Bernd Themann wrote on 09/19/2007 06:53:48 PM: > If I understood it right the problem you describe (quota update in > __napi_schdule) can cause further problems when you choose the > following numbers: > > CPU1: A. process 99 pkts > CPU1: B. netif_rx_complete() > CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action triggerd: > CPU2: C. set quota = 100 (__napi_schedule) > CPU2: D. call poll(), process 1 pkt > CPU2: D.2 call netif_rx_complete() (quota not exeeded) > CPU2: E. net_rx_action: set quota=99 > CPU1: F. net_rx_action: set qutoa=99 - 99 = 0 > CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been called > > Step G would fail as the device is not in the list due > to netif_rx_complete. This case can occur for all > devices running on an SMP system where interrupts are not pinned. I think list_move should be ok whether device is on the list or not. But it should not come to that code since work (99) != weight (100). If work == weight, then driver would not have done complete, and the next/prev would not be set to POISON. I like the clean changes made by Dave to fix this, and will test it today (if I can get my crashed system to come up). Thanks, - KK From kliteyn at mellanox.co.il Wed Sep 19 22:13:38 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 20 Sep 2007 07:13:38 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-20:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-19 OpenSM git rev = Wed_Sep_19_06:00:55_2007 [f3dcc0c51832008bd01f811d921f92c5fdd427ae] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From davem at davemloft.net Wed Sep 19 22:12:24 2007 From: davem at davemloft.net (David Miller) Date: Wed, 19 Sep 2007 22:12:24 -0700 (PDT) Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: References: <200709191523.48874.ossthema@de.ibm.com> Message-ID: <20070919.221224.26966518.davem@davemloft.net> From: Krishna Kumar2 Date: Thu, 20 Sep 2007 10:40:33 +0530 > I like the clean changes made by Dave to fix this, and will test it > today (if I can get my crashed system to come up). I would very much appreciate this testing, as I'm rather sure we've plugged up the most serious holes at this point. From krkumar2 at in.ibm.com Wed Sep 19 22:18:15 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Thu, 20 Sep 2007 10:48:15 +0530 Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070919.090557.24612742.davem@davemloft.net> Message-ID: Hi Dave, David Miller wrote on 09/19/2007 09:35:57 PM: > The NAPI_STATE_SCHED flag bit should provide all of the necessary > synchornization. > > Only the setter of that bit should add the NAPI instance to the > polling list. > > The polling loop runs atomically on the cpu where the NAPI instance > got added to the per-cpu polling list. And therefore decisions to > complete NAPI are serialized too. > > That serialized completion decision is also when the list deletion > occurs. About the "list deletion occurs", isn't the race I mentioned still present? If done < budget, the driver does netif_rx_complete (at which time some other cpu can add this NAPI to their list). But the first cpu might do some more actions on the napi, like ipoib_poll() calls request_notify_cq(priv->cq), when other cpu might have started using this napi. (net_rx_action's 'list_move' however will not execute since work != weight) Thanks, - KK From krkumar2 at in.ibm.com Wed Sep 19 22:54:01 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Thu, 20 Sep 2007 11:24:01 +0530 Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: <20070919.221224.26966518.davem@davemloft.net> Message-ID: Ran 4/16/64 thread iperf on latest bits with this patch and no issues after 30 mins. I used to consistently get the bug within 1-2 mins with just 4 threads prior to this patch. Tested-by: Krishna Kumar (if any value in that) thanks, - KK David Miller wrote on 09/20/2007 10:42:24 AM: > From: Krishna Kumar2 > Date: Thu, 20 Sep 2007 10:40:33 +0530 > > > I like the clean changes made by Dave to fix this, and will test it > > today (if I can get my crashed system to come up). > > I would very much appreciate this testing, as I'm rather sure we've > plugged up the most serious holes at this point. From jackm at dev.mellanox.co.il Wed Sep 19 23:37:35 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 20 Sep 2007 08:37:35 +0200 Subject: [ofa-general] Re: [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists In-Reply-To: References: <200709041047.32062.jackm@dev.mellanox.co.il> <20070919063421.GA6185@mellanox.co.il> Message-ID: <200709200837.36249.jackm@dev.mellanox.co.il> On Wednesday 19 September 2007 18:53, Roland Dreier wrote: > OK, I added the patch below to my tree.  I cleaned up Jack's patch a > little and it seems to work for me; I hope I didn't break anything. > Looks fine -- the overall logic is the same, with a couple of optimizations you added (e.g., no wmb on receive posts, size incremented once instead of once per loop). Thanks! - Jack From ogerlitz at voltaire.com Thu Sep 20 00:22:00 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 20 Sep 2007 09:22:00 +0200 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> Message-ID: <46F21F98.6090503@voltaire.com> Roland Dreier wrote: > thanks, applied. Hi Roland, You have sent a "thanks applied" email for the the ipoib qos patch twice that is on the below two posts, where you should have applied only v3 (the rest of the series is v2, only for ipoib there was v3). > [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support > [RFC] [PATCH 1/5 v3] ib/ipoib: specify Traffic Class with PR queries for QoS support Also, where have you apply it? your git tree at kernel.org was not updated for five days... Or. From mst at dev.mellanox.co.il Thu Sep 20 00:47:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Sep 2007 09:47:12 +0200 Subject: [ofa-general] [PATCH v8] IB/mlx4: shrinking WQE Message-ID: <20070920074712.GB7141@mellanox.co.il> ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use WR with NOP opcode to avoid wrap-around in the middle of WR. We set NoErrorCompletion bit to avoid getting completions with error for NOP WRs. Since NEC is only supported starting with firmware 2.2.232, we use constant-sized WRs for older firmware. And, since MLX QPs only support SEND, we use constant-sized WRs in this case. Signed-off-by: Michael S. Tsirkin --- It turns out I just resent the old patch version instead of posting a new one. Sorry about that, here's one that actually implements the changes: Changes since v7: - avoid mis-detecting recv write with immediate completion as NOP - increase min. wqe_shift for RC QPs to 64 bytes, so that stamping (which is done each 64 bytes) invalidates all WQEs - disable WQE shrinking if FW version is < 2.2.232, otherwise we could get CQE with error for NOP, which might overflow the CQ diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..20ba988 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,12 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP && + is_send)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +359,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 158507d..c844498 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0xffffffff) : + cpu_to_be32(0x7fffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -237,6 +310,8 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +327,69 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * We set NEC bit to avoid getting completions with error for NOP WRs. + * Since NEC is only supported starting with firmware 2.2.232, + * we use constant-sized WRs for older firmware. + * + * And, since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. + * + * We set WQE size to at least 64 bytes, this way stamping invalidates each WQE. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && + qp->sq_signal_bits && BITS_PER_LONG == 64 && + type != IB_QPT_SMI && type != IB_QPT_GSI) + qp->sq.wqe_shift = ilog2(64); + else + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +401,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +440,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +536,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1030,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1238,13 +1364,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1260,7 +1387,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1371,16 +1498,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1402,8 +1536,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..2c6c768 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -129,6 +129,11 @@ enum { MLX4_STAT_RATE_OFFSET = 5 }; +static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) +{ + return (major << 32) | (minor << 16) | subminor; +} + struct mlx4_caps { u64 fw_ver; int num_ports; @@ -185,7 +190,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..09a2230 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -154,7 +154,11 @@ struct mlx4_qp_context { u32 reserved5[10]; }; +/* Which firmware version adds support for NEC (NoErrorCompletion) bit */ +#define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) + enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, -- MST From mst at dev.mellanox.co.il Thu Sep 20 01:40:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Sep 2007 10:40:04 +0200 Subject: [ofa-general] Re: [PATCH v8] IB/mlx4: shrinking WQE In-Reply-To: <20070920074712.GB7141@mellanox.co.il> References: <20070920074712.GB7141@mellanox.co.il> Message-ID: <20070920084004.GB10316@mellanox.co.il> > It turns out I just resent the old patch version instead of > posting a new one. Sorry about that, here's one that actually > implements the changes: > > Changes since v7: > - avoid mis-detecting recv write with immediate completion as NOP > - increase min. wqe_shift for RC QPs to 64 bytes, so that > stamping (which is done each 64 bytes) invalidates all WQEs > - disable WQE shrinking if FW version is < 2.2.232, otherwise > we could get CQE with error for NOP, which might overflow the CQ BTW, this is less code: drivers/infiniband/hw/mlx4/cq.c | 12 +- drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 drivers/infiniband/hw/mlx4/qp.c | 206 +++++++++++++++++++++++++++++------ drivers/net/mlx4/alloc.c | 16 ++ include/linux/mlx4/device.h | 7 + include/linux/mlx4/qp.h | 4 6 files changed, 209 insertions(+), 38 deletions(-) Note that some 30 lines of the insertions are comments, and there's a sanity check in cq.c that we can get rid of (some 5 more lines). Less scary now, isn't it? -- MST From erezz at Voltaire.COM Thu Sep 20 01:44:18 2007 From: erezz at Voltaire.COM (Erez Zilber) Date: Thu, 20 Sep 2007 10:44:18 +0200 Subject: [ofa-general] [PATCH] In-Reply-To: References: Message-ID: <46F232E2.9010302@Voltaire.COM> Roland Dreier wrote: > err, subject should have been "[PATCH] IB/iser: Remove unnecessary includes" > It looks ok. If it complies on your machines, I'm ok with it. Erez From 99 at sina.com Thu Sep 20 02:27:07 2007 From: 99 at sina.com (Sydney Mckinnon) Date: Thu, 20 Sep 2007 12:27:07 +0300 Subject: [ofa-general] Can we talk? Message-ID: <01c7fb68$69bdde90$2596de59@99> Hello! I am bored today. I am nice girl that would like to chat with you. Email me at uavg at NearOut.info only, because I am writing not from my personal email. You will see some of my private pics. From vlad at lists.openfabrics.org Thu Sep 20 02:53:13 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 20 Sep 2007 02:53:13 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070920-0200 daily build status Message-ID: <20070920095313.13609E6087D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070920-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From dotanb at dev.mellanox.co.il Thu Sep 20 04:37:17 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 20 Sep 2007 13:37:17 +0200 Subject: [ofa-general] A question about rdma_get_cm_event Message-ID: <46F25B6D.9000000@dev.mellanox.co.il> Hi Sean. (First of all, thanks for applying all of the valgrind patches...) When one calls to rdma_get_cm_event, he gets a structure of the rdma_cm_event. In this structure there are 2 attributes which i want to discuss about: * private_data * private_data_len It seems that when one side send to the other private data, the private data is correct (i mean that the attribute private data points to a memory buffer with the expected data) but the private_data_len has a fixed size (depend on the ucma function which was called). 1) Is this is the expected behavior? 2) can you please add entry to the man pages of this function to clarify this expected content of those attributes? thanks Dotan From kliteyn at mellanox.co.il Thu Sep 20 04:38:11 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 20 Sep 2007 13:38:11 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: potentially uninitialized vars usage fix In-Reply-To: <20070919184812.GS29384@sashak.voltaire.com> References: <20070919184812.GS29384@sashak.voltaire.com> Message-ID: <46F25BA3.2050009@mellanox.co.il> Looks OK, thanks -- Yevgeny Sasha Khapyorsky wrote: > Fix usage of potentially uninitialized variables. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_ftree.c | 10 ++++------ > 1 files changed, 4 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index d8ba368..948129c 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -1526,7 +1526,7 @@ static int __osm_ftree_fabric_mark_leaf_switches(IN ftree_fabric_t * p_ftree) > static void __osm_ftree_fabric_make_indexing(IN ftree_fabric_t * p_ftree) > { > ftree_sw_t *p_remote_sw; > - ftree_sw_t *p_sw; > + ftree_sw_t *p_sw = NULL; > ftree_sw_t *p_next_sw; > ftree_tuple_t new_tuple; > uint32_t i; > @@ -2082,13 +2082,11 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, > > /* foreach down-going port group (in indexing order) > starting with the least loaded group */ > + i = p_sw->down_port_groups_idx; > for (k = 0; k < p_sw->down_port_groups_num; k++) { > - if (k == 0) > - i = p_sw->down_port_groups_idx; > - else > - i = (i + 1) % p_sw->down_port_groups_num; > > p_group = p_sw->down_port_groups[i]; > + i = (i + 1) % p_sw->down_port_groups_num; > > /* Skip this port group unless it points to a switch */ > if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) > @@ -3413,7 +3411,7 @@ static void __osm_ftree_fabric_set_leaf_rank(IN ftree_fabric_t * p_ftree) > { > unsigned i; > ftree_sw_t *p_sw; > - ftree_hca_t *p_hca; > + ftree_hca_t *p_hca = NULL; > ftree_hca_t *p_next_hca; > > OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_set_leaf_rank); > From kliteyn at mellanox.co.il Thu Sep 20 04:43:29 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 20 Sep 2007 13:43:29 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa_(multi)path_record: various fixes In-Reply-To: <20070919193100.GW29384@sashak.voltaire.com> References: <20070919193100.GW29384@sashak.voltaire.com> Message-ID: <46F25CE1.8090206@mellanox.co.il> Looks fine, thanks. -- Yevgeny Sasha Khapyorsky wrote: > Couple of similar fixes for osm_sa_path_record.c and > osm_sa_multipath_record.c - mostly related to using yet not initialized > variables. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_sa_multipath_record.c | 27 ++++++++++++--------------- > opensm/opensm/osm_sa_path_record.c | 25 ++++++++++--------------- > 2 files changed, 22 insertions(+), 30 deletions(-) > > diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c > index a94a943..efc6a07 100644 > --- a/opensm/opensm/osm_sa_multipath_record.c > +++ b/opensm/opensm/osm_sa_multipath_record.c > @@ -226,7 +226,7 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > const osm_physp_t *p_physp; > const osm_physp_t *p_src_physp; > const osm_physp_t *p_dest_physp; > - const osm_prtn_t *p_prtn; > + const osm_prtn_t *p_prtn = NULL; > const ib_port_info_t *p_pi; > ib_slvl_table_t *p_slvl_tbl; > ib_api_status_t status = IB_SUCCESS; > @@ -494,10 +494,6 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > && (rate > p_qos_level->rate_limit)) > rate = p_qos_level->rate_limit; > > - if (p_qos_level->pkt_life_set > - && (pkt_life > p_qos_level->pkt_life)) > - pkt_life = p_qos_level->pkt_life; > - > if (p_qos_level->sl_set) { > required_sl = p_qos_level->sl; > if (!(valid_sl_mask & (1 << required_sl))) { > @@ -505,14 +501,6 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > goto Exit; > } > } > - > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_mpr_rcv_get_path_parms: " > - "MultiPath params with QoS constaraints: " > - "min MTU = %u, min rate = %u, " > - "packet lifetime = %u, sl = %u\n", > - mtu, rate, pkt_life, required_sl); > } > > /* > @@ -608,7 +596,9 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > for loopback paths, packetLifeTime shall be zero. */ > if (p_src_port == p_dest_port) > pkt_life = 0; /* loopback */ > - else if (!(p_qos_level && p_qos_level->pkt_life_set)) > + else if (p_qos_level && p_qos_level->pkt_life_set) > + pkt_life = p_qos_level->pkt_life; > + else > pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > > /* we silently ignore cases where only the PktLife selector is defined */ > @@ -783,13 +773,13 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > required_pkey & > cl_ntoh16((uint16_t) ~ 0x8000)); > if (!p_prtn) { > + required_sl = OSM_DEFAULT_SL; > /* this may be possible when pkey tables are created somehow in > previous runs or things are going wrong here */ > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_mpr_rcv_get_path_parms: ERR 451A: " > "No partition found for PKey 0x%04x - using default SL %d\n", > cl_ntoh16(required_pkey), required_sl); > - required_sl = OSM_DEFAULT_SL; > } else > required_sl = p_prtn->sl; > > @@ -825,6 +815,13 @@ __osm_mpr_rcv_get_path_parms(IN osm_mpr_rcv_t * const p_rcv, > p_parms->sl = required_sl; > p_parms->hops = hops; > > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_mpr_rcv_get_path_parms: MultiPath params:" > + " mtu = %u, rate = %u, packet lifetime = %u," > + " pkey = %u, sl = %u, hops = %u\n", mtu, rate, > + pkt_life, cl_ntoh16(required_pkey), required_sl, hops); > + > Exit: > OSM_LOG_EXIT(p_rcv->p_log); > return (status); > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 5e06f75..3b183d9 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -487,7 +487,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > osm_qos_policy_get_qos_level_by_pr(p_rcv->p_subn->p_qos_policy, > p_pr, p_src_physp, p_dest_physp, > comp_mask))) { > - > if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_pr_rcv_get_path_parms: " > @@ -504,10 +503,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > && (rate > p_qos_level->rate_limit)) > rate = p_qos_level->rate_limit; > > - if (p_qos_level->pkt_life_set > - && (pkt_life > p_qos_level->pkt_life)) > - pkt_life = p_qos_level->pkt_life; > - > if (p_qos_level->sl_set) { > sl = p_qos_level->sl; > if (!(valid_sl_mask & (1 << sl))) { > @@ -515,14 +510,6 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > goto Exit; > } > } > - > - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > - osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_pr_rcv_get_path_parms: " > - "Path params with QoS constaraints: " > - "min MTU = %u, min rate = %u, " > - "packet lifetime = %u, sl = %u\n", > - mtu, rate, pkt_life, sl); > } > > /* > @@ -533,7 +520,9 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > */ > if (p_src_port == p_dest_port) > pkt_life = 0; > - else if (!(p_qos_level && p_qos_level->pkt_life_set)) > + else if (p_qos_level && p_qos_level->pkt_life_set) > + pkt_life = p_qos_level->pkt_life; > + else > pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > > /* > @@ -803,13 +792,13 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > * No specific SL in request or in QoS level - use partition SL > */ > if (!p_prtn) { > + sl = OSM_DEFAULT_SL; > /* this may be possible when pkey tables are created somehow in > previous runs or things are going wrong here */ > osm_log(p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_get_path_parms: ERR 1F1C: " > "No partition found for PKey 0x%04x - using default SL %d\n", > cl_ntoh16(pkey), sl); > - sl = OSM_DEFAULT_SL; > } else > sl = p_prtn->sl; > } else if (p_rcv->p_subn->opt.qos) { > @@ -843,6 +832,12 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > p_parms->pkey = pkey; > p_parms->sl = sl; > > + if (osm_log_is_active(p_rcv->p_log, OSM_LOG_DEBUG)) > + osm_log(p_rcv->p_log, OSM_LOG_DEBUG, > + "__osm_pr_rcv_get_path_parms: Path params:" > + " mtu = %u, rate = %u, packet lifetime = %u," > + " pkey = %u, sl = %u\n", > + mtu, rate, pkt_life, cl_ntoh16(pkey), sl); > Exit: > OSM_LOG_EXIT(p_rcv->p_log); > return (status); > From tziporet at mellanox.co.il Thu Sep 20 05:41:03 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 20 Sep 2007 14:41:03 +0200 Subject: [ofa-general] RE: Delaying OFED 1.3 alpha release to next week In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563DF6@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563DF6@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901563E0E@mtlexch01.mtl.com> Hi All, Due to some last minutes submissions that are not yet taken and some problems with the install script I delay the OFED 1.3 alpha release to next week. I also think we should agree on a new 1.3 schedule based on the changes in the alpha release. Another thing to consider is base the kernel code on 2.6.24 and in this way to reduce the amount of patches we have Tziporet From ogerlitz at voltaire.com Thu Sep 20 06:16:50 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 20 Sep 2007 15:16:50 +0200 (IST) Subject: [ofa-general] [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: Message-ID: changes from v1 - http://lists.openfabrics.org/pipermail/general/2007-September/040250.html - added module param to control the umcast bit in the device priv flags - changed the umcast bit name to IPOIB_FLAG_ADMIN_UMCAST_ALLOWED - the sysfs attribute has now values 0 and 1 instead of "allowed" and "disallowed" please review and consider for merge to 2.6.24 ----- The kernel IB stack allows (through the RDMA CM) user space multicast applications to interoperate with IP based apps optionally running at a different IP subnet. To support this inter-op for the case where the receiving party resides at the IB side, there is a need to handle IGMP (reports/queries) else the local IP router would not forward multicast traffic towards the IB network. This patch does a lookup on the database used for multicast reference counting and enhances IPoIB to ignore mulicast group which is already handled by user space, all this under a per device policy flag. That is when the policy flag allows it, IPoIB will not join and attach its QP to a multicast group which has an entry on the database. For each IPoIB device, the /sys/class/net/$dev/umcast attribute controls the policy flag where the default value follows the umcast_allowed module param (whose default value is zero). The flag can be read and set/unset through sysfs. Signed-off-by: Or Gerlitz Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-20 11:44:58.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-20 12:03:24.000000000 +0300 @@ -783,6 +783,7 @@ void ipoib_mcast_restart_task(struct wor struct ipoib_mcast *mcast, *tmcast; LIST_HEAD(remove_list); unsigned long flags; + struct ib_sa_mcmember_rec rec; ipoib_dbg_mcast(priv, "restarting multicast task\n"); @@ -816,6 +817,15 @@ void ipoib_mcast_restart_task(struct wor if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { struct ipoib_mcast *nmcast; + /* ignore group which is directly joined by user space */ + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags) && + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) + { + ipoib_dbg_mcast(priv, "ignoring multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + continue; + } + /* Not found or send-only group, let's add a new entry */ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-20 11:44:58.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-20 11:49:44.000000000 +0300 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_ADMIN_UMCAST_ALLOWED = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -364,6 +365,7 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-20 11:44:58.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-20 11:55:54.000000000 +0300 @@ -61,6 +61,10 @@ MODULE_PARM_DESC(send_queue_size, "Numbe module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +int ipoib_umcast_allowed = 0; +module_param_named(umcast_allowed, ipoib_umcast_allowed, int, 0444); +MODULE_PARM_DESC(umcast_allowed, "allow ignoring mulicast group which is already handled by user space"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -901,6 +905,9 @@ int ipoib_dev_init(struct net_device *de if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; + if (ipoib_umcast_allowed) + set_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + return 0; out_tx_ring_cleanup: @@ -1017,6 +1024,44 @@ static ssize_t show_pkey(struct device * } static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static ssize_t show_umcast(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags)) + return sprintf(buf, "1\n"); + else + return sprintf(buf, "0\n"); +} + +static ssize_t set_umcast(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (!strcmp(buf, "1\n")) { + set_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + ipoib_warn(priv, "ignoring multicast groups joined directly " + "by user space\n"); + return count; + } + + if (!strcmp(buf, "0\n")) { + clear_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + return count; + } + + return -EINVAL; +} +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); + +int ipoib_add_umcast_attr(struct net_device *dev) +{ + return device_create_file(&dev->dev, &dev_attr_umcast); +} + static ssize_t create_child(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -1134,6 +1179,8 @@ static struct net_device *ipoib_add_port goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-09-20 11:44:58.000000000 +0300 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-09-20 11:45:46.000000000 +0300 @@ -119,6 +119,8 @@ int ipoib_vlan_add(struct net_device *pd goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; From monis at voltaire.com Thu Sep 20 06:33:06 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:33:06 +0300 Subject: [ofa-general] [PATCH V5 0/11] net/bonding: ADD IPoIB support for the bonding driver Message-ID: <46F27692.3070404@voltaire.com> This patch series is the fifth version (see below link to V4) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. Patches 1-10 were originally submitted in V4 and patch 11 is an addition by Jay. Jay, The bonding patches you acked remain unchanged while I guess I sitll need to get an official ack by Roland for the IPoIB patches. Is it OK with you to push the entire series to the networking tree? Roland has already agreed to do so. Major changes from the previous version: ---------------------------------------- 1. Style changes 2. IPoIB - notify slave detach on vlan delete 3. Add function to net/core for slave detach instead of having it only in ib/ipoib 4. IPoIB - handle ib device and bonding device the same way in neigh_cleanup function Links to earlier discussion: ---------------------------- 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. V4 series http://lists.openfabrics.org/pipermail/general/2007-August/039825.html From monis at voltaire.com Thu Sep 20 06:39:25 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:39:25 +0300 Subject: [ofa-general] [PATCH V5 1/11] net/core: add a netdev notification for slave detach In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F2780D.5090702@voltaire.com> A slave of a bonding master that wants to send a notification before going down should call netdev_slave_detach(). The handling of this notification will be done outside the context of unregister_netdevice() which is sometimes necessary, as with IPoIB slave for example. Signed-off-by: Moni Shoua --- include/linux/if.h | 1 + net/core/dev.c | 20 ++++++++++++++++++++ 2 files changed, 21 insertions(+) Index: net-2.6/net/core/dev.c =================================================================== --- net-2.6.orig/net/core/dev.c 2007-09-20 08:04:47.164051688 +0200 +++ net-2.6/net/core/dev.c 2007-09-20 09:20:21.493060579 +0200 @@ -2588,6 +2588,25 @@ int netdev_set_master(struct net_device return 0; } +/** + * netdev_slave_detach - notify that slave is about to detach from master + * @slave: slave device + * + * Raise a flag that slave is about to detach from master + * and notify the netdev chain. + * The caller must hold the rtnl_mutex. + */ + +int netdev_slave_detach(struct net_device *slave) +{ + int ret = 0; + if (slave->flags & IFF_SLAVE) { + slave->priv_flags |= IFF_SLAVE_DETACH; + ret = call_netdevice_notifiers(NETDEV_CHANGE, slave); + } + return ret; +} + static void __dev_set_promiscuity(struct net_device *dev, int inc) { unsigned short old_flags = dev->flags; @@ -4120,6 +4139,7 @@ EXPORT_SYMBOL(dev_set_mac_address); EXPORT_SYMBOL(free_netdev); EXPORT_SYMBOL(netdev_boot_setup_check); EXPORT_SYMBOL(netdev_set_master); +EXPORT_SYMBOL(netdev_slave_detach); EXPORT_SYMBOL(netdev_state_change); EXPORT_SYMBOL(netif_receive_skb); EXPORT_SYMBOL(netif_rx); Index: net-2.6/include/linux/if.h =================================================================== --- net-2.6.orig/include/linux/if.h 2007-09-20 08:04:47.164051688 +0200 +++ net-2.6/include/linux/if.h 2007-09-20 08:15:29.577729301 +0200 @@ -61,6 +61,7 @@ #define IFF_MASTER_ALB 0x10 /* bonding master, balance-alb. */ #define IFF_BONDING 0x20 /* bonding master or slave */ #define IFF_SLAVE_NEEDARP 0x40 /* need ARPs for validation */ +#define IFF_SLAVE_DETACH 0x80 /* slave is about to unregister */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 From monis at voltaire.com Thu Sep 20 06:40:28 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:40:28 +0300 Subject: [ofa-general] [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F2784C.9070806@voltaire.com> When the bonding device enslaves IPoIB devices it takes pointers to functions in the ib_ipoib module. This is fine as long as the ib_ipoib nodule remains loaded while the references to its functions exist. So, to help bonding do a cleanup on time, when the IPoIB net device is a slave of a bonding master, let the master know that the IPoIB device is about to unregister (but before calling unregister). Signed-off-by: Moni Shoua --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++++++ drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 +++ drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 1 + 3 files changed, 11 insertions(+) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-20 08:35:34.000000000 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-20 14:20:16.495147879 +0200 @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); @@ -921,6 +922,7 @@ void ipoib_dev_cleanup(struct net_device /* Delete any child interfaces first */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + ipoib_slave_detach(cpriv->dev); unregister_netdev(cpriv->dev); ipoib_dev_cleanup(cpriv->dev); free_netdev(cpriv->dev); @@ -1208,6 +1210,7 @@ static void ipoib_remove_one(struct ib_d ib_unregister_event_handler(&priv->event_handler); flush_scheduled_work(); + ipoib_slave_detach(priv->dev); unregister_netdev(priv->dev); ipoib_dev_cleanup(priv->dev); free_netdev(priv->dev); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-09-20 09:26:11.000000000 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-09-20 09:27:20.182709679 +0200 @@ -157,6 +157,7 @@ int ipoib_vlan_delete(struct net_device mutex_lock(&ppriv->vlan_mutex); list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { if (priv->pkey == pkey) { + ipoib_slave_detach(priv->dev); unregister_netdev(priv->dev); ipoib_dev_cleanup(priv->dev); list_del(&priv->list); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-20 12:18:56.000000000 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-20 14:21:47.385972207 +0200 @@ -570,6 +570,13 @@ static inline void ipoib_cm_handle_rx_wc #endif +static inline void ipoib_slave_detach(struct net_device *dev) +{ + rtnl_lock(); + netdev_slave_detach(dev); + rtnl_unlock(); +} + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG void ipoib_create_debug_files(struct net_device *dev); void ipoib_delete_debug_files(struct net_device *dev); From monis at voltaire.com Thu Sep 20 06:41:37 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:41:37 +0300 Subject: [ofa-general] [PATCH V5 3/11] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27891.5010909@voltaire.com> IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb->dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n->dev->flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++- 3 files changed, 20 insertions(+), 11 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:09:26.534874404 +0200 @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_head list; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:23:54.725744661 +0200 @@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb->dst->neighbour); + neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -830,6 +830,13 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh->dev); + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", + n->dev->name); + } else + return; ipoib_dbg(priv, "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), @@ -837,13 +844,10 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(&priv->lock, flags); - neigh = *to_ipoib_neigh(n); - if (neigh) { - if (neigh->ah) - ah = neigh->ah; - list_del(&neigh->list); - ipoib_neigh_free(n->dev, neigh); - } + if (neigh->ah) + ah = neigh->ah; + list_del(&neigh->list); + ipoib_neigh_free(n->dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); @@ -851,7 +855,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -860,6 +865,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh->neighbour = neighbour; + neigh->dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(&neigh->queue); ipoib_cm_set(neigh, NULL); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:09:26.536874045 +0200 @@ -727,7 +727,8 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, + skb->dev); if (neigh) { kref_get(&mcast->ah->ref); From monis at voltaire.com Thu Sep 20 06:42:42 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:42:42 +0300 Subject: [ofa-general] [PATCH V5 4/11] IB/ipoib: Verify address handle validity on send In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F278D2.6080703@voltaire.com> When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:09:26.535874225 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:10:22.375853147 +0200 @@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, + if (unlikely((memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid))) || + (neigh->dev != dev))) { spin_lock(&priv->lock); /* * It's safe to call ipoib_put_ah() inside From monis at voltaire.com Thu Sep 20 06:43:28 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:43:28 +0300 Subject: [ofa-general] [PATCH V5 5/11] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27900.1010004@voltaire.com> This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:08:59.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 +0300 @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->hard_header = slave_dev->hard_header; + bond_dev->rebuild_header = slave_dev->rebuild_header; + bond_dev->hard_header_cache = slave_dev->hard_header_cache; + bond_dev->header_cache_update = slave_dev->header_cache_update; + bond_dev->hard_header_parse = slave_dev->hard_header_parse; + + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different " + "from other slaves (%d), can not enslave it.\n", + slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " From tziporet at dev.mellanox.co.il Thu Sep 20 06:46:44 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 20 Sep 2007 15:46:44 +0200 Subject: [ofa-general] Re: [ewg] Re: ofed-1.3 daily build package's content In-Reply-To: <200709181533.14764.hnguyen@linux.vnet.ibm.com> References: <200709171711.09316.hnguyen@linux.vnet.ibm.com> <20070918060952.GI24414@mellanox.co.il> <46EF9973.6020703@mellanox.co.il> <200709181533.14764.hnguyen@linux.vnet.ibm.com> Message-ID: <46F279C4.2010900@mellanox.co.il> Hoang-Nam Nguyen wrote: > Hi Tziporet! > > Hope the patch below helps. > Nam > > > Thanks - I submitted it and it solved the problem. Tziporet From monis at voltaire.com Thu Sep 20 06:45:40 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:45:40 +0300 Subject: [ofa-general] [PATCH V5 6/11] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address() In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27984.4020401@voltaire.com> This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 87 +++++++++++++++++++++++++++------------- drivers/net/bonding/bonding.h | 1 2 files changed, 60 insertions(+), 28 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 +0300 @@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC + * address is the one of the active slave. + */ + if (new_active && !bond->do_set_mac_addr) + memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, + new_active->dev->addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev->set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified does " - "not support setting the MAC address. " - "Your kernel likely does not support slave " - "devices.\n", bond_dev->name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond->slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + ": %s: Warning: The first slave device you " + "specified does not support setting the MAC " + "address. This bond MAC address would be that " + "of the active slave.\n", bond_dev->name); + bond->do_set_mac_addr = 0; + } else if (bond->do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + ": %s: Error: The slave device you specified " + "does not support setting the MAC addres,." + "but this bond uses this practice. \n" + , bond_dev->name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - /* - * Set slave to master's mac address. The application already - * set the master's mac address to that of the first slave - */ - memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); - addr.sa_family = slave_dev->type; - res = dev_set_mac_address(slave_dev, &addr); - if (res) { - dprintk("Error %d calling set_mac_address\n", res); - goto err_free; + if (bond->do_set_mac_addr) { + /* + * Set slave to master's mac address. The application already + * set the master's mac address to that of the first slave + */ + memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); + addr.sa_family = slave_dev->type; + res = dev_set_mac_address(slave_dev, &addr); + if (res) { + dprintk("Error %d calling set_mac_address\n", res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1612,9 +1631,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } err_free: kfree(new_slave); @@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address */ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address */ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1882,10 +1905,12 @@ static int bond_release_all(struct net_d /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE); @@ -3922,6 +3947,9 @@ static int bond_set_mac_address(struct n dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); + if (!bond->do_set_mac_addr) + return -EOPNOTSUPP; + if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; } @@ -4312,6 +4340,9 @@ static int bond_init(struct net_device * bond_create_proc_entry(bond); #endif + /* set do_set_mac_addr to true on startup */ + bond->do_set_mac_addr = 1; + list_add_tail(&bond->bond_list, &bond_dev_list); return 0; Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:08:58.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 10:55:34.359354833 +0300 @@ -185,6 +185,7 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; + s8 do_set_mac_addr; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From ogerlitz at voltaire.com Thu Sep 20 06:52:43 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 20 Sep 2007 15:52:43 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F006F5.5090801@ichips.intel.com> References: <46F006F5.5090801@ichips.intel.com> Message-ID: <46F27B2B.9000707@voltaire.com> Sean Hefty wrote: > Yes - this is possible. Note that although the group reference count is > 2, joins are tracked in different lists: active_list or pending_list. > The second join doesn't move to the active_list until it's processed by > the callback thread, to synchronize against errors and leaves. I see, however, there is no second join here, its a leave and join where the group refcount climbs to 2 since the the join code inc it on its synchronous part which is executed before the thread handles the processing of the leave request. >> Following that the leave work-element causes the thread to just dec the >> reference count to 1 in release_group() and do nothing else, and the join >> work-element causes the thread to return the cached address-handle attributes >> to the consumer. So no sa query is being sent to the SA. > This sounds like the correct behavior. I am not sure this is what we want from the core design. Say the consumer has some flexibility in the join request (eg through future api change), such that they can join a group, leave it, then join again this group with different "attributes". Then if the join crosses the leave in a way that causes the core code not to issue sa leave/join queries, its a bug from the perspective of the user. > Does the SA remove the node from the multicast group? If the HCA port > goes down, the multicast code will transition all existing multicast > groups to the error state. An error will be reported on active joins. > Pending joins will be processed normally after error handling has > completed. OK, on this specific host system there was no port down event! so the only event that the multicast and ipoib code got was port active. This is why the patch I sent solves (hides) the problem, it causes the multicast code to transition the group into the error state, so the ipoib join that follows causes an sa join query to be actually sent. > I'm wondering if the problem isn't in ipoib. When an error occurs on a > multicast group, the group transitions into the error state, and the > user is called back to let them know that they need to rejoin the group. > Since ipoib responds directly to port events and not multicast callback > errors, is there a chance ipoib missed the error notification? I don't think there's a problem in ipoib, it just does not rely on multicast error notifications but rather on port events. Do you think its less robust, and if yes, why? We decided to use port notifications also in a user space multicast app, since the multicast notification is delivered also on port error and as long as the port is down, we can't really join, so instead of implementing timer/retries, we join on port events (active, sm/my lid change, client re-register, etc). Or. From monis at voltaire.com Thu Sep 20 06:58:58 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:58:58 +0300 Subject: [ofa-general] [PATCH V5 7/11] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27CA2.5050600@voltaire.com> Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/bonding/bond_sysfs.c | 6 ++---- 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev->flags & IFF_UP)) { - dprintk("Error, master_dev is not up\n"); - return -EPERM; + printk(KERN_WARNING DRV_NAME + " %s: master_dev is not up in bond_enslave\n", + bond_dev->name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond->dev->flags & IFF_UP)) { - printk(KERN_ERR DRV_NAME - ": %s: Unable to update slaves because interface is down.\n", + printk(KERN_WARNING DRV_NAME + ": %s: doing slave updates when interface is down.\n", bond->dev->name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond->lock here, as bond_create grabs it. */ From monis at voltaire.com Thu Sep 20 07:00:59 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:00:59 +0200 Subject: [ofa-general] [PATCH V5 8/11] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27D1B.90203@voltaire.com> bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 3 ++- drivers/net/bonding/bond_sysfs.c | 19 +++++++++++++------ drivers/net/bonding/bonding.h | 1 + 3 files changed, 16 insertions(+), 7 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-20 14:29:11.911298577 +0300 @@ -1224,7 +1224,8 @@ static int bond_compute_features(struct struct slave *slave; struct net_device *bond_dev = bond->dev; unsigned long features = bond_dev->features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev->hard_header_len); int i; features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 12:14:41.152469089 +0300 @@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc printk(KERN_INFO DRV_NAME ": %s is being deleted...\n", bond->dev->name); - bond_deinit(bond->dev); - bond_destroy_sysfs_entry(bond); - unregister_netdevice(bond->dev); + bond_destroy(bond); rtnl_unlock(); goto out; } @@ -260,6 +258,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +324,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev->mtu; if (dev->mtu != bond->dev->mtu) { if (dev->change_mtu) { res = dev->change_mtu(dev, @@ -339,6 +339,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond->dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) + slave->original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,13 +354,17 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) { dev = slave->dev; + original_mtu = slave->original_mtu; break; } if (dev) { printk(KERN_INFO DRV_NAME ": %s: Removing slave %s\n", bond->dev->name, dev->name); rtnl_lock(); - res = bond_release(bond->dev, dev); + if (bond->setup_by_slave) + res = bond_release_and_destroy(bond->dev, dev); + else + res = bond_release(bond->dev, dev); rtnl_unlock(); if (res) { ret = res; @@ -365,9 +372,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev->change_mtu) { - dev->change_mtu(dev, 1500); + dev->change_mtu(dev, original_mtu); } else { - dev->mtu = 1500; + dev->mtu = original_mtu; } } else { Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:55:34.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-20 14:29:11.912298402 +0300 @@ -156,6 +156,7 @@ struct slave { s8 link; /* one of BOND_LINK_XXXX */ s8 state; /* one of BOND_STATE_XXXX */ u32 original_flags; + u32 original_mtu; u32 link_failure_count; u16 speed; u8 duplex; From monis at voltaire.com Thu Sep 20 07:02:51 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:02:51 +0200 Subject: [ofa-general] PATCH V5 9/11] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27D8B.1090808@voltaire.com> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev->state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 24 +++++++++++++++++++++--- drivers/net/bonding/bonding.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:56:33.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 +0300 @@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon if (new_active && !bond->do_set_mac_addr) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); - - bond_send_gratuitous_arp(bond); + if (bond->curr_active_slave && + test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) { + dprintk("delaying gratuitous arp on %s\n", + bond->curr_active_slave->dev->name); + bond->send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond->send_grat_arp) { + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) + dprintk("Needs to send gratuitous arp but not yet\n"); + else { + dprintk("sending delayed gratuitous arp on on %s\n", + bond->curr_active_slave->dev->name); + bond_send_gratuitous_arp(bond); + bond->send_grat_arp = 0; + } + } read_lock(&bond->curr_slave_lock); oldcurrent = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); @@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str if (bond->master_ip) { bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, - bond->master_ip, 0); + bond->master_ip, 0); } list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { @@ -4293,6 +4310,7 @@ static int bond_init(struct net_device * bond->current_arp_slave = NULL; bond->primary_slave = NULL; bond->dev = bond_dev; + bond->send_grat_arp = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:56:33.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 11:05:41.516451497 +0300 @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From monis at voltaire.com Thu Sep 20 07:04:03 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:04:03 +0200 Subject: [ofa-general] [PATCH V5 10/11] net/bonding: Destroy bonding master when last slave is gone In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27DD3.4080806@voltaire.com> When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 45 +++++++++++++++++++++++++++++++++++++++- drivers/net/bonding/bonding.h | 3 ++ 2 files changed, 47 insertions(+), 1 deletion(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-20 14:43:17.123702132 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-20 14:43:17.850571535 +0300 @@ -1256,6 +1256,7 @@ static int bond_compute_features(struct static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev->priv; bond_dev->hard_header = slave_dev->hard_header; bond_dev->rebuild_header = slave_dev->rebuild_header; bond_dev->hard_header_cache = slave_dev->hard_header_cache; @@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n memcpy(bond_dev->broadcast, slave_dev->broadcast, slave_dev->addr_len); + bond->setup_by_slave = 1; } /* enslave device to bond device */ @@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond->dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond->dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev->priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) && (bond->slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME " %s: destroying bond for.\n", + bond_dev->name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3322,7 +3353,11 @@ static int bond_slave_netdev_event(unsig switch (event) { case NETDEV_UNREGISTER: if (bond_dev) { - bond_release(bond_dev, slave_dev); + dprintk("slave %s unregisters\n", slave_dev->name); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); } break; case NETDEV_CHANGE: @@ -3331,6 +3366,13 @@ static int bond_slave_netdev_event(unsig * sets up a hierarchical bond, then rmmod's * one of the slave bonding devices? */ + if (slave_dev->priv_flags & IFF_SLAVE_DETACH) { + dprintk("slave %s detaching\n", slave_dev->name); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); + } break; case NETDEV_DOWN: /* @@ -4311,6 +4353,7 @@ static int bond_init(struct net_device * bond->primary_slave = NULL; bond->dev = bond_dev; bond->send_grat_arp = 0; + bond->setup_by_slave = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-20 14:43:17.123702132 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-20 14:47:52.845180870 +0300 @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); From monis at voltaire.com Thu Sep 20 07:06:34 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:06:34 +0200 Subject: [ofa-general] [PATCH V5 5/11] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27E6A.5000101@voltaire.com> This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:08:59.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 +0300 @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->hard_header = slave_dev->hard_header; + bond_dev->rebuild_header = slave_dev->rebuild_header; + bond_dev->hard_header_cache = slave_dev->hard_header_cache; + bond_dev->header_cache_update = slave_dev->header_cache_update; + bond_dev->hard_header_parse = slave_dev->hard_header_parse; + + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different " + "from other slaves (%d), can not enslave it.\n", + slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " From monis at voltaire.com Thu Sep 20 07:07:22 2007 From: monis at voltaire.com (Moni Shoua) Date: Thu, 20 Sep 2007 16:07:22 +0200 Subject: [ofa-general] [PATCH 11/11] bonding: Optionally allow ethernet slaves to keep own MAC In-Reply-To: <46F27692.3070404@voltaire.com> References: <46F27692.3070404@voltaire.com> Message-ID: <46F27E9A.2010108@voltaire.com> Update the "don't change MAC of slaves" functionality added in previous changes to be a generic option, rather than something tied to IB devices, as it's occasionally useful for regular ethernet devices as well. Adds "fail_over_mac" option (which is automatically enabled for IB slaves), applicable only to active-backup mode. Includes documentation update. Updates bonding driver version to 3.2.0. Signed-off-by: Jay Vosburgh --- Documentation/networking/bonding.txt | 33 +++++++++++++++++++ drivers/net/bonding/bond_main.c | 57 +++++++++++++++++++++------------ drivers/net/bonding/bond_sysfs.c | 49 +++++++++++++++++++++++++++++ drivers/net/bonding/bonding.h | 6 ++-- 4 files changed, 121 insertions(+), 24 deletions(-) diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 1da5666..1134062 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -281,6 +281,39 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address (the traditional behavior), or, when + enabled, change the bond's MAC address when changing the + active interface (i.e., fail over the MAC address itself). + + Fail over MAC is useful for devices that cannot ever alter + their MAC address, or for devices that refuse incoming + broadcasts with their own source MAC (which interferes with + the ARP monitor). + + The down side of fail over MAC is that every device on the + network must be updated via gratuitous ARP, vs. just updating + a switch or set of switches (which often takes place for any + traffic, not just ARP traffic, if the switch snoops incoming + traffic to update its tables) for the traditional method. If + the gratuitous ARP is lost, communication may be disrupted. + + When fail over MAC is used in conjuction with the mii monitor, + devices which assert link up prior to being able to actually + transmit and receive are particularly susecptible to loss of + the gratuitous ARP, and an appropriate updelay setting may be + required. + + A value of 0 disables fail over MAC, and is the default. A + value of 1 enables fail over MAC. This option is enabled + automatically if the first slave added cannot change its MAC + address. This option may be modified via sysfs only when no + slaves are present in the bond. + + This option was added in bonding version 3.2.0. + lacp_rate Option specifying the rate in which we'll ask our link partner diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77caca3..c01ff9d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL; static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, }; static char *arp_validate = NULL; +static int fail_over_mac = 0; struct bond_params bonding_defaults; module_param(max_bonds, int, 0); @@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0); MODULE_PARM_DESC(arp_ip_target, "arp targets in n.n.n.n form"); module_param(arp_validate, charp, 0); MODULE_PARM_DESC(arp_validate, "validate src/dst of ARP probes: none (default), active, backup or all"); +module_param(fail_over_mac, int, 0); +MODULE_PARM_DESC(fail_over_mac, "For active-backup, do not set all slaves to the same MAC. 0 of off (default), 1 for on."); /*----------------------------- Global variables ----------------------------*/ @@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) /* when bonding does not set the slave MAC address, the bond MAC * address is the one of the active slave. */ - if (new_active && !bond->do_set_mac_addr) + if (new_active && bond->params.fail_over_mac) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); if (bond->curr_active_slave && @@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) if (slave_dev->set_mac_address == NULL) { if (bond->slave_cnt == 0) { printk(KERN_WARNING DRV_NAME - ": %s: Warning: The first slave device you " - "specified does not support setting the MAC " - "address. This bond MAC address would be that " - "of the active slave.\n", bond_dev->name); - bond->do_set_mac_addr = 0; - } else if (bond->do_set_mac_addr) { + ": %s: Warning: The first slave device " + "specified does not support setting the MAC " + "address. Enabling the fail_over_mac option.", + bond_dev->name); + bond->params.fail_over_mac = 1; + } else if (!bond->params.fail_over_mac) { printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified " - "does not support setting the MAC addres,." - "but this bond uses this practice. \n" + ": %s: Error: The slave device specified " + "does not support setting the MAC address, " + "but fail_over_mac is not enabled.\n" , bond_dev->name); res = -EOPNOTSUPP; goto err_undo_flags; @@ -1405,7 +1408,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* * Set slave to master's mac address. The application already * set the master's mac address to that of the first slave @@ -1641,7 +1644,7 @@ err_close: dev_close(slave_dev); err_restore_mac: - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; dev_set_mac_address(slave_dev, &addr); @@ -1823,7 +1826,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address */ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -1944,7 +1947,7 @@ static int bond_release_all(struct net_device *bond_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address*/ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -3066,9 +3069,15 @@ static void bond_info_show_master(struct seq_file *seq) curr = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); - seq_printf(seq, "Bonding Mode: %s\n", + seq_printf(seq, "Bonding Mode: %s", bond_mode_name(bond->params.mode)); + if (bond->params.mode == BOND_MODE_ACTIVEBACKUP && + bond->params.fail_over_mac) + seq_printf(seq, " (fail_over_mac)"); + + seq_printf(seq, "\n"); + if (bond->params.mode == BOND_MODE_XOR || bond->params.mode == BOND_MODE_8023AD) { seq_printf(seq, "Transmit Hash Policy: %s (%d)\n", @@ -4008,8 +4017,12 @@ static int bond_set_mac_address(struct net_device *bond_dev, void *addr) dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); - if (!bond->do_set_mac_addr) - return -EOPNOTSUPP; + /* + * If fail_over_mac is enabled, do nothing and return success. + * Returning an error causes ifenslave to fail. + */ + if (bond->params.fail_over_mac) + return 0; if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; @@ -4402,10 +4415,6 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) #ifdef CONFIG_PROC_FS bond_create_proc_entry(bond); #endif - - /* set do_set_mac_addr to true on startup */ - bond->do_set_mac_addr = 1; - list_add_tail(&bond->bond_list, &bond_dev_list); return 0; @@ -4739,6 +4748,11 @@ static int bond_check_params(struct bond_params *params) primary = NULL; } + if (fail_over_mac && (bond_mode != BOND_MODE_ACTIVEBACKUP)) + printk(KERN_WARNING DRV_NAME + ": Warning: fail_over_mac only affects " + "active-backup mode.\n"); + /* fill params struct with the proper values */ params->mode = bond_mode; params->xmit_policy = xmit_hashtype; @@ -4750,6 +4764,7 @@ static int bond_check_params(struct bond_params *params) params->use_carrier = use_carrier; params->lacp_fast = lacp_fast; params->primary[0] = 0; + params->fail_over_mac = fail_over_mac; if (primary) { strncpy(params->primary, primary, IFNAMSIZ); diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 71db5d9..a907b68 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -567,6 +567,54 @@ static ssize_t bonding_store_arp_validate(struct device *d, static DEVICE_ATTR(arp_validate, S_IRUGO | S_IWUSR, bonding_show_arp_validate, bonding_store_arp_validate); /* + * Show and store fail_over_mac. User only allowed to change the + * value when there are no slaves. + */ +static ssize_t bonding_show_fail_over_mac(struct device *d, struct device_attribute *attr, char *buf) +{ + struct bonding *bond = to_bond(d); + + return sprintf(buf, "%d\n", bond->params.fail_over_mac) + 1; +} + +static ssize_t bonding_store_fail_over_mac(struct device *d, struct device_attribute *attr, const char *buf, size_t count) +{ + int new_value; + int ret = count; + struct bonding *bond = to_bond(d); + + if (bond->slave_cnt != 0) { + printk(KERN_ERR DRV_NAME + ": %s: Can't alter fail_over_mac with slaves in bond.\n", + bond->dev->name); + ret = -EPERM; + goto out; + } + + if (sscanf(buf, "%d", &new_value) != 1) { + printk(KERN_ERR DRV_NAME + ": %s: no fail_over_mac value specified.\n", + bond->dev->name); + ret = -EINVAL; + goto out; + } + + if ((new_value == 0) || (new_value == 1)) { + bond->params.fail_over_mac = new_value; + printk(KERN_INFO DRV_NAME ": %s: Setting fail_over_mac to %d.\n", + bond->dev->name, new_value); + } else { + printk(KERN_INFO DRV_NAME + ": %s: Ignoring invalid fail_over_mac value %d.\n", + bond->dev->name, new_value); + } +out: + return ret; +} + +static DEVICE_ATTR(fail_over_mac, S_IRUGO | S_IWUSR, bonding_show_fail_over_mac, bonding_store_fail_over_mac); + +/* * Show and set the arp timer interval. There are two tricky bits * here. First, if ARP monitoring is activated, then we must disable * MII monitoring. Second, if the ARP timer isn't running, we must @@ -1390,6 +1438,7 @@ static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL); static struct attribute *per_bond_attrs[] = { &dev_attr_slaves.attr, &dev_attr_mode.attr, + &dev_attr_fail_over_mac.attr, &dev_attr_arp_validate.attr, &dev_attr_arp_interval.attr, &dev_attr_arp_ip_target.attr, diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index ed0f587..9d6153e 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,8 +22,8 @@ #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION "3.1.3" -#define DRV_RELDATE "June 13, 2007" +#define DRV_VERSION "3.2.0" +#define DRV_RELDATE "September 13, 2007" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" @@ -128,6 +128,7 @@ struct bond_params { int arp_interval; int arp_validate; int use_carrier; + int fail_over_mac; int updelay; int downdelay; int lacp_fast; @@ -186,7 +187,6 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; - s8 do_set_mac_addr; s8 send_grat_arp; s8 setup_by_slave; struct net_device_stats stats; -- 1.5.2-rc2.GIT From HNGUYEN at de.ibm.com Thu Sep 20 07:07:40 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Thu, 20 Sep 2007 16:07:40 +0200 Subject: [ofa-general] Re: [ewg] RE: Delaying OFED 1.3 alpha release to next week In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563E0E@mtlexch01.mtl.com> Message-ID: Hello Tziporet! > Due to some last minutes submissions that are not yet taken and some > problems with the > install script I delay the OFED 1.3 alpha release to next week. > > I also think we should agree on a new 1.3 schedule based on the changes > in the alpha release. We're testing and backporting ehca on various kernel versions and distros. We'll have our backport patches ready by Tue next week. > > Another thing to consider is base the kernel code on 2.6.24 and in this > way to reduce the amount of patches we have I would prefer this option, because we have at the moment about 15 patches in queue for 2.6.24. Thanks Nam From sashak at voltaire.com Thu Sep 20 09:16:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 20 Sep 2007 18:16:15 +0200 Subject: [ofa-general] libibmad question forward In-Reply-To: <1190244474.7075.74.camel@hrosenstock-ws.xsigo.com> References: <795c49870709191610j4330cb96i8ff8fef359bdcb6b@mail.gmail.com> <1190244474.7075.74.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070920161615.GA21834@sashak.voltaire.com> On 16:27 Wed 19 Sep , Hal Rosenstock wrote: > On Wed, 2007-09-19 at 16:10 -0700, Jeff Becker wrote: > > I am trying to use libibmad library for initiating queries of Device > > Management and other class types. While initializing, the > > madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included as > > a part of mgmt_classes parameter. This is because mgmt_class_vers() > > (which is called by mad_register_port_client()/ mad_register_client()) > > fails to return class version for Device Management Class. > > > > I am able to make DM queries if mgmt_class_vers() is fixed i.e. just > > add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. > > > > static int > > mgmt_class_vers(int mgmt_class) > > > > { > > > > if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && > > mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || > > (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && > > mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) > > return 1; > > > > switch(mgmt_class) { > > case IB_SMI_CLASS: > > case IB_SMI_DIRECT_CLASS: > > return 1; > > case IB_SA_CLASS: > > return 2; > > case IB_PERFORMANCE_CLASS: > > return 1; > > // Change START > > case IB_DEVICE_MGMT_CLASS: > > return 1; > > // Change END > > } > > > > return 0; > > > > I am wondering if this minor anomaly can be submitted as a bug to > > broaden the usage of libibmad its usage for DM queries. > > Yes, DM class (and perhaps some other missing GS classes) should be > added there. So, I'm going to apply this. Sasha >From 46ad958b33c456672e2af711f36b494d398316bb Mon Sep 17 00:00:00 2001 From: Jeff Becker Date: Thu, 20 Sep 2007 17:48:55 +0200 Subject: [PATCH] libibmad: add support for IB_DEVICE_MGMT_CLASS From: Jeff Becker This adds IB_DEVICE_MGMT_CLASS to list of classes for which version is returned. Signed-off-by: Sasha Khapyorsky --- libibmad/src/register.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 3d1285a..d80fa14 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -95,6 +95,8 @@ mgmt_class_vers(int mgmt_class) return 2; case IB_PERFORMANCE_CLASS: return 1; + case IB_DEVICE_MGMT_CLASS: + return 1; } return 0; -- 1.5.3.1.91.gd3392 From rdreier at cisco.com Thu Sep 20 09:18:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 09:18:26 -0700 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: <46F21F98.6090503@voltaire.com> (Or Gerlitz's message of "Thu, 20 Sep 2007 09:22:00 +0200") References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com> <000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46F21F98.6090503@voltaire.com> Message-ID: > You have sent a "thanks applied" email for the the ipoib qos patch > twice that is on the below two posts, where you should have applied > only v3 (the rest of the series is v2, only for ipoib there was v3). Sorry... I actually applied the patch from Sean's git tree, so I hope I got the latest. > Also, where have you apply it? your git tree at kernel.org was not > updated for five days... I just got out of the habit of pushing my local tree to kernel.org. It should be updated there. - R. From rdreier at cisco.com Thu Sep 20 09:20:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 09:20:57 -0700 Subject: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <46F2784C.9070806@voltaire.com> (Moni Shoua's message of "Thu, 20 Sep 2007 16:40:28 +0300") References: <46F27692.3070404@voltaire.com> <46F2784C.9070806@voltaire.com> Message-ID: > + ipoib_slave_detach(cpriv->dev); > unregister_netdev(cpriv->dev); Maybe you already answered this before, but I'm still not clear why this notifier call can't just be added to the start of unregister_netdevice(), so we can avoid having driver needing to know anything about bonding internals? - R. From rdreier at cisco.com Thu Sep 20 09:29:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 09:29:12 -0700 Subject: [ofa-general] Re: [PATCH 3/3] IB/ehca: Make sure user pages are from hugetlb before using MR large pages In-Reply-To: <200709131816.21162.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 13 Sep 2007 18:16:20 +0200") References: <200709131814.13937.fenkes@de.ibm.com> <200709131816.21162.fenkes@de.ibm.com> Message-ID: thanks, applied this and the umem patch... From mshefty at ichips.intel.com Thu Sep 20 09:29:57 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Sep 2007 09:29:57 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F27B2B.9000707@voltaire.com> References: <46F006F5.5090801@ichips.intel.com> <46F27B2B.9000707@voltaire.com> Message-ID: <46F2A005.8000105@ichips.intel.com> > I see, however, there is no second join here, its a leave and join where > the group refcount climbs to 2 since the the join code inc it on its > synchronous part which is executed before the thread handles the > processing of the leave request. The refcount is only used to ensure that the group structure continues to exist. The code must be able to handle multiple users calling join/free at the same time, including a single user calling free before its previous call to join has completed. All MADs sent for the same multicast group must also be serialized to prevent join and leave requests for the same group from reaching the SA out of order. If you walk through ib_sa_free_multicast(), the group membership is decremented. A reference is held on the group because a work item has just been queued on the group for processing. We cannot remove this reference unless we avoid queuing the work item. And the work item is queued to ensure that the leave request to the SA is serialized with possible future join requests. > I am not sure this is what we want from the core design. Say the > consumer has some flexibility in the join request (eg through future api > change), such that they can join a group, leave it, then join again this > group with different "attributes". Then if the join crosses the leave in > a way that causes the core code not to issue sa leave/join queries, its > a bug from the perspective of the user. If the attributes from a subsequent join differ from an existing join, the subsequent join operation will fail. The only way I can think of to make this situation work is to add an asynchronous ib_sa_leave_multicast() routine that provides a callback after the leave completes, in addition to the existing free call. This could be a fairly difficult case to make work anyway, since it requires destroying the group at the SA before it can be re-created with the different attributes. It requires coordination across the group that's beyond the control of the local multicast module. (A single group creator could handle this fairly easily.) > OK, on this specific host system there was no port down event! so the > only event that the multicast and ipoib code got was port active. This > is why the patch I sent solves (hides) the problem, it causes the > multicast code to transition the group into the error state, so the > ipoib join that follows causes an sa join query to be actually sent. There were two port active events delivered back to back with no other events in between? If so, is this something that can or should occur? The patch itself looks fine to me; I'm trying to determine if there are other refcount problems in the multicast module. I'm not convinced that there are at this point. > I don't think there's a problem in ipoib, it just does not rely on > multicast error notifications but rather on port events. Do you think > its less robust, and if yes, why? As long as the multicast module gets the event notification first, which I believe is the case, then I don't think there's any problems. - Sean From rdreier at cisco.com Thu Sep 20 09:31:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 09:31:27 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Wed, 12 Sep 2007 11:13:08 -0700") References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> Message-ID: > Roland - can you please queue this up for 2.6.24? Done, thanks. From changquing.tang at hp.com Thu Sep 20 09:47:45 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 20 Sep 2007 16:47:45 -0000 Subject: [ofa-general] Can you clarify my 'RNR NAK timer' understanding ? Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030256F745@G3W0634.americas.hpqcorp.net> If I set RNR NAK timer to the biggest value (00000, 655.38 ms), when the HCA recevies a message and no receive WR outstanding, it will wait 655.38 ms, then it sends a RNR NAK back. If I post a receive WR during this waiting time, then the message will be received, and RNR NAK won't be sent. Am I right ? What is the side effect if I set RNR NAK timer a big timer ? Thanks --CQ From mshefty at ichips.intel.com Thu Sep 20 10:16:12 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Sep 2007 10:16:12 -0700 Subject: [ofa-general] Re: A question about rdma_get_cm_event In-Reply-To: <46F25B6D.9000000@dev.mellanox.co.il> References: <46F25B6D.9000000@dev.mellanox.co.il> Message-ID: <46F2AADC.7040201@ichips.intel.com> > When one calls to rdma_get_cm_event, he gets a structure of the > rdma_cm_event. > > In this structure there are 2 attributes which i want to discuss about: > * private_data > * private_data_len > > It seems that when one side send to the other private data, the private > data is correct > (i mean that the attribute private data points to a memory buffer with > the expected data) > but the private_data_len has a fixed size (depend on the ucma function > which was called). > > 1) Is this is the expected behavior? Yes - there's no way for the receiving side of an IB CM message to know how many bytes of private data are valid in the REQ, REP, etc. > 2) can you please add entry to the man pages of this function to clarify > this expected > content of those attributes? I will update the man pages. Thanks. - Sean From todd.rimmer at qlogic.com Thu Sep 20 10:17:41 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 20 Sep 2007 12:17:41 -0500 Subject: [ofa-general] Can you clarify my 'RNR NAK timer' understanding ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030256F745@G3W0634.americas.hpqcorp.net> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE0611930E3EB@EPEXCH2.qlogic.org> > From: Tang, Changqing > Sent: Thursday, September 20, 2007 12:48 PM > To: Michael S. Tsirkin > Cc: general at lists.openfabrics.org > Subject: [ofa-general] Can you clarify my 'RNR NAK timer' understanding ? > > > If I set RNR NAK timer to the biggest value (00000, 655.38 ms), when the > HCA recevies > a message and no receive WR outstanding, it will wait 655.38 ms, then > it sends a RNR > NAK back. > > If I post a receive WR during this waiting time, then the message will > be received, and > RNR NAK won't be sent. > > Am I right ? What is the side effect if I set RNR NAK timer a big > timer ? No, the behavior is if an input message arrives at a QP without any receive Q entries, the receiver immediately sends an RNR NAK. The RNR NAK timeout is part of the RNR NAK packet. In response to the RNR NAK packet, the sender waits at least the given timeout before retrying the send (actual wait time could be more than the requested value). Hence supplying a large RNR NAK timeout means there will be a large penalty when you run out of receive buffers. In this case your application will "stall" for at least the RNR NAK timeout duration even if the receiver replenishes its queue shortly after the RNR NAK is sent. One alternative is a smaller RNR NAK timeout on the receiver and a large (or infinite) RNR Retry on the sender. This avoids the time delay penalty, however it wastes bandwidth in the fabric if the sender resends before the receiver has replenished its receiver queue. The best alternative, which is employed by many IB protocols such as SRP, SDP, etc is to track application level credits so the sender will not attempt to send into an empty receive Q. Obviously that requires changes to the application protocol being designed. The tradeoff of performance vs wasting bandwidth vs protocol complexity will depend on the specific application/problem you are trying to solve. Todd Rimmer From bboas at systemfabricworks.com Thu Sep 20 10:59:30 2007 From: bboas at systemfabricworks.com (Bill Boas) Date: Thu, 20 Sep 2007 10:59:30 -0700 Subject: [ofa-general] RE: A Question In-Reply-To: References: Message-ID: <005101c7fbaf$feb100d0$6401a8c0@YOURCB10AA3FFD> David, I think you will get an answer from the "general" mail list. Bill Boas VP, Business Development System Fabric Works 510-375-8840 bboas at systemfabricworks.com www.systemfabricworks.com _____ From: David Gonzalez Marquez [mailto:fokerman at gmail.com] Sent: Thursday, September 20, 2007 10:11 AM To: membership at openfabrics.org Subject: A Question Hello I'm woking with debian linux. But I cannot find any support for Infiniband. All the packages are made for RPM management. A like to know what type of support exists for debian linux and as I can obtain it. Thanks David Gonzalez Marquez -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Thu Sep 20 11:01:53 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 20 Sep 2007 11:01:53 -0700 Subject: [ofa-general] libibmad question forward In-Reply-To: <20070920161615.GA21834@sashak.voltaire.com> References: <795c49870709191610j4330cb96i8ff8fef359bdcb6b@mail.gmail.com> <1190244474.7075.74.camel@hrosenstock-ws.xsigo.com> <20070920161615.GA21834@sashak.voltaire.com> Message-ID: <1190311314.7075.102.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-09-20 at 18:16 +0200, Sasha Khapyorsky wrote: > On 16:27 Wed 19 Sep , Hal Rosenstock wrote: > > On Wed, 2007-09-19 at 16:10 -0700, Jeff Becker wrote: > > > I am trying to use libibmad library for initiating queries of Device > > > Management and other class types. While initializing, the > > > madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included as > > > a part of mgmt_classes parameter. This is because mgmt_class_vers() > > > (which is called by mad_register_port_client()/ mad_register_client()) > > > fails to return class version for Device Management Class. > > > > > > I am able to make DM queries if mgmt_class_vers() is fixed i.e. just > > > add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. > > > > > > static int > > > mgmt_class_vers(int mgmt_class) > > > > > > { > > > > > > if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && > > > mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || > > > (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && > > > mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) > > > return 1; > > > > > > switch(mgmt_class) { > > > case IB_SMI_CLASS: > > > case IB_SMI_DIRECT_CLASS: > > > return 1; > > > case IB_SA_CLASS: > > > return 2; > > > case IB_PERFORMANCE_CLASS: > > > return 1; > > > // Change START > > > case IB_DEVICE_MGMT_CLASS: > > > return 1; Actually, there is an annex which makes this class version 2 which is supposed to support backward compatibility for version 1. I'm not sure whether both are in use (as to how important the backward compatibility is with this). Maybe someone else can comment on this aspect. -- Hal > > > // Change END > > > } > > > > > > return 0; > > > > > > I am wondering if this minor anomaly can be submitted as a bug to > > > broaden the usage of libibmad its usage for DM queries. > > > > Yes, DM class (and perhaps some other missing GS classes) should be > > added there. > > So, I'm going to apply this. > > Sasha > > From 46ad958b33c456672e2af711f36b494d398316bb Mon Sep 17 00:00:00 2001 > From: Jeff Becker > Date: Thu, 20 Sep 2007 17:48:55 +0200 > Subject: [PATCH] libibmad: add support for IB_DEVICE_MGMT_CLASS > > From: Jeff Becker > > This adds IB_DEVICE_MGMT_CLASS to list of classes for which version is > returned. > > Signed-off-by: Sasha Khapyorsky > --- > libibmad/src/register.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > index 3d1285a..d80fa14 100644 > --- a/libibmad/src/register.c > +++ b/libibmad/src/register.c > @@ -95,6 +95,8 @@ mgmt_class_vers(int mgmt_class) > return 2; > case IB_PERFORMANCE_CLASS: > return 1; > + case IB_DEVICE_MGMT_CLASS: > + return 1; > } > > return 0; From rdreier at cisco.com Thu Sep 20 11:03:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 11:03:06 -0700 Subject: [ofa-general] RE: A Question In-Reply-To: <005101c7fbaf$feb100d0$6401a8c0@YOURCB10AA3FFD> (Bill Boas's message of "Thu, 20 Sep 2007 10:59:30 -0700") References: <005101c7fbaf$feb100d0$6401a8c0@YOURCB10AA3FFD> Message-ID: > I'm woking with debian linux. But I cannot find any support for Infiniband. I maintain libibverbs and libmthca packages that are in the main Debian archive (ie "aptitude install libibverbs1 libmthca1" should get you those packages installed). The Debian kernel has all IB-related config options enabled AFAIK. I plan on packaging libmlx4 for Debian soon as well. Is there something else in particular that you're missing? - R. From rick.jones2 at hp.com Thu Sep 20 11:10:01 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 20 Sep 2007 11:10:01 -0700 Subject: [ofa-general] RE: A Question In-Reply-To: References: <005101c7fbaf$feb100d0$6401a8c0@YOURCB10AA3FFD> Message-ID: <46F2B779.80705@hp.com> Roland Dreier wrote: >>I'm woking with debian linux. But I cannot find any support for Infiniband. > > > I maintain libibverbs and libmthca packages that are in the main > Debian archive (ie "aptitude install libibverbs1 libmthca1" should get > you those packages installed). The Debian kernel has all IB-related > config options enabled AFAIK. I plan on packaging libmlx4 for Debian > soon as well. > > Is there something else in particular that you're missing? I'm guessing he is trying to grab OFED 1.X bits and install them like I was a few months ago. rick jones From davem at davemloft.net Thu Sep 20 11:12:28 2007 From: davem at davemloft.net (David Miller) Date: Thu, 20 Sep 2007 11:12:28 -0700 (PDT) Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: References: <20070919.090557.24612742.davem@davemloft.net> Message-ID: <20070920.111228.76772435.davem@davemloft.net> From: Krishna Kumar2 Date: Thu, 20 Sep 2007 10:48:15 +0530 > About the "list deletion occurs", isn't the race I mentioned still present? > If done < budget, the driver does netif_rx_complete (at which time some > other cpu can add this NAPI to their list). But the first cpu might do some more > actions on the napi, like ipoib_poll() calls request_notify_cq(priv->cq), > when other cpu might have started using this napi. > > (net_rx_action's 'list_move' however will not execute since work != weight) It is the driver's responsibility to adhere to the fact that once netif_rx_complete() is called, the driver is explicitly relinquishing ownership of the NAPI state. It therefore must not access that NAPI state until it has successfully acquired the NAPI_STATE_SCHED bit atomically, via a sched or resched. From davem at davemloft.net Thu Sep 20 11:12:42 2007 From: davem at davemloft.net (David Miller) Date: Thu, 20 Sep 2007 11:12:42 -0700 (PDT) Subject: [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() In-Reply-To: References: <20070919.221224.26966518.davem@davemloft.net> Message-ID: <20070920.111242.91444191.davem@davemloft.net> From: Krishna Kumar2 Date: Thu, 20 Sep 2007 11:24:01 +0530 > Ran 4/16/64 thread iperf on latest bits with this patch and no issues after > 30 mins. I used to > consistently get the bug within 1-2 mins with just 4 threads prior to this > patch. > > Tested-by: Krishna Kumar > (if any value in that) There is much value in that :-) Thanks a lot Kirshna. From mshefty at ichips.intel.com Thu Sep 20 11:48:04 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Sep 2007 11:48:04 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: References: Message-ID: <46F2C064.9030404@ichips.intel.com> > We saw the bug on a uni processor system running the ipath driver, where the > consumer is ipoib and the group being the IPv4 broadcast. When we take down > the link of the switch port connected to the device across the cable, ipoib > rushes to leave the group and then join it. On this system the join "crosses > the leave" and the SA does not take into account the node when computing the > multicast routing of the group --> the node does not get the broadcast traffic. I've read back over this description a few times, and I still don't fully grok the problem. Can you clarify if the following sequence is what's happening? 1. The node has joined the multicast group. Meaning that the SA has routed multicast traffic to the node. 2. You take down the link of the switch port that connects the node. Is this done via a program? 3. The port is brought back online. This generates a PORT_ACTIVE event, but the previous event was also PORT_ACTIVE. 4. ipoib leaves the group. 5. ipoib re-joins the group. 6. The multicast module isn't aware that any errors have occurred on the multicast group, so simply completes the join request at step 5 without SA involvement. If I'm understanding this, somewhere in the above sequence the multicast routing to this node is lost. Either the SA removed the node from the group, or the switch lost its routing tables, or ...? I'm also trying to understand how the problem would apply to a different setup: node 1 <-> switch A <-> switch B <-> switch C <-> SA Suppose the same link down/up occurred between switch A and switch B. What happens to the multicast members to the left of switch B? Will node 1 see a PORT_ACTIVE event in this case as well? - Sean From arlin.r.davis at intel.com Thu Sep 20 11:50:56 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 20 Sep 2007 11:50:56 -0700 Subject: [ofa-general] [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2 Message-ID: <000001c7fbb7$30cbad70$19b7020a@amr.corp.intel.com> James, Please review patches to allow coexistence of 2.0 and 1.2 libraries. I updated the dat.conf to provide configuration to both 1.2 and 2.0 providers. In addition, the development package (headers) is not targeted to include/dat2 instead of include/dat. A patch for 1.2 will follow shortly. Modifications to DAT 2.0 package to coexist with 1.2 libraries - cleanup CR-LF in dtestx - fix RPM specfile, 2.0.1 package - move devel to include/dat2 - change test examples to use new 2.0 provider names. Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/Makefile.am b/Makefile.am index b3a0149..f473aaa 100755 --- a/Makefile.am +++ b/Makefile.am @@ -66,7 +66,7 @@ dat_udat_libdat_la_SOURCES = dat/udat/udat.c \ dat/common/dat_init.c \ dat/common/dat_dr.c \ dat/common/dat_sr.c - +# version-info current:revision:age dat_udat_libdat_la_LDFLAGS = -version-info 2:0:0 $(dat_version_script) -ldl # @@ -178,11 +178,12 @@ dapl_udapl_libdaplcma_la_SOURCES = dapl/udapl/dapl_init.c \ dapl/openib_cma/dapl_ib_cm.c \ dapl/openib_cma/dapl_ib_mem.c $(XPROGRAMS) +# version-info current:revision:age dapl_udapl_libdaplcma_la_LDFLAGS = -version-info 2:0:0 $(daplcma_version_script) \ -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ -lpthread -libverbs -lrdmacm -libdatincludedir = $(includedir)/dat +libdatincludedir = $(includedir)/dat2 libdatinclude_HEADERS = dat/include/dat/dat.h \ dat/include/dat/dat_error.h \ @@ -244,7 +245,7 @@ EXTRA_DIST = dat/common/dat_dictionary.h \ dat/udat/libdat.map \ doc/dat.conf \ dapl/udapl/libdaplcma.map \ - libdat.spec.in \ + libdat2.spec.in \ $(man_MANS) \ test/dapltest/include/dapl_bpool.h \ test/dapltest/include/dapl_client_info.h \ @@ -274,7 +275,7 @@ EXTRA_DIST = dat/common/dat_dictionary.h \ test/dapltest/include/dapl_version.h \ test/dapltest/mdep/linux/dapl_mdep_user.h -dist-hook: libdat.spec - cp libdat.spec $(distdir) +dist-hook: libdat2.spec + cp libdat2.spec $(distdir) SUBDIRS = . test/dtest test/dapltest diff --git a/README b/README index 437c1f7..1fc55a2 100644 --- a/README +++ b/README @@ -17,16 +17,18 @@ Building debug version: ./configure --enable-debug make -Build example with OFED prefix (x86_64) ------------------------------------------ +Build example with OFED 1.2+ prefix (x86_64) +--------------------------------------------- ./autogen.sh -./configure --prefix /usr/local/ofed --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 CPPFLAGS="-I/usr/local/ofed/include" +./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 CPPFLAGS="-I/usr/include" make Installing: ---------- make install +Note: The development package installs DAT 2.0 include files under /usr/include/dat2 to co-exist with DAT 1.2 /usr/include/dat + NOTE: to link these libraries you must either use libtool and specify the full pathname of the library, or use the `-LLIBDIR' flag during linking and do at least one of the following: @@ -47,19 +49,32 @@ more information, such as the ld(1) and ld.so(8) manual pages. sample /etc/dat.conf # -# DAT 1.2 configuration file, sample OFED +# DAT 1.2 and 2.0 configuration file # # Each entry should have the following fields: # # \ # # -# For openib-cma provider you can specify as either: -# network address, network hostname, or netdev name and 0 for port +# For the uDAPL cma provder, specify as one of the following: +# network address, network hostname, or netdev name and 0 for port +# +# Simple (OpenIB-cma) default with netdev name provided first on list +# to enable use of same dat.conf version on all nodes # -# This example shows netdev name, enabling administrator to use same copy across cluster +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding # -OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdapl-cma.so mv_dapl.1.2 "ib0 0" "" +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" + ============================= 3.0 Bugs/Known issues diff --git a/configure.in b/configure.in index 7608e64..4eda85f 100644 --- a/configure.in +++ b/configure.in @@ -1,11 +1,11 @@ dnl Process this file with autoconf to produce a configure script. AC_PREREQ(2.57) -AC_INIT(dapl, 2.0.0, openib-general at openib.org) +AC_INIT(dapl, 2.0.1, general at lists.openfabrics.org) AC_CONFIG_SRCDIR([dat/udat/udat.c]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE(dapl, 2.0.0) +AM_INIT_AUTOMAKE(dapl, 2.0.1) AM_PROG_LIBTOOL @@ -86,6 +86,6 @@ AC_CACHE_CHECK(Check for RHEL5 system, ac_cv_rhel5, fi) AM_CONDITIONAL(OS_RHEL5, test "$ac_cv_rhel5" = "yes") -AC_CONFIG_FILES([Makefile test/dtest/Makefile test/dapltest/Makefile libdat.spec]) +AC_CONFIG_FILES([Makefile test/dtest/Makefile test/dapltest/Makefile libdat2.spec]) AC_OUTPUT diff --git a/doc/dat.conf b/doc/dat.conf index 2651673..005f9ee 100755 --- a/doc/dat.conf +++ b/doc/dat.conf @@ -1,5 +1,5 @@ # -# DAT 2.0 configuration file +# DAT 1.2 and 2.0 configuration file # # Each entry should have the following fields: # @@ -9,10 +9,18 @@ # For the uDAPL cma provder, specify as one of the following: # network address, network hostname, or netdev name and 0 for port # -# Simple (OpenIB-cma) default configuration with netdev name provided first on list -# to enable use of same dat.conf version on all nodes. Assumes x86_64 installation. +# Simple (OpenIB-cma) default with netdev name provided first on list +# to enable use of same dat.conf version on all nodes # -OpenIB-cma u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" -OpenIB-cma-1 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" -OpenIB-cma-2 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" -OpenIB-cma-3 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding +# +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 07b40ec..ba12a58 100644 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -44,7 +44,7 @@ #include #ifndef DAPL_PROVIDER -#define DAPL_PROVIDER "OpenIB-cma" +#define DAPL_PROVIDER "OpenIB-2-cma" #endif #define MAX_POLLING_CNT 50000 diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c index 153ce76..04a0d5d 100755 --- a/test/dtest/dtestx.c +++ b/test/dtest/dtestx.c @@ -30,785 +30,785 @@ * SOFTWARE. * * $Id: $ - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "dat/udat.h" -#include "dat/dat_ib_extensions.h" - -#define _OK(status, str) \ -{ \ - const char *maj_msg, *min_msg; \ - if (status != DAT_SUCCESS) { \ - dat_strerror(status, &maj_msg, &min_msg); \ - fprintf(stderr, str " returned %s : %s\n", maj_msg, min_msg); \ - exit(1); \ - } \ -} - + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "dat/udat.h" +#include "dat/dat_ib_extensions.h" + +#define _OK(status, str) \ +{ \ + const char *maj_msg, *min_msg; \ + if (status != DAT_SUCCESS) { \ + dat_strerror(status, &maj_msg, &min_msg); \ + fprintf(stderr, str " returned %s : %s\n", maj_msg, min_msg); \ + exit(1); \ + } \ +} + #define DTO_TIMEOUT (1000*1000*5) #define CONN_TIMEOUT (1000*1000*10) -#define SERVER_TIMEOUT (1000*1000*120) -#define SERVER_CONN_QUAL 31111 -#define BUF_SIZE 256 -#define BUF_SIZE_ATOMIC 8 -#define REG_MEM_COUNT 10 -#define SND_RDMA_BUF_INDEX 0 -#define RCV_RDMA_BUF_INDEX 1 -#define SEND_BUF_INDEX 2 -#define RECV_BUF_INDEX 3 - -u_int64_t *atomic_buf; -DAT_LMR_HANDLE lmr_atomic; -DAT_LMR_CONTEXT lmr_atomic_context; -DAT_RMR_CONTEXT rmr_atomic_context; -DAT_VLEN reg_atomic_size; -DAT_VADDR reg_atomic_addr; -DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; -DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; -DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; -DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; -DAT_VLEN reg_size[ REG_MEM_COUNT ]; -DAT_VADDR reg_addr[ REG_MEM_COUNT ]; -DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; -DAT_EP_HANDLE ep; -DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; -DAT_IA_HANDLE ia = DAT_HANDLE_NULL; -DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; -DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; -DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; -DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; -DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; -DAT_CR_HANDLE cr = DAT_HANDLE_NULL; -int server; - -char *usage = "-s | hostname (default == -s)\n"; - -void -send_msg( - void *data, - DAT_COUNT size, - DAT_LMR_CONTEXT context, - DAT_DTO_COOKIE cookie, - DAT_COMPLETION_FLAGS flags) -{ - DAT_LMR_TRIPLET iov; - DAT_EVENT event; - DAT_COUNT nmore; - DAT_RETURN status; - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = - &event.event_data.dto_completion_event_data; - - iov.lmr_context = context; - iov.virtual_address = (DAT_VADDR)(unsigned long)data; - iov.segment_length = (DAT_VLEN)size; - - status = dat_ep_post_send(ep, - 1, - &iov, - cookie, - flags); - _OK(status, "dat_ep_post_send"); - - if (! (flags & DAT_COMPLETION_SUPPRESS_FLAG)) { - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait after dat_ep_post_send"); - - if (event.event_number != DAT_DTO_COMPLETION_EVENT) { - printf("unexpected event waiting for post_send completion - 0x%x\n", event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status for post_send"); - } -} - -int -connect_ep(char *hostname) -{ - DAT_SOCK_ADDR remote_addr; - DAT_EP_ATTR ep_attr; - DAT_RETURN status; - DAT_REGION_DESCRIPTION region; - DAT_EVENT event; - DAT_COUNT nmore; - DAT_LMR_TRIPLET iov; - DAT_RMR_TRIPLET r_iov; - DAT_DTO_COOKIE cookie; - int i; - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = - &event.event_data.dto_completion_event_data; - - status = dat_ia_open("OpenIB-cma", 8, &async_evd, &ia); - _OK(status, "dat_ia_open"); - - status = dat_pz_create(ia, &pz); - _OK(status, "dat_pz_create"); - - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &cr_evd ); - _OK(status, "dat_evd_create CR"); - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CONNECTION_FLAG, &con_evd ); - _OK(status, "dat_evd_create CR"); - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_DTO_FLAG, &dto_evd ); - _OK(status, "dat_evd_create DTO"); - - memset(&ep_attr, 0, sizeof(ep_attr)); - ep_attr.service_type = DAT_SERVICE_TYPE_RC; - ep_attr.max_rdma_size = 0x10000; - ep_attr.qos = 0; - ep_attr.recv_completion_flags = 0; - ep_attr.max_recv_dtos = 10; - ep_attr.max_request_dtos = 10; - ep_attr.max_recv_iov = 1; - ep_attr.max_request_iov = 1; - ep_attr.max_rdma_read_in = 4; - ep_attr.max_rdma_read_out = 4; - ep_attr.request_completion_flags = DAT_COMPLETION_DEFAULT_FLAG; - ep_attr.ep_transport_specific_count = 0; - ep_attr.ep_transport_specific = NULL; - ep_attr.ep_provider_specific_count = 0; - ep_attr.ep_provider_specific = NULL; - - status = dat_ep_create(ia, pz, dto_evd, dto_evd, con_evd, &ep_attr, &ep); - _OK(status, "dat_ep_create"); - - for (i = 0; i < REG_MEM_COUNT; i++) { - buf[ i ] = (DAT_RMR_TRIPLET*)malloc(BUF_SIZE); - region.for_va = buf[ i ]; - status = dat_lmr_create(ia, - DAT_MEM_TYPE_VIRTUAL, - region, - BUF_SIZE, - pz, - DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, +#define SERVER_TIMEOUT (1000*1000*120) +#define SERVER_CONN_QUAL 31111 +#define BUF_SIZE 256 +#define BUF_SIZE_ATOMIC 8 +#define REG_MEM_COUNT 10 +#define SND_RDMA_BUF_INDEX 0 +#define RCV_RDMA_BUF_INDEX 1 +#define SEND_BUF_INDEX 2 +#define RECV_BUF_INDEX 3 + +u_int64_t *atomic_buf; +DAT_LMR_HANDLE lmr_atomic; +DAT_LMR_CONTEXT lmr_atomic_context; +DAT_RMR_CONTEXT rmr_atomic_context; +DAT_VLEN reg_atomic_size; +DAT_VADDR reg_atomic_addr; +DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; +DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; +DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; +DAT_VLEN reg_size[ REG_MEM_COUNT ]; +DAT_VADDR reg_addr[ REG_MEM_COUNT ]; +DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; +DAT_EP_HANDLE ep; +DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; +DAT_IA_HANDLE ia = DAT_HANDLE_NULL; +DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; +DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; +DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; +DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; +DAT_CR_HANDLE cr = DAT_HANDLE_NULL; +int server; + +char *usage = "-s | hostname (default == -s)\n"; + +void +send_msg( + void *data, + DAT_COUNT size, + DAT_LMR_CONTEXT context, + DAT_DTO_COOKIE cookie, + DAT_COMPLETION_FLAGS flags) +{ + DAT_LMR_TRIPLET iov; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_RETURN status; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + iov.lmr_context = context; + iov.virtual_address = (DAT_VADDR)(unsigned long)data; + iov.segment_length = (DAT_VLEN)size; + + status = dat_ep_post_send(ep, + 1, + &iov, + cookie, + flags); + _OK(status, "dat_ep_post_send"); + + if (! (flags & DAT_COMPLETION_SUPPRESS_FLAG)) { + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait after dat_ep_post_send"); + + if (event.event_number != DAT_DTO_COMPLETION_EVENT) { + printf("unexpected event waiting for post_send completion - 0x%x\n", event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status for post_send"); + } +} + +int +connect_ep(char *hostname) +{ + DAT_SOCK_ADDR remote_addr; + DAT_EP_ATTR ep_attr; + DAT_RETURN status; + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + int i; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + + status = dat_ia_open("OpenIB-2-cma", 8, &async_evd, &ia); + _OK(status, "dat_ia_open"); + + status = dat_pz_create(ia, &pz); + _OK(status, "dat_pz_create"); + + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &cr_evd ); + _OK(status, "dat_evd_create CR"); + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CONNECTION_FLAG, &con_evd ); + _OK(status, "dat_evd_create CR"); + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_DTO_FLAG, &dto_evd ); + _OK(status, "dat_evd_create DTO"); + + memset(&ep_attr, 0, sizeof(ep_attr)); + ep_attr.service_type = DAT_SERVICE_TYPE_RC; + ep_attr.max_rdma_size = 0x10000; + ep_attr.qos = 0; + ep_attr.recv_completion_flags = 0; + ep_attr.max_recv_dtos = 10; + ep_attr.max_request_dtos = 10; + ep_attr.max_recv_iov = 1; + ep_attr.max_request_iov = 1; + ep_attr.max_rdma_read_in = 4; + ep_attr.max_rdma_read_out = 4; + ep_attr.request_completion_flags = DAT_COMPLETION_DEFAULT_FLAG; + ep_attr.ep_transport_specific_count = 0; + ep_attr.ep_transport_specific = NULL; + ep_attr.ep_provider_specific_count = 0; + ep_attr.ep_provider_specific = NULL; + + status = dat_ep_create(ia, pz, dto_evd, dto_evd, con_evd, &ep_attr, &ep); + _OK(status, "dat_ep_create"); + + for (i = 0; i < REG_MEM_COUNT; i++) { + buf[ i ] = (DAT_RMR_TRIPLET*)malloc(BUF_SIZE); + region.for_va = buf[ i ]; + status = dat_lmr_create(ia, + DAT_MEM_TYPE_VIRTUAL, + region, + BUF_SIZE, + pz, + DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, DAT_VA_TYPE_VA, - &lmr[ i ], - &lmr_context[ i ], - &rmr_context[ i ], - ®_size[ i ], - ®_addr[ i ]); - _OK(status, "dat_lmr_create"); - } - - /* register atomic return buffer for original data */ - atomic_buf = (u_int64_t*)malloc(BUF_SIZE); - region.for_va = atomic_buf; - status = dat_lmr_create(ia, - DAT_MEM_TYPE_VIRTUAL, - region, - BUF_SIZE_ATOMIC, - pz, - DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, + &lmr[ i ], + &lmr_context[ i ], + &rmr_context[ i ], + ®_size[ i ], + ®_addr[ i ]); + _OK(status, "dat_lmr_create"); + } + + /* register atomic return buffer for original data */ + atomic_buf = (u_int64_t*)malloc(BUF_SIZE); + region.for_va = atomic_buf; + status = dat_lmr_create(ia, + DAT_MEM_TYPE_VIRTUAL, + region, + BUF_SIZE_ATOMIC, + pz, + DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, DAT_VA_TYPE_VA, - &lmr_atomic, - &lmr_atomic_context, - &rmr_atomic_context, - ®_atomic_size, - ®_atomic_addr); - _OK(status, "dat_lmr_create atomic"); - - for (i = RECV_BUF_INDEX; i < REG_MEM_COUNT; i++) { - cookie.as_64 = i; - iov.lmr_context = lmr_context[ i ]; - iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ i ]; - iov.segment_length = BUF_SIZE; - - status = dat_ep_post_recv(ep, - 1, - &iov, - cookie, - DAT_COMPLETION_DEFAULT_FLAG); - _OK(status, "dat_ep_post_recv"); - } - - /* setup receive buffer to initial string to be overwritten */ - strcpy((char*)buf[ RCV_RDMA_BUF_INDEX ], "blah, blah, blah\n"); - - if (server) { - - strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "server written data"); - - status = dat_psp_create(ia, - SERVER_CONN_QUAL, - cr_evd, - DAT_PSP_CONSUMER_FLAG, - &psp); - _OK(status, "dat_psp_create"); - - printf("Server waiting for connect request\n"); - status = dat_evd_wait(cr_evd, SERVER_TIMEOUT, 1, &event, &nmore); - _OK(status, "listen dat_evd_wait"); - - if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) { - printf("unexpected event after dat_psp_create: 0x%x\n", event.event_number); - exit(1); - } - - if ((event.event_data.cr_arrival_event_data.conn_qual != SERVER_CONN_QUAL) || - (event.event_data.cr_arrival_event_data.sp_handle.psp_handle != psp)) { - - printf("wrong cr event data\n"); - exit(1); - } - - cr = event.event_data.cr_arrival_event_data.cr_handle; - status = dat_cr_accept(cr, ep, 0, (DAT_PVOID)0); - - } else { - struct addrinfo *target; - int rval; - - if (getaddrinfo (hostname, NULL, NULL, &target) != 0) { - printf("Error getting remote address.\n"); - exit(1); - } - - rval = ((struct sockaddr_in *)target->ai_addr)->sin_addr.s_addr; - printf ("Server Name: %s \n", hostname); - printf ("Server Net Address: %d.%d.%d.%d\n", - (rval >> 0) & 0xff, - (rval >> 8) & 0xff, - (rval >> 16) & 0xff, - (rval >> 24) & 0xff); - - remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); - - strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "client written data"); - - status = dat_ep_connect(ep, - &remote_addr, - SERVER_CONN_QUAL, - CONN_TIMEOUT, - 0, - (DAT_PVOID)0, - 0, - DAT_CONNECT_DEFAULT_FLAG ); - _OK(status, "dat_psp_create"); - } - - printf("Client waiting for connect response\n"); - status = dat_evd_wait(con_evd, CONN_TIMEOUT, 1, &event, &nmore); - _OK(status, "connect dat_evd_wait"); - - if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { - printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); - exit(1); - } - - printf("Connected!\n"); - - /* - * Setup our remote memory and tell the other side about it - */ - printf("Sending RMR data to remote\n"); - r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; - r_iov.virtual_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); - r_iov.segment_length = BUF_SIZE; - - *buf[ SEND_BUF_INDEX ] = r_iov; - - send_msg( buf[ SEND_BUF_INDEX ], - sizeof(DAT_RMR_TRIPLET), - lmr_context[ SEND_BUF_INDEX ], - cookie, - DAT_COMPLETION_SUPPRESS_FLAG); - - /* - * Wait for their RMR - */ - printf("Waiting for remote to send RMR data\n"); - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait after dat_ep_post_send"); - - if (event.event_number != DAT_DTO_COMPLETION_EVENT) { - printf("unexpected event waiting for RMR context - 0x%x\n", - event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status for post_send"); - if ((dto_event->transfered_length != sizeof(DAT_RMR_TRIPLET)) || - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX)) { - printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", - (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); - exit(1); - } - - r_iov = *buf[ RECV_BUF_INDEX ]; - - printf("Received RMR from remote: r_iov: ctx=%x,va=%p,len=%d\n", - r_iov.rmr_context, - (void*)(unsigned long)r_iov.virtual_address, - r_iov.segment_length); - - return(0); -} - -int -disconnect_ep() -{ - DAT_RETURN status; - int i; - DAT_EVENT event; - DAT_COUNT nmore; - - status = dat_ep_disconnect(ep, DAT_CLOSE_DEFAULT); - _OK(status, "dat_ep_disconnect"); - + &lmr_atomic, + &lmr_atomic_context, + &rmr_atomic_context, + ®_atomic_size, + ®_atomic_addr); + _OK(status, "dat_lmr_create atomic"); + + for (i = RECV_BUF_INDEX; i < REG_MEM_COUNT; i++) { + cookie.as_64 = i; + iov.lmr_context = lmr_context[ i ]; + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ i ]; + iov.segment_length = BUF_SIZE; + + status = dat_ep_post_recv(ep, + 1, + &iov, + cookie, + DAT_COMPLETION_DEFAULT_FLAG); + _OK(status, "dat_ep_post_recv"); + } + + /* setup receive buffer to initial string to be overwritten */ + strcpy((char*)buf[ RCV_RDMA_BUF_INDEX ], "blah, blah, blah\n"); + + if (server) { + + strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "server written data"); + + status = dat_psp_create(ia, + SERVER_CONN_QUAL, + cr_evd, + DAT_PSP_CONSUMER_FLAG, + &psp); + _OK(status, "dat_psp_create"); + + printf("Server waiting for connect request\n"); + status = dat_evd_wait(cr_evd, SERVER_TIMEOUT, 1, &event, &nmore); + _OK(status, "listen dat_evd_wait"); + + if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) { + printf("unexpected event after dat_psp_create: 0x%x\n", event.event_number); + exit(1); + } + + if ((event.event_data.cr_arrival_event_data.conn_qual != SERVER_CONN_QUAL) || + (event.event_data.cr_arrival_event_data.sp_handle.psp_handle != psp)) { + + printf("wrong cr event data\n"); + exit(1); + } + + cr = event.event_data.cr_arrival_event_data.cr_handle; + status = dat_cr_accept(cr, ep, 0, (DAT_PVOID)0); + + } else { + struct addrinfo *target; + int rval; + + if (getaddrinfo (hostname, NULL, NULL, &target) != 0) { + printf("Error getting remote address.\n"); + exit(1); + } + + rval = ((struct sockaddr_in *)target->ai_addr)->sin_addr.s_addr; + printf ("Server Name: %s \n", hostname); + printf ("Server Net Address: %d.%d.%d.%d\n", + (rval >> 0) & 0xff, + (rval >> 8) & 0xff, + (rval >> 16) & 0xff, + (rval >> 24) & 0xff); + + remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); + + strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "client written data"); + + status = dat_ep_connect(ep, + &remote_addr, + SERVER_CONN_QUAL, + CONN_TIMEOUT, + 0, + (DAT_PVOID)0, + 0, + DAT_CONNECT_DEFAULT_FLAG ); + _OK(status, "dat_psp_create"); + } + + printf("Client waiting for connect response\n"); + status = dat_evd_wait(con_evd, CONN_TIMEOUT, 1, &event, &nmore); + _OK(status, "connect dat_evd_wait"); + + if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { + printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); + exit(1); + } + + printf("Connected!\n"); + + /* + * Setup our remote memory and tell the other side about it + */ + printf("Sending RMR data to remote\n"); + r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; + r_iov.virtual_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); + r_iov.segment_length = BUF_SIZE; + + *buf[ SEND_BUF_INDEX ] = r_iov; + + send_msg( buf[ SEND_BUF_INDEX ], + sizeof(DAT_RMR_TRIPLET), + lmr_context[ SEND_BUF_INDEX ], + cookie, + DAT_COMPLETION_SUPPRESS_FLAG); + + /* + * Wait for their RMR + */ + printf("Waiting for remote to send RMR data\n"); + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait after dat_ep_post_send"); + + if (event.event_number != DAT_DTO_COMPLETION_EVENT) { + printf("unexpected event waiting for RMR context - 0x%x\n", + event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status for post_send"); + if ((dto_event->transfered_length != sizeof(DAT_RMR_TRIPLET)) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX)) { + printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); + exit(1); + } + + r_iov = *buf[ RECV_BUF_INDEX ]; + + printf("Received RMR from remote: r_iov: ctx=%x,va=%p,len=%d\n", + r_iov.rmr_context, + (void*)(unsigned long)r_iov.virtual_address, + r_iov.segment_length); + + return(0); +} + +int +disconnect_ep() +{ + DAT_RETURN status; + int i; + DAT_EVENT event; + DAT_COUNT nmore; + + status = dat_ep_disconnect(ep, DAT_CLOSE_DEFAULT); + _OK(status, "dat_ep_disconnect"); + status = dat_evd_wait(con_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore); _OK(status, "dat_ep_disconnect"); - - if (server) { - status = dat_psp_free(psp); - _OK(status, "dat_psp_free"); - } - - for (i = 0; i < REG_MEM_COUNT; i++) { - status = dat_lmr_free(lmr[ i ]); - _OK(status, "dat_lmr_free"); - } - - status = dat_lmr_free(lmr_atomic); - _OK(status, "dat_lmr_free_atomic"); - - status = dat_ep_free(ep); - _OK(status, "dat_ep_free"); - - status = dat_evd_free(dto_evd); - _OK(status, "dat_evd_free DTO"); - status = dat_evd_free(con_evd); - _OK(status, "dat_evd_free CON"); - status = dat_evd_free(cr_evd); - _OK(status, "dat_evd_free CR"); - - status = dat_pz_free(pz); - _OK(status, "dat_pz_free"); - - status = dat_ia_close(ia, DAT_CLOSE_DEFAULT); - _OK(status, "dat_ia_close"); - - return(0); -} - -int -do_immediate() -{ - DAT_REGION_DESCRIPTION region; - DAT_EVENT event; - DAT_COUNT nmore; - DAT_LMR_TRIPLET iov; - DAT_RMR_TRIPLET r_iov; - DAT_DTO_COOKIE cookie; - DAT_RMR_CONTEXT their_context; - DAT_RETURN status; - DAT_UINT32 immed_data; - DAT_UINT32 immed_data_recv; - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = - &event.event_data.dto_completion_event_data; - DAT_IB_EXTENSION_EVENT_DATA *ext_event = - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; - - printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); - - if (server) { - immed_data = 0x1111; - } else { - immed_data = 0x7777; - } - - cookie.as_64 = 0x5555; - - r_iov = *buf[ RECV_BUF_INDEX ]; - - iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; - iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; - iov.segment_length = BUF_SIZE; - - cookie.as_64 = 0x9999; - - status = dat_ib_post_rdma_write_immed(ep, // ep_handle - 1, // num_segments - &iov, // LMR - cookie, // user_cookie - &r_iov, // RMR - immed_data, - DAT_COMPLETION_DEFAULT_FLAG); - _OK(status, "dat_ib_post_rdma_write_immed"); - - /* - * Collect first event, write completion or the inbound recv with immed - */ - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); - if (event.event_number != DAT_IB_DTO_EVENT) - { - printf("unexpected event # waiting for WR-IMMED - 0x%x\n", - event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status"); - if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) - { - if ((dto_event->transfered_length != BUF_SIZE) || - (dto_event->user_cookie.as_64 != 0x9999)) - { - printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", - (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64); - exit(1); - } - } - else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) - { - if ((dto_event->transfered_length != BUF_SIZE) || - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) - { - printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", - (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64, - sizeof(int), RECV_BUF_INDEX+1); - exit(1); - } - - /* get immediate data from event */ - immed_data_recv = ext_event->val.immed.data; - } - else - { - printf("unexpected extension type for event - 0x%x, 0x%x\n", - event.event_number, ext_event->type); - exit(1); - } - - - /* - * Collect second event, write completion or the inbound recv with immed - */ - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); - if (event.event_number != DAT_IB_DTO_EVENT) - { - printf("unexpected event # waiting for WR-IMMED - 0x%x\n", - event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status"); - if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) - { - if ((dto_event->transfered_length != BUF_SIZE) || - (dto_event->user_cookie.as_64 != 0x9999)) - { - printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", - (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64); - exit(1); - } - } - else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) - { - if ((dto_event->transfered_length != BUF_SIZE) || - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) - { - printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", - (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64, - sizeof(int), RECV_BUF_INDEX+1); - exit(1); - } - - /* get immediate data from event */ - immed_data_recv = ext_event->val.immed.data; - } - else - { - printf("unexpected extension type for event - 0x%x, 0x%x\n", - event.event_number, ext_event->type); - exit(1); - } - - if ((server) && (immed_data_recv != 0x7777)) - { - printf("ERROR: Server got unexpected immed_data_recv 0x%x/0x%x\n", - 0x7777, immed_data_recv); - exit(1); - } - else if ((!server) && (immed_data_recv != 0x1111)) - { - printf("ERROR: Client got unexpected immed_data_recv 0x%x/0x%x\n", - 0x1111, immed_data_recv); - exit(1); - } - - if (server) - printf("Server received immed_data=0x%x\n", immed_data_recv); - else - printf("Client received immed_data=0x%x\n", immed_data_recv); - - printf("rdma buffer %p contains: %s\n", - buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); - - printf("\n RDMA_WRITE_WITH_IMMEDIATE_DATA test - PASSED\n"); - return (0); -} - -int -do_cmp_swap() -{ - DAT_DTO_COOKIE cookie; - DAT_RETURN status; - DAT_EVENT event; - DAT_COUNT nmore; - DAT_LMR_TRIPLET l_iov; - DAT_RMR_TRIPLET r_iov; - volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = - &event.event_data.dto_completion_event_data; - DAT_IB_EXTENSION_EVENT_DATA *ext_event = - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; - - printf("\nDoing CMP and SWAP\n"); - - r_iov = *buf[ RECV_BUF_INDEX ]; - - l_iov.lmr_context = lmr_atomic_context; - l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; - l_iov.segment_length = BUF_SIZE_ATOMIC; - - cookie.as_64 = 3333; - if (server) { - *target = 0x12345; - sleep(1); - /* server does not compare and should not swap */ - status = dat_ib_post_cmp_and_swap( ep, - (DAT_UINT64)0x654321, - (DAT_UINT64)0x6789A, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } else { - *target = 0x54321; - sleep(1); - /* client does compare and should swap */ - status = dat_ib_post_cmp_and_swap( ep, - (DAT_UINT64)0x12345, - (DAT_UINT64)0x98765, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } - _OK(status, "dat_ib_post_cmp_and_swap"); - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait for compare and swap"); - if (event.event_number != DAT_IB_DTO_EVENT) { - printf("unexpected event after post_cmp_and_swap: 0x%x\n", - event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status for CMP and SWAP"); - if (ext_event->type != DAT_IB_CMP_AND_SWAP) { - printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", - (int)ext_event->type, - (int)dto_event->user_cookie.as_64, - *atomic_buf); - exit(1); - } - sleep(1); /* wait for other side to complete swap */ - if (server) { - printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); - printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); - - if (*atomic_buf != 0x54321 || *target != 0x98765) { - printf("ERROR: Server CMP_SWAP\n"); - exit(1); - } - } else { - printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); - printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); - - if (*atomic_buf != 0x12345 || *target != 0x54321) { - printf("ERROR: Client CMP_SWAP\n"); - exit(1); - } - } - printf("\n CMP_SWAP test - PASSED\n"); - return(0); -} - -int -do_fetch_add() -{ - DAT_DTO_COOKIE cookie; - DAT_RETURN status; - DAT_EVENT event; - DAT_COUNT nmore; - DAT_LMR_TRIPLET l_iov; - DAT_RMR_TRIPLET r_iov; - volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = - &event.event_data.dto_completion_event_data; - DAT_IB_EXTENSION_EVENT_DATA *ext_event = - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; - - printf("\nDoing FETCH and ADD\n"); - - r_iov = *buf[ RECV_BUF_INDEX ]; - - l_iov.lmr_context = lmr_atomic_context; - l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; - l_iov.segment_length = BUF_SIZE_ATOMIC; - - cookie.as_64 = 0x7777; - if (server) { - /* Wait for client to finish cmp_swap */ - while (*target != 0x98765) - sleep(1); - *target = 0x10; - sleep(1); - status = dat_ib_post_fetch_and_add( ep, - (DAT_UINT64)0x100, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } else { - /* Wait for server, no swap so nothing to check */ - *target = 0x100; - sleep(1); - status = dat_ib_post_fetch_and_add( ep, - (DAT_UINT64)0x10, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } - _OK(status, "dat_ib_post_fetch_and_add"); - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait for fetch and add"); - if (event.event_number != DAT_IB_DTO_EVENT) { - printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status for FETCH and ADD"); - if (ext_event->type != DAT_IB_FETCH_AND_ADD) { - printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", - (int)ext_event->type, - (int)dto_event->user_cookie.as_64, - (int)*atomic_buf); - exit(1); - } - - if (server) { - printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf); - } else { - printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf); - } - - sleep(1); - - if (server) { - status = dat_ib_post_fetch_and_add( ep, - (DAT_UINT64)0x100, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } else { - status = dat_ib_post_fetch_and_add( ep, - (DAT_UINT64)0x10, - &l_iov, - cookie, - &r_iov, - DAT_COMPLETION_DEFAULT_FLAG); - } - - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); - _OK(status, "dat_evd_wait for second fetch and add"); - if (event.event_number != DAT_IB_DTO_EVENT) { - printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); - exit(1); - } - - _OK(dto_event->status, "event status for second FETCH and ADD"); - if (ext_event->type != DAT_IB_FETCH_AND_ADD) { - printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", - (int)ext_event->type, - (int)dto_event->user_cookie.as_64, - (long)atomic_buf); - exit(1); - } - - sleep(1); /* wait for other side to complete fetch_add */ - - if (server) { - printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); - printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); - - if (*atomic_buf != 0x200 || *target != 0x30) { - printf("ERROR: Server FETCH_ADD\n"); - exit(1); - } - } else { - printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); - printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); - - if (*atomic_buf != 0x20 || *target != 0x300) { - printf("ERROR: Server FETCH_ADD\n"); - exit(1); - } - } - printf("\n FETCH_ADD test - PASSED\n"); - return(0); -} - -int -main(int argc, char **argv) -{ - char *hostname; - - if (argc > 2) { - printf(usage); - exit(1); - } - - if ((argc == 1) || strcmp(argv[ 1 ], "-s") == 0) - { - server = 1; - } else { - server = 0; - hostname = argv[ 1 ]; - } - - - /* - * connect - */ - if (connect_ep(hostname)) { - exit(1); - } - if (do_immediate()) { - exit(1); - } - if (do_cmp_swap()) { - exit(1); - } - if (do_fetch_add()) { - exit(1); - } - return (disconnect_ep()); -} + + if (server) { + status = dat_psp_free(psp); + _OK(status, "dat_psp_free"); + } + + for (i = 0; i < REG_MEM_COUNT; i++) { + status = dat_lmr_free(lmr[ i ]); + _OK(status, "dat_lmr_free"); + } + + status = dat_lmr_free(lmr_atomic); + _OK(status, "dat_lmr_free_atomic"); + + status = dat_ep_free(ep); + _OK(status, "dat_ep_free"); + + status = dat_evd_free(dto_evd); + _OK(status, "dat_evd_free DTO"); + status = dat_evd_free(con_evd); + _OK(status, "dat_evd_free CON"); + status = dat_evd_free(cr_evd); + _OK(status, "dat_evd_free CR"); + + status = dat_pz_free(pz); + _OK(status, "dat_pz_free"); + + status = dat_ia_close(ia, DAT_CLOSE_DEFAULT); + _OK(status, "dat_ia_close"); + + return(0); +} + +int +do_immediate() +{ + DAT_REGION_DESCRIPTION region; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET iov; + DAT_RMR_TRIPLET r_iov; + DAT_DTO_COOKIE cookie; + DAT_RMR_CONTEXT their_context; + DAT_RETURN status; + DAT_UINT32 immed_data; + DAT_UINT32 immed_data_recv; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + DAT_IB_EXTENSION_EVENT_DATA *ext_event = + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; + + printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); + + if (server) { + immed_data = 0x1111; + } else { + immed_data = 0x7777; + } + + cookie.as_64 = 0x5555; + + r_iov = *buf[ RECV_BUF_INDEX ]; + + iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; + iov.segment_length = BUF_SIZE; + + cookie.as_64 = 0x9999; + + status = dat_ib_post_rdma_write_immed(ep, // ep_handle + 1, // num_segments + &iov, // LMR + cookie, // user_cookie + &r_iov, // RMR + immed_data, + DAT_COMPLETION_DEFAULT_FLAG); + _OK(status, "dat_ib_post_rdma_write_immed"); + + /* + * Collect first event, write completion or the inbound recv with immed + */ + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); + if (event.event_number != DAT_IB_DTO_EVENT) + { + printf("unexpected event # waiting for WR-IMMED - 0x%x\n", + event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status"); + if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999)) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit(1); + } + } + else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) + { + printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit(1); + } + + /* get immediate data from event */ + immed_data_recv = ext_event->val.immed.data; + } + else + { + printf("unexpected extension type for event - 0x%x, 0x%x\n", + event.event_number, ext_event->type); + exit(1); + } + + + /* + * Collect second event, write completion or the inbound recv with immed + */ + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); + if (event.event_number != DAT_IB_DTO_EVENT) + { + printf("unexpected event # waiting for WR-IMMED - 0x%x\n", + event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status"); + if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != 0x9999)) + { + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64); + exit(1); + } + } + else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) + { + if ((dto_event->transfered_length != BUF_SIZE) || + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) + { + printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", + (int)dto_event->transfered_length, + (int)dto_event->user_cookie.as_64, + sizeof(int), RECV_BUF_INDEX+1); + exit(1); + } + + /* get immediate data from event */ + immed_data_recv = ext_event->val.immed.data; + } + else + { + printf("unexpected extension type for event - 0x%x, 0x%x\n", + event.event_number, ext_event->type); + exit(1); + } + + if ((server) && (immed_data_recv != 0x7777)) + { + printf("ERROR: Server got unexpected immed_data_recv 0x%x/0x%x\n", + 0x7777, immed_data_recv); + exit(1); + } + else if ((!server) && (immed_data_recv != 0x1111)) + { + printf("ERROR: Client got unexpected immed_data_recv 0x%x/0x%x\n", + 0x1111, immed_data_recv); + exit(1); + } + + if (server) + printf("Server received immed_data=0x%x\n", immed_data_recv); + else + printf("Client received immed_data=0x%x\n", immed_data_recv); + + printf("rdma buffer %p contains: %s\n", + buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); + + printf("\n RDMA_WRITE_WITH_IMMEDIATE_DATA test - PASSED\n"); + return (0); +} + +int +do_cmp_swap() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + DAT_IB_EXTENSION_EVENT_DATA *ext_event = + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; + + printf("\nDoing CMP and SWAP\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 3333; + if (server) { + *target = 0x12345; + sleep(1); + /* server does not compare and should not swap */ + status = dat_ib_post_cmp_and_swap( ep, + (DAT_UINT64)0x654321, + (DAT_UINT64)0x6789A, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + *target = 0x54321; + sleep(1); + /* client does compare and should swap */ + status = dat_ib_post_cmp_and_swap( ep, + (DAT_UINT64)0x12345, + (DAT_UINT64)0x98765, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK(status, "dat_ib_post_cmp_and_swap"); + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait for compare and swap"); + if (event.event_number != DAT_IB_DTO_EVENT) { + printf("unexpected event after post_cmp_and_swap: 0x%x\n", + event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status for CMP and SWAP"); + if (ext_event->type != DAT_IB_CMP_AND_SWAP) { + printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", + (int)ext_event->type, + (int)dto_event->user_cookie.as_64, + *atomic_buf); + exit(1); + } + sleep(1); /* wait for other side to complete swap */ + if (server) { + printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); + + if (*atomic_buf != 0x54321 || *target != 0x98765) { + printf("ERROR: Server CMP_SWAP\n"); + exit(1); + } + } else { + printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); + + if (*atomic_buf != 0x12345 || *target != 0x54321) { + printf("ERROR: Client CMP_SWAP\n"); + exit(1); + } + } + printf("\n CMP_SWAP test - PASSED\n"); + return(0); +} + +int +do_fetch_add() +{ + DAT_DTO_COOKIE cookie; + DAT_RETURN status; + DAT_EVENT event; + DAT_COUNT nmore; + DAT_LMR_TRIPLET l_iov; + DAT_RMR_TRIPLET r_iov; + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = + &event.event_data.dto_completion_event_data; + DAT_IB_EXTENSION_EVENT_DATA *ext_event = + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; + + printf("\nDoing FETCH and ADD\n"); + + r_iov = *buf[ RECV_BUF_INDEX ]; + + l_iov.lmr_context = lmr_atomic_context; + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; + l_iov.segment_length = BUF_SIZE_ATOMIC; + + cookie.as_64 = 0x7777; + if (server) { + /* Wait for client to finish cmp_swap */ + while (*target != 0x98765) + sleep(1); + *target = 0x10; + sleep(1); + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + /* Wait for server, no swap so nothing to check */ + *target = 0x100; + sleep(1); + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + _OK(status, "dat_ib_post_fetch_and_add"); + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait for fetch and add"); + if (event.event_number != DAT_IB_DTO_EVENT) { + printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status for FETCH and ADD"); + if (ext_event->type != DAT_IB_FETCH_AND_ADD) { + printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", + (int)ext_event->type, + (int)dto_event->user_cookie.as_64, + (int)*atomic_buf); + exit(1); + } + + if (server) { + printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf); + } else { + printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf); + } + + sleep(1); + + if (server) { + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x100, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } else { + status = dat_ib_post_fetch_and_add( ep, + (DAT_UINT64)0x10, + &l_iov, + cookie, + &r_iov, + DAT_COMPLETION_DEFAULT_FLAG); + } + + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); + _OK(status, "dat_evd_wait for second fetch and add"); + if (event.event_number != DAT_IB_DTO_EVENT) { + printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); + exit(1); + } + + _OK(dto_event->status, "event status for second FETCH and ADD"); + if (ext_event->type != DAT_IB_FETCH_AND_ADD) { + printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", + (int)ext_event->type, + (int)dto_event->user_cookie.as_64, + (long)atomic_buf); + exit(1); + } + + sleep(1); /* wait for other side to complete fetch_add */ + + if (server) { + printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); + printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); + + if (*atomic_buf != 0x200 || *target != 0x30) { + printf("ERROR: Server FETCH_ADD\n"); + exit(1); + } + } else { + printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); + printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); + + if (*atomic_buf != 0x20 || *target != 0x300) { + printf("ERROR: Server FETCH_ADD\n"); + exit(1); + } + } + printf("\n FETCH_ADD test - PASSED\n"); + return(0); +} + +int +main(int argc, char **argv) +{ + char *hostname; + + if (argc > 2) { + printf(usage); + exit(1); + } + + if ((argc == 1) || strcmp(argv[ 1 ], "-s") == 0) + { + server = 1; + } else { + server = 0; + hostname = argv[ 1 ]; + } + + + /* + * connect + */ + if (connect_ep(hostname)) { + exit(1); + } + if (do_immediate()) { + exit(1); + } + if (do_cmp_swap()) { + exit(1); + } + if (do_fetch_add()) { + exit(1); + } + return (disconnect_ep()); +} -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Thu Sep 20 12:18:36 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 20 Sep 2007 12:18:36 -0700 Subject: [ofa-general] [PATCH] uDAPL 1.2 mods to coexist with uDAPL 2.0 Message-ID: <000501c7fbbb$0d084390$19b7020a@amr.corp.intel.com> James, Please review patches to allow coexistence of 2.0 and 1.2 libraries. Modifications to DAT 1.2 package to coexist with 2.0 libraries - fix RPM specfile, configure.in, 1.2.2 package - update dat.conf Signed-off by: Arlin Davis ardavis at ichips.intel.com diff --git a/configure.in b/configure.in index e11fa73..3cb3d1b 100644 --- a/configure.in +++ b/configure.in @@ -1,11 +1,11 @@ dnl Process this file with autoconf to produce a configure script. AC_PREREQ(2.57) -AC_INIT(dapl, 1.2.1, openib-general at openib.org) +AC_INIT(dapl, 1.2.2, openib-general at openib.org) AC_CONFIG_SRCDIR([dat/udat/udat.c]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE(dapl, 1.2.1) +AM_INIT_AUTOMAKE(dapl, 1.2.2) AM_PROG_LIBTOOL diff --git a/doc/dat.conf b/doc/dat.conf index cb9ff00..005f9ee 100644 --- a/doc/dat.conf +++ b/doc/dat.conf @@ -1,5 +1,5 @@ # -# DAT 1.2 configuration file +# DAT 1.2 and 2.0 configuration file # # Each entry should have the following fields: # @@ -9,13 +9,18 @@ # For the uDAPL cma provder, specify as one of the following: # network address, network hostname, or netdev name and 0 for port # -# Simple (OpenIB-cma) default with netdev name provided first on list +# Simple (OpenIB-cma) default with netdev name provided first on list # to enable use of same dat.conf version on all nodes -# -# Add examples for multiple interfaces and IPoIB HA fail over, and bonding # -OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib0 0" "" -OpenIB-cma-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib1 0" "" -OpenIB-cma-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib2 0" "" -OpenIB-cma-3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib3 0" "" -OpenIB-bond u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "bond0 0" "" +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding +# +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" diff --git a/libdat.spec.in b/libdat.spec.in index 7e81b97..15b8694 100644 --- a/libdat.spec.in +++ b/libdat.spec.in @@ -33,7 +33,7 @@ # $Id: $ %define ver 1.2 -%define RELEASE 1 +%define RELEASE 2 %define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} Summary: Userspace DAT and DAPL API. @@ -43,8 +43,8 @@ Release: %rel%{?dist} License: Dual GPL/BSD/CPL Group: System Environment/Libraries -BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) -Source: http://openfabrics.org/~ardavis/%{name}-%{version}-%{release}.tgz +BuildRoot: %{_tmppath}/%{name}-%{version}.%{release}-root-%(%{__id_u} -n) +Source: http://openfabrics.org/downloads/dapl/%{name}-%{version}.%{release}.tar.gz Url: http://openfabrics.org/ %description @@ -54,7 +54,7 @@ RDMA API that supports DAT 1.2 specification %package devel Summary: Development files for the libdat and libdapl libraries Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} +Requires: %{name} = %{version}.%{release} %description devel Static libraries and header files for the libdat and libdapl library. @@ -62,16 +62,15 @@ Static libraries and header files for the libdat and libdapl library. %package utils Summary: Test suites for uDAPL library Group: System Environment/Libraries -Requires: %{name} = %{version}-%{release} +Requires: %{name} = %{version}.%{release} %description utils Useful test suites to validate uDAPL library API's. %prep -%setup -q -n %{name} +%setup -q -n %{name}-%{version}.%{release} %build -./autogen.sh %configure make @@ -112,7 +111,10 @@ rm -rf $RPM_BUILD_ROOT %{_mandir}/man1/* %changelog -* Wed June 6 2007 Arlin Davis - 1.2.1 +* Wed Jun 6 2007 Arlin Davis - 1.2.2 +- OFED 1.3, DAT/DAPL Version 1.2, Release 2 + +* Wed Jun 6 2007 Arlin Davis - 1.2.1 - OFED 1.2, DAT/DAPL Version 1.2, Release 1 * Fri Oct 20 2006 Arlin Davis - 1.2.0 From mst at dev.mellanox.co.il Thu Sep 20 12:36:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 20 Sep 2007 21:36:04 +0200 Subject: [ofa-general] Re: [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists In-Reply-To: References: <200709041047.32062.jackm@dev.mellanox.co.il> <20070919063421.GA6185@mellanox.co.il> Message-ID: <20070920193604.GA31861@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists > > OK, I added the patch below to my tree. I cleaned up Jack's patch a > little and it seems to work for me; I hope I didn't break anything. BTW, isn't it actually 2.6.23 material? This fixes data corruption ... -- MST From rdreier at cisco.com Thu Sep 20 13:12:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 13:12:54 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/mlx4: Handle new FW requirement for send request prefetching, for WQE sg lists In-Reply-To: <20070920193604.GA31861@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 20 Sep 2007 21:36:04 +0200") References: <200709041047.32062.jackm@dev.mellanox.co.il> <20070919063421.GA6185@mellanox.co.il> <20070920193604.GA31861@mellanox.co.il> Message-ID: > BTW, isn't it actually 2.6.23 material? This fixes data corruption ... I don't know. No one told me how severe the impact is. Has anyone seen this make a difference outside of a synthetic stress test? - R. From sashak at voltaire.com Thu Sep 20 13:28:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 20 Sep 2007 22:28:09 +0200 Subject: [ofa-general] [ANNOUNCE] management tarballs release Message-ID: <20070920202809.GB21834@sashak.voltaire.com> Hi, There is a new release of the management (OpenSM and infiniband diagnostics) tarballs available in: http://www.openfabrics.org/downloads/management/ md5sum: 3d0bad9aa4cedc7f88a2cc0d7b7ec3ea dist/infiniband-diags-1.3.2.tar.gz b6f2274fb3bc949902eb2d501d3dc1cc dist/libibcommon-1.0.5.tar.gz 4a3b18c5e6eac4020cfe2bc095600e53 dist/libibmad-1.1.2.tar.gz 4a76a71f38fdc9ae4314233e9a46b10b dist/libibumad-1.1.3.tar.gz 7cb18c8ce4bd74d432e25f220c9ca32f dist/opensm-3.1.5.tar.gz Sasha From sashak at voltaire.com Thu Sep 20 13:29:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 20 Sep 2007 22:29:36 +0200 Subject: [ewg] Re: [ofa-general] RE: OFA website edits In-Reply-To: <46F0060E.1080505@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> Message-ID: <20070920202936.GC21834@sashak.voltaire.com> On 10:08 Tue 18 Sep , Arlin Davis wrote: > > Maintainers, > > Please move your packages and update your WEB_README. Currently we only have > rdmacm, dapl, cxgb3, and WinOF updated for this process. done for management. Sasha From ralph.campbell at qlogic.com Thu Sep 20 16:33:44 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 20 Sep 2007 16:33:44 -0700 Subject: [ofa-general] [PATCH] IB/core - possible bug in handling link down in ib_sa_join_multicast() Message-ID: <1190331224.20700.27.camel@brick.pathscale.com> I was looking at the code for multicast.c and noticed that ib_sa_join_multicast() calls queue_join() which puts the request at the front of the group->pending_list. If this is a second request, it seems like it would interfere with process_join_error() since group->last_join won't point to the member at the head of the pending_list. The sequence would thus be: 1. ib_sa_join_multicast() // puts member1 on head of pending_list and starts work thread 2. mcast_work_handler() // calls send_join() which sets group->last_join to member1 3. ib_sa_join_multicast() // puts member2 on head of pending_list 4. IB_EVENT_PORT_ERR event calls mcast_groups_lost() // sets group->state to MCAST_ERROR 5. join_handler() is called with error status 6. process_join_error() fails to process member1 since it doesn't match the first entry in the group->pending_list. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 15b4c4d..1bc1fe6 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -196,7 +196,7 @@ static void queue_join(struct mcast_member *member) unsigned long flags; spin_lock_irqsave(&group->lock, flags); - list_add(&member->list, &group->pending_list); + list_add_tail(&member->list, &group->pending_list); if (group->state == MCAST_IDLE) { group->state = MCAST_BUSY; atomic_inc(&group->refcount); From mshefty at ichips.intel.com Thu Sep 20 17:00:30 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Sep 2007 17:00:30 -0700 Subject: [ofa-general] [PATCH] IB/core - possible bug in handling link down in ib_sa_join_multicast() In-Reply-To: <1190331224.20700.27.camel@brick.pathscale.com> References: <1190331224.20700.27.camel@brick.pathscale.com> Message-ID: <46F3099E.7040008@ichips.intel.com> Ralph Campbell wrote: > I was looking at the code for multicast.c and noticed that > ib_sa_join_multicast() calls queue_join() which puts the > request at the front of the group->pending_list. If this > is a second request, it seems like it would interfere with > process_join_error() since group->last_join won't point > to the member at the head of the pending_list. The sequence > would thus be: Thanks. This does indeed appear to be a bug, which your patch should fix. However, to clarify, the problem is really: > 1. ib_sa_join_multicast() > // puts member1 on head of pending_list and starts work thread > 2. mcast_work_handler() > // calls send_join() which sets group->last_join to member1 > 3. ib_sa_join_multicast() > // puts member2 on head of pending_list > 4. IB_EVENT_PORT_ERR event calls mcast_groups_lost() > // sets group->state to MCAST_ERROR replace 4 above with: 4. Join operation fails with non-zero status. I.e. the problem is related to a failure response from the SA, perhaps due to an invalid setting for the multicast group, and not related to a port event. > 5. join_handler() is called with error status > 6. process_join_error() fails to process member1 since > it doesn't match the first entry in the group->pending_list. The impact is that the failed join request gets tossed. The request at the head of the pending_list now gets processed. After it completes, the original request (from step 1) ends up trying again. So, everything should eventually work out as expected. Roland, can we please queue this for 2.6.24? Would you like it resubmitted with an updated patch description? - Sean > Signed-off-by: Ralph Campbell > > diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c > index 15b4c4d..1bc1fe6 100644 > --- a/drivers/infiniband/core/multicast.c > +++ b/drivers/infiniband/core/multicast.c > @@ -196,7 +196,7 @@ static void queue_join(struct mcast_member *member) > unsigned long flags; > > spin_lock_irqsave(&group->lock, flags); > - list_add(&member->list, &group->pending_list); > + list_add_tail(&member->list, &group->pending_list); > if (group->state == MCAST_IDLE) { > group->state = MCAST_BUSY; > atomic_inc(&group->refcount); > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Sep 20 18:44:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 20 Sep 2007 18:44:21 -0700 Subject: [ofa-general] [PATCH] IB/core - possible bug in handling link down in ib_sa_join_multicast() In-Reply-To: <46F3099E.7040008@ichips.intel.com> (Sean Hefty's message of "Thu, 20 Sep 2007 17:00:30 -0700") References: <1190331224.20700.27.camel@brick.pathscale.com> <46F3099E.7040008@ichips.intel.com> Message-ID: > Roland, can we please queue this for 2.6.24? Would you like it > resubmitted with an updated patch description? Yes, please, if the original description is wrong then please correct it. From kohls at kohls.chtah.com Thu Sep 20 20:28:59 2007 From: kohls at kohls.chtah.com (Kohls.com) Date: Fri, 21 Sep 2007 03:28:59 -0000 Subject: [ofa-general] ***SPAM*** Early Birds + Free Shipping Message-ID: http://kohls.chtah.com/a/tBG8zOyBBZVhBBanSjFBVGXb$Qd/kohl25 ************************************************************************** FREE Standard Shipping* when you spend $75 or more! Saturday only. Surcharges still apply. ************************************************************************** Super Saturday This Saturday, the early bird gets the extra savings! Shop in-store and online for Early Bird specials on great deals from every department! Time is a tickin'! These great savings are only available from 1am-4pm (EDT) online and 7am-1pm (local time) in-store! Keep an eye on the clock ... before time runs out! Early Bird Specials Online: 1am-4pm (EDT) In-store: 7am-1pm (local time) ************************************************************************** Today's Ad Online! Visit Today's Ad at Kohls.com to see what's on sale at your nearest Kohl's Department Store. Plus, shop online for many of our featured sale items! ************************************************************************** Ticktock, ticktock ... check out these superior savings before time runs out! 40-60% Off Women's Classic Collections 50-55% Off Juniors' SO Knit Tops 55% Off Men's Suit Separates & Sport Coats 50-60% Off ENTIRE STOCK Kids' Sleepwear 40-50% Off Running Shoes EXTRA 15% Off Kitchen Electrics already 10-40% off EXTRA 10% Off Vacuums & Floor Care already 10-25% off 60% Off Bed Sets ************************************************************************** 60-80% Clearance Shop Kohls.com Clearance for up to 60-80% off** original prices on items from every department. But hurry--quantities are limited and deals like these won't last. ************************************************************************** *Surcharges may apply due to size, weight or special handling required. If your item has a surcharge, it will appear on the product page. **Clearance prices represent savings off original prices. Interim markdowns may have been taken. Sorry, no price adjustments. This mailbox is unattended, so please do not reply to this message. Instead, e-mail us at myaccount.help at kohls.com, or write to us at Kohl's Department Stores, Attention: Customer Service, N56 W17000 Ridgewood Drive, Menomonee Falls, WI 53051. If you no longer wish to receive e-mails from Kohls.com, unsubscribe by pasting this link into the Address field of your Internet browser: http://kohls.chtah.com/a/tBG8zOyBBZVhBBanSjFBVGXb$Qd/kohl24 Super Saturday sale prices good September 22, 2007. Free Standard Shipping offer good September 22, 2007. Early Bird prices good online 1am-4pm (EDT) September 22, 2007. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Thu Sep 20 22:17:37 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 21 Sep 2007 07:17:37 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-21:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-20 OpenSM git rev = Thu_Sep_20_19:16:34_2007 [3ec8b607b6aebad15c314c59888ceea19b3180fe] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From keshetti85-student at yahoo.co.in Fri Sep 21 00:24:08 2007 From: keshetti85-student at yahoo.co.in (keshetti85-student at yahoo.co.in) Date: Fri, 21 Sep 2007 12:54:08 +0530 (IST) Subject: [ofa-general] ***SPAM*** [query] Multipath discovery in openSM Message-ID: <634370.92702.qm@web8324.mail.in.yahoo.com> What is the exact significance of the configurable option LMC in the opensm.conf file? If there are multiple paths between two end nodes in a network and I set the LMC > 0 then whether the openSM itself identifies those routes and updates the switch forwarding tables or is it the duty of some other consumer ?? And after configuring multiple paths between end nodes, how exactly they are used for path redundancy and load sharing. Again is it the duty of the openSM (in case any SM) or the application? PS: Please CC your valuable responses to my e-mail address. regards, Mahesh Now you can chat without downloading messenger. Go to http://in.messenger.yahoo.com/webmessengerpromo.php -------------- next part -------------- An HTML attachment was scrubbed... URL: From krkumar2 at in.ibm.com Fri Sep 21 01:16:55 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 21 Sep 2007 13:46:55 +0530 Subject: [ofa-general] [PATCH] Cleanup ipoib_poll() to use meaningful variable names Message-ID: <20070921081655.13058.93140.sendpatchset@localhost.localdomain> 1. Cleanup variable names in ibpob_poll 2. "while loop" optimization in the poll handler since net_rx_action guarantees 'budget' is atleast 1. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 20 +++++++------------- 1 files changed, 7 insertions(+), 13 deletions(-) diff -ruNp a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-21 13:16:41.000000000 +0530 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-21 13:20:42.000000000 +0530 @@ -285,18 +285,15 @@ int ipoib_poll(struct napi_struct *napi, { struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi); struct net_device *dev = priv->dev; - int done; - int t; - int n, i; - - done = 0; + int num_wc, max_wc; + int done = 0; poll_more: - while (done < budget) { - int max = (budget - done); + do { + int i; - t = min(IPOIB_NUM_WC, max); - n = ib_poll_cq(priv->cq, t, priv->ibwc); + max_wc = min(IPOIB_NUM_WC, budget - done); + num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc); for (i = 0; i < n; i++) { struct ib_wc *wc = priv->ibwc + i; @@ -310,10 +307,7 @@ poll_more: } else ipoib_ib_handle_tx_wc(dev, wc); } - - if (n != t) - break; - } + } while (num_wc == max_wc && done < budget); if (done < budget) { netif_rx_complete(dev, napi); From krkumar2 at in.ibm.com Fri Sep 21 01:17:13 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 21 Sep 2007 13:47:13 +0530 Subject: [ofa-general] [PATCH] Minor optimizations in ipoib_poll In-Reply-To: <20070921081655.13058.93140.sendpatchset@localhost.localdomain> References: <20070921081655.13058.93140.sendpatchset@localhost.localdomain> Message-ID: <20070921081713.13058.61080.sendpatchset@localhost.localdomain> If the poll loop executes more than once (and it happens on my system with two flood pings): - no need to calculate "budget - done" on every iteration (but will require to do this once, when returning from fn) - check for one variable being non-zero instead of comparing two vars for every iteration. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 17 +++++++++-------- 1 files changed, 9 insertions(+), 8 deletions(-) diff -ruNp a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-21 13:20:42.000000000 +0530 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-21 13:37:20.000000000 +0530 @@ -286,30 +286,30 @@ int ipoib_poll(struct napi_struct *napi, struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi); struct net_device *dev = priv->dev; int num_wc, max_wc; - int done = 0; + int remaining = budget; poll_more: do { int i; - max_wc = min(IPOIB_NUM_WC, budget - done); + max_wc = min(IPOIB_NUM_WC, remaining); num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc); - for (i = 0; i < n; i++) { + for (i = 0; i < num_wc; i++) { struct ib_wc *wc = priv->ibwc + i; if (wc->wr_id & IPOIB_CM_OP_SRQ) { - ++done; + --remaining; ipoib_cm_handle_rx_wc(dev, wc); } else if (wc->wr_id & IPOIB_OP_RECV) { - ++done; + --remaining; ipoib_ib_handle_rx_wc(dev, wc); } else ipoib_ib_handle_tx_wc(dev, wc); } - } while (num_wc == max_wc && done < budget); + } while (num_wc == max_wc && remaining); - if (done < budget) { + if (remaining) { netif_rx_complete(dev, napi); if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | @@ -318,7 +318,8 @@ poll_more: goto poll_more; } - return done; + /* return number of receives processed */ + return budget - remaining; } void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) From keshetti85-student at yahoo.co.in Fri Sep 21 01:25:46 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Fri, 21 Sep 2007 13:55:46 +0530 Subject: [ofa-general] [query] Multi path discovery in openSM Message-ID: <829ded920709210125q3c4c89dak8b211267b6e31e55@mail.gmail.com> What is the exact significance of the configurable option LMC in the opensm.conf file? If there are multiple paths between two end nodes in a network and I set the LMC > 0 then whether the openSM itself identifies those routes and updates the switch forwarding tables or is it the duty of some other consumer ?? And after configuring multiple paths between end nodes, how exactly they are used for path redundancy and load sharing. Again is it the duty of the openSM (in case any SM) or the application? PS: Please CC your valuable responses to my e-mail address. regards, Mahesh From quentin at egj.org Fri Sep 21 02:42:01 2007 From: quentin at egj.org (quentin Puronvarsi) Date: Fri, 21 Sep 2007 12:42:01 +0300 Subject: [ofa-general] unsoulfu Message-ID: <342233762047.758035712173@egj.org> Ru+mor N*e+w-s+: Onc_olo,gy M,e-d_. I+n.c.. (O_TC: O NCO) a Canc er Tr_eatm.ent Solu-tio+ns Gro.up is s*a_i*d to h a.v*e expe'r*ienced o.v+e'r a 10_00% in.creas'e in re-ven ues f,o.r t+h*e f*iscal 3.r'd qua_rter end*ing J.u*l,y*, 2,0-0,7 com pa red w i't.h t'h_e pr.ior y.e'a+r w.hile f*iscal f*ourth quart er resul_ts f*o+r 2+0-0+7 a,r,e on tr'ack to excee*d t'h-i s y earÂ’s thi-rd quarte,r results . O*N-C O addi't ionally p.lans to inc*r_ease ser,vice of_ferin,gs whi,ch a.r-e curr_entl*y un+derw.ay. Do*nÂ’t w a+i,t f-o-r t-h e n'e,w's to c-o_m,e o.u,t a n-d l,o.s+e t'h_e opport,un.ity to g,e.t in fro+nt of the gen'eral inve-'sting p'ublic. On colog+y M.e,d is in a mult_ibil,lion do'llar in'dust ry w h+e.r.e t'h e.y a*r,e gainin_g mark,et shar'e rapid_ly. C-a+l.l y'o-u_r bro,ker n,o+w f_o*r O.N C O+. From vlad at lists.openfabrics.org Fri Sep 21 02:55:58 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 21 Sep 2007 02:55:58 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070921-0200 daily build status Message-ID: <20070921095558.BB450E60896@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070921-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From krkumar2 at in.ibm.com Fri Sep 21 02:42:47 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 21 Sep 2007 15:12:47 +0530 Subject: [ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 In-Reply-To: <46EE4E7A.1000700@voltaire.com> Message-ID: Hi Or, Sorry about the delay, I ran into various bugs and then system froze for about 2 days. Or Gerlitz wrote on 09/17/2007 03:22:58 PM: > good, please test with rev5 and let us know. I tested with rev5 and this is what I found (different from what I said earlier about EHCA): Original code had almost zero retransmission (for a particular run), 1 for EHCA and 0 for MTHCA. With batching, both had high retransmissions: 73680 for EHCA and 70268 for MTHCA. It seems I was wrong when I said EHCA was having no issues. So far I have identical retransmission numbers for E1000 only. > transmission of 4K batched packets sounds like a real problem for the > receiver side, with 0.5K send/recv queue size, its 8 batches of 512 > packets each were for each RX there is completion (WC) to process, SKB > to alloc and post to the QP where for the TX there's only posting to the > QP, processes one (?) WC and free 512 SKBs. The receiver and sender both have 4K WR's. I had earlier changed batching so that IPoIB will send atmost 2 skbs even if more are present in the queue and send 2 more after the first two and so on. But that too gave high numbers for retransmissions. > If indeed the situation is so unsymmetrical, I am starting to think that > the CPU utilization at the sender side might be much higher with > batching then without batching, have you looked into that? Overall it is almost the same. I had used netperf (about 1 month back) and it gave almost same numbers. I haven't tried recently. Even in regular code, though batching is not done, qdisc_restart() does xmit in a tight loop. The only difference is that dev->queue_lock is DROPPED/GOT for each skb, and dev->tx_lock is held for shorter times. I avoid the former and have no control for the latter. > I am not with you. Looking on 2.6.22 and 2.6.23-rc5, for both their > ipoib-NAPI mechanism is implemented through the function ipoib_poll > being the polling api for the network stack etc, so what is the old API > and where does this difference exist? I meant the pre-Stephen-Hemminger converted NAPI. He had changed the old NAPI to newer one (where driver doesn't get *budget, etc). > You might want to try something lighter such as iperf udp test, where a > nice criteria would be to compare bandwidth AND packet loss between > no-batching and batching. As for the MTU, the default is indeed 2K > (2044) but its always to just know the facts, namely what was the mtu > during the test. OK, that is a good idea. I will try it over the weekend. > if you have user space libraries installed, load ib_uverbs and run the > command ibv_devinfo, you will see all the infiniband devices on your > system and for each its device id and firmware version. If not, you > should be looking on > > /sys/class/infiniband/$device/hca_type > and > /sys/class/infiniband/$device/fw_ver Both these files are not present, though ehca0 is present. For mthca, the values are : MT23108 & 3.5.0. Thanks, - KK From PHF at zurich.ibm.com Fri Sep 21 06:31:45 2007 From: PHF at zurich.ibm.com (Philip Frey1) Date: Fri, 21 Sep 2007 15:31:45 +0200 Subject: [ofa-general] OFED 1.2.5 & Ammasso 1100 Message-ID: Hello, I am trying to get an Ammasso 1100 card to work with OFED. So far I have installed a vanilla kernel with infiniband support and the Ammasso driver (module). When I boot it, lsmod shows iw_c2. So far so good but when I try to use the libibverbs from OFED 1.2.5.1, I get the following error on "# ib_rdma_bw": libibverbs: Fatal: couldn't read uverbs ABI version. 3194:main: No IB devices found I also tried to install the complete "Basic" set of tools from the OFED installer but after that I was no longer able to load the iw_c2 module. Can you point me to instructions on how to use the Ammasso 1100 card with the userspace verbs? Thank you very much, Philip Frey From swise at opengridcomputing.com Fri Sep 21 08:10:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Sep 2007 10:10:46 -0500 Subject: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070919105612.GA31158@2ka.mipt.ru> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <20070914130941.GG18517@2ka.mipt.ru> <46EC00BE.3020801@opengridcomputing.com> <20070916142241.GA26848@2ka.mipt.ru> <46EE9C50.7070406@opengridcomputing.com> <20070919105612.GA31158@2ka.mipt.ru> Message-ID: <46F3DEF6.3010404@opengridcomputing.com> Evgeniy Polyakov wrote: > Hi Steve. > > On Mon, Sep 17, 2007 at 10:25:04AM -0500, Steve Wise (swise at opengridcomputing.com) wrote: >>> Does creating the whole new netdevice is a too big overhead, or is it >>> considered bad idea? >> I think its too big overhead, and pretty invasive on the low level cxgb3 >> driver. I think having a device in the 'ifconfig -a' after iw_cxgb3 is >> loaded and devices discovered would be a good thing for the admin. This >> is the angle Roland suggested. I'm just not sure how to implement it. >> >> But if someone could explain how I might create this full netdevice as a >> pseudo device on top of the real one, maybe I could implement it. >> >> Note that non TCP traffic still needs to utilize this interface for ND >> to work properly with the RDMA core. > > Just a though - what about allowing secondary addresses with the same > address as main one? I.e. change bit of the core code to allow creating > aliases with the same address as main device, so that you would be able > to create ':iw' alias during rdma device initialization? > The problem is that on rdma route/address resolution the rdma core CM uses the routing table to look up which local device to use. So what we need is separate ip subnets for rdma vs non rdma tcp. Also, to avoid the original issue of 4-tuple conflicts, the rdma device _must_ listen on specific local "rdma-only" ip addresses and thus they must be not the same address as that used for native host tcp traffic. Steve. From swise at opengridcomputing.com Fri Sep 21 08:29:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Sep 2007 10:29:46 -0500 Subject: [ofa-general] Re: [ewg] RE: Delaying OFED 1.3 alpha release to next week In-Reply-To: References: Message-ID: <46F3E36A.7010605@opengridcomputing.com> Hoang-Nam Nguyen wrote: > Hello Tziporet! >> Due to some last minutes submissions that are not yet taken and some >> problems with the >> install script I delay the OFED 1.3 alpha release to next week. >> >> I also think we should agree on a new 1.3 schedule based on the changes >> in the alpha release. > We're testing and backporting ehca on various kernel versions and distros. > We'll have our backport patches ready by Tue next week. >> Another thing to consider is base the kernel code on 2.6.24 and in this >> way to reduce the amount of patches we have > I would prefer this option, because we have at the moment about 15 > patches in queue for 2.6.24. > I agree. I'd rather see ofed-1.3 on a 2.6.24 base and keep ofed-1.2.5 alive a little longer... Steve From swise at opengridcomputing.com Fri Sep 21 08:31:30 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Sep 2007 10:31:30 -0500 Subject: [ofa-general] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> Message-ID: <46F3E3D2.70601@opengridcomputing.com> Michael, can you pull this patch into ofed-1.2.5 and ofed-1.3? Or would you want me to push it into my git tree for you to pull from? Thanks, Steve. Roland Dreier wrote: > > Roland - can you please queue this up for 2.6.24? > > Done, thanks. From guthridg at us.ibm.com Fri Sep 21 11:17:05 2007 From: guthridg at us.ibm.com (Scott Guthridge) Date: Fri, 21 Sep 2007 14:17:05 -0400 Subject: [ofa-general] IBV_WC_WR_FLUSH_ERR: first WQE only or all pending WQE's? Message-ID: I have an application that has just posted a few sends on a connected RC queue pair, when either the application itself modifies the QP state to error, or the remote side goes into error. The *first* of these posted send WQE's generates a CQE indicating IBV_WC_WR_FLUSH_ERR [or something like IBV_WC_REM_OP_ERR, in the remote case] as I would expect. But the remaining pending WQE's never seem to generate CQE's. [The ibv_post_send operation did not give local errors on these, BTW.] As a result, my app. hangs waiting for the pending operations to drain. IB architecture spec. sections 9.9.2.3 and 9.9.2.4 seem to suggest that all pending WQE's behind the failed request (error class B, I think) should generate CQE's with the FLUSH error. Questions: (1) Do I understand the spec correctly? Should WQE's posted subsequently to the one that is going to fail be generating FLUSH errors? (2) Has anyone seen this behavior before? Is it common? [I haven't tried switching hardware -- card I'm using *may* not be production level.] If it *is* common behavior, I may need to recode my app. to mark all outstanding requests as failed upon receiving the first error, and then ignore any subsequent errors, to be defensive about it -- this seems kludgy, though, and I'd rather not do that if I don't have to. Thanks, Scott From jlentini at netapp.com Fri Sep 21 12:40:06 2007 From: jlentini at netapp.com (James Lentini) Date: Fri, 21 Sep 2007 15:40:06 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2 In-Reply-To: <000001c7fbb7$30cbad70$19b7020a@amr.corp.intel.com> References: <000001c7fbb7$30cbad70$19b7020a@amr.corp.intel.com> Message-ID: Comments below: On Thu, 20 Sep 2007, Arlin Davis wrote: > James, > > > > Please review patches to allow coexistence of 2.0 and 1.2 libraries. I updated the dat.conf to > provide configuration to both 1.2 and 2.0 providers. In addition, the development package (headers) > is not targeted to include/dat2 instead of include/dat. A patch for 1.2 will follow shortly. > > > > Modifications to DAT 2.0 package to coexist with 1.2 libraries > > - cleanup CR-LF in dtestx > > - fix RPM specfile, 2.0.1 package > > - move devel to include/dat2 > > - change test examples to use new 2.0 provider names. > > > > Signed-off by: Arlin Davis ardavis at ichips.intel.com > > > > diff --git a/Makefile.am b/Makefile.am > index b3a0149..f473aaa 100755 > --- a/Makefile.am > +++ b/Makefile.am > @@ -66,7 +66,7 @@ dat_udat_libdat_la_SOURCES = dat/udat/udat.c \ > dat/common/dat_init.c \ > dat/common/dat_dr.c \ > dat/common/dat_sr.c > - > +# version-info current:revision:age What does this comment do? > dat_udat_libdat_la_LDFLAGS = -version-info 2:0:0 $(dat_version_script) -ldl > > # > @@ -178,11 +178,12 @@ dapl_udapl_libdaplcma_la_SOURCES = dapl/udapl/dapl_init.c \ > dapl/openib_cma/dapl_ib_cm.c \ > dapl/openib_cma/dapl_ib_mem.c $(XPROGRAMS) > > +# version-info current:revision:age ditto > dapl_udapl_libdaplcma_la_LDFLAGS = -version-info 2:0:0 $(daplcma_version_script) \ > -Wl,-init,dapl_init -Wl,-fini,dapl_fini \ > -lpthread -libverbs -lrdmacm > > -libdatincludedir = $(includedir)/dat > +libdatincludedir = $(includedir)/dat2 > > libdatinclude_HEADERS = dat/include/dat/dat.h \ > dat/include/dat/dat_error.h \ > @@ -244,7 +245,7 @@ EXTRA_DIST = dat/common/dat_dictionary.h \ > dat/udat/libdat.map \ > doc/dat.conf \ > dapl/udapl/libdaplcma.map \ > - libdat.spec.in \ > + libdat2.spec.in \ > $(man_MANS) \ > test/dapltest/include/dapl_bpool.h \ > test/dapltest/include/dapl_client_info.h \ > @@ -274,7 +275,7 @@ EXTRA_DIST = dat/common/dat_dictionary.h \ > test/dapltest/include/dapl_version.h \ > test/dapltest/mdep/linux/dapl_mdep_user.h > > -dist-hook: libdat.spec > - cp libdat.spec $(distdir) > +dist-hook: libdat2.spec > + cp libdat2.spec $(distdir) > > SUBDIRS = . test/dtest test/dapltest > diff --git a/README b/README > index 437c1f7..1fc55a2 100644 > --- a/README > +++ b/README > @@ -17,16 +17,18 @@ Building debug version: > ./configure --enable-debug > make > > -Build example with OFED prefix (x86_64) > ------------------------------------------ > +Build example with OFED 1.2+ prefix (x86_64) > +--------------------------------------------- > ./autogen.sh > -./configure --prefix /usr/local/ofed --libdir /usr/local/ofed/lib64 LDFLAGS=-L/usr/local/ofed/lib64 > CPPFLAGS="-I/usr/local/ofed/include" > +./configure --prefix /usr --sysconf=/etc --libdir /usr/lib64 LDFLAGS=-L/usr/lib64 > CPPFLAGS="-I/usr/include" It looks like there was some line wrap here and in other places below. No big deal. > make > > Installing: > ---------- > make install > > +Note: The development package installs DAT 2.0 include files under /usr/include/dat2 to co-exist > with DAT 1.2 /usr/include/dat > + > NOTE: to link these libraries you must either use libtool and > specify the full pathname of the library, or use the `-LLIBDIR' > flag during linking and do at least one of the following: > @@ -47,19 +49,32 @@ more information, such as the ld(1) and ld.so(8) manual pages. > sample /etc/dat.conf > > # > -# DAT 1.2 configuration file, sample OFED > +# DAT 1.2 and 2.0 configuration file > # > # Each entry should have the following fields: > # > # \ > # > # > -# For openib-cma provider you can specify as either: > -# network address, network hostname, or netdev name and 0 for port > +# For the uDAPL cma provder, specify as one of the following: > +# network address, network hostname, or netdev name and 0 for port > +# > +# Simple (OpenIB-cma) default with netdev name provided first on list > +# to enable use of same dat.conf version on all nodes > # > -# This example shows netdev name, enabling administrator to use same copy across cluster > +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding The previous line is TODO, right? I'd suggest annotating it with that text to make it clear to users. > # > -OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdapl-cma.so mv_dapl.1.2 "ib0 0" "" > +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" > +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" > +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" > +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" > +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" > +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" > +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" > +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" > +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" > +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" > + > > ============================= > 3.0 Bugs/Known issues > diff --git a/configure.in b/configure.in > index 7608e64..4eda85f 100644 > --- a/configure.in > +++ b/configure.in > @@ -1,11 +1,11 @@ > dnl Process this file with autoconf to produce a configure script. > > AC_PREREQ(2.57) > -AC_INIT(dapl, 2.0.0, openib-general at openib.org) > +AC_INIT(dapl, 2.0.1, general at lists.openfabrics.org) > AC_CONFIG_SRCDIR([dat/udat/udat.c]) > AC_CONFIG_AUX_DIR(config) > AM_CONFIG_HEADER(config.h) > -AM_INIT_AUTOMAKE(dapl, 2.0.0) > +AM_INIT_AUTOMAKE(dapl, 2.0.1) > > AM_PROG_LIBTOOL > > @@ -86,6 +86,6 @@ AC_CACHE_CHECK(Check for RHEL5 system, ac_cv_rhel5, > fi) > AM_CONDITIONAL(OS_RHEL5, test "$ac_cv_rhel5" = "yes") > > -AC_CONFIG_FILES([Makefile test/dtest/Makefile test/dapltest/Makefile libdat.spec]) > +AC_CONFIG_FILES([Makefile test/dtest/Makefile test/dapltest/Makefile libdat2.spec]) > > AC_OUTPUT > diff --git a/doc/dat.conf b/doc/dat.conf > index 2651673..005f9ee 100755 > --- a/doc/dat.conf > +++ b/doc/dat.conf > @@ -1,5 +1,5 @@ > # > -# DAT 2.0 configuration file > +# DAT 1.2 and 2.0 configuration file > # > # Each entry should have the following fields: > # > @@ -9,10 +9,18 @@ > # For the uDAPL cma provder, specify as one of the following: > # network address, network hostname, or netdev name and 0 for port > # > -# Simple (OpenIB-cma) default configuration with netdev name provided first on list > -# to enable use of same dat.conf version on all nodes. Assumes x86_64 installation. > +# Simple (OpenIB-cma) default with netdev name provided first on list > +# to enable use of same dat.conf version on all nodes > # > -OpenIB-cma u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" > -OpenIB-cma-1 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" > -OpenIB-cma-2 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" > -OpenIB-cma-3 u2.0 nonthreadsafe default /usr/lib64/libdaplcma.so mv_dapl.2.0 "ib0 0" "" > +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding > +# Again, the previous line is a TODO? > +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" > +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" > +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" > +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" > +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" > +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" > +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" > +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" > +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" > +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" > diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c > index 07b40ec..ba12a58 100644 > --- a/test/dtest/dtest.c > +++ b/test/dtest/dtest.c > @@ -44,7 +44,7 @@ > #include > > #ifndef DAPL_PROVIDER > -#define DAPL_PROVIDER "OpenIB-cma" > +#define DAPL_PROVIDER "OpenIB-2-cma" Should we update OpenIB to ofa? Obviously, this isn't necessary as part of this change > #endif > > #define MAX_POLLING_CNT 50000 > diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c > index 153ce76..04a0d5d 100755 > --- a/test/dtest/dtestx.c > +++ b/test/dtest/dtestx.c > @@ -30,785 +30,785 @@ > * SOFTWARE. > * > * $Id: $ > - */ > The formating seems strange below here. There appears to be an extra space after each "-" line. Ignoring that, I'm in complete agreement with switching the examples over to use the 2.0 APIs. > -#include > > -#include > > -#include > > -#include > > -#include > > -#include > > -#include > > -#include > > -#include > > - > > -#include "dat/udat.h" > > -#include "dat/dat_ib_extensions.h" > > - > > -#define _OK(status, str) \ > > -{ \ > > - const char *maj_msg, *min_msg; \ > > - if (status != DAT_SUCCESS) { \ > > - dat_strerror(status, &maj_msg, &min_msg); \ > > - fprintf(stderr, str " returned %s : %s\n", maj_msg, min_msg); \ > > - exit(1); \ > > - } \ > > -} > > - > > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "dat/udat.h" > +#include "dat/dat_ib_extensions.h" > + > +#define _OK(status, str) \ > +{ \ > + const char *maj_msg, *min_msg; \ > + if (status != DAT_SUCCESS) { \ > + dat_strerror(status, &maj_msg, &min_msg); \ > + fprintf(stderr, str " returned %s : %s\n", maj_msg, min_msg); \ > + exit(1); \ > + } \ > +} > + > #define DTO_TIMEOUT (1000*1000*5) > #define CONN_TIMEOUT (1000*1000*10) > -#define SERVER_TIMEOUT (1000*1000*120) > > -#define SERVER_CONN_QUAL 31111 > > -#define BUF_SIZE 256 > > -#define BUF_SIZE_ATOMIC 8 > > -#define REG_MEM_COUNT 10 > > -#define SND_RDMA_BUF_INDEX 0 > > -#define RCV_RDMA_BUF_INDEX 1 > > -#define SEND_BUF_INDEX 2 > > -#define RECV_BUF_INDEX 3 > > - > > -u_int64_t *atomic_buf; > > -DAT_LMR_HANDLE lmr_atomic; > > -DAT_LMR_CONTEXT lmr_atomic_context; > > -DAT_RMR_CONTEXT rmr_atomic_context; > > -DAT_VLEN reg_atomic_size; > > -DAT_VADDR reg_atomic_addr; > > -DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; > > -DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; > > -DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; > > -DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; > > -DAT_VLEN reg_size[ REG_MEM_COUNT ]; > > -DAT_VADDR reg_addr[ REG_MEM_COUNT ]; > > -DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; > > -DAT_EP_HANDLE ep; > > -DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; > > -DAT_IA_HANDLE ia = DAT_HANDLE_NULL; > > -DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; > > -DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; > > -DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; > > -DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; > > -DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; > > -DAT_CR_HANDLE cr = DAT_HANDLE_NULL; > > -int server; > > - > > -char *usage = "-s | hostname (default == -s)\n"; > > - > > -void > > -send_msg( > > - void *data, > > - DAT_COUNT size, > > - DAT_LMR_CONTEXT context, > > - DAT_DTO_COOKIE cookie, > > - DAT_COMPLETION_FLAGS flags) > > -{ > > - DAT_LMR_TRIPLET iov; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - DAT_RETURN status; > > - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > > - &event.event_data.dto_completion_event_data; > > - > > - iov.lmr_context = context; > > - iov.virtual_address = (DAT_VADDR)(unsigned long)data; > > - iov.segment_length = (DAT_VLEN)size; > > - > > - status = dat_ep_post_send(ep, > > - 1, > > - &iov, > > - cookie, > > - flags); > > - _OK(status, "dat_ep_post_send"); > > - > > - if (! (flags & DAT_COMPLETION_SUPPRESS_FLAG)) { > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait after dat_ep_post_send"); > > - > > - if (event.event_number != DAT_DTO_COMPLETION_EVENT) { > > - printf("unexpected event waiting for post_send completion - 0x%x\n", > event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status for post_send"); > > - } > > -} > > - > > -int > > -connect_ep(char *hostname) > > -{ > > - DAT_SOCK_ADDR remote_addr; > > - DAT_EP_ATTR ep_attr; > > - DAT_RETURN status; > > - DAT_REGION_DESCRIPTION region; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - DAT_LMR_TRIPLET iov; > > - DAT_RMR_TRIPLET r_iov; > > - DAT_DTO_COOKIE cookie; > > - int i; > > - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > > - &event.event_data.dto_completion_event_data; > > - > > - status = dat_ia_open("OpenIB-cma", 8, &async_evd, &ia); > > - _OK(status, "dat_ia_open"); > > - > > - status = dat_pz_create(ia, &pz); > > - _OK(status, "dat_pz_create"); > > - > > - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &cr_evd ); > > - _OK(status, "dat_evd_create CR"); > > - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CONNECTION_FLAG, &con_evd ); > > - _OK(status, "dat_evd_create CR"); > > - status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_DTO_FLAG, &dto_evd ); > > - _OK(status, "dat_evd_create DTO"); > > - > > - memset(&ep_attr, 0, sizeof(ep_attr)); > > - ep_attr.service_type = DAT_SERVICE_TYPE_RC; > > - ep_attr.max_rdma_size = 0x10000; > > - ep_attr.qos = 0; > > - ep_attr.recv_completion_flags = 0; > > - ep_attr.max_recv_dtos = 10; > > - ep_attr.max_request_dtos = 10; > > - ep_attr.max_recv_iov = 1; > > - ep_attr.max_request_iov = 1; > > - ep_attr.max_rdma_read_in = 4; > > - ep_attr.max_rdma_read_out = 4; > > - ep_attr.request_completion_flags = DAT_COMPLETION_DEFAULT_FLAG; > > - ep_attr.ep_transport_specific_count = 0; > > - ep_attr.ep_transport_specific = NULL; > > - ep_attr.ep_provider_specific_count = 0; > > - ep_attr.ep_provider_specific = NULL; > > - > > - status = dat_ep_create(ia, pz, dto_evd, dto_evd, con_evd, &ep_attr, &ep); > > - _OK(status, "dat_ep_create"); > > - > > - for (i = 0; i < REG_MEM_COUNT; i++) { > > - buf[ i ] = (DAT_RMR_TRIPLET*)malloc(BUF_SIZE); > > - region.for_va = buf[ i ]; > > - status = dat_lmr_create(ia, > > - DAT_MEM_TYPE_VIRTUAL, > > - region, > > - BUF_SIZE, > > - pz, > > - DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, > > +#define SERVER_TIMEOUT (1000*1000*120) > +#define SERVER_CONN_QUAL 31111 > +#define BUF_SIZE 256 > +#define BUF_SIZE_ATOMIC 8 > +#define REG_MEM_COUNT 10 > +#define SND_RDMA_BUF_INDEX 0 > +#define RCV_RDMA_BUF_INDEX 1 > +#define SEND_BUF_INDEX 2 > +#define RECV_BUF_INDEX 3 > + > +u_int64_t *atomic_buf; > +DAT_LMR_HANDLE lmr_atomic; > +DAT_LMR_CONTEXT lmr_atomic_context; > +DAT_RMR_CONTEXT rmr_atomic_context; > +DAT_VLEN reg_atomic_size; > +DAT_VADDR reg_atomic_addr; > +DAT_LMR_HANDLE lmr[ REG_MEM_COUNT ]; > +DAT_LMR_CONTEXT lmr_context[ REG_MEM_COUNT ]; > +DAT_RMR_TRIPLET rmr[ REG_MEM_COUNT ]; > +DAT_RMR_CONTEXT rmr_context[ REG_MEM_COUNT ]; > +DAT_VLEN reg_size[ REG_MEM_COUNT ]; > +DAT_VADDR reg_addr[ REG_MEM_COUNT ]; > +DAT_RMR_TRIPLET * buf[ REG_MEM_COUNT ]; > +DAT_EP_HANDLE ep; > +DAT_EVD_HANDLE async_evd = DAT_HANDLE_NULL; > +DAT_IA_HANDLE ia = DAT_HANDLE_NULL; > +DAT_PZ_HANDLE pz = DAT_HANDLE_NULL; > +DAT_EVD_HANDLE cr_evd = DAT_HANDLE_NULL; > +DAT_EVD_HANDLE con_evd = DAT_HANDLE_NULL; > +DAT_EVD_HANDLE dto_evd = DAT_HANDLE_NULL; > +DAT_PSP_HANDLE psp = DAT_HANDLE_NULL; > +DAT_CR_HANDLE cr = DAT_HANDLE_NULL; > +int server; > + > +char *usage = "-s | hostname (default == -s)\n"; > + > +void > +send_msg( > + void *data, > + DAT_COUNT size, > + DAT_LMR_CONTEXT context, > + DAT_DTO_COOKIE cookie, > + DAT_COMPLETION_FLAGS flags) > +{ > + DAT_LMR_TRIPLET iov; > + DAT_EVENT event; > + DAT_COUNT nmore; > + DAT_RETURN status; > + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > + &event.event_data.dto_completion_event_data; > + > + iov.lmr_context = context; > + iov.virtual_address = (DAT_VADDR)(unsigned long)data; > + iov.segment_length = (DAT_VLEN)size; > + > + status = dat_ep_post_send(ep, > + 1, > + &iov, > + cookie, > + flags); > + _OK(status, "dat_ep_post_send"); > + > + if (! (flags & DAT_COMPLETION_SUPPRESS_FLAG)) { > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait after dat_ep_post_send"); > + > + if (event.event_number != DAT_DTO_COMPLETION_EVENT) { > + printf("unexpected event waiting for post_send completion - 0x%x\n", > event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status for post_send"); > + } > +} > + > +int > +connect_ep(char *hostname) > +{ > + DAT_SOCK_ADDR remote_addr; > + DAT_EP_ATTR ep_attr; > + DAT_RETURN status; > + DAT_REGION_DESCRIPTION region; > + DAT_EVENT event; > + DAT_COUNT nmore; > + DAT_LMR_TRIPLET iov; > + DAT_RMR_TRIPLET r_iov; > + DAT_DTO_COOKIE cookie; > + int i; > + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > + &event.event_data.dto_completion_event_data; > + > + status = dat_ia_open("OpenIB-2-cma", 8, &async_evd, &ia); > + _OK(status, "dat_ia_open"); > + > + status = dat_pz_create(ia, &pz); > + _OK(status, "dat_pz_create"); > + > + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CR_FLAG, &cr_evd ); > + _OK(status, "dat_evd_create CR"); > + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_CONNECTION_FLAG, &con_evd ); > + _OK(status, "dat_evd_create CR"); > + status = dat_evd_create(ia, 10, DAT_HANDLE_NULL, DAT_EVD_DTO_FLAG, &dto_evd ); > + _OK(status, "dat_evd_create DTO"); > + > + memset(&ep_attr, 0, sizeof(ep_attr)); > + ep_attr.service_type = DAT_SERVICE_TYPE_RC; > + ep_attr.max_rdma_size = 0x10000; > + ep_attr.qos = 0; > + ep_attr.recv_completion_flags = 0; > + ep_attr.max_recv_dtos = 10; > + ep_attr.max_request_dtos = 10; > + ep_attr.max_recv_iov = 1; > + ep_attr.max_request_iov = 1; > + ep_attr.max_rdma_read_in = 4; > + ep_attr.max_rdma_read_out = 4; > + ep_attr.request_completion_flags = DAT_COMPLETION_DEFAULT_FLAG; > + ep_attr.ep_transport_specific_count = 0; > + ep_attr.ep_transport_specific = NULL; > + ep_attr.ep_provider_specific_count = 0; > + ep_attr.ep_provider_specific = NULL; > + > + status = dat_ep_create(ia, pz, dto_evd, dto_evd, con_evd, &ep_attr, &ep); > + _OK(status, "dat_ep_create"); > + > + for (i = 0; i < REG_MEM_COUNT; i++) { > + buf[ i ] = (DAT_RMR_TRIPLET*)malloc(BUF_SIZE); > + region.for_va = buf[ i ]; > + status = dat_lmr_create(ia, > + DAT_MEM_TYPE_VIRTUAL, > + region, > + BUF_SIZE, > + pz, > + DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, > DAT_VA_TYPE_VA, > - &lmr[ i ], > > - &lmr_context[ i ], > > - &rmr_context[ i ], > > - ®_size[ i ], > > - ®_addr[ i ]); > > - _OK(status, "dat_lmr_create"); > > - } > > - > > - /* register atomic return buffer for original data */ > > - atomic_buf = (u_int64_t*)malloc(BUF_SIZE); > > - region.for_va = atomic_buf; > > - status = dat_lmr_create(ia, > > - DAT_MEM_TYPE_VIRTUAL, > > - region, > > - BUF_SIZE_ATOMIC, > > - pz, > > - DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, > > + &lmr[ i ], > + &lmr_context[ i ], > + &rmr_context[ i ], > + ®_size[ i ], > + ®_addr[ i ]); > + _OK(status, "dat_lmr_create"); > + } > + > + /* register atomic return buffer for original data */ > + atomic_buf = (u_int64_t*)malloc(BUF_SIZE); > + region.for_va = atomic_buf; > + status = dat_lmr_create(ia, > + DAT_MEM_TYPE_VIRTUAL, > + region, > + BUF_SIZE_ATOMIC, > + pz, > + DAT_MEM_PRIV_ALL_FLAG|DAT_IB_MEM_PRIV_REMOTE_ATOMIC, > DAT_VA_TYPE_VA, > - &lmr_atomic, > > - &lmr_atomic_context, > > - &rmr_atomic_context, > > - ®_atomic_size, > > - ®_atomic_addr); > > - _OK(status, "dat_lmr_create atomic"); > > - > > - for (i = RECV_BUF_INDEX; i < REG_MEM_COUNT; i++) { > > - cookie.as_64 = i; > > - iov.lmr_context = lmr_context[ i ]; > > - iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ i ]; > > - iov.segment_length = BUF_SIZE; > > - > > - status = dat_ep_post_recv(ep, > > - 1, > > - &iov, > > - cookie, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - _OK(status, "dat_ep_post_recv"); > > - } > > - > > - /* setup receive buffer to initial string to be overwritten */ > > - strcpy((char*)buf[ RCV_RDMA_BUF_INDEX ], "blah, blah, blah\n"); > > - > > - if (server) { > > - > > - strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "server written data"); > > - > > - status = dat_psp_create(ia, > > - SERVER_CONN_QUAL, > > - cr_evd, > > - DAT_PSP_CONSUMER_FLAG, > > - &psp); > > - _OK(status, "dat_psp_create"); > > - > > - printf("Server waiting for connect request\n"); > > - status = dat_evd_wait(cr_evd, SERVER_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "listen dat_evd_wait"); > > - > > - if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) { > > - printf("unexpected event after dat_psp_create: 0x%x\n", event.event_number); > > - exit(1); > > - } > > - > > - if ((event.event_data.cr_arrival_event_data.conn_qual != SERVER_CONN_QUAL) || > > - (event.event_data.cr_arrival_event_data.sp_handle.psp_handle != psp)) { > > - > > - printf("wrong cr event data\n"); > > - exit(1); > > - } > > - > > - cr = event.event_data.cr_arrival_event_data.cr_handle; > > - status = dat_cr_accept(cr, ep, 0, (DAT_PVOID)0); > > - > > - } else { > > - struct addrinfo *target; > > - int rval; > > - > > - if (getaddrinfo (hostname, NULL, NULL, &target) != 0) { > > - printf("Error getting remote address.\n"); > > - exit(1); > > - } > > - > > - rval = ((struct sockaddr_in *)target->ai_addr)->sin_addr.s_addr; > > - printf ("Server Name: %s \n", hostname); > > - printf ("Server Net Address: %d.%d.%d.%d\n", > > - (rval >> 0) & 0xff, > > - (rval >> 8) & 0xff, > > - (rval >> 16) & 0xff, > > - (rval >> 24) & 0xff); > > - > > - remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); > > - > > - strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "client written data"); > > - > > - status = dat_ep_connect(ep, > > - &remote_addr, > > - SERVER_CONN_QUAL, > > - CONN_TIMEOUT, > > - 0, > > - (DAT_PVOID)0, > > - 0, > > - DAT_CONNECT_DEFAULT_FLAG ); > > - _OK(status, "dat_psp_create"); > > - } > > - > > - printf("Client waiting for connect response\n"); > > - status = dat_evd_wait(con_evd, CONN_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "connect dat_evd_wait"); > > - > > - if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { > > - printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); > > - exit(1); > > - } > > - > > - printf("Connected!\n"); > > - > > - /* > > - * Setup our remote memory and tell the other side about it > > - */ > > - printf("Sending RMR data to remote\n"); > > - r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; > > - r_iov.virtual_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); > > - r_iov.segment_length = BUF_SIZE; > > - > > - *buf[ SEND_BUF_INDEX ] = r_iov; > > - > > - send_msg( buf[ SEND_BUF_INDEX ], > > - sizeof(DAT_RMR_TRIPLET), > > - lmr_context[ SEND_BUF_INDEX ], > > - cookie, > > - DAT_COMPLETION_SUPPRESS_FLAG); > > - > > - /* > > - * Wait for their RMR > > - */ > > - printf("Waiting for remote to send RMR data\n"); > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait after dat_ep_post_send"); > > - > > - if (event.event_number != DAT_DTO_COMPLETION_EVENT) { > > - printf("unexpected event waiting for RMR context - 0x%x\n", > > - event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status for post_send"); > > - if ((dto_event->transfered_length != sizeof(DAT_RMR_TRIPLET)) || > > - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX)) { > > - printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", > > - (int)dto_event->transfered_length, > > - (int)dto_event->user_cookie.as_64, > > - sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); > > - exit(1); > > - } > > - > > - r_iov = *buf[ RECV_BUF_INDEX ]; > > - > > - printf("Received RMR from remote: r_iov: ctx=%x,va=%p,len=%d\n", > > - r_iov.rmr_context, > > - (void*)(unsigned long)r_iov.virtual_address, > > - r_iov.segment_length); > > - > > - return(0); > > -} > > - > > -int > > -disconnect_ep() > > -{ > > - DAT_RETURN status; > > - int i; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - > > - status = dat_ep_disconnect(ep, DAT_CLOSE_DEFAULT); > > - _OK(status, "dat_ep_disconnect"); > > - > > + &lmr_atomic, > + &lmr_atomic_context, > + &rmr_atomic_context, > + ®_atomic_size, > + ®_atomic_addr); > + _OK(status, "dat_lmr_create atomic"); > + > + for (i = RECV_BUF_INDEX; i < REG_MEM_COUNT; i++) { > + cookie.as_64 = i; > + iov.lmr_context = lmr_context[ i ]; > + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ i ]; > + iov.segment_length = BUF_SIZE; > + > + status = dat_ep_post_recv(ep, > + 1, > + &iov, > + cookie, > + DAT_COMPLETION_DEFAULT_FLAG); > + _OK(status, "dat_ep_post_recv"); > + } > + > + /* setup receive buffer to initial string to be overwritten */ > + strcpy((char*)buf[ RCV_RDMA_BUF_INDEX ], "blah, blah, blah\n"); > + > + if (server) { > + > + strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "server written data"); > + > + status = dat_psp_create(ia, > + SERVER_CONN_QUAL, > + cr_evd, > + DAT_PSP_CONSUMER_FLAG, > + &psp); > + _OK(status, "dat_psp_create"); > + > + printf("Server waiting for connect request\n"); > + status = dat_evd_wait(cr_evd, SERVER_TIMEOUT, 1, &event, &nmore); > + _OK(status, "listen dat_evd_wait"); > + > + if (event.event_number != DAT_CONNECTION_REQUEST_EVENT) { > + printf("unexpected event after dat_psp_create: 0x%x\n", event.event_number); > + exit(1); > + } > + > + if ((event.event_data.cr_arrival_event_data.conn_qual != SERVER_CONN_QUAL) || > + (event.event_data.cr_arrival_event_data.sp_handle.psp_handle != psp)) { > + > + printf("wrong cr event data\n"); > + exit(1); > + } > + > + cr = event.event_data.cr_arrival_event_data.cr_handle; > + status = dat_cr_accept(cr, ep, 0, (DAT_PVOID)0); > + > + } else { > + struct addrinfo *target; > + int rval; > + > + if (getaddrinfo (hostname, NULL, NULL, &target) != 0) { > + printf("Error getting remote address.\n"); > + exit(1); > + } > + > + rval = ((struct sockaddr_in *)target->ai_addr)->sin_addr.s_addr; > + printf ("Server Name: %s \n", hostname); > + printf ("Server Net Address: %d.%d.%d.%d\n", > + (rval >> 0) & 0xff, > + (rval >> 8) & 0xff, > + (rval >> 16) & 0xff, > + (rval >> 24) & 0xff); > + > + remote_addr = *((DAT_IA_ADDRESS_PTR)target->ai_addr); > + > + strcpy((char*)buf[ SND_RDMA_BUF_INDEX ], "client written data"); > + > + status = dat_ep_connect(ep, > + &remote_addr, > + SERVER_CONN_QUAL, > + CONN_TIMEOUT, > + 0, > + (DAT_PVOID)0, > + 0, > + DAT_CONNECT_DEFAULT_FLAG ); > + _OK(status, "dat_psp_create"); > + } > + > + printf("Client waiting for connect response\n"); > + status = dat_evd_wait(con_evd, CONN_TIMEOUT, 1, &event, &nmore); > + _OK(status, "connect dat_evd_wait"); > + > + if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { > + printf("unexpected event after dat_ep_connect: 0x%x\n", event.event_number); > + exit(1); > + } > + > + printf("Connected!\n"); > + > + /* > + * Setup our remote memory and tell the other side about it > + */ > + printf("Sending RMR data to remote\n"); > + r_iov.rmr_context = rmr_context[ RCV_RDMA_BUF_INDEX ]; > + r_iov.virtual_address = (DAT_VADDR)((unsigned long)buf[ RCV_RDMA_BUF_INDEX ]); > + r_iov.segment_length = BUF_SIZE; > + > + *buf[ SEND_BUF_INDEX ] = r_iov; > + > + send_msg( buf[ SEND_BUF_INDEX ], > + sizeof(DAT_RMR_TRIPLET), > + lmr_context[ SEND_BUF_INDEX ], > + cookie, > + DAT_COMPLETION_SUPPRESS_FLAG); > + > + /* > + * Wait for their RMR > + */ > + printf("Waiting for remote to send RMR data\n"); > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait after dat_ep_post_send"); > + > + if (event.event_number != DAT_DTO_COMPLETION_EVENT) { > + printf("unexpected event waiting for RMR context - 0x%x\n", > + event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status for post_send"); > + if ((dto_event->transfered_length != sizeof(DAT_RMR_TRIPLET)) || > + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX)) { > + printf("unexpected event data for receive: len=%d cookie=%d expected %d/%d\n", > + (int)dto_event->transfered_length, > + (int)dto_event->user_cookie.as_64, > + sizeof(DAT_RMR_TRIPLET), RECV_BUF_INDEX); > + exit(1); > + } > + > + r_iov = *buf[ RECV_BUF_INDEX ]; > + > + printf("Received RMR from remote: r_iov: ctx=%x,va=%p,len=%d\n", > + r_iov.rmr_context, > + (void*)(unsigned long)r_iov.virtual_address, > + r_iov.segment_length); > + > + return(0); > +} > + > +int > +disconnect_ep() > +{ > + DAT_RETURN status; > + int i; > + DAT_EVENT event; > + DAT_COUNT nmore; > + > + status = dat_ep_disconnect(ep, DAT_CLOSE_DEFAULT); > + _OK(status, "dat_ep_disconnect"); > + > status = dat_evd_wait(con_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore); > _OK(status, "dat_ep_disconnect"); > - > > - if (server) { > > - status = dat_psp_free(psp); > > - _OK(status, "dat_psp_free"); > > - } > > - > > - for (i = 0; i < REG_MEM_COUNT; i++) { > > - status = dat_lmr_free(lmr[ i ]); > > - _OK(status, "dat_lmr_free"); > > - } > > - > > - status = dat_lmr_free(lmr_atomic); > > - _OK(status, "dat_lmr_free_atomic"); > > - > > - status = dat_ep_free(ep); > > - _OK(status, "dat_ep_free"); > > - > > - status = dat_evd_free(dto_evd); > > - _OK(status, "dat_evd_free DTO"); > > - status = dat_evd_free(con_evd); > > - _OK(status, "dat_evd_free CON"); > > - status = dat_evd_free(cr_evd); > > - _OK(status, "dat_evd_free CR"); > > - > > - status = dat_pz_free(pz); > > - _OK(status, "dat_pz_free"); > > - > > - status = dat_ia_close(ia, DAT_CLOSE_DEFAULT); > > - _OK(status, "dat_ia_close"); > > - > > - return(0); > > -} > > - > > -int > > -do_immediate() > > -{ > > - DAT_REGION_DESCRIPTION region; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - DAT_LMR_TRIPLET iov; > > - DAT_RMR_TRIPLET r_iov; > > - DAT_DTO_COOKIE cookie; > > - DAT_RMR_CONTEXT their_context; > > - DAT_RETURN status; > > - DAT_UINT32 immed_data; > > - DAT_UINT32 immed_data_recv; > > - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > > - &event.event_data.dto_completion_event_data; > > - DAT_IB_EXTENSION_EVENT_DATA *ext_event = > > - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > > - > > - printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); > > - > > - if (server) { > > - immed_data = 0x1111; > > - } else { > > - immed_data = 0x7777; > > - } > > - > > - cookie.as_64 = 0x5555; > > - > > - r_iov = *buf[ RECV_BUF_INDEX ]; > > - > > - iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; > > - iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; > > - iov.segment_length = BUF_SIZE; > > - > > - cookie.as_64 = 0x9999; > > - > > - status = dat_ib_post_rdma_write_immed(ep, // ep_handle > > - 1, // num_segments > > - &iov, // LMR > > - cookie, // user_cookie > > - &r_iov, // RMR > > - immed_data, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - _OK(status, "dat_ib_post_rdma_write_immed"); > > - > > - /* > > - * Collect first event, write completion or the inbound recv with immed > > - */ > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); > > - if (event.event_number != DAT_IB_DTO_EVENT) > > - { > > - printf("unexpected event # waiting for WR-IMMED - 0x%x\n", > > - event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status"); > > - if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) > > - { > > - if ((dto_event->transfered_length != BUF_SIZE) || > > - (dto_event->user_cookie.as_64 != 0x9999)) > > - { > > - printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", > > - (int)dto_event->transfered_length, > > - (int)dto_event->user_cookie.as_64); > > - exit(1); > > - } > > - } > > - else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) > > - { > > - if ((dto_event->transfered_length != BUF_SIZE) || > > - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) > > - { > > - printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", > > - (int)dto_event->transfered_length, > > - (int)dto_event->user_cookie.as_64, > > - sizeof(int), RECV_BUF_INDEX+1); > > - exit(1); > > - } > > - > > - /* get immediate data from event */ > > - immed_data_recv = ext_event->val.immed.data; > > - } > > - else > > - { > > - printf("unexpected extension type for event - 0x%x, 0x%x\n", > > - event.event_number, ext_event->type); > > - exit(1); > > - } > > - > > - > > - /* > > - * Collect second event, write completion or the inbound recv with immed > > - */ > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); > > - if (event.event_number != DAT_IB_DTO_EVENT) > > - { > > - printf("unexpected event # waiting for WR-IMMED - 0x%x\n", > > - event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status"); > > - if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) > > - { > > - if ((dto_event->transfered_length != BUF_SIZE) || > > - (dto_event->user_cookie.as_64 != 0x9999)) > > - { > > - printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", > > - (int)dto_event->transfered_length, > > - (int)dto_event->user_cookie.as_64); > > - exit(1); > > - } > > - } > > - else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) > > - { > > - if ((dto_event->transfered_length != BUF_SIZE) || > > - (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) > > - { > > - printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", > > - (int)dto_event->transfered_length, > > - (int)dto_event->user_cookie.as_64, > > - sizeof(int), RECV_BUF_INDEX+1); > > - exit(1); > > - } > > - > > - /* get immediate data from event */ > > - immed_data_recv = ext_event->val.immed.data; > > - } > > - else > > - { > > - printf("unexpected extension type for event - 0x%x, 0x%x\n", > > - event.event_number, ext_event->type); > > - exit(1); > > - } > > - > > - if ((server) && (immed_data_recv != 0x7777)) > > - { > > - printf("ERROR: Server got unexpected immed_data_recv 0x%x/0x%x\n", > > - 0x7777, immed_data_recv); > > - exit(1); > > - } > > - else if ((!server) && (immed_data_recv != 0x1111)) > > - { > > - printf("ERROR: Client got unexpected immed_data_recv 0x%x/0x%x\n", > > - 0x1111, immed_data_recv); > > - exit(1); > > - } > > - > > - if (server) > > - printf("Server received immed_data=0x%x\n", immed_data_recv); > > - else > > - printf("Client received immed_data=0x%x\n", immed_data_recv); > > - > > - printf("rdma buffer %p contains: %s\n", > > - buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); > > - > > - printf("\n RDMA_WRITE_WITH_IMMEDIATE_DATA test - PASSED\n"); > > - return (0); > > -} > > - > > -int > > -do_cmp_swap() > > -{ > > - DAT_DTO_COOKIE cookie; > > - DAT_RETURN status; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - DAT_LMR_TRIPLET l_iov; > > - DAT_RMR_TRIPLET r_iov; > > - volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; > > - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > > - &event.event_data.dto_completion_event_data; > > - DAT_IB_EXTENSION_EVENT_DATA *ext_event = > > - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > > - > > - printf("\nDoing CMP and SWAP\n"); > > - > > - r_iov = *buf[ RECV_BUF_INDEX ]; > > - > > - l_iov.lmr_context = lmr_atomic_context; > > - l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; > > - l_iov.segment_length = BUF_SIZE_ATOMIC; > > - > > - cookie.as_64 = 3333; > > - if (server) { > > - *target = 0x12345; > > - sleep(1); > > - /* server does not compare and should not swap */ > > - status = dat_ib_post_cmp_and_swap( ep, > > - (DAT_UINT64)0x654321, > > - (DAT_UINT64)0x6789A, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } else { > > - *target = 0x54321; > > - sleep(1); > > - /* client does compare and should swap */ > > - status = dat_ib_post_cmp_and_swap( ep, > > - (DAT_UINT64)0x12345, > > - (DAT_UINT64)0x98765, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } > > - _OK(status, "dat_ib_post_cmp_and_swap"); > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait for compare and swap"); > > - if (event.event_number != DAT_IB_DTO_EVENT) { > > - printf("unexpected event after post_cmp_and_swap: 0x%x\n", > > - event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status for CMP and SWAP"); > > - if (ext_event->type != DAT_IB_CMP_AND_SWAP) { > > - printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", > > - (int)ext_event->type, > > - (int)dto_event->user_cookie.as_64, > > - *atomic_buf); > > - exit(1); > > - } > > - sleep(1); /* wait for other side to complete swap */ > > - if (server) { > > - printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); > > - printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); > > - > > - if (*atomic_buf != 0x54321 || *target != 0x98765) { > > - printf("ERROR: Server CMP_SWAP\n"); > > - exit(1); > > - } > > - } else { > > - printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); > > - printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); > > - > > - if (*atomic_buf != 0x12345 || *target != 0x54321) { > > - printf("ERROR: Client CMP_SWAP\n"); > > - exit(1); > > - } > > - } > > - printf("\n CMP_SWAP test - PASSED\n"); > > - return(0); > > -} > > - > > -int > > -do_fetch_add() > > -{ > > - DAT_DTO_COOKIE cookie; > > - DAT_RETURN status; > > - DAT_EVENT event; > > - DAT_COUNT nmore; > > - DAT_LMR_TRIPLET l_iov; > > - DAT_RMR_TRIPLET r_iov; > > - volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; > > - DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > > - &event.event_data.dto_completion_event_data; > > - DAT_IB_EXTENSION_EVENT_DATA *ext_event = > > - (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > > - > > - printf("\nDoing FETCH and ADD\n"); > > - > > - r_iov = *buf[ RECV_BUF_INDEX ]; > > - > > - l_iov.lmr_context = lmr_atomic_context; > > - l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; > > - l_iov.segment_length = BUF_SIZE_ATOMIC; > > - > > - cookie.as_64 = 0x7777; > > - if (server) { > > - /* Wait for client to finish cmp_swap */ > > - while (*target != 0x98765) > > - sleep(1); > > - *target = 0x10; > > - sleep(1); > > - status = dat_ib_post_fetch_and_add( ep, > > - (DAT_UINT64)0x100, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } else { > > - /* Wait for server, no swap so nothing to check */ > > - *target = 0x100; > > - sleep(1); > > - status = dat_ib_post_fetch_and_add( ep, > > - (DAT_UINT64)0x10, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } > > - _OK(status, "dat_ib_post_fetch_and_add"); > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait for fetch and add"); > > - if (event.event_number != DAT_IB_DTO_EVENT) { > > - printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status for FETCH and ADD"); > > - if (ext_event->type != DAT_IB_FETCH_AND_ADD) { > > - printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", > > - (int)ext_event->type, > > - (int)dto_event->user_cookie.as_64, > > - (int)*atomic_buf); > > - exit(1); > > - } > > - > > - if (server) { > > - printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf); > > - } else { > > - printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf); > > - } > > - > > - sleep(1); > > - > > - if (server) { > > - status = dat_ib_post_fetch_and_add( ep, > > - (DAT_UINT64)0x100, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } else { > > - status = dat_ib_post_fetch_and_add( ep, > > - (DAT_UINT64)0x10, > > - &l_iov, > > - cookie, > > - &r_iov, > > - DAT_COMPLETION_DEFAULT_FLAG); > > - } > > - > > - status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > > - _OK(status, "dat_evd_wait for second fetch and add"); > > - if (event.event_number != DAT_IB_DTO_EVENT) { > > - printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); > > - exit(1); > > - } > > - > > - _OK(dto_event->status, "event status for second FETCH and ADD"); > > - if (ext_event->type != DAT_IB_FETCH_AND_ADD) { > > - printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", > > - (int)ext_event->type, > > - (int)dto_event->user_cookie.as_64, > > - (long)atomic_buf); > > - exit(1); > > - } > > - > > - sleep(1); /* wait for other side to complete fetch_add */ > > - > > - if (server) { > > - printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); > > - printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); > > - > > - if (*atomic_buf != 0x200 || *target != 0x30) { > > - printf("ERROR: Server FETCH_ADD\n"); > > - exit(1); > > - } > > - } else { > > - printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); > > - printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); > > - > > - if (*atomic_buf != 0x20 || *target != 0x300) { > > - printf("ERROR: Server FETCH_ADD\n"); > > - exit(1); > > - } > > - } > > - printf("\n FETCH_ADD test - PASSED\n"); > > - return(0); > > -} > > - > > -int > > -main(int argc, char **argv) > > -{ > > - char *hostname; > > - > > - if (argc > 2) { > > - printf(usage); > > - exit(1); > > - } > > - > > - if ((argc == 1) || strcmp(argv[ 1 ], "-s") == 0) > > - { > > - server = 1; > > - } else { > > - server = 0; > > - hostname = argv[ 1 ]; > > - } > > - > > - > > - /* > > - * connect > > - */ > > - if (connect_ep(hostname)) { > > - exit(1); > > - } > > - if (do_immediate()) { > > - exit(1); > > - } > > - if (do_cmp_swap()) { > > - exit(1); > > - } > > - if (do_fetch_add()) { > > - exit(1); > > - } > > - return (disconnect_ep()); > > -} > > + > + if (server) { > + status = dat_psp_free(psp); > + _OK(status, "dat_psp_free"); > + } > + > + for (i = 0; i < REG_MEM_COUNT; i++) { > + status = dat_lmr_free(lmr[ i ]); > + _OK(status, "dat_lmr_free"); > + } > + > + status = dat_lmr_free(lmr_atomic); > + _OK(status, "dat_lmr_free_atomic"); > + > + status = dat_ep_free(ep); > + _OK(status, "dat_ep_free"); > + > + status = dat_evd_free(dto_evd); > + _OK(status, "dat_evd_free DTO"); > + status = dat_evd_free(con_evd); > + _OK(status, "dat_evd_free CON"); > + status = dat_evd_free(cr_evd); > + _OK(status, "dat_evd_free CR"); > + > + status = dat_pz_free(pz); > + _OK(status, "dat_pz_free"); > + > + status = dat_ia_close(ia, DAT_CLOSE_DEFAULT); > + _OK(status, "dat_ia_close"); > + > + return(0); > +} > + > +int > +do_immediate() > +{ > + DAT_REGION_DESCRIPTION region; > + DAT_EVENT event; > + DAT_COUNT nmore; > + DAT_LMR_TRIPLET iov; > + DAT_RMR_TRIPLET r_iov; > + DAT_DTO_COOKIE cookie; > + DAT_RMR_CONTEXT their_context; > + DAT_RETURN status; > + DAT_UINT32 immed_data; > + DAT_UINT32 immed_data_recv; > + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > + &event.event_data.dto_completion_event_data; > + DAT_IB_EXTENSION_EVENT_DATA *ext_event = > + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > + > + printf("\nDoing RDMA WRITE IMMEDIATE DATA\n"); > + > + if (server) { > + immed_data = 0x1111; > + } else { > + immed_data = 0x7777; > + } > + > + cookie.as_64 = 0x5555; > + > + r_iov = *buf[ RECV_BUF_INDEX ]; > + > + iov.lmr_context = lmr_context[ SND_RDMA_BUF_INDEX ]; > + iov.virtual_address = (DAT_VADDR)(unsigned long)buf[ SND_RDMA_BUF_INDEX ]; > + iov.segment_length = BUF_SIZE; > + > + cookie.as_64 = 0x9999; > + > + status = dat_ib_post_rdma_write_immed(ep, // ep_handle > + 1, // num_segments > + &iov, // LMR > + cookie, // user_cookie > + &r_iov, // RMR > + immed_data, > + DAT_COMPLETION_DEFAULT_FLAG); > + _OK(status, "dat_ib_post_rdma_write_immed"); > + > + /* > + * Collect first event, write completion or the inbound recv with immed > + */ > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); > + if (event.event_number != DAT_IB_DTO_EVENT) > + { > + printf("unexpected event # waiting for WR-IMMED - 0x%x\n", > + event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status"); > + if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) > + { > + if ((dto_event->transfered_length != BUF_SIZE) || > + (dto_event->user_cookie.as_64 != 0x9999)) > + { > + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", > + (int)dto_event->transfered_length, > + (int)dto_event->user_cookie.as_64); > + exit(1); > + } > + } > + else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) > + { > + if ((dto_event->transfered_length != BUF_SIZE) || > + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) > + { > + printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", > + (int)dto_event->transfered_length, > + (int)dto_event->user_cookie.as_64, > + sizeof(int), RECV_BUF_INDEX+1); > + exit(1); > + } > + > + /* get immediate data from event */ > + immed_data_recv = ext_event->val.immed.data; > + } > + else > + { > + printf("unexpected extension type for event - 0x%x, 0x%x\n", > + event.event_number, ext_event->type); > + exit(1); > + } > + > + > + /* > + * Collect second event, write completion or the inbound recv with immed > + */ > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait after dat_ib_post_rdma_write"); > + if (event.event_number != DAT_IB_DTO_EVENT) > + { > + printf("unexpected event # waiting for WR-IMMED - 0x%x\n", > + event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status"); > + if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED) > + { > + if ((dto_event->transfered_length != BUF_SIZE) || > + (dto_event->user_cookie.as_64 != 0x9999)) > + { > + printf("unexpected event data for rdma_write_immed: len=%d cookie=0x%x\n", > + (int)dto_event->transfered_length, > + (int)dto_event->user_cookie.as_64); > + exit(1); > + } > + } > + else if (ext_event->type == DAT_IB_RDMA_WRITE_IMMED_DATA) > + { > + if ((dto_event->transfered_length != BUF_SIZE) || > + (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) > + { > + printf("unexpected event data of immediate write: len=%d cookie=%d expected %d/%d\n", > + (int)dto_event->transfered_length, > + (int)dto_event->user_cookie.as_64, > + sizeof(int), RECV_BUF_INDEX+1); > + exit(1); > + } > + > + /* get immediate data from event */ > + immed_data_recv = ext_event->val.immed.data; > + } > + else > + { > + printf("unexpected extension type for event - 0x%x, 0x%x\n", > + event.event_number, ext_event->type); > + exit(1); > + } > + > + if ((server) && (immed_data_recv != 0x7777)) > + { > + printf("ERROR: Server got unexpected immed_data_recv 0x%x/0x%x\n", > + 0x7777, immed_data_recv); > + exit(1); > + } > + else if ((!server) && (immed_data_recv != 0x1111)) > + { > + printf("ERROR: Client got unexpected immed_data_recv 0x%x/0x%x\n", > + 0x1111, immed_data_recv); > + exit(1); > + } > + > + if (server) > + printf("Server received immed_data=0x%x\n", immed_data_recv); > + else > + printf("Client received immed_data=0x%x\n", immed_data_recv); > + > + printf("rdma buffer %p contains: %s\n", > + buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); > + > + printf("\n RDMA_WRITE_WITH_IMMEDIATE_DATA test - PASSED\n"); > + return (0); > +} > + > +int > +do_cmp_swap() > +{ > + DAT_DTO_COOKIE cookie; > + DAT_RETURN status; > + DAT_EVENT event; > + DAT_COUNT nmore; > + DAT_LMR_TRIPLET l_iov; > + DAT_RMR_TRIPLET r_iov; > + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; > + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > + &event.event_data.dto_completion_event_data; > + DAT_IB_EXTENSION_EVENT_DATA *ext_event = > + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > + > + printf("\nDoing CMP and SWAP\n"); > + > + r_iov = *buf[ RECV_BUF_INDEX ]; > + > + l_iov.lmr_context = lmr_atomic_context; > + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; > + l_iov.segment_length = BUF_SIZE_ATOMIC; > + > + cookie.as_64 = 3333; > + if (server) { > + *target = 0x12345; > + sleep(1); > + /* server does not compare and should not swap */ > + status = dat_ib_post_cmp_and_swap( ep, > + (DAT_UINT64)0x654321, > + (DAT_UINT64)0x6789A, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } else { > + *target = 0x54321; > + sleep(1); > + /* client does compare and should swap */ > + status = dat_ib_post_cmp_and_swap( ep, > + (DAT_UINT64)0x12345, > + (DAT_UINT64)0x98765, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } > + _OK(status, "dat_ib_post_cmp_and_swap"); > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait for compare and swap"); > + if (event.event_number != DAT_IB_DTO_EVENT) { > + printf("unexpected event after post_cmp_and_swap: 0x%x\n", > + event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status for CMP and SWAP"); > + if (ext_event->type != DAT_IB_CMP_AND_SWAP) { > + printf("unexpected event data of cmp and swap : type=%d cookie=%d original 0x%llx\n", > + (int)ext_event->type, > + (int)dto_event->user_cookie.as_64, > + *atomic_buf); > + exit(1); > + } > + sleep(1); /* wait for other side to complete swap */ > + if (server) { > + printf("Server got original data = 0x%llx, expected 0x54321\n", *atomic_buf); > + printf("Client final result (on server) = 0x%llx, expected 0x98765\n", *target); > + > + if (*atomic_buf != 0x54321 || *target != 0x98765) { > + printf("ERROR: Server CMP_SWAP\n"); > + exit(1); > + } > + } else { > + printf("Client got original data = 0x%llx, expected 0x12345\n",*atomic_buf); > + printf("Server final result (on client) = 0x%llx, expected 0x54321\n", *target); > + > + if (*atomic_buf != 0x12345 || *target != 0x54321) { > + printf("ERROR: Client CMP_SWAP\n"); > + exit(1); > + } > + } > + printf("\n CMP_SWAP test - PASSED\n"); > + return(0); > +} > + > +int > +do_fetch_add() > +{ > + DAT_DTO_COOKIE cookie; > + DAT_RETURN status; > + DAT_EVENT event; > + DAT_COUNT nmore; > + DAT_LMR_TRIPLET l_iov; > + DAT_RMR_TRIPLET r_iov; > + volatile DAT_UINT64 *target = (DAT_UINT64*)buf[ RCV_RDMA_BUF_INDEX ]; > + DAT_DTO_COMPLETION_EVENT_DATA *dto_event = > + &event.event_data.dto_completion_event_data; > + DAT_IB_EXTENSION_EVENT_DATA *ext_event = > + (DAT_IB_EXTENSION_EVENT_DATA *)&event.event_extension_data[0]; > + > + printf("\nDoing FETCH and ADD\n"); > + > + r_iov = *buf[ RECV_BUF_INDEX ]; > + > + l_iov.lmr_context = lmr_atomic_context; > + l_iov.virtual_address = (DAT_VADDR)(unsigned long)atomic_buf; > + l_iov.segment_length = BUF_SIZE_ATOMIC; > + > + cookie.as_64 = 0x7777; > + if (server) { > + /* Wait for client to finish cmp_swap */ > + while (*target != 0x98765) > + sleep(1); > + *target = 0x10; > + sleep(1); > + status = dat_ib_post_fetch_and_add( ep, > + (DAT_UINT64)0x100, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } else { > + /* Wait for server, no swap so nothing to check */ > + *target = 0x100; > + sleep(1); > + status = dat_ib_post_fetch_and_add( ep, > + (DAT_UINT64)0x10, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } > + _OK(status, "dat_ib_post_fetch_and_add"); > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait for fetch and add"); > + if (event.event_number != DAT_IB_DTO_EVENT) { > + printf("unexpected event after post_fetch_and_add: 0x%x\n", event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status for FETCH and ADD"); > + if (ext_event->type != DAT_IB_FETCH_AND_ADD) { > + printf("unexpected event data of fetch and add : type=%d cookie=%d original%d\n", > + (int)ext_event->type, > + (int)dto_event->user_cookie.as_64, > + (int)*atomic_buf); > + exit(1); > + } > + > + if (server) { > + printf("Client original data (on server) = 0x%llx, expected 0x100\n", *atomic_buf); > + } else { > + printf("Server original data (on client) = 0x%llx, expected 0x10\n", *atomic_buf); > + } > + > + sleep(1); > + > + if (server) { > + status = dat_ib_post_fetch_and_add( ep, > + (DAT_UINT64)0x100, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } else { > + status = dat_ib_post_fetch_and_add( ep, > + (DAT_UINT64)0x10, > + &l_iov, > + cookie, > + &r_iov, > + DAT_COMPLETION_DEFAULT_FLAG); > + } > + > + status = dat_evd_wait(dto_evd, DTO_TIMEOUT, 1, &event, &nmore); > + _OK(status, "dat_evd_wait for second fetch and add"); > + if (event.event_number != DAT_IB_DTO_EVENT) { > + printf("unexpected event after second post_fetch_and_add: 0x%x\n", event.event_number); > + exit(1); > + } > + > + _OK(dto_event->status, "event status for second FETCH and ADD"); > + if (ext_event->type != DAT_IB_FETCH_AND_ADD) { > + printf("unexpected event data of second fetch and add : type=%d cookie=%d original%d\n", > + (int)ext_event->type, > + (int)dto_event->user_cookie.as_64, > + (long)atomic_buf); > + exit(1); > + } > + > + sleep(1); /* wait for other side to complete fetch_add */ > + > + if (server) { > + printf("Server got original data = 0x%llx, expected 0x200\n", *atomic_buf); > + printf("Client final result (on server) = 0x%llx, expected 0x30\n", *target); > + > + if (*atomic_buf != 0x200 || *target != 0x30) { > + printf("ERROR: Server FETCH_ADD\n"); > + exit(1); > + } > + } else { > + printf("Server side original data = 0x%llx, expected 0x20\n", *atomic_buf); > + printf("Server final result (on client) = 0x%llx, expected 0x300\n", *target); > + > + if (*atomic_buf != 0x20 || *target != 0x300) { > + printf("ERROR: Server FETCH_ADD\n"); > + exit(1); > + } > + } > + printf("\n FETCH_ADD test - PASSED\n"); > + return(0); > +} > + > +int > +main(int argc, char **argv) > +{ > + char *hostname; > + > + if (argc > 2) { > + printf(usage); > + exit(1); > + } > + > + if ((argc == 1) || strcmp(argv[ 1 ], "-s") == 0) > + { > + server = 1; > + } else { > + server = 0; > + hostname = argv[ 1 ]; > + } > + > + > + /* > + * connect > + */ > + if (connect_ep(hostname)) { > + exit(1); > + } > + if (do_immediate()) { > + exit(1); > + } > + if (do_cmp_swap()) { > + exit(1); > + } > + if (do_fetch_add()) { > + exit(1); > + } > + return (disconnect_ep()); > +} > > > > > > > > > From jlentini at netapp.com Fri Sep 21 12:41:30 2007 From: jlentini at netapp.com (James Lentini) Date: Fri, 21 Sep 2007 15:41:30 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] uDAPL 1.2 mods to coexist with uDAPL 2.0 In-Reply-To: <000501c7fbbb$0d084390$19b7020a@amr.corp.intel.com> References: <000501c7fbbb$0d084390$19b7020a@amr.corp.intel.com> Message-ID: I agree with the goal of supporting both 1.2 and 2.0 implementations on the same system. Thanks for working on this. Comments below: On Thu, 20 Sep 2007, Arlin Davis wrote: > > James, > > Please review patches to allow coexistence of 2.0 and 1.2 libraries. > > Modifications to DAT 1.2 package to coexist with 2.0 libraries > - fix RPM specfile, configure.in, 1.2.2 package > - update dat.conf > > > Signed-off by: Arlin Davis ardavis at ichips.intel.com > > diff --git a/configure.in b/configure.in > index e11fa73..3cb3d1b 100644 > --- a/configure.in > +++ b/configure.in > @@ -1,11 +1,11 @@ > dnl Process this file with autoconf to produce a configure script. > > AC_PREREQ(2.57) > -AC_INIT(dapl, 1.2.1, openib-general at openib.org) > +AC_INIT(dapl, 1.2.2, openib-general at openib.org) How about general at lists.openfabrics.org? > AC_CONFIG_SRCDIR([dat/udat/udat.c]) > AC_CONFIG_AUX_DIR(config) > AM_CONFIG_HEADER(config.h) > -AM_INIT_AUTOMAKE(dapl, 1.2.1) > +AM_INIT_AUTOMAKE(dapl, 1.2.2) > > AM_PROG_LIBTOOL > > diff --git a/doc/dat.conf b/doc/dat.conf > index cb9ff00..005f9ee 100644 > --- a/doc/dat.conf > +++ b/doc/dat.conf > @@ -1,5 +1,5 @@ > # > -# DAT 1.2 configuration file > +# DAT 1.2 and 2.0 configuration file > # > # Each entry should have the following fields: > # > @@ -9,13 +9,18 @@ > # For the uDAPL cma provder, specify as one of the following: > # network address, network hostname, or netdev name and 0 for port > # > -# Simple (OpenIB-cma) default with netdev name provided first on list > +# Simple (OpenIB-cma) default with netdev name provided first on list > # to enable use of same dat.conf version on all nodes > -# > -# Add examples for multiple interfaces and IPoIB HA fail over, and bonding > # > -OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib0 0" "" > -OpenIB-cma-1 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib1 0" "" > -OpenIB-cma-2 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib2 0" "" > -OpenIB-cma-3 u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "ib3 0" "" > -OpenIB-bond u1.2 nonthreadsafe default /usr/lib/libdaplcma.so dapl.1.2 "bond0 0" "" > +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding As in the comments in my previous mail, this looks like a TODO to me. > +# > +OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" "" > +OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" "" > +OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0" "" > +OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0" "" > +OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0" "" > +OpenIB-2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" "" > +OpenIB-2-cma-1 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib1 0" "" > +OpenIB-2-cma-2 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib2 0" "" > +OpenIB-2-cma-3 u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib3 0" "" > +OpenIB-2-bond u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "bond0 0" "" > diff --git a/libdat.spec.in b/libdat.spec.in > index 7e81b97..15b8694 100644 > --- a/libdat.spec.in > +++ b/libdat.spec.in > @@ -33,7 +33,7 @@ > # $Id: $ > > %define ver 1.2 > -%define RELEASE 1 > +%define RELEASE 2 > %define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} > > Summary: Userspace DAT and DAPL API. > @@ -43,8 +43,8 @@ Release: %rel%{?dist} > > License: Dual GPL/BSD/CPL > Group: System Environment/Libraries > -BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) > -Source: http://openfabrics.org/~ardavis/%{name}-%{version}-%{release}.tgz > +BuildRoot: %{_tmppath}/%{name}-%{version}.%{release}-root-%(%{__id_u} -n) > +Source: http://openfabrics.org/downloads/dapl/%{name}-%{version}.%{release}.tar.gz > Url: http://openfabrics.org/ > > %description > @@ -54,7 +54,7 @@ RDMA API that supports DAT 1.2 specification > %package devel > Summary: Development files for the libdat and libdapl libraries > Group: System Environment/Libraries > -Requires: %{name} = %{version}-%{release} > +Requires: %{name} = %{version}.%{release} > > %description devel > Static libraries and header files for the libdat and libdapl library. > @@ -62,16 +62,15 @@ Static libraries and header files for the libdat and libdapl library. > %package utils > Summary: Test suites for uDAPL library > Group: System Environment/Libraries > -Requires: %{name} = %{version}-%{release} > +Requires: %{name} = %{version}.%{release} > > %description utils > Useful test suites to validate uDAPL library API's. > > %prep > -%setup -q -n %{name} > +%setup -q -n %{name}-%{version}.%{release} > > %build > -./autogen.sh > %configure > make > > @@ -112,7 +111,10 @@ rm -rf $RPM_BUILD_ROOT > %{_mandir}/man1/* > > %changelog > -* Wed June 6 2007 Arlin Davis - 1.2.1 > +* Wed Jun 6 2007 Arlin Davis - 1.2.2 > +- OFED 1.3, DAT/DAPL Version 1.2, Release 2 > + > +* Wed Jun 6 2007 Arlin Davis - 1.2.1 > - OFED 1.2, DAT/DAPL Version 1.2, Release 1 > > * Fri Oct 20 2006 Arlin Davis - 1.2.0 > From rdreier at cisco.com Fri Sep 21 12:48:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Sep 2007 12:48:32 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46F0060E.1080505@ichips.intel.com> (Arlin Davis's message of "Tue, 18 Sep 2007 10:08:30 -0700") References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> Message-ID: > Maintainers, > Please move your packages and update your WEB_README. Currently we > only have rdmacm, dapl, cxgb3, and WinOF updated for this process. Has anyone defined what the contents of WEB_README should be? Specifically, what information should it contain? How should it be formatted? Is it plain text, or is HTML markup allowed, or... ? Also, same questions about the README file. Has any work been done to make the individual package download directories more usable? eg if I go to http://www.openfabrics.org/downloads/rdmacm/ then I just get a raw directory listing... it would be nice to be able to include a "latest" link, SHA1 checksum, link to git repository, etc. Thanks, Roland From mshefty at ichips.intel.com Fri Sep 21 13:19:24 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Sep 2007 13:19:24 -0700 Subject: [ofa-general] [PATCH] IB/core - possible bug in handling link down in ib_sa_join_multicast() In-Reply-To: References: <1190331224.20700.27.camel@brick.pathscale.com> <46F3099E.7040008@ichips.intel.com> Message-ID: <46F4274C.9080108@ichips.intel.com> > Yes, please, if the original description is wrong then please correct it. I took Ralph's patch, updated the description, and pushed the patch out to my git tree (on top of the previous patches that were there): git://git.openfabrics.org/~shefty/rdma-dev.git for-roland - Sean From ardavis at ichips.intel.com Fri Sep 21 14:52:07 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 21 Sep 2007 14:52:07 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> Message-ID: <46F43D07.1010902@ichips.intel.com> Roland Dreier wrote: > > Maintainers, > > > Please move your packages and update your WEB_README. Currently we > > only have rdmacm, dapl, cxgb3, and WinOF updated for this process. > > Has anyone defined what the contents of WEB_README should be? > Specifically, what information should it contain? How should it be > formatted? Is it plain text, or is HTML markup allowed, or... ? Plain text, short summary about the project and directory contents. > > Also, same questions about the README file. More exhaustive text about the download files and packages. Could include source repository location, build, or install information. It is really up to the maintainer. > > Has any work been done to make the individual package download > directories more usable? eg if I go to > http://www.openfabrics.org/downloads/rdmacm/ > then I just get a raw directory listing... it would be nice to be able > to include a "latest" link, SHA1 checksum, link to git repository, We can always expand on the raw directory view. The immediate goal was to move away from the outdated static links without generating a lot of work for Jeff and the maintainers. Jeff, can something be done to make this more usable per Roland's request? -arlin From rdreier at cisco.com Fri Sep 21 14:54:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Sep 2007 14:54:09 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46F43D07.1010902@ichips.intel.com> (Arlin Davis's message of "Fri, 21 Sep 2007 14:52:07 -0700") References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> Message-ID: > Plain text, short summary about the project and directory contents. > > > Also, same questions about the README file. > > More exhaustive text about the download files and packages. Could > include source repository location, build, or install information. It > is really up to the maintainer. Plain text is not ideal because it means that the README can't include hyperlinks (eg to the repository, wiki pages, etc). - R. From ardavis at ichips.intel.com Fri Sep 21 15:44:33 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 21 Sep 2007 15:44:33 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> Message-ID: <46F44951.6080401@ichips.intel.com> Roland Dreier wrote: > > Plain text, short summary about the project and directory contents. > > > > > Also, same questions about the README file. > > > > More exhaustive text about the download files and packages. Could > > include source repository location, build, or install information. It > > is really up to the maintainer. > > Plain text is not ideal because it means that the README can't include > hyperlinks (eg to the repository, wiki pages, etc). > I agree, but plain text keeps it simple for maintainers. If someone wants to come up with a template that does not require the maintainers to become webmasters, then I would be more then happy to give it a try. -arlin From rdreier at cisco.com Fri Sep 21 15:54:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Sep 2007 15:54:31 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46F44951.6080401@ichips.intel.com> (Arlin Davis's message of "Fri, 21 Sep 2007 15:44:33 -0700") References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> Message-ID: > I agree, but plain text keeps it simple for maintainers. If someone > wants to come up with a template that does not require the maintainers > to become webmasters, then I would be more then happy to give it a try. Actually nothing is parsing or using README right now, correct? It's just a file that appears in the directory listing. So I could just stick a README.html in my directory instead and it should all work fine, right? - R. From ardavis at ichips.intel.com Fri Sep 21 16:18:04 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 21 Sep 2007 16:18:04 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> Message-ID: <46F4512C.4010505@ichips.intel.com> Roland Dreier wrote: > > I agree, but plain text keeps it simple for maintainers. If someone > > wants to come up with a template that does not require the maintainers > > to become webmasters, then I would be more then happy to give it a try. > > Actually nothing is parsing or using README right now, correct? It's > just a file that appears in the directory listing. > > So I could just stick a README.html in my directory instead and it > should all work fine, right? Yes, correct. WEB_README is the only file parsed. Maybe we can have Jeff optionally link directly to a project README.html instead of the raw directory if README.html exists. Jeff, can this be done? -arlin From akshay.mathur at qlogic.com Fri Sep 21 16:19:50 2007 From: akshay.mathur at qlogic.com (Akshay Mathur) Date: Fri, 21 Sep 2007 18:19:50 -0500 Subject: [ofa-general] libibmad question forward In-Reply-To: <1190311314.7075.102.camel@hrosenstock-ws.xsigo.com> Message-ID: <99863D2ED484D449811D97A4C44C9CBD4D9A77@EPEXCH2.qlogic.org> I was wondering if the patch to Device Mgmt class in mgmt_class_vers() was applied. Hal, Roland, Sean, The value return by mgmt_class_vers() is used for agent registration with umad and is not used for creating mad requests by mad_encode(). So, What is the implication of registering with mgmt_class_version of 2? Does the driver implements backwards compatibility? In my experiments, I was able to register when mgmt_class_vers() returned a version value of 2 and send / receive DM queries with class_version set to 1. Thanks Akshay Mathur QLogic Corporation 780 Fifth Avenue, Suite 140 King of Prussia, PA 19406 Office: 610.233.4836 Fax: 610.233.4777 -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Thursday, September 20, 2007 2:02 PM To: Sasha Khapyorsky Cc: Jeff Becker; general Subject: Re: [ofa-general] libibmad question forward On Thu, 2007-09-20 at 18:16 +0200, Sasha Khapyorsky wrote: > On 16:27 Wed 19 Sep , Hal Rosenstock wrote: > > On Wed, 2007-09-19 at 16:10 -0700, Jeff Becker wrote: > > > I am trying to use libibmad library for initiating queries of Device > > > Management and other class types. While initializing, the > > > madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included as > > > a part of mgmt_classes parameter. This is because mgmt_class_vers() > > > (which is called by mad_register_port_client()/ mad_register_client()) > > > fails to return class version for Device Management Class. > > > > > > I am able to make DM queries if mgmt_class_vers() is fixed i.e. just > > > add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. > > > > > > static int > > > mgmt_class_vers(int mgmt_class) > > > > > > { > > > > > > if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && > > > mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || > > > (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && > > > mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) > > > return 1; > > > > > > switch(mgmt_class) { > > > case IB_SMI_CLASS: > > > case IB_SMI_DIRECT_CLASS: > > > return 1; > > > case IB_SA_CLASS: > > > return 2; > > > case IB_PERFORMANCE_CLASS: > > > return 1; > > > // Change START > > > case IB_DEVICE_MGMT_CLASS: > > > return 1; Actually, there is an annex which makes this class version 2 which is supposed to support backward compatibility for version 1. I'm not sure whether both are in use (as to how important the backward compatibility is with this). Maybe someone else can comment on this aspect. -- Hal > > > // Change END > > > } > > > > > > return 0; > > > > > > I am wondering if this minor anomaly can be submitted as a bug to > > > broaden the usage of libibmad its usage for DM queries. > > > > Yes, DM class (and perhaps some other missing GS classes) should be > > added there. > > So, I'm going to apply this. > > Sasha > > From 46ad958b33c456672e2af711f36b494d398316bb Mon Sep 17 00:00:00 2001 > From: Jeff Becker > Date: Thu, 20 Sep 2007 17:48:55 +0200 > Subject: [PATCH] libibmad: add support for IB_DEVICE_MGMT_CLASS > > From: Jeff Becker > > This adds IB_DEVICE_MGMT_CLASS to list of classes for which version is > returned. > > Signed-off-by: Sasha Khapyorsky > --- > libibmad/src/register.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > index 3d1285a..d80fa14 100644 > --- a/libibmad/src/register.c > +++ b/libibmad/src/register.c > @@ -95,6 +95,8 @@ mgmt_class_vers(int mgmt_class) > return 2; > case IB_PERFORMANCE_CLASS: > return 1; > + case IB_DEVICE_MGMT_CLASS: > + return 1; > } > > return 0; _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Sep 21 16:21:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 21 Sep 2007 16:21:49 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46F4512C.4010505@ichips.intel.com> (Arlin Davis's message of "Fri, 21 Sep 2007 16:18:04 -0700") References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <795c49870708081047v2d5598b6q6c7ef5a91063897c@mail.gmail.com> <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> <46F4512C.4010505@ichips.intel.com> Message-ID: > Yes, correct. WEB_README is the only file parsed. Maybe we can have > Jeff optionally link directly to a project README.html instead of the > raw directory if README.html exists. Jeff, can this be done? Actually WEB_README is not parsed, just copied line-by-line into the main generated page (I just looked at the php source). So HTML markup would work for WEB_README too. And there's no link to the README generated by anything except the raw directory listing anyway. So it doesn't matter if you call your file README, README.html, or WHATEVER.html, it will still show up in the directory listing the same way. You could even have both README.txt and README.html in the directory. - R. From becker at nas.nasa.gov Fri Sep 21 17:02:30 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Fri, 21 Sep 2007 17:02:30 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> <46F4512C.4010505@ichips.intel.com> Message-ID: <795c49870709211702k1294cd79y5b7c987b04958adf@mail.gmail.com> I'm OK with these suggestions. Please let me know what you would like implemented. Thanks. -jeff On 9/21/07, Roland Dreier wrote: > > Yes, correct. WEB_README is the only file parsed. Maybe we can have > > Jeff optionally link directly to a project README.html instead of the > > raw directory if README.html exists. Jeff, can this be done? > > Actually WEB_README is not parsed, just copied line-by-line into the > main generated page (I just looked at the php source). So HTML markup > would work for WEB_README too. > > And there's no link to the README generated by anything except the raw > directory listing anyway. So it doesn't matter if you call your file > README, README.html, or WHATEVER.html, it will still show up in the > directory listing the same way. You could even have both README.txt > and README.html in the directory. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From lottery_agent0006 at comcast.net Fri Sep 21 19:11:13 2007 From: lottery_agent0006 at comcast.net (UK NAIONAL LOTTERY) Date: Sat, 22 Sep 2007 02:11:13 +0000 Subject: [ofa-general] ***SPAM*** Att:lucky winnwer Message-ID: <092220070211.10886.46F479C100094F2D00002A862215575114C0CFCFCF9B020A090EA1979D0A9B9B0104@comcast.net> Draw date:25/08/2007 Ref : UK/776078X2/23 Batch: 013/06/8394369 WINNING NOTIFICATION: We happily announce to you the Draw (07/1069) of the UK NATIONAL LOTTERY, online UK National Lottery program held on 25th August 2007.Your e-mailaddress attached to Ticket Number:8603445956738 with Serial number5368/02 drew the Winning Numbers:09,20,22,24,27,28 (bonus no. 12), which subsequently won you the lottery in the 2nd category i.e match 5 plus 1 bonus. You have therefore been approved to claim total sum of ďż˝ 1,500,000(One Million Five Hundred Thousand Pounds Sterling ),payout in Us dollars;2,776,646.55 in cash credited to file ktu/9023116608/03.All participants for the online version were selected randomly from World Wide Web sites through computer draw system and extracted from over 100,000 unions, associations, and orporate bodies that are listed online. This is a promotion to mark the beginning of the year in which no tickets were sold.Your lucky winning number falls within ur European booklet representative office in Europe as indicated in your play coupon. In view of this, your ďż˝ 1,500,000 will be released to you by any of our payment offices in Europe. For security reasons, you are advised to keep your winning information confidential till your claim is processed. ***************************************************************** MR. LUKE MORE Foreign Services Manager,CLAIMS PROCESSING LOTTERY AGENT. TRANS-ATLANTIC S.A LONDON, UNITED KINGDOM E-mail : contactagentlukemore03 at yahoo.co.uk ******************************************************************** Yours faithfully, JENIFER CRAVE Phone: +447031893157 , +447031851469 UK Lottery Online Coordinator. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Fri Sep 21 22:07:38 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 22 Sep 2007 07:07:38 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-22:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-21 OpenSM git rev = Thu_Sep_20_21:41:18_2007 [cb9d01f98c9a68098d4db47bf160295cb521b367] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=519 Fail=1 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 12 Pkey IS3-128.topo Failures: 1 Pkey IS3-128.topo From vlad at lists.openfabrics.org Sat Sep 22 02:54:16 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 22 Sep 2007 02:54:16 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070922-0200 daily build status Message-ID: <20070922095417.63A93E60882@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070922-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From euromillionn at latinmail.com Sat Sep 22 02:53:26 2007 From: euromillionn at latinmail.com (euro million) Date: Sat, 22 Sep 2007 11:53:26 +0200 (CEST) Subject: [ofa-general] congratulations !!! Message-ID: <20070922095326.BEBAFE28830@smtp.latinmail.com> AWARD NOTIFICATION: Securitas Seguridad España S.A. C/Barbadillo, Sin Numero, 28042 Madrid, Spain (Admitted 1986) Sir/Madam, We are pleased to inform you of the result winners of Loterias International Lottery Programmes held on 20th, September, 2007 from the online Lottery success. This is a Millennium Scientific Computer Game in which email addresses were used. It is a promotional program aimed at encouraging internet users;therefore you do not need to buy ticket to enter for it.You have been approve for the star prize of Euros 874,000,00 Euros (Eight Hundred And Seventhy Four Thousand Euros Only).To claim your winning prize you are to contact the appointed agent as soon as possible for the immediate release of your winnings: Dr. Louiz Cruz Securitas Seguridad España S.A. C/Barbadillo, Sin Numero, 28042 Madrid, Spain (Admitted 1986) E-mail:seguritasegur at aim.com Ref No.ES/112/56/09/MD Batch No: WNTO/2791/WD/ES Lucky No: 31-09-76.49-520 Serial No: MUOTI/36780 Note:You are advised to keep this winning very confidential until you receive your lump prize in your account or optional cheque issuance to you,This is a protective measure put in place to avoid people applying for your winnig fund,as we have had cases like this before. The Validity period of the winnings is for 14 working days hence you are expected to make your claims immediately, any claim not made before this date will be returned to the MINISTERIO DE ECONOMIA Y HACIENDA . Once again congratulations !!! Best Regards, Mrs.Laura Jones. Program Assistant La temporada de huracanes y el terremoto de Peru, a un click http://wwwstarmedia.com/noticias/especiales/desastresnaturales.html From hal.rosenstock at gmail.com Sat Sep 22 06:18:35 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 22 Sep 2007 09:18:35 -0400 Subject: [ofa-general] libibmad question forward In-Reply-To: <99863D2ED484D449811D97A4C44C9CBD4D9A77@EPEXCH2.qlogic.org> References: <1190311314.7075.102.camel@hrosenstock-ws.xsigo.com> <99863D2ED484D449811D97A4C44C9CBD4D9A77@EPEXCH2.qlogic.org> Message-ID: On 9/21/07, Akshay Mathur wrote: > I was wondering if the patch to Device Mgmt class in mgmt_class_vers() > was applied. Sasha did this the other day and sent email on the list. > Hal, Roland, Sean, > > The value return by mgmt_class_vers() is used for agent registration > with umad and is not used for creating mad requests by mad_encode(). > So, What is the implication of registering with mgmt_class_version of 2? It affects what class version for a given class is received by the MAD layer in the kernel. > Does the driver implements backwards compatibility? Driver isn't responsible for handling class specific class versions. > In my experiments, I > was able to register when mgmt_class_vers() returned a version value of > 2 and send / receive DM queries with class_version set to 1. You can send anything. I'm surprised that you were able to receive a DM version 1 when registering with version 2 though. -- Hal > Thanks > Akshay Mathur > QLogic Corporation > 780 Fifth Avenue, Suite 140 > King of Prussia, PA 19406 > Office: 610.233.4836 > Fax: 610.233.4777 > > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal > Rosenstock > Sent: Thursday, September 20, 2007 2:02 PM > To: Sasha Khapyorsky > Cc: Jeff Becker; general > Subject: Re: [ofa-general] libibmad question forward > > On Thu, 2007-09-20 at 18:16 +0200, Sasha Khapyorsky wrote: > > On 16:27 Wed 19 Sep , Hal Rosenstock wrote: > > > On Wed, 2007-09-19 at 16:10 -0700, Jeff Becker wrote: > > > > I am trying to use libibmad library for initiating queries of > Device > > > > Management and other class types. While initializing, the > > > > madrpc_init() call fails when I have IB_DEVICE_MGMT_CLASS included > as > > > > a part of mgmt_classes parameter. This is because > mgmt_class_vers() > > > > (which is called by mad_register_port_client()/ > mad_register_client()) > > > > fails to return class version for Device Management Class. > > > > > > > > I am able to make DM queries if mgmt_class_vers() is fixed i.e. > just > > > > add a case to return the version for IB_DEVICE_MGMT_CLASS. e.g. > > > > > > > > static int > > > > mgmt_class_vers(int mgmt_class) > > > > > > > > { > > > > > > > > if ((mgmt_class >= IB_VENDOR_RANGE1_START_CLASS && > > > > mgmt_class <= IB_VENDOR_RANGE1_END_CLASS) || > > > > (mgmt_class >= IB_VENDOR_RANGE2_START_CLASS && > > > > mgmt_class <= IB_VENDOR_RANGE2_END_CLASS)) > > > > return 1; > > > > > > > > switch(mgmt_class) { > > > > case IB_SMI_CLASS: > > > > case IB_SMI_DIRECT_CLASS: > > > > return 1; > > > > case IB_SA_CLASS: > > > > return 2; > > > > case IB_PERFORMANCE_CLASS: > > > > return 1; > > > > // Change START > > > > case IB_DEVICE_MGMT_CLASS: > > > > return 1; > > Actually, there is an annex which makes this class version 2 which is > supposed to support backward compatibility for version 1. I'm not sure > whether both are in use (as to how important the backward compatibility > is with this). Maybe someone else can comment on this aspect. > > -- Hal > > > > > // Change END > > > > } > > > > > > > > return 0; > > > > > > > > I am wondering if this minor anomaly can be submitted as a bug to > > > > broaden the usage of libibmad its usage for DM queries. > > > > > > Yes, DM class (and perhaps some other missing GS classes) should be > > > added there. > > > > So, I'm going to apply this. > > > > Sasha > > > > From 46ad958b33c456672e2af711f36b494d398316bb Mon Sep 17 00:00:00 2001 > > From: Jeff Becker > > Date: Thu, 20 Sep 2007 17:48:55 +0200 > > Subject: [PATCH] libibmad: add support for IB_DEVICE_MGMT_CLASS > > > > From: Jeff Becker > > > > This adds IB_DEVICE_MGMT_CLASS to list of classes for which version is > > returned. > > > > Signed-off-by: Sasha Khapyorsky > > --- > > libibmad/src/register.c | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > > > diff --git a/libibmad/src/register.c b/libibmad/src/register.c > > index 3d1285a..d80fa14 100644 > > --- a/libibmad/src/register.c > > +++ b/libibmad/src/register.c > > @@ -95,6 +95,8 @@ mgmt_class_vers(int mgmt_class) > > return 2; > > case IB_PERFORMANCE_CLASS: > > return 1; > > + case IB_DEVICE_MGMT_CLASS: > > + return 1; > > } > > > > return 0; > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Sat Sep 22 13:56:06 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 22 Sep 2007 15:56:06 -0500 Subject: [ofa-general] Re: A question about rdma_get_cm_event In-Reply-To: <46F2AADC.7040201@ichips.intel.com> References: <46F25B6D.9000000@dev.mellanox.co.il> <46F2AADC.7040201@ichips.intel.com> Message-ID: <46F58166.1070204@opengridcomputing.com> Sean Hefty wrote: >> When one calls to rdma_get_cm_event, he gets a structure of the >> rdma_cm_event. >> >> In this structure there are 2 attributes which i want to discuss about: >> * private_data >> * private_data_len >> >> It seems that when one side send to the other private data, the >> private data is correct >> (i mean that the attribute private data points to a memory buffer with >> the expected data) >> but the private_data_len has a fixed size (depend on the ucma function >> which was called). >> >> 1) Is this is the expected behavior? > > Yes - there's no way for the receiving side of an IB CM message to know > how many bytes of private data are valid in the REQ, REP, etc. > >> 2) can you please add entry to the man pages of this function to >> clarify this expected >> content of those attributes? > > I will update the man pages. Thanks. > Note that the private data length _is_ correct for iwarp. So the man pages should mention that this is an IB-only issue maybe? And maybe indicate that transport-independent applications should not rely on the length... Steve. From bales at americanimplement.com Sat Sep 22 14:02:47 2007 From: bales at americanimplement.com (Amado Elliott) Date: Sat, 22 Sep 2007 18:02:47 -0300 Subject: [ofa-general] Get one this Message-ID: <01c7fd5b$ed8bec90$698604bd@bales> PPYH Represents Apartments In Manhattan Hill Project Physical Property Holdings Inc. PPYH $0.25 The Manhattan Hill website shows it all, Check out the release. Once the news hits the street this will climb hard. Move on PPYH firs thing Mon. From kliteyn at mellanox.co.il Sat Sep 22 22:17:53 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 23 Sep 2007 07:17:53 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-23:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-22 OpenSM git rev = Thu_Sep_20_21:41:18_2007 [cb9d01f98c9a68098d4db47bf160295cb521b367] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From dotanb at dev.mellanox.co.il Sat Sep 22 23:05:04 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 23 Sep 2007 08:05:04 +0200 Subject: [ofa-general] IBV_WC_WR_FLUSH_ERR: first WQE only or all pending WQE's? In-Reply-To: References: Message-ID: <46F60210.5010102@dev.mellanox.co.il> Hi. Scott Guthridge wrote: > (1) Do I understand the spec correctly? Should WQE's posted subsequently > to the one that is going to fail be generating FLUSH errors? > Yes, you are. When the QP state is being changed to the Error (because of a real error or by the user) all of the outstanding WRs should be flushed with error for this QP. > (2) Has anyone seen this behavior before? Is it common? [I haven't tried > switching hardware -- card I'm using *may* not be production level.] If it > *is* common behavior, I may need to recode my app. to mark all outstanding > requests as failed upon receiving the first error, and then ignore any > subsequent errors, to be defensive about it -- this seems kludgy, though, > and I'd rather not do that if I don't have to. > You didn't specify which HW you are using, but maybe you should ask for a FW/SW upgrade for the HCA that you are using, this may solve your problem. Dotan From ogerlitz at voltaire.com Sun Sep 23 00:36:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 23 Sep 2007 09:36:59 +0200 Subject: [ofa-general] [RFC] [PATCH 1/5 v2] ib/ipoib: specify Traffic Class with PR queries for QoS support In-Reply-To: References: <000501c7ef3b$5e600b10$3c98070a@amr.corp.intel.com><000601c7ef3b$b0dfe2c0$3c98070a@amr.corp.intel.com> <46F21F98.6090503@voltaire.com> Message-ID: <46F6179B.4010505@voltaire.com> Roland Dreier wrote: > > > You have sent a "thanks applied" email for the the ipoib qos patch > > twice that is on the below two posts, where you should have applied > > only v3 (the rest of the series is v2, only for ipoib there was v3). > > Sorry... I actually applied the patch from Sean's git tree, so I hope > I got the latest. yes, its the latest. Or. From monisonlists at gmail.com Sun Sep 23 00:55:34 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Sun, 23 Sep 2007 09:55:34 +0200 Subject: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: References: <46F27692.3070404@voltaire.com> <46F2784C.9070806@voltaire.com> Message-ID: <46F61BF6.3000203@gmail.com> Roland Dreier wrote: > > + ipoib_slave_detach(cpriv->dev); > > unregister_netdev(cpriv->dev); > > Maybe you already answered this before, but I'm still not clear why > this notifier call can't just be added to the start of > unregister_netdevice(), so we can avoid having driver needing to know > anything about bonding internals? > > - R. The action in bonding to a detach of slave is to unregister the master (see patch 10). This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). That's why I had to notify the detach before unregister begins. From ogerlitz at voltaire.com Sun Sep 23 01:40:30 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 23 Sep 2007 10:40:30 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F2C064.9030404@ichips.intel.com> References: <46F2C064.9030404@ichips.intel.com> Message-ID: <46F6267E.7090407@voltaire.com> Sean Hefty wrote: > I've read back over this description a few times, and I still don't > fully grok the problem. Can you clarify if the following sequence is > what's happening? > 1. The node has joined the multicast group. Meaning that the SA has > routed multicast traffic to the node. > 2. You take down the link of the switch port that connects the node. Is > this done via a program? > 3. The port is brought back online. This generates a PORT_ACTIVE event, > but the previous event was also PORT_ACTIVE. > 4. ipoib leaves the group. > 5. ipoib re-joins the group. > 6. The multicast module isn't aware that any errors have occurred on the > multicast group, so simply completes the join request at step 5 without > SA involvement. > If I'm understanding this, somewhere in the above sequence the multicast > routing to this node is lost. Either the SA removed the node from the > group, or the switch lost its routing tables, or ...? Indeed am taking the switch link down via a program. Now, is this case there was --no-- previous event, when the port was brought back online there was PORT_ACTIVE event (its a driver issue which we look at). However, from the view point of the SA there was "GID out" event, so the HCA port was dropped out from the multicast group and the multicast routing (spanning tree, MFTs configuration etc) was computed without this port being included. This is the ipoib logging of what happens from its perspective (I have added the event number to the "port state change event" print): > ib0: Port state change event 9 > ib0: Flushing ib0 > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:0000:0001 > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 > ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001 > ib0: starting multicast thread > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) > ib0: Created ah c504b7a0 > ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV c504b7a0, LID 0xc000, SL 0 > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001 > ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0) > ib0: Created ah c504ba20 > ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV c504ba20, LID 0xc001, SL 0 > ib0: successfully joined all multicast groups > I'm also trying to understand how the problem would apply to a different > setup: > > node 1 <-> switch A <-> switch B <-> switch C <-> SA > > Suppose the same link down/up occurred between switch A and switch B. > What happens to the multicast members to the left of switch B? Will > node 1 see a PORT_ACTIVE event in this case as well? The members of multicast group are only HCA ports. Indeed, join/leave requests of members cause the SA to trigger the SM to recompute the multicast routing, however, there are more causes, such as a port going down anywhere in the fabric, so if its an hca port it would be dropped from all the group it is member in, and if its a switch port, all the effected unicast AND multicast routing must be computed by the SM. The host would only see port up/down events as of changes in the link state in the local port or in the port which is connected to it through the cable. Or. Or. From mst at dev.mellanox.co.il Sun Sep 23 01:50:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Sep 2007 10:50:52 +0200 Subject: [ofa-general] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <46F3E3D2.70601@opengridcomputing.com> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> <46F3E3D2.70601@opengridcomputing.com> Message-ID: <20070923085052.GC24557@mellanox.co.il> Yes, please push this into your git tree (and please verify that cross-build to all OS-es passes). Further, please do it this way: add the patch in ofed-1.2.5 and then merge 1.2.5 into 1.3. Quoting Steve Wise : Subject: Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. Michael, can you pull this patch into ofed-1.2.5 and ofed-1.3? Or would you want me to push it into my git tree for you to pull from? Thanks, Steve. Roland Dreier wrote: > > Roland - can you please queue this up for 2.6.24? > > Done, thanks. -- MST From admin at epixstudios.com Sun Sep 23 02:25:28 2007 From: admin at epixstudios.com (admin) Date: Sun, 23 Sep 2007 11:25:28 +0200 Subject: [ofa-general] Football: Arsenal thrash Derby Message-ID: <168334436.27156493095897@laura> PPYH Gets Into Manhattan Hill Project In Hong Kong Physical Property Holdings Inc. PPYH $0.25 Read up this weekend and go through the Manhattan Hill website. PPYH will rock investors on Monday with this news. Move on PPYH firs thing Mon. 802.11n standard 'at serious risk'Exclusive The IEEE working group developing. Stocks gained early Friday afternoon, as investors recharged the recent. From vlad at lists.openfabrics.org Sun Sep 23 03:07:29 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 23 Sep 2007 03:07:29 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070923-0200 daily build status Message-ID: <20070923100729.54ADFE60854@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070923-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From jackm at dev.mellanox.co.il Sun Sep 23 05:57:48 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 23 Sep 2007 14:57:48 +0200 Subject: [ofa-general] XRC patch set Message-ID: <200709231457.48606.jackm@dev.mellanox.co.il> Roland, Have you had a chance to review the XRC patch set I posted on September 18 (given below)? Any feedback would be appreciated. Please note that this implementation of XRC is for userspace QPs/SRQs only. Today, I'm working on the Kernel space XRC implementation. - Jack # [ofa-general] [PATCH 0 of 5] XRC implementation patches (libibverbs, libmlx4, core, mlx4) Jack Morgenstein # [ofa-general] [PATCH 1 of 5] libibverbs: XRC implementation Jack Morgenstein # [ofa-general] [PATCH 2 of 5] libmlx4: XRC implementation Jack Morgenstein # [ofa-general] [PATCH 3 of 5] core: XRC implementation for fd = -1 when opening an xrc domain Jack Morgenstein # [ofa-general] [PATCH 4 of 5] core: XRC implementation -- add support for working with file descriptors Jack Morgenstein # [ofa-general] [PATCH 5 of 5] mlx4: XRC implementation Jack Morgenstein ------------------------------------------------------- From mst at dev.mellanox.co.il Sun Sep 23 06:36:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Sep 2007 15:36:08 +0200 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <20070911032054.GA21811@mellanox.co.il> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <20070911032054.GA21811@mellanox.co.il> Message-ID: <20070923133608.GA11619@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > > Quoting Sean Hefty : > > Subject: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > > > Roland, please pull from: > > > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > > > This will pick up QoS and CM scalability changes that I would like to get > > into 2.6.24 (and OFED 1.3). > > Sean, where can I pull changes for ofed 1.3 from? > The changes should go into kernel_patches/fixes for OFED. Any update? I see ~shefty/ofed_1_2.git but no 1.3 code. Please note that I can not pull for-roland branch into OFED 1.3. -- MST From tziporet at dev.mellanox.co.il Sun Sep 23 07:04:27 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 23 Sep 2007 16:04:27 +0200 Subject: [ofa-general] Re: [PATCH v7] IB/mlx4: shrinking WQE In-Reply-To: References: <20070919153143.GF31061@mellanox.co.il> Message-ID: <46F6726B.20404@mellanox.co.il> Roland Dreier wrote: > Given this added complexity: > > 6 files changed, 226 insertions(+), 39 deletions(-) > > and the unpleasantness of having if (BITS_PER_LONG == 64) various > places, can you quantify the improvement this gives? > > Would it make more sense to do this for userspace first? > > Its actually more important in user kernel since in user space MPI coalescing solve most of the BW problems of small messages In kernel modules this change improves the BW of small messages too. And although it touching several files all of them are in the low level driver thus not impact on stability of the core or ULPs. Tziporet From rdreier at cisco.com Sun Sep 23 09:34:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 23 Sep 2007 09:34:46 -0700 Subject: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: <46F61BF6.3000203@gmail.com> (Moni Shoua's message of "Sun, 23 Sep 2007 09:55:34 +0200") References: <46F27692.3070404@voltaire.com> <46F2784C.9070806@voltaire.com> <46F61BF6.3000203@gmail.com> Message-ID: > The action in bonding to a detach of slave is to unregister the master (see patch 10). > This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). I'm confused. Your patch has: > + ipoib_slave_detach(cpriv->dev); > unregister_netdev(cpriv->dev); And ipoib_slave_detach() is: > +static inline void ipoib_slave_detach(struct net_device *dev) > +{ > + rtnl_lock(); > + netdev_slave_detach(dev); > + rtnl_unlock(); > +} so you are calling netdev_slave_detach() with the rtnl lock held. Why can't you make the same call from the start of unregister_netdevice()? Anyway, if the rtnl lock is a problem, can you just add the call to netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock? - R. From rdreier at cisco.com Sun Sep 23 09:35:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 23 Sep 2007 09:35:19 -0700 Subject: [ofa-general] XRC patch set In-Reply-To: <200709231457.48606.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 23 Sep 2007 14:57:48 +0200") References: <200709231457.48606.jackm@dev.mellanox.co.il> Message-ID: > Have you had a chance to review the XRC patch set I posted on September 18 (given below)? Not really, I am still backed up on other things that we want to get into 2.6.24. - R. From swise at opengridcomputing.com Sun Sep 23 10:21:27 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 12:21:27 -0500 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <20070916091024.GF30150@mellanox.co.il> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> Message-ID: <46F6A097.2040402@opengridcomputing.com> Michael, I don't see these in the ofed_1_2/linux-2.6.git repos? Ditto for the 1.3 repos... Michael S. Tsirkin wrote: > Done. I'll push soon. > > Quoting Steve Wise : > Subject: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > > Vlad (Michael/Tziporet in Vlad's absence), > > Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of > these patches are either in 2.6.23 or merged into Jeff Garzik's upstream > branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we > update ofed-1.2.5 and ofed-1.3 will all of these fixes. > > I'll send another email with the ofed-1.3 changes as they will be > slightly different. > > Please pull the ofed_1_2_c changes from: > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > > The patch files added to kernel_patches/fixes include: > >> swise at dell3:~/git/ofed-1.2.5> stg series >> + 0029-cxgb3-engine-microcode-load >> + 0030-cxgb3-MAC-workaround-update >> + 0031-cxgb3-Update-rx-coalescing-length >> + 0032-cxgb3-SGE-doorbell-overflow-warning >> + 0033-cxgb3-use-immediate-data-for-offload-Tx >> + 0034-cxgb3-Expose-HW-memory-page-info >> + 0035-cxgb3-tighten-checks-on-TID-values >> + 0036-cxgb3-Fatal-error-update >> + 0037-cxgb3-log-adapter-serial-number >> + 0038-cxgb3-Update-internal-memory-management >> + 0039-cxgb3-update-firmware-version >> + 0040-cxgb3-log-and-clear-PEX-errors >> + 0041-cxgb3-remove-false-positive-in-xgmac-workaround >> + 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >> + 0043-cxgb3-CQ-context-operations-time-out-too-soon >> + 0044-cxgb3-Add-T3C-rev >> + 0045-cxgb3-Update-engine-microcode-version >>> 0046-cxgb3-driver-version > > Steve. > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From hadi at cyberus.ca Sun Sep 23 10:53:07 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 13:53:07 -0400 Subject: [ofa-general] [PATCHES] TX batching In-Reply-To: <1189988958.4230.55.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> Message-ID: <1190569987.4256.52.camel@localhost> I had plenty of time this weekend so i have been doing a _lot_ of testing. My next emails will send a set of patches: Patch 1: Introduces explicit tx locking Patch 2: Introduces batching interface Patch 3: Core uses batching interface Patch 4: get rid of dev->gso_skb Testing ------- Each of these patches has been performance tested and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I have not run testing on patch #4 because i had to let the machine go, but will have some access to it tommorow early morning where i can run some tests. Comments -------- Iam trying to kill ->hard_batch_xmit() but it would be tricky to do without it for LLTX drivers. Anything i try will require a few extra checks. OTOH, I could kill LLTX for the drivers i am using that are LLTX and then drop that interface or I could say "no support for LLTX". I am in a dilema. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 4 i will appreaciate it. More patches to follow - i didnt want to overload people by dumping too many patches. Most of these patches below are ready to go; some are need some testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver - ethtool interface - There is at least one other driver promised to me I am also going to update the two documents i posted earlier. Hopefully i can do that today. cheers, jamal From hadi at cyberus.ca Sun Sep 23 10:56:45 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 13:56:45 -0400 Subject: [ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190569987.4256.52.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> Message-ID: <1190570205.4256.56.camel@localhost> I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal -------------- next part -------------- [NET_SCHED] explict hold dev tx lock For N cpus, with full throttle traffic on all N CPUs, funneling traffic to the same ethernet device, the devices queue lock is contended by all N CPUs constantly. The TX lock is only contended by a max of 2 CPUS. In the current mode of qdisc operation, when all N CPUs contend for the dequeue region and one of them (after all the work) entering dequeue region, we may endup aborting the path if we are unable to get the tx lock and go back to contend for the queue lock. As N goes up, this gets worse. The changes in this patch result in a small increase in performance with a 4CPU (2xdual-core) with no irq binding. My tests are UDP based and keep all 4CPUs busy all the time for the period of the test. Both e1000 and tg3 showed similar behavior. I expect higher gains with more CPUs. Summary below with different UDP packets and the resulting pps seen. Note at around 200Bytes, the two dont seem that much different and we are approaching wire speed (with plenty of CPU available; eg at 512B, the app is sitting at 80% idle on both cases). +------------+--------------+-------------+------------+--------+ pktsize | 64B | 128B | 256B | 512B |1024B | +------------+--------------+-------------+------------+--------+ Original| 467482 | 463061 | 388267 | 216308 | 114704 | | | | | | | txlock | 468922 | 464060 | 388298 | 216316 | 114709 | ----------------------------------------------------------------- Signed-off-by: Jamal Hadi Salim --- commit b0e36991c5850dfe930f80ee508b08fdcabc18d1 tree b1787bba26f80a325298f89d1ec882cc5ab524ae parent 42765047105fdd496976bc1784d22eec1cd9b9aa author Jamal Hadi Salim Sun, 23 Sep 2007 09:09:17 -0400 committer Jamal Hadi Salim Sun, 23 Sep 2007 09:09:17 -0400 net/sched/sch_generic.c | 19 ++----------------- 1 files changed, 2 insertions(+), 17 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index e970e8e..95ae119 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; - unsigned lockless; int ret; /* Dequeue packet */ if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) return 0; - /* - * When the driver has LLTX set, it does its own locking in - * start_xmit. These checks are worth it because even uncongested - * locks can be quite expensive. The driver can do a trylock, as - * is being done here; in case of lock contention it should return - * NETDEV_TX_LOCKED and the packet will be requeued. - */ - lockless = (dev->features & NETIF_F_LLTX); - - if (!lockless && !netif_tx_trylock(dev)) { - /* Another CPU grabbed the driver tx lock */ - return handle_dev_cpu_collision(skb, dev, q); - } /* And release queue */ spin_unlock(&dev->queue_lock); + HARD_TX_LOCK(dev, smp_processor_id()); ret = dev_hard_start_xmit(skb, dev); - - if (!lockless) - netif_tx_unlock(dev); + HARD_TX_UNLOCK(dev); spin_lock(&dev->queue_lock); q = dev->qdisc; From hadi at cyberus.ca Sun Sep 23 10:58:37 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 13:58:37 -0400 Subject: [ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface In-Reply-To: <1190570205.4256.56.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> Message-ID: <1190570317.4256.59.camel@localhost> This patch introduces the netdevice interface for batching. cheers, jamal -------------- next part -------------- [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. A typical driver dev->hard_start_xmit() has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functions anyways]. With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1)use its dev->hard_prep_xmit() method to achieve #a 2)use its dev->hard_end_xmit() method to achieve #d 3)#b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are optional. The core will first do the packet formatting by invoking your supplied dev->hard_prep_xmit() method. It will then pass you the packet via your dev->hard_start_xmit() method and lastly will invoke your dev->hard_end_xmit() when it completes passing you all the packets queued for you. dev->hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). LLTX present a challenge in that we have to introduce a deviation from the norm and introduce the ->hard_batch_xmit() method. An LLTX driver presents us with ->hard_batch_xmit() to which we pass it a list of packets in a dev->blist skb queue. It is then the responsibility of the ->hard_batch_xmit() to exercise steps #b and #c for all packets and #d when the batching is complete. Step #a is already done for you by the time you get the packets in dev->blist. And last xmit_win variable is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Some decisions i had to make: - every driver will have a xmit_win variable and the core will set it to 1 which means the behavior of non-batching drivers stays the same. - the batch list, blist, is no longer a pointer; wastes a little extra memmory i plan to recoup by killing gso_skb in later patches. Theres a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and David Miller, have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and netiron folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim --- commit ab4b07ef2e4069c115c9c1707d86ae2344a5ded5 tree 994b42b03bbfcc09ac8b7670c53c12e0b2a71dc7 parent b0e36991c5850dfe930f80ee508b08fdcabc18d1 author Jamal Hadi Salim Sun, 23 Sep 2007 10:30:32 -0400 committer Jamal Hadi Salim Sun, 23 Sep 2007 10:30:32 -0400 include/linux/netdevice.h | 17 +++++++ net/core/dev.c | 106 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 123 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cf89ce6..443cded 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -453,6 +453,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -578,6 +579,15 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + /* hard_batch_xmit is needed for LLTX, kill it when those + * disappear or better kill it now and dont support LLTX + */ + int (*hard_batch_xmit) (struct net_device *dev); + int (*hard_prep_xmit) (struct sk_buff *skb, + struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -592,6 +602,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1022,6 +1033,12 @@ extern int dev_set_mac_address(struct net_device *, struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_batch_xmit(struct net_device *dev); +extern int prepare_gso_skb(struct sk_buff *skb, + struct net_device *dev, + struct sk_buff_head *skbs); +extern int xmit_prepare_skb(struct sk_buff *skb, + struct net_device *dev); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index 91c31e6..25d01fd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1531,6 +1531,110 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } +int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev, + struct sk_buff_head *skbs) +{ + int tdq = 0; + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + nskb->next = NULL; + + if (dev->hard_prep_xmit) { + /* note: skb->cb is set in hard_prep_xmit(), + * it should not be trampled somewhere + * between here and the driver picking it + * The VLAN code wrongly assumes it owns it + * so the driver needs to be careful; for + * good handling look at tg3 driver .. + */ + int ret = dev->hard_prep_xmit(nskb, dev); + if (ret != NETDEV_TX_OK) + continue; + } + /* Driver likes this packet .. */ + tdq++; + __skb_queue_tail(skbs, nskb); + } while (skb->next); + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + + return tdq; +} + +int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree_skb(skb); + return 0; + } + if (skb->next) + return prepare_gso_skb(skb, dev, skbs); + } + + if (dev->hard_prep_xmit) { + int ret = dev->hard_prep_xmit(skb, dev); + if (ret != NETDEV_TX_OK) + return 0; + } + __skb_queue_tail(skbs, skb); + return 1; +} + +int dev_batch_xmit(struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + int rc = NETDEV_TX_OK; + struct sk_buff *skb; + int orig_w = dev->xmit_win; + int orig_pkts = skb_queue_len(skbs); + + if (dev->hard_batch_xmit) { /* only for LLTX devices */ + rc = dev->hard_batch_xmit(dev); + } else { + while ((skb = __skb_dequeue(skbs)) != NULL) { + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + rc = dev->hard_start_xmit(skb, dev); + if (unlikely(rc)) + break; + /* + * XXX: multiqueue may need closer srutiny.. + */ + if (unlikely(netif_queue_stopped(dev) || + netif_subqueue_stopped(dev, skb->queue_mapping))) { + rc = NETDEV_TX_BUSY; + break; + } + } + } + + /* driver is likely buggy and lied to us on how much + * space it had. Damn you driver .. + */ + if (unlikely(skb_queue_len(skbs))) { + printk(KERN_WARNING "Likely bug %s %s (%d) " + "left %d/%d window now %d, orig %d\n", + dev->name, rc?"busy":"locked", + netif_queue_stopped(dev), + skb_queue_len(skbs), + orig_pkts, + dev->xmit_win, + orig_w); + rc = NETDEV_TX_BUSY; + } + + if (orig_pkts > skb_queue_len(skbs)) + if (dev->hard_end_xmit) + dev->hard_end_xmit(dev); + + return rc; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb->next)) { @@ -3565,6 +3669,8 @@ int register_netdevice(struct net_device *dev) } } + dev->xmit_win = 1; + skb_queue_head_init(&dev->blist); /* * nil rebuild_header routine, * that should be never called and used as just bug trap. From hadi at cyberus.ca Sun Sep 23 11:00:09 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 14:00:09 -0400 Subject: [ofa-general] [PATCH 3/4][NET_BATCH] net core use batching In-Reply-To: <1190570317.4256.59.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> Message-ID: <1190570409.4256.62.camel@localhost> This patch adds the usage of batching within the core. cheers, jamal -------------- next part -------------- [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: +------------+--------------+-------------+------------+--------+ | 64B | 128B | 256B | 512B |1024B | +------------+--------------+-------------+------------+--------+ Original| 467482 | 463061 | 388267 | 216308 | 114704 | | | | | | | txlock | 468922 | 464060 | 388298 | 216316 | 114709 | | | | | | | tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | | | | | | | tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | | | | | | | tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | +------------+--------------+-------------+------------+--------+ The first two colums "Original" and "txlock" were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. "tg3nobtx" shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. "tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3 driver. "tg3btxco" demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is ->hard_batch_xmit() introduces a new method which is intrusive. I have #if-0ed some of the old functions so the patch is more readable. Signed-off-by: Jamal Hadi Salim --- commit e26705f6ef7db034df7af3f4fccd7cd40b8e46e0 tree b99c469497a0145ca5c0651dc4229ce17da5b31c parent 6b8e2f76f86c35a6b2cee3698c633d20495ae0c0 author Jamal Hadi Salim Sun, 23 Sep 2007 11:35:25 -0400 committer Jamal Hadi Salim Sun, 23 Sep 2007 11:35:25 -0400 net/sched/sch_generic.c | 127 +++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 115 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..86a3f9d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q->q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev->xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + "Dead loop on netdevice %s, fix it urgently!\n", + dev->name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q->ops->requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb->next) { + count += skb_shinfo(skb)->nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev->xmit_win; + + if (count && dev->gso_skb) { + skb = dev->gso_skb; + dev->gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count > 0) { + skb = q->dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} + +static int xmit_prepare_pkts(struct net_device *dev, + struct sk_buff_head *tlist) +{ + struct sk_buff *skb; + struct sk_buff_head *flist = &dev->blist; + + while ((skb = __skb_dequeue(tlist)) != NULL) + xmit_prepare_skb(skb, dev); + + return skb_queue_len(flist); +} /* * NOTE: Called under dev->queue_lock with locally disabled BH. @@ -130,22 +222,27 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) + +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *tpktlist) { struct Qdisc *q = dev->qdisc; - struct sk_buff *skb; - int ret; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + ret = xmit_get_pkts(dev, q, tpktlist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(&dev->queue_lock); + /* prepare to embark */ + xmit_prepare_pkts(dev, tpktlist); + + /* bye packets ....*/ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(&dev->queue_lock); @@ -158,8 +255,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(&dev->blist, dev, q); break; default: @@ -168,7 +265,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(&dev->blist, dev, q); break; } @@ -177,8 +274,11 @@ static inline int qdisc_restart(struct net_device *dev) void __qdisc_run(struct net_device *dev) { + struct sk_buff_head tpktlist; + skb_queue_head_init(&tpktlist); + do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, &tpktlist)) break; } while (!netif_queue_stopped(dev)); @@ -564,6 +664,9 @@ void dev_deactivate(struct net_device *dev) skb = dev->gso_skb; dev->gso_skb = NULL; + if (!skb_queue_empty(&dev->blist)) + skb_queue_purge(&dev->blist); + dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); kfree_skb(skb); From hadi at cyberus.ca Sun Sep 23 11:02:01 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 14:02:01 -0400 Subject: [ofa-general] [PATCH 4/4][NET_SCHED] kill dev->gso_skb In-Reply-To: <1190570409.4256.62.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> Message-ID: <1190570521.4256.65.camel@localhost> This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal -------------- next part -------------- [NET_SCHED] kill dev->gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev->blist (and still not need dev->gso_skb) Signed-off-by: Jamal Hadi Salim --- commit c6d2d61a73e1df5daaa294876f62454413fcb0af tree 1d7bf650096a922a6b6a4e7d6810f83320eb94dd parent e26705f6ef7db034df7af3f4fccd7cd40b8e46e0 author Jamal Hadi Salim Sun, 23 Sep 2007 12:25:10 -0400 committer Jamal Hadi Salim Sun, 23 Sep 2007 12:25:10 -0400 include/linux/netdevice.h | 3 --- net/sched/sch_generic.c | 12 ------------ 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 443cded..7811729 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -560,9 +560,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 86a3f9d..b4e1607 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev->xmit_win; - if (count && dev->gso_skb) { - skb = dev->gso_skb; - dev->gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count > 0) { skb = q->dequeue(q); if (!skb) @@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(&dev->queue_lock); qdisc = dev->qdisc; @@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev->gso_skb; - dev->gso_skb = NULL; if (!skb_queue_empty(&dev->blist)) skb_queue_purge(&dev->blist); dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ From jeff at garzik.org Sun Sep 23 11:19:04 2007 From: jeff at garzik.org (Jeff Garzik) Date: Sun, 23 Sep 2007 14:19:04 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1190569987.4256.52.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> Message-ID: <46F6AE18.7080708@garzik.org> jamal wrote: > More patches to follow - i didnt want to overload people by dumping > too many patches. Most of these patches below are ready to go; some are > need some testing and others need a little porting from an earlier > kernel: > - tg3 driver (tested and works well, but dont want to send > - tun driver > - pktgen > - netiron driver > - e1000 driver You should post at least a couple driver patches to see how its used on Real Hardware(tm)... :) The batching idea has always seemed like a no-brainer to me, so I'm very interested to see how this turns out. Jeff From mst at dev.mellanox.co.il Sun Sep 23 11:22:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Sep 2007 20:22:23 +0200 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <46F6A097.2040402@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> <46F6A097.2040402@opengridcomputing.com> Message-ID: <20070923182223.GB12425@mellanox.co.il> You don't? I do. http://www.openfabrics.org/git/?p=ofed_1_2/linux-2.6.git;a=summary has ofed_1_2_c http://www.openfabrics.org/git/?p=ofed_1_3/linux-2.6.git;a=summary has ofed_kernel Quoting Steve Wise : Subject: Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes Michael, I don't see these in the ofed_1_2/linux-2.6.git repos? Ditto for the 1.3 repos... Michael S. Tsirkin wrote: >Done. I'll push soon. > >Quoting Steve Wise : >Subject: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > >Vlad (Michael/Tziporet in Vlad's absence), > >Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of >these patches are either in 2.6.23 or merged into Jeff Garzik's upstream >branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we >update ofed-1.2.5 and ofed-1.3 will all of these fixes. > >I'll send another email with the ofed-1.3 changes as they will be >slightly different. > >Please pull the ofed_1_2_c changes from: > >git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > >The patch files added to kernel_patches/fixes include: > >>swise at dell3:~/git/ofed-1.2.5> stg series >>+ 0029-cxgb3-engine-microcode-load >>+ 0030-cxgb3-MAC-workaround-update >>+ 0031-cxgb3-Update-rx-coalescing-length >>+ 0032-cxgb3-SGE-doorbell-overflow-warning >>+ 0033-cxgb3-use-immediate-data-for-offload-Tx >>+ 0034-cxgb3-Expose-HW-memory-page-info >>+ 0035-cxgb3-tighten-checks-on-TID-values >>+ 0036-cxgb3-Fatal-error-update >>+ 0037-cxgb3-log-adapter-serial-number >>+ 0038-cxgb3-Update-internal-memory-management >>+ 0039-cxgb3-update-firmware-version >>+ 0040-cxgb3-log-and-clear-PEX-errors >>+ 0041-cxgb3-remove-false-positive-in-xgmac-workaround >>+ 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >>+ 0043-cxgb3-CQ-context-operations-time-out-too-soon >>+ 0044-cxgb3-Add-T3C-rev >>+ 0045-cxgb3-Update-engine-microcode-version >>>0046-cxgb3-driver-version > >Steve. >_______________________________________________ >ewg mailing list >ewg at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -- MST From swise at opengridcomputing.com Sun Sep 23 11:42:39 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 13:42:39 -0500 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <20070923182223.GB12425@mellanox.co.il> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> <46F6A097.2040402@opengridcomputing.com> <20070923182223.GB12425@mellanox.co.il> Message-ID: <46F6B39F.3080902@opengridcomputing.com> I don't see any of my commits... Here is the shortlog from my public ofed_1_2 repos. These commits aren't in the ofed-1.2.5 repos nor merged into ofed-1.3: swise at hosting:~/scm$ GIT_DIR=ofed_1_2.git git log ofed_1_2_c db330ef9aa21b5ab905a2b7cc58ebe4d2f85844a..|GIT_DIR=ofed_1_2.git git shortlog Steve Wise (18): cxgb3: Add patch 47330077650a25d417155848516b2cba97999602. cxgb3: Add commit 3d5e7fe7e6b505d4c48e1722edc37dd788c36d60 cxgb3: Add commit 98db31aa99dda0e32116b6df1bdf9a97531f73fd cxgb3: Add commit ea04cdf4eaccec57dd57ccb752eddde60acd000b cxgb3: Add commit 69a7ed553015fc33fc3b96b2df9b71d398648bea cxgb3: Add commit 22c7401b9d87421f4edb624c70eb1e0c9a876be0 cxgb3: Add commit 2c61ac81dd3b127a8539d086b5cb5c9f5fccd9ec cxgb3: Add commit 142760115c9bf9887a2c97bc685dd1806e9ac91d cxgb3: Add commit b76839e242219562c167b229fde20a9be91b3875 cxgb3: Add commit 89531c0df41d1bc73983ae86277bbea446d034bb cxgb3: added 0039-cxgb3-update-firmware-version cxgb3: added 0040-cxgb3-log-and-clear-PEX-errors cxgb3: added 0041-cxgb3-remove-false-positive-in-xgmac-workaround cxgb3: added 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts cxgb3: added 0043-cxgb3-CQ-context-operations-time-out-too-soon cxgb3: added 0044-cxgb3-Add-T3C-rev cxgb3: added 0045-cxgb3-Update-engine-microcode-version cxgb3: add 0046-cxgb3-driver-version patch. Michael S. Tsirkin wrote: > You don't? > I do. > > http://www.openfabrics.org/git/?p=ofed_1_2/linux-2.6.git;a=summary > > has ofed_1_2_c > > http://www.openfabrics.org/git/?p=ofed_1_3/linux-2.6.git;a=summary > > has ofed_kernel > > > Quoting Steve Wise : > Subject: Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > > Michael, > > I don't see these in the ofed_1_2/linux-2.6.git repos? Ditto for the > 1.3 repos... > > > > > Michael S. Tsirkin wrote: >> Done. I'll push soon. >> >> Quoting Steve Wise : >> Subject: [GIT PULL ofed_1_2_c] cxgb3 bug fixes >> >> Vlad (Michael/Tziporet in Vlad's absence), >> >> Please integrate the following cxgb3 bug fixes into ofed-1.2.5. All of >> these patches are either in 2.6.23 or merged into Jeff Garzik's upstream >> branch of netdev-2.6 and will go into 2.6.24. Chelsio recommends we >> update ofed-1.2.5 and ofed-1.3 will all of these fixes. >> >> I'll send another email with the ofed-1.3 changes as they will be >> slightly different. >> >> Please pull the ofed_1_2_c changes from: >> >> git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c >> >> The patch files added to kernel_patches/fixes include: >> >>> swise at dell3:~/git/ofed-1.2.5> stg series >>> + 0029-cxgb3-engine-microcode-load >>> + 0030-cxgb3-MAC-workaround-update >>> + 0031-cxgb3-Update-rx-coalescing-length >>> + 0032-cxgb3-SGE-doorbell-overflow-warning >>> + 0033-cxgb3-use-immediate-data-for-offload-Tx >>> + 0034-cxgb3-Expose-HW-memory-page-info >>> + 0035-cxgb3-tighten-checks-on-TID-values >>> + 0036-cxgb3-Fatal-error-update >>> + 0037-cxgb3-log-adapter-serial-number >>> + 0038-cxgb3-Update-internal-memory-management >>> + 0039-cxgb3-update-firmware-version >>> + 0040-cxgb3-log-and-clear-PEX-errors >>> + 0041-cxgb3-remove-false-positive-in-xgmac-workaround >>> + 0042-cxgb3-Set-the-CQ_ERR-bit-in-CQ-contexts >>> + 0043-cxgb3-CQ-context-operations-time-out-too-soon >>> + 0044-cxgb3-Add-T3C-rev >>> + 0045-cxgb3-Update-engine-microcode-version >>>> 0046-cxgb3-driver-version >> Steve. >> _______________________________________________ >> ewg mailing list >> ewg at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> > From mst at dev.mellanox.co.il Sun Sep 23 11:44:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Sep 2007 20:44:33 +0200 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <46F6B39F.3080902@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> <46F6A097.2040402@opengridcomputing.com> <20070923182223.GB12425@mellanox.co.il> <46F6B39F.3080902@opengridcomputing.com> Message-ID: <20070923184432.GA18109@mellanox.co.il> > Quoting Steve Wise : > Subject: Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > > I don't see any of my commits... > > > Here is the shortlog from my public ofed_1_2 repos. These commits > aren't in the ofed-1.2.5 repos nor merged into ofed-1.3: You are right. I pulled your trees but forgot to push out to ofed directories. Doing it now. -- MST From sashak at voltaire.com Sun Sep 23 11:59:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Sep 2007 20:59:15 +0200 Subject: [ofa-general] [PATCH] opensm: osm_version.h is generated by ./configure Message-ID: <20070923185915.GB2131@sashak.voltaire.com> include/opensm/osm_version.h file will be generated (from osm_version.h.in template) by ./configure. In this way we only keep OpenSM version string in one place (in configure.in). Signed-off-by: Sasha Khapyorsky --- opensm/configure.in | 2 +- opensm/include/Makefile.am | 1 - opensm/include/opensm/osm_version.h | 60 -------------------------------- opensm/include/opensm/osm_version.h.in | 60 ++++++++++++++++++++++++++++++++ 4 files changed, 61 insertions(+), 62 deletions(-) delete mode 100644 opensm/include/opensm/osm_version.h create mode 100644 opensm/include/opensm/osm_version.h.in diff --git a/opensm/configure.in b/opensm/configure.in index cc8cf14..d120c05 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -87,4 +87,4 @@ OPENIB_APP_OSMV_CHECK_LIB CFLAGS=$ac_env_CFLAGS_value dnl Create the following Makefiles -AC_OUTPUT([Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) +AC_OUTPUT([include/opensm/osm_version.h Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am index ab67446..b2d01fa 100644 --- a/opensm/include/Makefile.am +++ b/opensm/include/Makefile.am @@ -4,7 +4,6 @@ SUBDIRS = . nobase_pkginclude_HEADERS = iba/ib_types.h iba/ib_cm_types.h EXTRA_DIST = \ - $(srcdir)/opensm/osm_version.h \ $(srcdir)/opensm/osm_sa_path_record.h \ $(srcdir)/opensm/osm_lid_mgr.h \ $(srcdir)/opensm/osm_vl_arb_rcv.h \ diff --git a/opensm/include/opensm/osm_version.h b/opensm/include/opensm/osm_version.h deleted file mode 100644 index 39d5696..0000000 --- a/opensm/include/opensm/osm_version.h +++ /dev/null @@ -1,60 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#ifndef _OSM_VERSION_H_ -#define _OSM_VERSION_H_ - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****s* OpenSM: Base/OSM_VERSION -* NAME -* OSM_VERSION -* -* DESCRIPTION -* The version string for OpenSM -* -* SYNOPSIS -*/ -#define OSM_VERSION "OpenSM 3.1.5" -/********/ -END_C_DECLS -#endif /* _OSM_VERSION_H_ */ diff --git a/opensm/include/opensm/osm_version.h.in b/opensm/include/opensm/osm_version.h.in new file mode 100644 index 0000000..f5661d0 --- /dev/null +++ b/opensm/include/opensm/osm_version.h.in @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _OSM_VERSION_H_ +#define _OSM_VERSION_H_ + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS +/****s* OpenSM: Base/OSM_VERSION +* NAME +* OSM_VERSION +* +* DESCRIPTION +* The version string for OpenSM +* +* SYNOPSIS +*/ +#define OSM_VERSION "OpenSM @VERSION@" +/********/ +END_C_DECLS +#endif /* _OSM_VERSION_H_ */ -- 1.5.3.1.91.gd3392 From bagworm at andynmagic.karoo.co.uk Sun Sep 23 11:51:31 2007 From: bagworm at andynmagic.karoo.co.uk (Kelvin Barnard) Date: , 23 Sep 2007 14:51:31 -0400 Subject: [ofa-general] Market Alert Message-ID: <01c7fe12$c17f7290$33e28a18@bagworm> PPYH Represents Apartments In Manhattan Hill Project Physical Property Holdings Inc. PPYH $0.25 Read the news and check out the Manhattan Hill website. This is going to rocket this stock come Monday. Get ahead of the rush and grab PPYH first thing on Monday morning. From sashak at voltaire.com Sun Sep 23 12:15:56 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 23 Sep 2007 21:15:56 +0200 Subject: [ofa-general] [query] Multi path discovery in openSM In-Reply-To: <829ded920709210125q3c4c89dak8b211267b6e31e55@mail.gmail.com> References: <829ded920709210125q3c4c89dak8b211267b6e31e55@mail.gmail.com> Message-ID: <20070923191556.GC2131@sashak.voltaire.com> Hi Manesh, On 13:55 Fri 21 Sep , Keshetti Mahesh wrote: > What is the exact significance of the configurable option LMC > in the opensm.conf file? Multiple LIDs will be assigned for single end node when LMC is > 0. > If there are multiple paths between two end nodes in a network and > I set the LMC > 0 then whether the openSM itself identifies those > routes and updates the switch forwarding tables or is it the duty of some > other consumer ?? OpenSM. > And after configuring multiple paths between end nodes, how exactly they > are used for path redundancy and load sharing. > Again is it the duty of the openSM (in case any SM) or the application? It is up to application (not OpenSM) how to use it. Basically by using different end node LIDs you are able to utilize different paths. Sasha From hadi at cyberus.ca Sun Sep 23 12:11:52 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 15:11:52 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <46F6AE18.7080708@garzik.org> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> Message-ID: <1190574713.5030.4.camel@localhost> On Sun, 2007-23-09 at 14:19 -0400, Jeff Garzik wrote: > > You should post at least a couple driver patches to see how its used on > Real Hardware(tm)... :) This is the tg3 patch i used for the testing - against whats in Daves net-2.6.24 tree. Patch may be a bit hard to read. For an example of an LLTX version look at the e1000 in the older git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git If the intel folks will accept the patch i'd really like to kill the e1000 LLTX interface. The tg3 in that tree used the old style batch_xmit() interface. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: tg3.p Type: text/x-patch Size: 16359 bytes Desc: not available URL: From mst at dev.mellanox.co.il Sun Sep 23 12:11:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 23 Sep 2007 21:11:36 +0200 Subject: [ofa-general] Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes In-Reply-To: <46F6A097.2040402@opengridcomputing.com> References: <46E94B36.70406@opengridcomputing.com> <20070916091024.GF30150@mellanox.co.il> <46F6A097.2040402@opengridcomputing.com> Message-ID: <20070923191136.GB18109@mellanox.co.il> > Quoting Steve Wise : > Subject: Re: [GIT PULL ofed_1_2_c] cxgb3 bug fixes > > Michael, > > I don't see these in the ofed_1_2/linux-2.6.git repos? Ditto for the > 1.3 repos... Should be fixed now. -- MST From kliteyn at mellanox.co.il Sun Sep 23 12:13:52 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 23 Sep 2007 21:13:52 +0200 Subject: [ofa-general] [PATCH] opensm: osm_version.h is generated by ./configure In-Reply-To: <20070923185915.GB2131@sashak.voltaire.com> References: <20070923185915.GB2131@sashak.voltaire.com> Message-ID: <46F6BAF0.4070509@mellanox.co.il> Good idea. Thanks. -- Yevgeny Sasha Khapyorsky wrote: > include/opensm/osm_version.h file will be generated (from > osm_version.h.in template) by ./configure. In this way we only keep > OpenSM version string in one place (in configure.in). > > Signed-off-by: Sasha Khapyorsky > --- > opensm/configure.in | 2 +- > opensm/include/Makefile.am | 1 - > opensm/include/opensm/osm_version.h | 60 -------------------------------- > opensm/include/opensm/osm_version.h.in | 60 ++++++++++++++++++++++++++++++++ > 4 files changed, 61 insertions(+), 62 deletions(-) > delete mode 100644 opensm/include/opensm/osm_version.h > create mode 100644 opensm/include/opensm/osm_version.h.in > > diff --git a/opensm/configure.in b/opensm/configure.in > index cc8cf14..d120c05 100644 > --- a/opensm/configure.in > +++ b/opensm/configure.in > @@ -87,4 +87,4 @@ OPENIB_APP_OSMV_CHECK_LIB > CFLAGS=$ac_env_CFLAGS_value > > dnl Create the following Makefiles > -AC_OUTPUT([Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) > +AC_OUTPUT([include/opensm/osm_version.h Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) > diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am > index ab67446..b2d01fa 100644 > --- a/opensm/include/Makefile.am > +++ b/opensm/include/Makefile.am > @@ -4,7 +4,6 @@ SUBDIRS = . > nobase_pkginclude_HEADERS = iba/ib_types.h iba/ib_cm_types.h > > EXTRA_DIST = \ > - $(srcdir)/opensm/osm_version.h \ > $(srcdir)/opensm/osm_sa_path_record.h \ > $(srcdir)/opensm/osm_lid_mgr.h \ > $(srcdir)/opensm/osm_vl_arb_rcv.h \ > diff --git a/opensm/include/opensm/osm_version.h b/opensm/include/opensm/osm_version.h > deleted file mode 100644 > index 39d5696..0000000 > --- a/opensm/include/opensm/osm_version.h > +++ /dev/null > @@ -1,60 +0,0 @@ > -/* > - * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > - * > - * This software is available to you under a choice of one of two > - * licenses. You may choose to be licensed under the terms of the GNU > - * General Public License (GPL) Version 2, available from the file > - * COPYING in the main directory of this source tree, or the > - * OpenIB.org BSD license below: > - * > - * Redistribution and use in source and binary forms, with or > - * without modification, are permitted provided that the following > - * conditions are met: > - * > - * - Redistributions of source code must retain the above > - * copyright notice, this list of conditions and the following > - * disclaimer. > - * > - * - Redistributions in binary form must reproduce the above > - * copyright notice, this list of conditions and the following > - * disclaimer in the documentation and/or other materials > - * provided with the distribution. > - * > - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > - * SOFTWARE. > - * > - */ > - > -#ifndef _OSM_VERSION_H_ > -#define _OSM_VERSION_H_ > - > -#ifdef __cplusplus > -# define BEGIN_C_DECLS extern "C" { > -# define END_C_DECLS } > -#else /* !__cplusplus */ > -# define BEGIN_C_DECLS > -# define END_C_DECLS > -#endif /* __cplusplus */ > - > -BEGIN_C_DECLS > -/****s* OpenSM: Base/OSM_VERSION > -* NAME > -* OSM_VERSION > -* > -* DESCRIPTION > -* The version string for OpenSM > -* > -* SYNOPSIS > -*/ > -#define OSM_VERSION "OpenSM 3.1.5" > -/********/ > -END_C_DECLS > -#endif /* _OSM_VERSION_H_ */ > diff --git a/opensm/include/opensm/osm_version.h.in b/opensm/include/opensm/osm_version.h.in > new file mode 100644 > index 0000000..f5661d0 > --- /dev/null > +++ b/opensm/include/opensm/osm_version.h.in > @@ -0,0 +1,60 @@ > +/* > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _OSM_VERSION_H_ > +#define _OSM_VERSION_H_ > + > +#ifdef __cplusplus > +# define BEGIN_C_DECLS extern "C" { > +# define END_C_DECLS } > +#else /* !__cplusplus */ > +# define BEGIN_C_DECLS > +# define END_C_DECLS > +#endif /* __cplusplus */ > + > +BEGIN_C_DECLS > +/****s* OpenSM: Base/OSM_VERSION > +* NAME > +* OSM_VERSION > +* > +* DESCRIPTION > +* The version string for OpenSM > +* > +* SYNOPSIS > +*/ > +#define OSM_VERSION "OpenSM @VERSION@" > +/********/ > +END_C_DECLS > +#endif /* _OSM_VERSION_H_ */ > From auke-jan.h.kok at intel.com Sun Sep 23 12:36:57 2007 From: auke-jan.h.kok at intel.com (Kok, Auke) Date: Sun, 23 Sep 2007 12:36:57 -0700 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1190574713.5030.4.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> Message-ID: <46F6C059.6000600@intel.com> jamal wrote: > On Sun, 2007-23-09 at 14:19 -0400, Jeff Garzik wrote: > >> You should post at least a couple driver patches to see how its used on >> Real Hardware(tm)... :) > > This is the tg3 patch i used for the testing - against whats in Daves > net-2.6.24 tree. Patch may be a bit hard to read. > For an example of an LLTX version look at the e1000 in the older git > tree at: > git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git > > If the intel folks will accept the patch i'd really like to kill > the e1000 LLTX interface. > The tg3 in that tree used the old style batch_xmit() interface. please be reminded that we're going to strip down e1000 and most of the features should go into e1000e, which has much less hardware workarounds. I'm still reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) AUke From rdreier at cisco.com Sun Sep 23 13:06:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 23 Sep 2007 13:06:05 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get one fix for a data corruption bug in 2.6.23-rc7: Jack Morgenstein (1): IB/mlx4: Fix data corruption triggered by wrong headroom marking order drivers/infiniband/hw/mlx4/qp.c | 62 ++++++++++++++++++++++++++++++-------- 1 files changed, 49 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index ba0428d..85c51bd 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1211,12 +1211,42 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); } -static void set_data_seg(struct mlx4_wqe_data_seg *dseg, - struct ib_sge *sg) +static void set_mlx_icrc_seg(void *dseg) +{ + u32 *t = dseg; + struct mlx4_wqe_inline_seg *iseg = dseg; + + t[1] = 0; + + /* + * Need a barrier here before writing the byte_count field to + * make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment begins + * a new cacheline, the HCA prefetcher could grab the 64-byte + * chunk and get a valid (!= * 0xffffffff) byte count but + * stale data, and end up sending the wrong data. + */ + wmb(); + + iseg->byte_count = cpu_to_be32((1 << 31) | 4); +} + +static void set_data_seg(struct mlx4_wqe_data_seg *dseg, struct ib_sge *sg) { - dseg->byte_count = cpu_to_be32(sg->length); dseg->lkey = cpu_to_be32(sg->lkey); dseg->addr = cpu_to_be64(sg->addr); + + /* + * Need a barrier here before writing the byte_count field to + * make sure that all the data is visible before the + * byte_count field is set. Otherwise, if the segment begins + * a new cacheline, the HCA prefetcher could grab the 64-byte + * chunk and get a valid (!= * 0xffffffff) byte count but + * stale data, and end up sending the wrong data. + */ + wmb(); + + dseg->byte_count = cpu_to_be32(sg->length); } int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, @@ -1225,6 +1255,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct mlx4_ib_qp *qp = to_mqp(ibqp); void *wqe; struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_data_seg *dseg; unsigned long flags; int nreq; int err = 0; @@ -1324,22 +1355,27 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, break; } - for (i = 0; i < wr->num_sge; ++i) { - set_data_seg(wqe, wr->sg_list + i); + /* + * Write data segments in reverse order, so as to + * overwrite cacheline stamp last within each + * cacheline. This avoids issues with WQE + * prefetching. + */ - wqe += sizeof (struct mlx4_wqe_data_seg); - size += sizeof (struct mlx4_wqe_data_seg) / 16; - } + dseg = wqe; + dseg += wr->num_sge - 1; + size += wr->num_sge * (sizeof (struct mlx4_wqe_data_seg) / 16); /* Add one more inline data segment for ICRC for MLX sends */ - if (qp->ibqp.qp_type == IB_QPT_SMI || qp->ibqp.qp_type == IB_QPT_GSI) { - ((struct mlx4_wqe_inline_seg *) wqe)->byte_count = - cpu_to_be32((1 << 31) | 4); - ((u32 *) wqe)[1] = 0; - wqe += sizeof (struct mlx4_wqe_data_seg); + if (unlikely(qp->ibqp.qp_type == IB_QPT_SMI || + qp->ibqp.qp_type == IB_QPT_GSI)) { + set_mlx_icrc_seg(dseg + 1); size += sizeof (struct mlx4_wqe_data_seg) / 16; } + for (i = wr->num_sge - 1; i >= 0; --i, --dseg) + set_data_seg(dseg, wr->sg_list + i); + ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ? MLX4_WQE_CTRL_FENCE : 0) | size; From swise at opengridcomputing.com Sun Sep 23 13:29:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 15:29:24 -0500 Subject: [ofa-general] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <20070923085052.GC24557@mellanox.co.il> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> <46F3E3D2.70601@opengridcomputing.com> <20070923085052.GC24557@mellanox.co.il> Message-ID: <46F6CCA4.1010607@opengridcomputing.com> Michael S. Tsirkin wrote: > Yes, please push this into your git tree (and please verify that > cross-build to all OS-es passes). > done! git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > Further, please do it this way: add the patch in ofed-1.2.5 > and then merge 1.2.5 into 1.3. > done! git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel Steve. From swise at opengridcomputing.com Sun Sep 23 13:33:35 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 15:33:35 -0500 Subject: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46E99586.90905@ichips.intel.com> References: <20070913191617.30937.95960.stgit@dell3.ogc.int> <46E99586.90905@ichips.intel.com> Message-ID: <46F6CD9F.9090606@opengridcomputing.com> Sean Hefty wrote: >> The iWARP driver must translate all listens on address 0.0.0.0 to the >> set of rdma-only ip addresses for the device in question. This prevents >> incoming connect requests to the TCP ipaddresses from going up the >> rdma stack. > > I've only given this a high level review at this point, and while the > patch looks okay on first pass, is there a way to move some of this > functionality to either the rdma_cm or iw_cm? I don't like the idea of > every iwarp driver having to implement address/listen list maintenance. > I may have some ideas after re-examining it. > Note: some rnic drivers might want to support this differently. So maybe we don't want this in the iwcm yet until we see that more iwarp drivers need exactly the same functionality. >> Implementation Details: > > There are a couple of areas that I made a note to look at in more detail > (because I didn't understand everything that was happening), but I did > have one minor nit - most uses of list_del_init can just be list_del. > fixed. From swise at opengridcomputing.com Sun Sep 23 13:36:49 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 15:36:49 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Message-ID: <20070923203649.8324.64524.stgit@dell3.ogc.int> iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Version 3: - don't use list_del_init() where list_del() is sufficient. Version 2: - added a per-device mutex for the address and listening endpoints lists. - wait for all replies if sending multiple passive_open requests to rnic. - log warning if no addresses are available when a listen is issued. - tested --- Design: The sysadmin creates "for iwarp use only" alias interfaces of the form "devname:iw*" where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with "iw". The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. EG: ifconfig eth0 192.168.70.123 up ifconfig eth0:iw1 192.168.71.123 up ifconfig eth0:iw2 192.168.72.123 up In the above example, 192.168.70/24 is for TCP traffic, while 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. The rdma-only interface must be on its own IP subnet. This allows routing all rdma traffic onto this interface. The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. Implementation Details: - The iw_cxgb3 driver registers for inetaddr events via register_inetaddr_notifier(). This allows tracking the iwarp-only addresses/subnets as they get added and deleted. The iwarp driver maintains a list of the current iwarp-only addresses. - The iw_cxgb3 driver builds the list of iwarp-only addresses for its devices at module insert time. This is needed because the inetaddr notifier callbacks don't "replay" address-add events when someone registers. So the driver must build the initial list at module load time. - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver must translate that into a set of listens on the iwarp-only addresses. This is implemented by maintaining a list of stid/addr entries per listening endpoint. - When a new iwarp-only address is added or removed, the iw_cxgb3 driver must traverse the set of listening endpoints and update them accordingly. This allows an application to bind to 0.0.0.0 prior to the iwarp-only interfaces being configured. It also allows changing the iwarp-only set of addresses and getting the expected behavior for apps already bound to 0.0.0.0. This is done by maintaining a list of listening endpoints off the device struct. - The address list, the listening endpoint list, and each list of stid/addrs in use per listening endpoint are all protected via a mutex per iw_cxgb3 device. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 125 ++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 11 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 259 +++++++++++++++++++++++++++------ drivers/infiniband/hw/cxgb3/iwch_cm.h | 15 ++ 4 files changed, 360 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 0315c9d..d81d46e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = { static LIST_HEAD(dev_list); static DEFINE_MUTEX(dev_mutex); +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return; + } + addr->ifa = ifa; + mutex_lock(&rnicp->mutex); + list_add_tail(&addr->entry, &rnicp->addrlist); + mutex_unlock(&rnicp->mutex); +} + +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + if (addr->ifa == ifa) { + list_del(&addr->entry); + kfree(addr); + goto out; + } + } +out: + mutex_unlock(&rnicp->mutex); +} + +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) +{ + int i; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) + if (netdev == rnicp->rdev.port_info.lldevs[i]) + return 1; + return 0; +} + +static inline int is_iwarp_label(char *label) +{ + char *colon; + + colon = strchr(label, ':'); + if (colon && !strncmp(colon+1, "iw", 2)) + return 1; + return 0; +} + +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa->ifa_address); + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x deleted\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + iwch_listeners_del_addr(rnicp, ifa->ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + list_del(&addr->entry); + kfree(addr); + } + mutex_unlock(&rnicp->mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ + int i; + struct in_device *indev; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, + ifa->ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(&rnicp->qpidr); idr_init(&rnicp->mmidr); spin_lock_init(&rnicp->lock); + INIT_LIST_HEAD(&rnicp->addrlist); + INIT_LIST_HEAD(&rnicp->listen_eps); + mutex_init(&rnicp->mutex); + rnicp->nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(&rnicp->nb); rnicp->attr.vendor_id = 0x168; rnicp->attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(&dev_mutex); list_for_each_entry_safe(dev, tmp, &dev_list, entry) { if (dev->rdev.t3cdev_p == tdev) { + unregister_inetaddr_notifier(&dev->nb); + delete_addrlist(dev); list_del(&dev->entry); iwch_unregister_device(dev); cxio_rdev_close(&dev->rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,6 +36,8 @@ #include #include #include #include +#include +#include #include @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { u32 cq_overflow_detection; }; +struct iwch_addrlist { + struct list_head entry; + struct in_ifaddr *ifa; +}; + struct iwch_dev { struct ib_device ibdev; struct cxio_rdev rdev; @@ -111,6 +118,10 @@ struct iwch_dev { struct idr mmidr; spinlock_t lock; struct list_head entry; + struct notifier_block nb; + struct list_head addrlist; + struct list_head listen_eps; + struct mutex mutex; }; static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 1cdfcd4..afc8a48 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t return CPL_RET_BUF_DONE; } -static int listen_start(struct iwch_listen_ep *ep) +static int wait_for_reply(struct iwch_ep_common *epc) +{ + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); + wait_event(epc->waitq, epc->rpl_done); + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); + return epc->rpl_err; +} + +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, + __be32 addr) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_listen_entry *le; + + le = kmalloc(sizeof *le, GFP_KERNEL); + if (!le) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return NULL; + } + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, + &t3c_client, ep); + if (le->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", + __FUNCTION__); + kfree(le); + return NULL; + } + le->addr = addr; + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + return le; +} + +static void dealloc_listener(struct iwch_listen_ep *ep, + struct iwch_listen_entry *le) +{ + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + cxgb3_free_stid(ep->com.tdev, le->stid); + kfree(le); +} + +static void dealloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le, *tmp; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + + mutex_lock(&h->mutex); + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { + list_del(&le->entry); + dealloc_listener(ep, le); + } + mutex_unlock(&h->mutex); +} + +static int alloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_addrlist *addr; + struct iwch_listen_entry *le; + int err = 0; + int added=0; + mutex_lock(&h->mutex); + list_for_each_entry(addr, &h->addrlist, entry) { + if (ep->com.local_addr.sin_addr.s_addr == 0 || + ep->com.local_addr.sin_addr.s_addr == + addr->ifa->ifa_address) { + le = alloc_listener(ep, addr->ifa->ifa_address); + if (!le) + break; + list_add_tail(&le->entry, &ep->listeners); + added++; + } + } + mutex_unlock(&h->mutex); + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) + err = -EADDRNOTAVAIL; + if (!err && !added) + printk(KERN_WARNING MOD + "No RDMA interface found for device %s\n", + pci_name(h->rdev.rnic_info.pdev)); + return err; +} + +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int stid) { struct sk_buff *skb; - struct cpl_pass_open_req *req; + struct cpl_close_listserv_req *req; + + PDBG("%s stid %u\n", __FUNCTION__, stid); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->cpu_idx = 0; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); + skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; + cxgb3_ofld_send(ep->com.tdev, skb); + return wait_for_reply(&ep->com); +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_stop_one(ep, le->stid); + if (err) + break; + } + mutex_unlock(&h->mutex); + return err; +} + +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int stid, + __be32 addr, __be16 port) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, ntohl(addr), + ntohs(port)); skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); if (!skb) { - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); return -ENOMEM; } req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); - req->local_port = ep->com.local_addr.sin_port; - req->local_ip = ep->com.local_addr.sin_addr.s_addr; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); + req->local_port = port; + req->local_ip = addr; req->peer_port = 0; req->peer_ip = 0; req->peer_netmask = 0; @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return wait_for_reply(&ep->com); +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_start_one(ep, le->stid, le->addr, + ep->com.local_addr.sin_port); + if (err) + goto fail; + } + mutex_unlock(&h->mutex); + return err; +fail: + mutex_unlock(&h->mutex); + listen_stop(ep); + return err; } static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * return CPL_RET_BUF_DONE; } -static int listen_stop(struct iwch_listen_ep *ep) -{ - struct sk_buff *skb; - struct cpl_close_listserv_req *req; - - PDBG("%s ep %p\n", __FUNCTION__, ep); - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); - if (!skb) { - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); - return -ENOMEM; - } - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - req->cpu_idx = 0; - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); - skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; -} - static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) { struct iwch_listen_ep *ep = ctx; struct cpl_close_listserv_rpl *rpl = cplhdr(skb); - PDBG("%s ep %p\n", __FUNCTION__, ep); + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); + ep->com.rpl_err = status2errno(rpl->status); ep->com.rpl_done = 1; wake_up(&ep->com.waitq); return CPL_RET_BUF_DONE; } +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + le = alloc_listener(listen_ep, addr); + if (le) { + list_add_tail(&le->entry, &listen_ep->listeners); + listen_start_one(listen_ep, le->stid, addr, + listen_ep->com.local_addr.sin_port); + } + } + mutex_unlock(&rnicp->mutex); +} + +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, + entry) + if (le->addr == addr) { + listen_stop_one(listen_ep, le->stid); + list_del(&le->entry); + dealloc_listener(listen_ep, le); + } + } + mutex_unlock(&rnicp->mutex); +} + static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) { struct cpl_pass_accept_rpl *rpl; @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i goto err; /* wait for wr_ack */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; + err = wait_for_reply(&ep->com); if (err) goto err; @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * ep->com.cm_id = cm_id; ep->backlog = backlog; ep->com.local_addr = cm_id->local_addr; + INIT_LIST_HEAD(&ep->listeners); - /* - * Allocate a server TID. - */ - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); - if (ep->stid == -1) { - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); - err = -ENOMEM; + err = alloc_listener_list(ep); + if (err) goto fail2; - } state_set(&ep->com, LISTEN); err = listen_start(ep); - if (err) - goto fail3; - /* wait for pass_open_rpl */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; if (!err) { cm_id->provider_data = ep; + mutex_lock(&h->mutex); + list_add_tail(&ep->entry, &h->listen_eps); + mutex_unlock(&h->mutex); goto out; } -fail3: - cxgb3_free_stid(ep->com.tdev, ep->stid); + dealloc_listener_list(ep); fail2: cm_id->rem_ref(cm_id); put_ep(&ep->com); @@ -1923,18 +2084,20 @@ out: int iwch_destroy_listen(struct iw_cm_id *cm_id) { int err; + struct iwch_dev *h = to_iwch_dev(cm_id->device); struct iwch_listen_ep *ep = to_listen_ep(cm_id); PDBG("%s ep %p\n", __FUNCTION__, ep); might_sleep(); + mutex_lock(&h->mutex); + list_del(&ep->entry); + mutex_unlock(&h->mutex); state_set(&ep->com, DEAD); ep->com.rpl_done = 0; ep->com.rpl_err = 0; err = listen_stop(ep); - wait_event(ep->com.waitq, ep->com.rpl_done); - cxgb3_free_stid(ep->com.tdev, ep->stid); - err = ep->com.rpl_err; + dealloc_listener_list(ep); cm_id->rem_ref(cm_id); put_ep(&ep->com); return err; diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 6107e7c..23e5a22 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -162,10 +162,19 @@ struct iwch_ep_common { int rpl_err; }; -struct iwch_listen_ep { - struct iwch_ep_common com; +struct iwch_listen_entry { + struct list_head entry; unsigned int stid; + __be32 addr; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; /* Must be first entry! */ + struct list_head entry; + struct list_head listeners; int backlog; + int listen_count; + int listen_rpls; }; struct iwch_ep { @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); void __free_ep(struct kref *kref); void iwch_rearp(struct iwch_ep *ep); int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t); +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); int __init iwch_cm_init(void); void __exit iwch_cm_term(void); From openingdoors at ageconcernglos.org.uk Sun Sep 23 12:38:27 2007 From: openingdoors at ageconcernglos.org.uk (openingdoors) Date: Sun, 23 Sep 2007 21:38:27 +0200 Subject: [ofa-general] Iraq: Blackwater staff face charges Message-ID: PPYH Gets A hold Of Manhattan Hill Project Apartments. Physical Property Holdings Inc. PPYH $0.25 Look at the Manhattan Hill website and read the release. This is going to rocket this stock come Monday. Move on PPYH firs thing Mon. Chancellor Alistair Darling has vowed to learn lessons from the. 2007 whisked off in black helicopterThose among who feel that. X-Antivirus: avast! (VPS 000775-6, 22/09/2007), Outbound message X-Antivirus-Status: Clean From swise at opengridcomputing.com Sun Sep 23 14:11:48 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 23 Sep 2007 16:11:48 -0500 Subject: [ofa-general] [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 Message-ID: <46F6D694.6050407@opengridcomputing.com> Please pull the latest from my libcxgb3 git repos to update the ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 and git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 Thanks, Steve. From hadi at cyberus.ca Sun Sep 23 14:20:48 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 23 Sep 2007 17:20:48 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <46F6C059.6000600@intel.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <46F6C059.6000600@intel.com> Message-ID: <1190582448.4240.2.camel@localhost> On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote: > please be reminded that we're going to strip down e1000 and most of the features > should go into e1000e, which has much less hardware workarounds. I'm still > reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) sure - the question then is, will you take those changes if i use e1000e? theres a few cleanups that have nothing to do with batching; take a look at the modified e1000 on the git tree. cheers, jamal From kliteyn at mellanox.co.il Sun Sep 23 22:12:39 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 24 Sep 2007 07:12:39 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-24:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-23 OpenSM git rev = Thu_Sep_20_21:41:18_2007 [cb9d01f98c9a68098d4db47bf160295cb521b367] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From keshetti85-student at yahoo.co.in Sun Sep 23 23:53:33 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 24 Sep 2007 12:23:33 +0530 Subject: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <829ded920709232341oddcf151p50d873c55d4a1724@mail.gmail.com> References: <829ded920709210125q3c4c89dak8b211267b6e31e55@mail.gmail.com> <829ded920709232341oddcf151p50d873c55d4a1724@mail.gmail.com> Message-ID: <829ded920709232353r7e4bf11ai7e261da61de15485@mail.gmail.com> > > If there are multiple paths between two end nodes in a network and > > I set the LMC > 0 then whether the openSM itself identifies those > > routes and updates the switch forwarding tables or is it the duty of some > > other consumer ?? > > OpenSM. I am using min-hop algorithm with openSM. Now in this case, if there are multiple paths (some are not min-hop paths) will the openSM(LMC > 0) configure those paths? regards, Mahesh From auke-jan.h.kok at intel.com Mon Sep 24 00:00:23 2007 From: auke-jan.h.kok at intel.com (Kok, Auke) Date: Mon, 24 Sep 2007 00:00:23 -0700 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1190582448.4240.2.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <46F6C059.6000600@intel.com> <1190582448.4240.2.camel@localhost> Message-ID: <46F76087.8030109@intel.com> jamal wrote: > On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote: > >> please be reminded that we're going to strip down e1000 and most of the features >> should go into e1000e, which has much less hardware workarounds. I'm still >> reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) > > sure - the question then is, will you take those changes if i use > e1000e? theres a few cleanups that have nothing to do with batching; > take a look at the modified e1000 on the git tree. that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. Auke From eitan at mellanox.co.il Mon Sep 24 00:05:01 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 24 Sep 2007 09:05:01 +0200 Subject: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <829ded920709232353r7e4bf11ai7e261da61de15485@mail.gmail.com> References: <829ded920709210125q3c4c89dak8b211267b6e31e55@mail.gmail.com><829ded920709232341oddcf151p50d873c55d4a1724@mail.gmail.com> <829ded920709232353r7e4bf11ai7e261da61de15485@mail.gmail.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9024AA8FC@mtlexch01.mtl.com> OpenSM will always use min-hop paths (no matter what routing algorithm except maybe for LASH). If you use the default algorithms OpenSM will tend to spread traffic such that if you have used LMC=1 (2 LIDs per port) The two paths going to LID0 and LID1 will go through different systems or if not possible through different nodes. EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Keshetti Mahesh > Sent: Monday, September 24, 2007 8:54 AM > To: openIB > Subject: [ofa-general] Re: [query] Multi path discovery in openSM > > > > If there are multiple paths between two end nodes in a > network and I > > > set the LMC > 0 then whether the openSM itself identifies those > > > routes and updates the switch forwarding tables or is it > the duty of > > > some other consumer ?? > > > > OpenSM. > > I am using min-hop algorithm with openSM. > Now in this case, if there are multiple paths (some are not > min-hop paths) will the openSM(LMC > 0) configure those paths? > > regards, > Mahesh > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From keshetti85-student at yahoo.co.in Mon Sep 24 00:23:05 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 24 Sep 2007 12:53:05 +0530 Subject: Fw: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <479359.26315.qm@web8315.mail.in.yahoo.com> References: <479359.26315.qm@web8315.mail.in.yahoo.com> Message-ID: <829ded920709240023v1282341cq4e14ce29f19fba1b@mail.gmail.com> > OpenSM will always use min-hop paths (no matter what routing algorithm > except maybe for LASH). > If you use the default algorithms OpenSM will tend to spread traffic > such that if you have used LMC=1 (2 LIDs per port) > The two paths going to LID0 and LID1 will go through different systems > or if not possible through different nodes. > Using the same example you have mentioned, what happens if LMC=1and there are 2 paths (say P1, P2 and P2 is costlier than P1) between two nodes (say N1, N2). Will the openSM still configure two different paths for LID0 and LID1? -Mahesh > EZ From gnostic at taajayscatering.com Mon Sep 24 00:27:55 2007 From: gnostic at taajayscatering.com (alamannic) Date: Mon, 24 Sep 2007 07:27:55 +0000 Subject: [ofa-general] cowsucker cowtail cowthwort cowtongue Message-ID: <688001c7fe7c$0267bbc2$172a2c74@C-1> Get on AC+GU fi,rst thin-g Mo=nda-y! AS+SET CAPIT;A+L GP I;NC. ACG-U $1.1=5 A*CGU A,sset Cap.ita,l G.roup+, In.c. w,ill focu,s up+on loc+a,tin-g and in-ve+st=ing in s.mall., pr*o+fi;tab*le ent-e.r.p=rises wit-h p.rom=isin*g gr,owt;h pote.n=t;ia,l. The Co;mp*any inte;nd*s to in=ve+st in co*m.pa+nies in a w=ide rang-e of c;ate+go+rie-s, inc=l-udi.ng man=u+f.act*urin;g=, env*ir,o+nme-nt-al cl*ean*-up*, fina.nci.a+l servi,c+e+s and othe,r ar.e;as, thi-s comp;an*y is g+oing to e;x-plo=de. ACG-U AC=GU ACG+U ACG;U ACG.U HU,RRY c.all y=our Br;o+ker Now !!! Hug+e PR ca;m,paig-n un-de-r+way now and its ti;me for you to get in now and ri-de th,is wa;ve ea-rly to pr*of,it. paraldehyde, or opium, must be given in large doses. Chloral is perhaps the best, and the patient should rarely have less than 150 grains in twenty-four hours. When he is unable to swallow, it should be given by the rectum. The administration of chloroform is of value in conserving the strength of the patient, by abolishing the spasms, and enabling the -- alameda From eitan at mellanox.co.il Mon Sep 24 00:50:56 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 24 Sep 2007 09:50:56 +0200 Subject: Fw: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <829ded920709240023v1282341cq4e14ce29f19fba1b@mail.gmail.com> References: <479359.26315.qm@web8315.mail.in.yahoo.com> <829ded920709240023v1282341cq4e14ce29f19fba1b@mail.gmail.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C9024AA95A@mtlexch01.mtl.com> > > OpenSM will always use min-hop paths (no matter what > routing algorithm > > except maybe for LASH). > > If you use the default algorithms OpenSM will tend to > spread traffic > > such that if you have used LMC=1 (2 LIDs per port) The two > paths going > > to LID0 and LID1 will go through different systems or if > not possible > > through different nodes. > > > > Using the same example you have mentioned, what happens if > LMC=1and there are 2 paths (say P1, P2 and P2 is costlier > than P1) between two nodes (say N1, N2). > Will the openSM still configure two different paths for LID0 and LID1? I am not sure I follow "costlier" but if you mean that P1 is 3 hops and P2 is 4 hops than P2 "does not exist" from OpenSM standpoint. > > -Mahesh > > > EZ > From backups at arllc.com Mon Sep 24 00:55:04 2007 From: backups at arllc.com (Anderson Compton) Date: Mon, 24 Sep 2007 09:55:04 +0200 Subject: [ofa-general] Pick The Right One Message-ID: <01c7fe80$376df310$e43c57c3@backups> PPYH Represents Apartments In Manhattan Hill Project Physical Property Holdings Inc. sym: PPYH $0.25 This release is huge, read up and check out the Manhattan hill website. From mst at dev.mellanox.co.il Mon Sep 24 01:57:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Sep 2007 10:57:48 +0200 Subject: [ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 In-Reply-To: <46F6D694.6050407@opengridcomputing.com> References: <46F6D694.6050407@opengridcomputing.com> Message-ID: <20070924085748.GD23796@mellanox.co.il> > Quoting Steve Wise : > Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 > > Please pull the latest from my libcxgb3 git repos to update the > ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version > 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. > > git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 This looks wrong. 1.2.X releases are done from ofed_1_2 branch. 1.2.5 is just a tag. What do you want me to do? > and > > git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 OK for that one. -- MST From vlad at lists.openfabrics.org Mon Sep 24 02:49:58 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 24 Sep 2007 02:49:58 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070924-0200 daily build status Message-ID: <20070924094958.391D2E6082C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.17 Log: Build failed on ia64 with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.12_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.13_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.15_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.14_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.17_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.22 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.22_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.19_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16.21-0.8-default Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:187: error: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.16.21-0.8-default_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16.21-0.8-default' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.21.1 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.21.1_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18-8.el5 Log: /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:187: error: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070924-0200_linux-2.6.18-8.el5_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18-8.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mst at dev.mellanox.co.il Mon Sep 24 04:47:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Sep 2007 13:47:13 +0200 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> Message-ID: <20070924114713.GB32619@mellanox.co.il> > Quoting Sean Hefty : > Subject: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch > > Roland, please pull from: > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > This will pick up QoS and CM scalability changes that I would like to get > into 2.6.24 (and OFED 1.3). I used git-format-patch to extract patches from this tree and add them to ofed 1.3 kernel tree. -- MST From eli at mellanox.co.il Mon Sep 24 05:35:00 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:35:00 +0200 Subject: [ofa-general] ipoib patches - resend subset Message-ID: <1190637300.4947.54.camel@mtls03> Hi Roland, as per your request for a smaller number of changes, I resend this subset of the previous series. From eli at mellanox.co.il Mon Sep 24 05:35:55 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:35:55 +0200 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support Message-ID: <1190637355.4947.56.camel@mtls03> Add high dma support to ipoib This patch assumes all IB devices support 64 bit DMA. Signed-off-by: Eli Cohen --- Index: linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.23-rc1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:16.000000000 +0300 +++ linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:27.000000000 +0300 @@ -1079,6 +1079,8 @@ static struct net_device *ipoib_add_port SET_NETDEV_DEV(priv->dev, hca->dma_device); + priv->dev->features |= NETIF_F_HIGHDMA; + result = ib_query_pkey(hca, port, 0, &priv->pkey); if (result) { printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", From eli at mellanox.co.il Mon Sep 24 05:36:51 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:36:51 +0200 Subject: [ofa-general] [PATCH 2/11] IB/ipoib: support for sending gather skbs Message-ID: <1190637411.4947.58.camel@mtls03> From: Michael S. Tsirkin Subject: IB/ipoib: support for sending gather skbs This patch, by itself, does nothing - this prepares the ground for hardware checksum support patches. NETIF_F_SG can't be actually set without enabling hardware checksum support, so this is done by the follow-up patches. Signed-off-by: Michael S. Tsirkin --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 11:20:24.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.000000000 +0200 @@ -122,9 +122,61 @@ struct ipoib_rx_buf { struct ipoib_tx_buf { struct sk_buff *skb; - u64 mapping; + u64 mapping[MAX_SKB_FRAGS + 1]; }; +static inline int ipoib_dma_map_tx(struct ib_device *ca, + struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int frags; + int i; + + mapping[0] = ib_dma_map_single(ca, skb->data, skb_headlen(skb), + DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[0]))) + return -EIO; + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + mapping[i + 1] = ib_dma_map_page(ca, frag->page, + frag->page_offset, frag->size, + DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(ca, mapping[i + 1]))) + goto partial_error; + } + return 0; + +partial_error: + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + for (; i > 0; --i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1]; + ib_dma_unmap_page(ca, mapping[i], frag->size, DMA_TO_DEVICE); + } + return -EIO; +} + +static inline void ipoib_dma_unmap_tx(struct ib_device *ca, + struct ipoib_tx_buf *tx_req) +{ + struct sk_buff *skb = tx_req->skb; + u64 *mapping = tx_req->mapping; + int frags; + int i; + + ib_dma_unmap_single(ca, mapping[0], skb_headlen(skb), DMA_TO_DEVICE); + + frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < frags; ++i) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + ib_dma_unmap_page(ca, mapping[i + 1], frag->size, + DMA_TO_DEVICE); + } +} + struct ib_cm_id; struct ipoib_cm_data { @@ -269,7 +321,7 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; + struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; struct ib_send_wr tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 11:20:24.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.000000000 +0200 @@ -491,15 +491,22 @@ repost: static inline int post_send(struct ipoib_dev_priv *priv, struct ipoib_cm_tx *tx, unsigned int wr_id, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) + { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; - + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -507,7 +514,6 @@ void ipoib_cm_send(struct net_device *de { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > tx->mtu)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -530,20 +536,19 @@ void ipoib_cm_send(struct net_device *de */ tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; - if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), - addr, skb->len))) { + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, + skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -577,7 +582,7 @@ static void ipoib_cm_handle_tx_wc(struct tx_req = &tx->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); /* FIXME: is this right? Shouldn't we only increment on success? */ ++priv->stats.tx_packets; @@ -814,7 +819,7 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; - attr.cap.max_send_sge = 1; + attr.cap.max_send_sge = dev->features & NETIF_F_SG ? MAX_SKB_FRAGS + 1 : 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -981,8 +986,7 @@ static void ipoib_cm_tx_destroy(struct i if (p->tx_ring) { while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 11:20:24.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 11:57:02.000000000 +0200 @@ -257,8 +257,7 @@ static void ipoib_ib_handle_tx_wc(struct tx_req = &priv->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); ++priv->stats.tx_packets; priv->stats.tx_bytes += tx_req->skb->len; @@ -343,16 +342,23 @@ void ipoib_ib_completion(struct ib_cq *c static inline int post_send(struct ipoib_dev_priv *priv, unsigned int wr_id, struct ib_ah *address, u32 qpn, - u64 addr, int len) + u64 *mapping, int headlen, + skb_frag_t *frags, + int nr_frags) { struct ib_send_wr *bad_wr; + int i; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; - - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + priv->tx_sge[0].addr = mapping[0]; + priv->tx_sge[0].length = headlen; + for (i = 0; i < nr_frags; ++i) { + priv->tx_sge[i + 1].addr = mapping[i + 1]; + priv->tx_sge[i + 1].length = frags[i].size; + } + priv->tx_wr.num_sge = nr_frags + 1; + priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr.ud.remote_qpn = qpn; + priv->tx_wr.wr.ud.ah = address; return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } @@ -362,7 +368,6 @@ void ipoib_send(struct net_device *dev, { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; - u64 addr; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -385,20 +390,19 @@ void ipoib_send(struct net_device *dev, */ tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); - if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { + if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; } - tx_req->mapping = addr; if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { + address->ah, qpn, + tx_req->mapping, skb_headlen(skb), + skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -604,10 +608,7 @@ int ipoib_ib_dev_stop(struct net_device while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, - tx_req->mapping, - tx_req->skb->len, - DMA_TO_DEVICE); + ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-24 11:20:24.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-24 12:24:02.000000000 +0200 @@ -149,14 +149,14 @@ int ipoib_transport_dev_init(struct net_ .cap = { .max_send_wr = ipoib_sendq_size, .max_recv_wr = ipoib_recvq_size, - .max_send_sge = 1, + .max_send_sge = dev->features & NETIF_F_SG ? MAX_SKB_FRAGS + 1 : 1, .max_recv_sge = 1 }, .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_UD }; - int ret, size; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,11 +197,11 @@ int ipoib_transport_dev_init(struct net_ priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; + for (i = 0; i < MAX_SKB_FRAGS + 1; ++i) + priv->tx_sge[i].lkey = priv->mr->lkey; priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; + priv->tx_wr.sg_list = priv->tx_sge; priv->tx_wr.send_flags = IB_SEND_SIGNALED; return 0; From eli at mellanox.co.il Mon Sep 24 05:37:31 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:37:31 +0200 Subject: [ofa-general] [PATCH 3/11] ib_core: add checksum offload support Message-ID: <1190637451.4947.60.camel@mtls03> Add checksum offload support to the core Signed-off-by: Eli Cohen --- A device that publishes IB_DEVICE_IP_CSUM actually supports calculating checksum on transmit and provides indication whether the checksum is OK on receive. Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 13:24:22.000000000 +0200 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-24 13:24:40.000000000 +0200 @@ -95,7 +95,8 @@ enum ib_device_cap_flags { IB_DEVICE_N_NOTIFY_CQ = (1<<14), IB_DEVICE_ZERO_STAG = (1<<15), IB_DEVICE_SEND_W_INV = (1<<16), - IB_DEVICE_MEM_WINDOW = (1<<17) + IB_DEVICE_MEM_WINDOW = (1<<17), + IB_DEVICE_IP_CSUM = (1<<18), }; enum ib_atomic_cap { @@ -431,6 +432,7 @@ struct ib_wc { u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ + int csum_ok; }; enum ib_cq_notify_flags { @@ -615,7 +617,9 @@ enum ib_send_flags { IB_SEND_FENCE = 1, IB_SEND_SIGNALED = (1<<1), IB_SEND_SOLICITED = (1<<2), - IB_SEND_INLINE = (1<<3) + IB_SEND_INLINE = (1<<3), + IB_SEND_IP_CSUM = (1<<4), + IB_SEND_UDP_TCP_CSUM = (1<<5) }; struct ib_sge { From eli at mellanox.co.il Mon Sep 24 05:38:42 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:38:42 +0200 Subject: [ofa-general] [PATCH 5/11]: mlx4_ib: add checksum offload support Message-ID: <1190637522.4947.64.camel@mtls03> Add checksum offload support to mlx4 Signed-off-by: Ali Ayub Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h 2007-09-24 12:36:46.000000000 +0200 @@ -45,11 +45,11 @@ struct mlx4_cqe { u8 sl; u8 reserved1; __be16 rlid; - u32 reserved2; + __be32 ipoib_status; __be32 byte_cnt; __be16 wqe_index; __be16 checksum; - u8 reserved3[3]; + u8 reserved2[3]; u8 owner_sr_opcode; }; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 12:38:29.000000000 +0200 @@ -439,6 +439,8 @@ static int mlx4_ib_poll_one(struct mlx4_ wc->wc_flags |= be32_to_cpu(cqe->g_mlpath_rqpn) & 0x80000000 ? IB_WC_GRH : 0; wc->pkey_index = be32_to_cpu(cqe->immed_rss_invalid) >> 16; + wc->csum_ok = be32_to_cpu(cqe->ipoib_status) & 0x10000000 && + be16_to_cpu(cqe->checksum) == 0xffff; } return 0; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-24 12:36:46.000000000 +0200 @@ -100,6 +100,8 @@ static int mlx4_ib_query_device(struct i props->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UD_AV_PORT) props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; + if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + props->device_cap_flags |= IB_DEVICE_IP_CSUM; props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & 0xffffff; @@ -626,6 +628,9 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; + if (ibdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + ibdev->ib_dev.flags |= IB_DEVICE_IP_CSUM; + if (init_node_data(ibdev)) goto err_map; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-24 12:36:46.000000000 +0200 @@ -1433,6 +1433,10 @@ int mlx4_ib_post_send(struct ib_qp *ibqp cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr->send_flags & IB_SEND_SOLICITED ? cpu_to_be32(MLX4_WQE_CTRL_SOLICITED) : 0) | + ((wr->send_flags & IB_SEND_IP_CSUM) ? + cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM) : 0) | + ((wr->send_flags & IB_SEND_UDP_TCP_CSUM) ? + cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp->sq_signal_bits; if (wr->opcode == IB_WR_SEND_WITH_IMM || Index: ofa_1_3_dev_kernel/include/linux/mlx4/qp.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/qp.h 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/qp.h 2007-09-24 12:36:46.000000000 +0200 @@ -162,6 +162,8 @@ enum { MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, + MLX4_WQE_CTRL_IP_CSUM = 1 << 4, + MLX4_WQE_CTRL_TCP_UDP_CSUM = 1 << 5, }; struct mlx4_wqe_ctrl_seg { Index: ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/fw.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/fw.c 2007-09-24 12:36:46.000000000 +0200 @@ -741,6 +741,9 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); MLX4_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); + if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 3); + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 1000); if (err) From eli at mellanox.co.il Mon Sep 24 05:39:11 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:39:11 +0200 Subject: [ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support Message-ID: <1190637551.4947.66.camel@mtls03> Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.000000000 +0200 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_RX_CSUM = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 13:05:21.000000000 +0200 @@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); ipoib_warn(priv, "enabling connected mode " "will cause multicast packet drops\n"); + + /* clear ipv6 flag too */ + dev->features &= ~NETIF_F_IP_CSUM; + + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + ipoib_flush_paths(dev); return count; } @@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); dev->mtu = min(priv->mcast_mtu, dev->mtu); ipoib_flush_paths(dev); + + if (priv->ca->flags & IB_DEVICE_IP_CSUM) + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ + return count; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 11:57:02.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-24 13:03:27.000000000 +0200 @@ -37,6 +37,7 @@ #include #include +#include #include @@ -231,6 +232,16 @@ static void ipoib_ib_handle_rx_wc(struct skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + + /* check rx csum */ + if (test_bit(IPOIB_FLAG_RX_CSUM, &priv->flags) && likely(wc->csum_ok)) { + /* Note: this is a specific requirement for Mellanox + HW but since it is the only HW currently supporting + checksum offload I put it here */ + if ((((struct iphdr *)(skb->data))->ihl) == 5) + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + netif_receive_skb(skb); repost: @@ -396,6 +407,15 @@ void ipoib_send(struct net_device *dev, return; } + if (priv->ca->flags & IB_DEVICE_IP_CSUM && + skb->ip_summed == CHECKSUM_PARTIAL) + priv->tx_wr.send_flags |= + IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM; + else + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, tx_req->mapping, skb_headlen(skb), Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 12:23:00.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.000000000 +0200 @@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +static void set_tx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) + return; + + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) + return; + + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ +} + +static void set_rx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) + return; + + set_bit(IPOIB_FLAG_RX_CSUM, &priv->flags); +} + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { @@ -1165,6 +1188,9 @@ static struct net_device *ipoib_add_port goto event_failed; } + set_tx_csum(priv->dev); + set_rx_csum(priv->dev); + result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", From eli at mellanox.co.il Mon Sep 24 05:39:42 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:39:42 +0200 Subject: [ofa-general] [PATCH 7/11] IB/ipoib: Add ethtool support Message-ID: <1190637582.4947.68.camel@mtls03> Add ethtool support to ipoib Signed-off-by: Eli Cohen --- This one is actually the foundation with no real contecxt. I think we can add here all the logic of wheather to allow using a certain feature, e.g. checksum offload, scatter/gather etc. and decide on all the dependencies. Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/Makefile 2007-09-24 11:19:04.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/Makefile 2007-09-24 13:07:43.000000000 +0200 @@ -4,7 +4,8 @@ ib_ipoib-y := ipoib_main.o \ ipoib_ib.o \ ipoib_multicast.o \ ipoib_verbs.o \ - ipoib_vlan.o + ipoib_vlan.o \ + ipoib_etool.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM) += ipoib_cm.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-24 13:07:43.000000000 +0200 @@ -0,0 +1,55 @@ +/* + * Copyright (c) 2007 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ipoib_etool.c $ + */ + +#include +#include +#include + +#include "ipoib.h" + +static void ipoib_get_drvinfo(struct net_device *netdev, + struct ethtool_drvinfo *drvinfo) +{ + strncpy(drvinfo->driver, "ipoib", sizeof(drvinfo->driver) - 1); +} + +static const struct ethtool_ops ipoib_ethtool_ops = { + .get_drvinfo = ipoib_get_drvinfo, + .get_tso = ethtool_op_get_tso, +}; + +void ipoib_set_ethtool_ops(struct net_device *dev) +{ + SET_ETHTOOL_OPS(dev, &ipoib_ethtool_ops); +} Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:07:43.000000000 +0200 @@ -485,6 +485,8 @@ void ipoib_pkey_poll(struct work_struct int ipoib_pkey_dev_delay_open(struct net_device *dev); void ipoib_drain_cq(struct net_device *dev); +void ipoib_set_ethtool_ops(struct net_device *dev); + #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:07:43.000000000 +0200 @@ -992,6 +992,7 @@ static void ipoib_setup(struct net_devic dev->neigh_setup = ipoib_neigh_setup_dev; dev->poll = ipoib_poll; dev->weight = 100; + ipoib_set_ethtool_ops(dev); dev->watchdog_timeo = HZ; From eli at mellanox.co.il Mon Sep 24 05:40:12 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:40:12 +0200 Subject: [ofa-general] [PATCH 8/11]: Add support for modifying CQ params Message-ID: <1190637612.4947.70.camel@mtls03> Add support for modifying CQ parameters for controlling event generation moderation. This allows to control the rate of event (interrupt) generation by specifying a minimum number of CQEs or a minimum period of time required to generate an event. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/rdma/ib_verbs.h 2007-09-24 12:33:41.000000000 +0200 +++ ofa_1_3_dev_kernel/include/rdma/ib_verbs.h 2007-09-24 13:07:59.000000000 +0200 @@ -967,6 +967,8 @@ struct ib_device { int comp_vector, struct ib_ucontext *context, struct ib_udata *udata); + int (*modify_cq)(struct ib_cq *cq, u16 cq_count, + u16 cq_period); int (*destroy_cq)(struct ib_cq *cq); int (*resize_cq)(struct ib_cq *cq, int cqe, struct ib_udata *udata); @@ -1372,6 +1374,16 @@ struct ib_cq *ib_create_cq(struct ib_dev int ib_resize_cq(struct ib_cq *cq, int cqe); /** + * ib_modify_cq - Modifies moderation params of the CQ + * @cq: The CQ to modify. + * @cq_count: number of CQEs that will tirgger an event + * @cq_period: max period of time beofre triggering an event + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); + +/** * ib_destroy_cq - Destroys the specified CQ. * @cq: The CQ to destroy. */ Index: ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/core/verbs.c 2007-09-24 11:19:03.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/core/verbs.c 2007-09-24 13:07:59.000000000 +0200 @@ -628,6 +628,13 @@ struct ib_cq *ib_create_cq(struct ib_dev } EXPORT_SYMBOL(ib_create_cq); +int ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + return cq->device->modify_cq ? + cq->device->modify_cq(cq, cq_count, cq_period) : -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_cq); + int ib_destroy_cq(struct ib_cq *cq) { if (atomic_read(&cq->usecnt)) From eli at mellanox.co.il Mon Sep 24 05:40:39 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:40:39 +0200 Subject: [ofa-general] [PATCH 9/11] mlx4_ib: add support for modifying CQ parameters Message-ID: <1190637639.4947.72.camel@mtls03> Add support for modifying CQ parameters. Signed-off-by: Eli Cohen --- Add support for modifying CQ parameters. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2007-09-24 12:36:46.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/main.c 2007-09-24 13:08:55.000000000 +0200 @@ -613,6 +613,7 @@ static void *mlx4_ib_add(struct mlx4_dev ibdev->ib_dev.post_send = mlx4_ib_post_send; ibdev->ib_dev.post_recv = mlx4_ib_post_recv; ibdev->ib_dev.create_cq = mlx4_ib_create_cq; + ibdev->ib_dev.modify_cq = mlx4_ib_modify_cq; ibdev->ib_dev.destroy_cq = mlx4_ib_destroy_cq; ibdev->ib_dev.poll_cq = mlx4_ib_poll_cq; ibdev->ib_dev.req_notify_cq = mlx4_ib_arm_cq; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 12:38:29.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/cq.c 2007-09-24 13:08:55.000000000 +0200 @@ -91,6 +91,25 @@ static struct mlx4_cqe *next_cqe_sw(stru return get_sw_cqe(cq, cq->mcq.cons_index); } +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period) +{ + struct mlx4_ib_cq *mcq = to_mcq(cq); + struct mlx4_ib_dev *dev = to_mdev(cq->device); + struct mlx4_cq_context *context; + int err; + + context = kzalloc(sizeof *context, GFP_KERNEL); + if (!context) + return -ENOMEM; + + context->cq_period = cpu_to_be16(cq_period); + context->cq_max_count = cpu_to_be16(cq_count); + err = mlx4_cq_modify(dev->dev, &mcq->mcq, context, 1); + + kfree(context); + return err; +} + struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata) Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-24 11:19:03.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-09-24 13:08:55.000000000 +0200 @@ -249,6 +249,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct struct ib_udata *udata); int mlx4_ib_dereg_mr(struct ib_mr *mr); +int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period); struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata); Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-24 11:19:03.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c 2007-09-24 13:08:55.000000000 +0200 @@ -38,33 +38,11 @@ #include #include +#include #include "mlx4.h" #include "icm.h" -struct mlx4_cq_context { - __be32 flags; - u16 reserved1[3]; - __be16 page_offset; - __be32 logsize_usrpage; - u8 reserved2; - u8 cq_period; - u8 reserved3; - u8 cq_max_count; - u8 reserved4[3]; - u8 comp_eqn; - u8 log_page_size; - u8 reserved5[2]; - u8 mtt_base_addr_h; - __be32 mtt_base_addr_l; - __be32 last_notified_index; - __be32 solicit_producer_index; - __be32 consumer_index; - __be32 producer_index; - u32 reserved6[2]; - __be64 db_rec_addr; -}; - #define MLX4_CQ_STATUS_OK ( 0 << 28) #define MLX4_CQ_STATUS_OVERFLOW ( 9 << 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 << 28) @@ -121,6 +99,13 @@ static int mlx4_SW2HW_CQ(struct mlx4_dev MLX4_CMD_TIME_CLASS_A); } +static int mlx4_MODIFY_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, + int cq_num, u32 opmod) +{ + return mlx4_cmd(dev, mailbox->dma, cq_num, opmod, MLX4_CMD_MODIFY_CQ, + MLX4_CMD_TIME_CLASS_A); +} + static int mlx4_HW2SW_CQ(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox, int cq_num) { @@ -206,6 +191,24 @@ err_out: } EXPORT_SYMBOL_GPL(mlx4_cq_alloc); +int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq, + struct mlx4_cq_context *context, int modify) +{ + struct mlx4_cmd_mailbox *mailbox; + int err; + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + memcpy(mailbox->buf, context, sizeof *context); + err = mlx4_MODIFY_CQ(dev, mailbox, cq->cqn, modify); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} +EXPORT_SYMBOL_GPL(mlx4_cq_modify); + void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) { struct mlx4_priv *priv = mlx4_priv(dev); Index: ofa_1_3_dev_kernel/include/linux/mlx4/cq.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cq.h 2007-09-24 12:36:46.000000000 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cq.h 2007-09-24 13:08:55.000000000 +0200 @@ -38,6 +38,27 @@ #include #include +struct mlx4_cq_context { + __be32 flags; + u16 reserved1[3]; + __be16 page_offset; + __be32 logsize_usrpage; + u16 cq_period; + u16 cq_max_count; + u8 reserved4[3]; + u8 comp_eqn; + u8 log_page_size; + u8 reserved5[2]; + u8 mtt_base_addr_h; + __be32 mtt_base_addr_l; + __be32 last_notified_index; + __be32 solicit_producer_index; + __be32 consumer_index; + __be32 producer_index; + u32 reserved6[2]; + __be64 db_rec_addr; +}; + struct mlx4_cqe { __be32 my_qpn; __be32 immed_rss_invalid; @@ -120,4 +141,8 @@ enum { MLX4_CQ_DB_REQ_NOT = 2 << 24 }; + +int mlx4_cq_modify(struct mlx4_dev *dev, struct mlx4_cq *cq, + struct mlx4_cq_context *context, int resize); + #endif /* MLX4_CQ_H */ Index: ofa_1_3_dev_kernel/include/linux/mlx4/cmd.h =================================================================== --- ofa_1_3_dev_kernel.orig/include/linux/mlx4/cmd.h 2007-09-24 11:19:03.000000000 +0200 +++ ofa_1_3_dev_kernel/include/linux/mlx4/cmd.h 2007-09-24 13:08:55.000000000 +0200 @@ -81,7 +81,7 @@ enum { MLX4_CMD_SW2HW_CQ = 0x16, MLX4_CMD_HW2SW_CQ = 0x17, MLX4_CMD_QUERY_CQ = 0x18, - MLX4_CMD_RESIZE_CQ = 0x2c, + MLX4_CMD_MODIFY_CQ = 0x2c, /* SRQ commands */ MLX4_CMD_SW2HW_SRQ = 0x35, From eli at mellanox.co.il Mon Sep 24 05:41:24 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:41:24 +0200 Subject: [ofa-general] [PATCH 10/11]: IB/ipoib modify cq params Message-ID: <1190637684.4947.74.camel@mtls03> Implement support for modifying IPOIB CQ moderation params This can be used to tune at run time the paramters controlling the event (interrupt) generation rate and thus reduce the overhead incurred by hadling interrupts resulting in better throughput. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:07:43.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:12:21.000000000 +0200 @@ -270,6 +270,13 @@ struct ipoib_cm_dev_priv { struct ib_recv_wr rx_wr; }; +struct ipoib_ethtool_st { + u16 rx_coalesce_usecs; + u16 tx_coalesce_usecs; + u16 rx_max_coalesced_frames; + u16 tx_max_coalesced_frames; +}; + /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -346,6 +353,7 @@ struct ipoib_dev_priv { struct dentry *mcg_dentry; struct dentry *path_dentry; #endif + struct ipoib_ethtool_st etool; }; struct ipoib_ah { Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-24 13:07:43.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_etool.c 2007-09-24 13:09:26.000000000 +0200 @@ -44,9 +44,49 @@ static void ipoib_get_drvinfo(struct net strncpy(drvinfo->driver, "ipoib", sizeof(drvinfo->driver) - 1); } +static int ipoib_get_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + coal->rx_coalesce_usecs = priv->etool.rx_coalesce_usecs; + coal->tx_coalesce_usecs = priv->etool.tx_coalesce_usecs; + coal->rx_max_coalesced_frames = priv->etool.rx_max_coalesced_frames; + coal->rx_max_coalesced_frames = priv->etool.tx_max_coalesced_frames; + + return 0; +} + +static int ipoib_set_coalesce(struct net_device *dev, + struct ethtool_coalesce *coal) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (coal->rx_coalesce_usecs > 0xffff || + coal->tx_coalesce_usecs > 0xffff || + coal->rx_max_coalesced_frames > 0xffff || + coal->tx_max_coalesced_frames > 0xffff) + return -EINVAL; + + ret = ib_modify_cq(priv->cq, coal->rx_max_coalesced_frames, + coal->rx_coalesce_usecs); + if (ret) + return ret; + + priv->etool.rx_coalesce_usecs = coal->rx_coalesce_usecs; + priv->etool.tx_coalesce_usecs = coal->tx_coalesce_usecs; + priv->etool.rx_max_coalesced_frames = coal->rx_max_coalesced_frames; + priv->etool.tx_max_coalesced_frames = coal->rx_max_coalesced_frames; + + return 0; +} + static const struct ethtool_ops ipoib_ethtool_ops = { .get_drvinfo = ipoib_get_drvinfo, .get_tso = ethtool_op_get_tso, + .get_coalesce = ipoib_get_coalesce, + .set_coalesce = ipoib_set_coalesce, }; void ipoib_set_ethtool_ops(struct net_device *dev) From eli at mellanox.co.il Mon Sep 24 05:42:07 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:42:07 +0200 Subject: [ofa-general] [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters Message-ID: <1190637727.4947.76.camel@mtls03> From: Michael S. Tsirkin Subject: IB/ipoib: support for sending gather skbs Enable interrupt coalescing for CQs in mlx4. Signed-off-by: Michael S. Tsirkin --- Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-24 13:08:55.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c 2007-09-24 13:12:42.000000000 +0200 @@ -43,6 +43,14 @@ #include "mlx4.h" #include "icm.h" +static int cq_max_count = 16; +static int cq_period = 10; + +module_param(cq_max_count, int, 0444); +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); +module_param(cq_period, int, 0444); +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); + #define MLX4_CQ_STATUS_OK ( 0 << 28) #define MLX4_CQ_STATUS_OVERFLOW ( 9 << 28) #define MLX4_CQ_STATUS_WRITE_FAIL (10 << 28) From eli at mellanox.co.il Mon Sep 24 05:46:43 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 24 Sep 2007 14:46:43 +0200 Subject: [ofa-general] [PATCH 4/11] ib_mthca: add checksum offload support Message-ID: <1190638003.4947.79.camel@mtls03> Add checksum offload support in mthca Signed-off-by: Eli Cohen --- resending - adding the openfabrics list Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-09-24 12:34:59.000000000 +0200 @@ -1377,6 +1377,9 @@ int mthca_INIT_HCA(struct mthca_dev *dev MTHCA_PUT(inbox, param->uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); } + if (dev->device_cap_flags & IB_DEVICE_IP_CSUM) + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(7 << 3); + err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, HZ, status); mthca_free_mailbox(dev, mailbox); Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cmd.h 2007-09-24 12:34:59.000000000 +0200 @@ -103,6 +103,7 @@ enum { DEV_LIM_FLAG_RAW_IPV6 = 1 << 4, DEV_LIM_FLAG_RAW_ETHER = 1 << 5, DEV_LIM_FLAG_SRQ = 1 << 6, + DEV_LIM_FLAG_IPOIB_CSUM = 1 << 7, DEV_LIM_FLAG_BAD_PKEY_CNTR = 1 << 8, DEV_LIM_FLAG_BAD_QKEY_CNTR = 1 << 9, DEV_LIM_FLAG_MW = 1 << 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_cq.c 2007-09-24 12:36:06.000000000 +0200 @@ -119,7 +119,8 @@ struct mthca_cqe { __be32 my_qpn; __be32 my_ee; __be32 rqpn; - __be16 sl_g_mlpath; + u8 sl_ipok; + u8 g_mlpath; __be16 rlid; __be32 imm_etype_pkey_eec; __be32 byte_cnt; @@ -498,6 +499,7 @@ static inline int mthca_poll_one(struct int is_send; int free_cqe = 1; int err = 0; + u16 checksum; cqe = next_cqe_sw(cq); if (!cqe) @@ -639,12 +641,14 @@ static inline int mthca_poll_one(struct break; } entry->slid = be16_to_cpu(cqe->rlid); - entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->sl = cqe->sl_ipok >> 4; entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; - entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->dlid_path_bits = cqe->g_mlpath & 0x7f; entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; - entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? - IB_WC_GRH : 0; + entry->wc_flags |= cqe->g_mlpath & 0x80 ? IB_WC_GRH : 0; + checksum = (be32_to_cpu(cqe->rqpn) >> 24) | + ((be32_to_cpu(cqe->my_ee) >> 16) & 0xff00); + entry->csum_ok = (cqe->sl_ipok & 1 && checksum == 0xffff); } entry->status = IB_WC_SUCCESS; Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_main.c 2007-09-24 12:34:59.000000000 +0200 @@ -289,6 +289,10 @@ static int mthca_dev_lim(struct mthca_de if (dev_lim->flags & DEV_LIM_FLAG_SRQ) mdev->mthca_flags |= MTHCA_FLAG_SRQ; + if (mthca_is_memfree(mdev)) + if (dev_lim->flags & DEV_LIM_FLAG_IPOIB_CSUM) + mdev->device_cap_flags |= IB_DEVICE_IP_CSUM; + return 0; } @@ -1125,6 +1129,8 @@ static int __mthca_init_one(struct pci_d if (err) goto err_cmd; + mdev->ib_dev.flags = mdev->device_cap_flags; + if (mdev->fw_ver < mthca_hca_table[hca_type].latest_fw) { mthca_warn(mdev, "HCA FW version %d.%d.%03d is old (%d.%d.%03d is current).\n", (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_qp.c 2007-09-24 12:34:59.000000000 +0200 @@ -2024,6 +2024,10 @@ int mthca_arbel_post_send(struct ib_qp * cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | ((wr->send_flags & IB_SEND_SOLICITED) ? cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + ((wr->send_flags & IB_SEND_IP_CSUM) ? + cpu_to_be32(MTHCA_NEXT_IP_CSUM) : 0) | + ((wr->send_flags & IB_SEND_UDP_TCP_CSUM) ? + cpu_to_be32(MTHCA_NEXT_TCP_UDP_CSUM) : 0) | cpu_to_be32(1); if (wr->opcode == IB_WR_SEND_WITH_IMM || wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) Index: ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_wqe.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mthca/mthca_wqe.h 2007-09-24 11:19:08.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mthca/mthca_wqe.h 2007-09-24 12:34:59.000000000 +0200 @@ -38,14 +38,15 @@ #include enum { - MTHCA_NEXT_DBD = 1 << 7, - MTHCA_NEXT_FENCE = 1 << 6, - MTHCA_NEXT_CQ_UPDATE = 1 << 3, - MTHCA_NEXT_EVENT_GEN = 1 << 2, - MTHCA_NEXT_SOLICIT = 1 << 1, - - MTHCA_MLX_VL15 = 1 << 17, - MTHCA_MLX_SLR = 1 << 16 + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + MTHCA_NEXT_IP_CSUM = 1 << 4, + MTHCA_NEXT_TCP_UDP_CSUM = 1 << 5, + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 }; enum { From sashak at voltaire.com Mon Sep 24 05:58:34 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 24 Sep 2007 14:58:34 +0200 Subject: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9024AA8FC@mtlexch01.mtl.com> References: <829ded920709232353r7e4bf11ai7e261da61de15485@mail.gmail.com> <6C2C79E72C305246B504CBA17B5500C9024AA8FC@mtlexch01.mtl.com> Message-ID: <20070924125834.GE2131@sashak.voltaire.com> On 09:05 Mon 24 Sep , Eitan Zahavi wrote: > OpenSM will always use min-hop paths (no matter what routing algorithm I would clarify here - For LMC > 0 OpenSM will choose different paths between _discovered_ shortest paths. For min-hop algorithm those shortest paths are real min-hops paths. For Up/Down it is min-hop paths which satisfies Up/Down constraint. > except maybe for LASH). For LASH too (LASH is abbreviation of "LAyered SHortest paths"). There a different layers (VLs in case of IB) are used for credit loops resolution. However current LASH implementation does not support LMC > 0. Sasha > If you use the default algorithms OpenSM will tend to spread traffic > such that if you have used LMC=1 (2 LIDs per port) > The two paths going to LID0 and LID1 will go through different systems > or if not possible through different nodes. > > EZ > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > Keshetti Mahesh > > Sent: Monday, September 24, 2007 8:54 AM > > To: openIB > > Subject: [ofa-general] Re: [query] Multi path discovery in openSM > > > > > > If there are multiple paths between two end nodes in a > > network and I > > > > set the LMC > 0 then whether the openSM itself identifies those > > > > routes and updates the switch forwarding tables or is it > > the duty of > > > > some other consumer ?? > > > > > > OpenSM. > > > > I am using min-hop algorithm with openSM. > > Now in this case, if there are multiple paths (some are not > > min-hop paths) will the openSM(LMC > 0) configure those paths? > > > > regards, > > Mahesh > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Mon Sep 24 05:50:04 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Sep 2007 14:50:04 +0200 Subject: [ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 In-Reply-To: <20070924085748.GD23796@mellanox.co.il> References: <46F6D694.6050407@opengridcomputing.com> <20070924085748.GD23796@mellanox.co.il> Message-ID: <20070924125004.GA11953@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 > > > Quoting Steve Wise : > > Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 > > > > Please pull the latest from my libcxgb3 git repos to update the > > ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version > > 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. > > > > git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 > > This looks wrong. 1.2.X releases are done from ofed_1_2 branch. > 1.2.5 is just a tag. What do you want me to do? I figured it out. done. -- MST From mst at dev.mellanox.co.il Mon Sep 24 06:01:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Sep 2007 15:01:40 +0200 Subject: [ofa-general] [PATCHv3] IB/ipoib: HW checksum support In-Reply-To: <20070904091133.GA23437@mellanox.co.il> References: <20070830130852.GF2532@mellanox.co.il> <20070904091133.GA23437@mellanox.co.il> Message-ID: <20070924130139.GB11953@mellanox.co.il> Add module option hw_csum: when set, IPoIB will report HW CSUM and S/G support, and rely on hardware end-to-end transport checksum (ICRC) instead of software-level protocol checksums. Forwarding such packets outside the IB subnet would increase the risk of data corruption, so it is safest not to set hw_csum flag on gateways. To reduce the chance of this routing triggering data corruption by mistake, on RX we set skb checksum field to CHECKSUM_UNNECESSARY - this way if such a packet ends up outside the IB network, it is detected as malformed and dropped. To enable interoperability with IEEE IPoIB, checksum for outgoing packets is calculated in software unless the remote advertises hw_csum capability by setting a bit in hardware address flag. Signed-off-by: Michael S. Tsirkin --- This patch has to be applied on top of [PATCH 2/11] IB/ipoib: support for sending gather skbs. Updates since v2: Enable interoperability with IEEE IPoIB. Split out S/G support to a separate patch. Updates since v1: fixed thinko in setting header flags. When applied on top of previously posted mlx4 patches, and with hw_csum enabled on both ends, this patch speeds up single-stream netperf bandwidth on connectx DDR from 1000 to 1250 MBytes/sec. diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..485f979 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_HW_CSUM = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -104,9 +105,11 @@ enum { /* structs */ +#define IPOIB_HEADER_F_HWCSUM 0x1 + struct ipoib_header { __be16 proto; - u16 reserved; + __be16 flags; }; struct ipoib_pseudoheader { @@ -430,6 +478,8 @@ void ipoib_pkey_poll(struct work_struct *work); int ipoib_pkey_dev_delay_open(struct net_device *dev); void ipoib_drain_cq(struct net_device *dev); +#define IPOIB_FLAGS_HWCSUM 0x01 + #ifdef CONFIG_INFINIBAND_IPOIB_CM #define IPOIB_FLAGS_RC 0x80 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..a308e92 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -407,6 +407,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; int frags; + struct ipoib_header *header; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -469,7 +470,10 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1094488..59b1735 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -170,6 +170,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; struct sk_buff *skb; + struct ipoib_header *header; u64 addr; ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n", @@ -220,7 +221,10 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; + header = (struct ipoib_header *)skb->data; + skb->protocol = header->proto; + if (header->flags & cpu_to_be16(IPOIB_HEADER_F_HWCSUM)) + skb->ip_summed = CHECKSUM_UNNECESSARY; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..74d10e6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -55,11 +55,14 @@ MODULE_LICENSE("Dual BSD/GPL"); int ipoib_sendq_size __read_mostly = IPOIB_TX_RING_SIZE; int ipoib_recvq_size __read_mostly = IPOIB_RX_RING_SIZE; +static int ipoib_hw_csum __read_mostly = 0; module_param_named(send_queue_size, ipoib_sendq_size, int, 0444); MODULE_PARM_DESC(send_queue_size, "Number of descriptors in send queue"); module_param_named(recv_queue_size, ipoib_recvq_size, int, 0444); MODULE_PARM_DESC(recv_queue_size, "Number of descriptors in receive queue"); +module_param_named(hw_csum, ipoib_hw_csum, int, 0444); +MODULE_PARM_DESC(hw_csum, "Rely on hardware end-to-end checksum (ICRC) if > 0"); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -778,11 +781,18 @@ static int ipoib_hard_header(struct sk_buff *skb, void *daddr, void *saddr, unsigned len) { struct ipoib_header *header; + struct ipoib_dev_priv *priv = netdev_priv(dev); header = (struct ipoib_header *) skb_push(skb, sizeof *header); header->proto = htons(type); - header->reserved = 0; + if (!test_bit(IPOIB_FLAG_HW_CSUM, &priv->flags) || + skb->ip_summed != CHECKSUM_PARTIAL) + header->flags = 0; + else if (daddr && *((char *)daddr) & IPOIB_FLAGS_HWCSUM) + header->flags = cpu_to_be16(IPOIB_HEADER_F_HWCSUM); + else + skb_checksum_help(skb); /* * If we don't have a neighbour structure, stuff the @@ -901,6 +911,9 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; + if (ipoib_hw_csum) + dev->dev_addr[0] |= IPOIB_FLAGS_HWCSUM; + return 0; out_tx_ring_cleanup: @@ -964,6 +977,10 @@ static void ipoib_setup(struct net_device *dev) dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + if (ipoib_hw_csum) { + dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM; + set_bit(IPOIB_FLAG_HW_CSUM, &priv->flags); + } /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; -- MST From tziporet at dev.mellanox.co.il Mon Sep 24 06:03:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 24 Sep 2007 15:03:06 +0200 Subject: [ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 In-Reply-To: <20070924125004.GA11953@mellanox.co.il> References: <46F6D694.6050407@opengridcomputing.com> <20070924085748.GD23796@mellanox.co.il> <20070924125004.GA11953@mellanox.co.il> Message-ID: <46F7B58A.4020603@mellanox.co.il> Michael S. Tsirkin wrote: >> Quoting Michael S. Tsirkin : >> Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 >> >> > I figured it out. done. > > And I did a new build of OFED 1.2.5 daily (look at http://www.openfabrics.org/builds/connectx/latest.txt) Tziporet From monisonlists at gmail.com Mon Sep 24 06:56:46 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 24 Sep 2007 15:56:46 +0200 Subject: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister In-Reply-To: References: <46F27692.3070404@voltaire.com> <46F2784C.9070806@voltaire.com> <46F61BF6.3000203@gmail.com> Message-ID: <46F7C21E.1000204@gmail.com> Roland Dreier wrote: > > The action in bonding to a detach of slave is to unregister the master (see patch 10). > > This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). > > I'm confused. Your patch has: > > > + ipoib_slave_detach(cpriv->dev); > > unregister_netdev(cpriv->dev); > > And ipoib_slave_detach() is: > > > +static inline void ipoib_slave_detach(struct net_device *dev) > > +{ > > + rtnl_lock(); > > + netdev_slave_detach(dev); > > + rtnl_unlock(); > > +} > > so you are calling netdev_slave_detach() with the rtnl lock held. > Why can't you make the same call from the start of unregister_netdevice()? > > Anyway, if the rtnl lock is a problem, can you just add the call to > netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock? > > - R. > Your comment made me do a little rethinking. In bonding, device is released by calling unregister_netdevice() that doesn't take the rtnl_lock (unlike unregister_netdev() that does). I guess that this made me confused to think that this is not possible. So, I guess I could put the detach notification in unregister_netedev() and the reaction to the notification in the bonding driver would not block. However, I looked one more time at the code of unregister_netdevice() and found out that nothing prevents from calling unregister_netdevice() again when the notification NETDEV_GOING_DOWN is sent. I tried that and it works. I have a new set of patches without sending a slave detach and I will send it soon. Thanks for the comment Roland. It makes this patch simpler. I'd also like to give a credit to Jay for the idea of using NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. He suggested it a while ago but I wrongly thought that it wouldn't work. From swise at opengridcomputing.com Mon Sep 24 06:58:59 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 24 Sep 2007 08:58:59 -0500 Subject: [ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 In-Reply-To: <20070924085748.GD23796@mellanox.co.il> References: <46F6D694.6050407@opengridcomputing.com> <20070924085748.GD23796@mellanox.co.il> Message-ID: <46F7C2A3.3060103@opengridcomputing.com> Michael S. Tsirkin wrote: >> Quoting Steve Wise : >> Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 >> >> Please pull the latest from my libcxgb3 git repos to update the >> ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version >> 1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. >> >> git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 > Go look at http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary It has a ofed_1_2_5 branch. I believe Vlad setup the build scripts to handle this. Yes? > This looks wrong. 1.2.X releases are done from ofed_1_2 branch. > 1.2.5 is just a tag. What do you want me to do? > >> and >> >> git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 > > OK for that one. > > From mst at dev.mellanox.co.il Mon Sep 24 07:07:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 24 Sep 2007 16:07:34 +0200 Subject: [ofa-general] Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 In-Reply-To: <46F7C2A3.3060103@opengridcomputing.com> References: <46F6D694.6050407@opengridcomputing.com> <20070924085748.GD23796@mellanox.co.il> <46F7C2A3.3060103@opengridcomputing.com> Message-ID: <20070924140733.GC11953@mellanox.co.il> > Quoting Steve Wise : > Subject: Re: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 > > > > Michael S. Tsirkin wrote: > >>Quoting Steve Wise : > >>Subject: [GIT PULL] ofed-1.2.5 / ofed-1.3 - new libcxgb3 release v1.0.2 > >> > >>Please pull the latest from my libcxgb3 git repos to update the > >>ofed-1.2.5 and ofed-1.3 libcxgb3 release. This will update to version > >>1.0.2 of libcxgb3 which fixes a doorbell issue on big-endian platforms. > >> > >>git://git.openfabrics.org/~swise/libcxgb3 ofed_1_2_5 > > > > Go look at > > http://www.openfabrics.org/git/?p=ofed_1_2_5/libcxgb3.git;a=summary > > It has a ofed_1_2_5 branch. I believe Vlad setup the build scripts to > handle this. > > Yes? > > >This looks wrong. 1.2.X releases are done from ofed_1_2 branch. > >1.2.5 is just a tag. What do you want me to do? > > > >>and > >> > >>git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3 > > > >OK for that one. > > > > It's OK, done for both. -- MST From tziporet at dev.mellanox.co.il Mon Sep 24 07:41:33 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 24 Sep 2007 16:41:33 +0200 Subject: [ofa-general] Re: [ewg] OFED teleconference today In-Reply-To: References: Message-ID: <46F7CC9D.70009@mellanox.co.il> Jeff Squyres wrote: > Friendly reminder: the OFED teleconference is several hours from now > (Monday, September 24, 2007). > > Noon US eastern / 9am US Pacific / -=>6pm Israel<=- > 1. Monday, Sep 24, code 210062024 (***TODAY***) > Agenda: 1. Agree on the new OFED 1.3 schedule: * Feature freeze - Sep 25 * Alpha release - Oct 1 * Beta release - Oct 17 (may change according to 2.6.24 rc1 availability) * RC1 - Oct 24 * RC2 - Nov 7 * RC3 - Nov 20 * RC4 - Dec 4 * GA release - Dec 18 2. Agree to move to kernel base 2.6.24 Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is available. This will reduce many patches and with the new timeline seems more appropriate. Please send if you have any other agenda items Tziporet From alicia.acero at ciemat.es Mon Sep 24 07:44:01 2007 From: alicia.acero at ciemat.es (Acero Fernandez Alicia) Date: Mon, 24 Sep 2007 16:44:01 +0200 Subject: [ofa-general] ofed-1.2.5/ofed-1.2.5.1 Message-ID: <50C74E87FB16FB4F9356E175CA15423E02D6D9B9@STR.ciemat.es> Hi, I am going to install OFED software in our cluster, but in the download section there are two different versions 1.2.5 and 1.2.5.1. Could anyone tell me what are the differences between both of them?and I would like to know if the 1.2.5.1 is an stable version, as well. Thank you in advance. Regards Alicia Acero ---------------------------- Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener informaciďż˝n privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilizaciďż˝n, divulgaciďż˝n y/o copia sin autorizaciďż˝n estďż˝ prohibida en virtud de la legislaciďż˝n vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucciďż˝n. Disclaimer: This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. ---------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at voltaire.com Mon Sep 24 08:27:44 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:27:44 +0200 Subject: [ofa-general] [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver Message-ID: <46F7D770.4090500@voltaire.com> This patch series is the sixth version (see below link to V5) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. Patches 1-8 were originally submitted in V5 and patch 9 is an addition by Jay. Major changes from the previous version: ---------------------------------------- 1. Remove the patches to net/core. Bonding will use the NETDEV_GOING_DOWN notification instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. This reduces the amount of patches from 11 to 9. Links to earlier discussion: ---------------------------- 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. V5 series http://lists.openfabrics.org/pipermail/general/2007-September/040996.html From monisonlists at gmail.com Mon Sep 24 08:29:55 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:29:55 +0200 Subject: [ofa-general] [PATCH V6 1/9] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7D7F3.1070708@gmail.com> IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb->dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n->dev->flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++- 3 files changed, 20 insertions(+), 11 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 17:09:26.534874404 +0200 @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_head list; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:23:54.725744661 +0200 @@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb->dst->neighbour); + neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -830,6 +830,13 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh->dev); + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", + n->dev->name); + } else + return; ipoib_dbg(priv, "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), @@ -837,13 +844,10 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(&priv->lock, flags); - neigh = *to_ipoib_neigh(n); - if (neigh) { - if (neigh->ah) - ah = neigh->ah; - list_del(&neigh->list); - ipoib_neigh_free(n->dev, neigh); - } + if (neigh->ah) + ah = neigh->ah; + list_del(&neigh->list); + ipoib_neigh_free(n->dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); @@ -851,7 +855,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -860,6 +865,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh->neighbour = neighbour; + neigh->dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(&neigh->queue); ipoib_cm_set(neigh, NULL); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:08:53.245849217 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-18 17:09:26.536874045 +0200 @@ -727,7 +727,8 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, + skb->dev); if (neigh) { kref_get(&mcast->ah->ref); From monis at voltaire.com Mon Sep 24 08:30:56 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:30:56 +0200 Subject: [ofa-general] [PATCH V6 2/9] IB/ipoib: Verify address handle validity on send In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7D830.3060809@voltaire.com> When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:09:26.535874225 +0200 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-18 17:10:22.375853147 +0200 @@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, + if (unlikely((memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid))) || + (neigh->dev != dev))) { spin_lock(&priv->lock); /* * It's safe to call ipoib_put_ah() inside From monis at voltaire.com Mon Sep 24 08:32:09 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:32:09 +0200 Subject: [ofa-general] [PATCH V6 3/9] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7D879.7050603@voltaire.com> This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:08:59.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.424688411 +0300 @@ -1237,6 +1237,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->hard_header = slave_dev->hard_header; + bond_dev->rebuild_header = slave_dev->rebuild_header; + bond_dev->hard_header_cache = slave_dev->hard_header_cache; + bond_dev->header_cache_update = slave_dev->header_cache_update; + bond_dev->hard_header_parse = slave_dev->hard_header_parse; + + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different " + "from other slaves (%d), can not enslave it.\n", + slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " From monis at voltaire.com Mon Sep 24 08:36:12 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:36:12 +0200 Subject: [ofa-general] [PATCH V6 4/9] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address() In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7D96C.1020503@voltaire.com> This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 87 +++++++++++++++++++++++++++------------- drivers/net/bonding/bonding.h | 1 2 files changed, 60 insertions(+), 28 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 +0300 @@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC + * address is the one of the active slave. + */ + if (new_active && !bond->do_set_mac_addr) + memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, + new_active->dev->addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev->set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified does " - "not support setting the MAC address. " - "Your kernel likely does not support slave " - "devices.\n", bond_dev->name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond->slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + ": %s: Warning: The first slave device you " + "specified does not support setting the MAC " + "address. This bond MAC address would be that " + "of the active slave.\n", bond_dev->name); + bond->do_set_mac_addr = 0; + } else if (bond->do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + ": %s: Error: The slave device you specified " + "does not support setting the MAC addres,." + "but this bond uses this practice. \n" + , bond_dev->name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - /* - * Set slave to master's mac address. The application already - * set the master's mac address to that of the first slave - */ - memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); - addr.sa_family = slave_dev->type; - res = dev_set_mac_address(slave_dev, &addr); - if (res) { - dprintk("Error %d calling set_mac_address\n", res); - goto err_free; + if (bond->do_set_mac_addr) { + /* + * Set slave to master's mac address. The application already + * set the master's mac address to that of the first slave + */ + memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); + addr.sa_family = slave_dev->type; + res = dev_set_mac_address(slave_dev, &addr); + if (res) { + dprintk("Error %d calling set_mac_address\n", res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1612,9 +1631,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } err_free: kfree(new_slave); @@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address */ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address */ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1882,10 +1905,12 @@ static int bond_release_all(struct net_d /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE); @@ -3922,6 +3947,9 @@ static int bond_set_mac_address(struct n dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); + if (!bond->do_set_mac_addr) + return -EOPNOTSUPP; + if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; } @@ -4312,6 +4340,9 @@ static int bond_init(struct net_device * bond_create_proc_entry(bond); #endif + /* set do_set_mac_addr to true on startup */ + bond->do_set_mac_addr = 1; + list_add_tail(&bond->bond_list, &bond_dev_list); return 0; Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:08:58.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 10:55:34.359354833 +0300 @@ -185,6 +185,7 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; + s8 do_set_mac_addr; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From monis at voltaire.com Mon Sep 24 08:37:00 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:37:00 +0200 Subject: [ofa-general] [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7D99C.3030602@voltaire.com> Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/bonding/bond_sysfs.c | 6 ++---- 2 files changed, 5 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev->flags & IFF_UP)) { - dprintk("Error, master_dev is not up\n"); - return -EPERM; + printk(KERN_WARNING DRV_NAME + " %s: master_dev is not up in bond_enslave\n", + bond_dev->name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond->dev->flags & IFF_UP)) { - printk(KERN_ERR DRV_NAME - ": %s: Unable to update slaves because interface is down.\n", + printk(KERN_WARNING DRV_NAME + ": %s: doing slave updates when interface is down.\n", bond->dev->name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond->lock here, as bond_create grabs it. */ From monisonlists at gmail.com Mon Sep 24 08:40:57 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:40:57 +0200 Subject: [ofa-general] [PATCH V6 6/9] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7DA89.1050403@gmail.com> bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 3 ++- drivers/net/bonding/bond_sysfs.c | 10 ++++++++-- drivers/net/bonding/bonding.h | 1 + 3 files changed, 11 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-09-24 12:52:33.000000000 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 12:57:33.411459811 +0200 @@ -1224,7 +1224,8 @@ static int bond_compute_features(struct struct slave *slave; struct net_device *bond_dev = bond->dev; unsigned long features = bond_dev->features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev->hard_header_len); int i; features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 12:55:09.000000000 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-09-24 13:00:23.752680721 +0200 @@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev->mtu; if (dev->mtu != bond->dev->mtu) { if (dev->change_mtu) { res = dev->change_mtu(dev, @@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond->dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) + slave->original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) { dev = slave->dev; + original_mtu = slave->original_mtu; break; } if (dev) { @@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev->change_mtu) { - dev->change_mtu(dev, 1500); + dev->change_mtu(dev, original_mtu); } else { - dev->mtu = 1500; + dev->mtu = original_mtu; } } else { Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 12:55:09.000000000 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 12:57:33.412459636 +0200 @@ -156,6 +156,7 @@ struct slave { s8 link; /* one of BOND_LINK_XXXX */ s8 state; /* one of BOND_STATE_XXXX */ u32 original_flags; + u32 original_mtu; u32 link_failure_count; u16 speed; u8 duplex; From monis at voltaire.com Mon Sep 24 08:46:12 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:46:12 +0200 Subject: [ofa-general] PATCH V6 7/9] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7DBC4.2090307@voltaire.com> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev->state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 24 +++++++++++++++++++++--- drivers/net/bonding/bonding.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:56:33.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 +0300 @@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon if (new_active && !bond->do_set_mac_addr) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); - - bond_send_gratuitous_arp(bond); + if (bond->curr_active_slave && + test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) { + dprintk("delaying gratuitous arp on %s\n", + bond->curr_active_slave->dev->name); + bond->send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond->send_grat_arp) { + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) + dprintk("Needs to send gratuitous arp but not yet\n"); + else { + dprintk("sending delayed gratuitous arp on on %s\n", + bond->curr_active_slave->dev->name); + bond_send_gratuitous_arp(bond); + bond->send_grat_arp = 0; + } + } read_lock(&bond->curr_slave_lock); oldcurrent = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); @@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str if (bond->master_ip) { bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, - bond->master_ip, 0); + bond->master_ip, 0); } list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { @@ -4293,6 +4310,7 @@ static int bond_init(struct net_device * bond->current_arp_slave = NULL; bond->primary_slave = NULL; bond->dev = bond_dev; + bond->send_grat_arp = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:56:33.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 11:05:41.516451497 +0300 @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From monis at voltaire.com Mon Sep 24 08:47:42 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:47:42 +0200 Subject: [ofa-general] [PATCH V6 8/9] net/bonding: Destroy bonding master when last slave is gone In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7DC1E.6060209@voltaire.com> When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 37 +++++++++++++++++++++++++++++++++++++ drivers/net/bonding/bond_sysfs.c | 9 +++++---- drivers/net/bonding/bonding.h | 3 +++ 3 files changed, 45 insertions(+), 4 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-09-24 14:05:05.658979207 +0200 @@ -1256,6 +1256,7 @@ static int bond_compute_features(struct static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev->priv; bond_dev->hard_header = slave_dev->hard_header; bond_dev->rebuild_header = slave_dev->rebuild_header; bond_dev->hard_header_cache = slave_dev->hard_header_cache; @@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n memcpy(bond_dev->broadcast, slave_dev->broadcast, slave_dev->addr_len); + bond->setup_by_slave = 1; } /* enslave device to bond device */ @@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond->dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond->dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev->priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) && (bond->slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME " %s: destroying bond %s.\n", + bond_dev->name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3337,6 +3368,11 @@ static int bond_slave_netdev_event(unsig * ... Or is it this? */ break; + case NETDEV_GOING_DOWN: + dprintk("slave %s is going down\n", slave_dev->name); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + break; case NETDEV_CHANGEMTU: /* * TODO: Should slaves be allowed to @@ -4311,6 +4347,7 @@ static int bond_init(struct net_device * bond->primary_slave = NULL; bond->dev = bond_dev; bond->send_grat_arp = 0; + bond->setup_by_slave = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.055441842 +0200 +++ net-2.6/drivers/net/bonding/bonding.h 2007-09-24 14:01:24.627340013 +0200 @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-09-24 14:01:23.523536550 +0200 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-09-24 14:01:24.628339835 +0200 @@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc printk(KERN_INFO DRV_NAME ": %s is being deleted...\n", bond->dev->name); - bond_deinit(bond->dev); - bond_destroy_sysfs_entry(bond); - unregister_netdevice(bond->dev); + bond_destroy(bond); rtnl_unlock(); goto out; } @@ -363,7 +361,10 @@ static ssize_t bonding_store_slaves(stru printk(KERN_INFO DRV_NAME ": %s: Removing slave %s\n", bond->dev->name, dev->name); rtnl_lock(); - res = bond_release(bond->dev, dev); + if (bond->setup_by_slave) + res = bond_release_and_destroy(bond->dev, dev); + else + res = bond_release(bond->dev, dev); rtnl_unlock(); if (res) { ret = res; From monis at voltaire.com Mon Sep 24 08:49:18 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 24 Sep 2007 17:49:18 +0200 Subject: [ofa-general] [PATCH 9/9] bonding: Optionally allow ethernet slaves to keep own MAC In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F7DC7E.7090509@voltaire.com> Update the "don't change MAC of slaves" functionality added in previous changes to be a generic option, rather than something tied to IB devices, as it's occasionally useful for regular ethernet devices as well. Adds "fail_over_mac" option (which is automatically enabled for IB slaves), applicable only to active-backup mode. Includes documentation update. Updates bonding driver version to 3.2.0. Signed-off-by: Jay Vosburgh --- Documentation/networking/bonding.txt | 33 +++++++++++++++++++ drivers/net/bonding/bond_main.c | 57 +++++++++++++++++++++------------ drivers/net/bonding/bond_sysfs.c | 49 +++++++++++++++++++++++++++++ drivers/net/bonding/bonding.h | 6 ++-- 4 files changed, 121 insertions(+), 24 deletions(-) diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 1da5666..1134062 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -281,6 +281,39 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address (the traditional behavior), or, when + enabled, change the bond's MAC address when changing the + active interface (i.e., fail over the MAC address itself). + + Fail over MAC is useful for devices that cannot ever alter + their MAC address, or for devices that refuse incoming + broadcasts with their own source MAC (which interferes with + the ARP monitor). + + The down side of fail over MAC is that every device on the + network must be updated via gratuitous ARP, vs. just updating + a switch or set of switches (which often takes place for any + traffic, not just ARP traffic, if the switch snoops incoming + traffic to update its tables) for the traditional method. If + the gratuitous ARP is lost, communication may be disrupted. + + When fail over MAC is used in conjuction with the mii monitor, + devices which assert link up prior to being able to actually + transmit and receive are particularly susecptible to loss of + the gratuitous ARP, and an appropriate updelay setting may be + required. + + A value of 0 disables fail over MAC, and is the default. A + value of 1 enables fail over MAC. This option is enabled + automatically if the first slave added cannot change its MAC + address. This option may be modified via sysfs only when no + slaves are present in the bond. + + This option was added in bonding version 3.2.0. + lacp_rate Option specifying the rate in which we'll ask our link partner diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 77caca3..c01ff9d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -97,6 +97,7 @@ static char *xmit_hash_policy = NULL; static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[BOND_MAX_ARP_TARGETS] = { NULL, }; static char *arp_validate = NULL; +static int fail_over_mac = 0; struct bond_params bonding_defaults; module_param(max_bonds, int, 0); @@ -130,6 +131,8 @@ module_param_array(arp_ip_target, charp, NULL, 0); MODULE_PARM_DESC(arp_ip_target, "arp targets in n.n.n.n form"); module_param(arp_validate, charp, 0); MODULE_PARM_DESC(arp_validate, "validate src/dst of ARP probes: none (default), active, backup or all"); +module_param(fail_over_mac, int, 0); +MODULE_PARM_DESC(fail_over_mac, "For active-backup, do not set all slaves to the same MAC. 0 of off (default), 1 for on."); /*----------------------------- Global variables ----------------------------*/ @@ -1099,7 +1102,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active) /* when bonding does not set the slave MAC address, the bond MAC * address is the one of the active slave. */ - if (new_active && !bond->do_set_mac_addr) + if (new_active && bond->params.fail_over_mac) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); if (bond->curr_active_slave && @@ -1371,16 +1374,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) if (slave_dev->set_mac_address == NULL) { if (bond->slave_cnt == 0) { printk(KERN_WARNING DRV_NAME - ": %s: Warning: The first slave device you " - "specified does not support setting the MAC " - "address. This bond MAC address would be that " - "of the active slave.\n", bond_dev->name); - bond->do_set_mac_addr = 0; - } else if (bond->do_set_mac_addr) { + ": %s: Warning: The first slave device " + "specified does not support setting the MAC " + "address. Enabling the fail_over_mac option.", + bond_dev->name); + bond->params.fail_over_mac = 1; + } else if (!bond->params.fail_over_mac) { printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified " - "does not support setting the MAC addres,." - "but this bond uses this practice. \n" + ": %s: Error: The slave device specified " + "does not support setting the MAC address, " + "but fail_over_mac is not enabled.\n" , bond_dev->name); res = -EOPNOTSUPP; goto err_undo_flags; @@ -1405,7 +1408,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* * Set slave to master's mac address. The application already * set the master's mac address to that of the first slave @@ -1641,7 +1644,7 @@ err_close: dev_close(slave_dev); err_restore_mac: - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; dev_set_mac_address(slave_dev, &addr); @@ -1823,7 +1826,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address */ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -1944,7 +1947,7 @@ static int bond_release_all(struct net_device *bond_dev) /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->do_set_mac_addr) { + if (!bond->params.fail_over_mac) { /* restore original ("permanent") mac address*/ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); addr.sa_family = slave_dev->type; @@ -3066,9 +3069,15 @@ static void bond_info_show_master(struct seq_file *seq) curr = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); - seq_printf(seq, "Bonding Mode: %s\n", + seq_printf(seq, "Bonding Mode: %s", bond_mode_name(bond->params.mode)); + if (bond->params.mode == BOND_MODE_ACTIVEBACKUP && + bond->params.fail_over_mac) + seq_printf(seq, " (fail_over_mac)"); + + seq_printf(seq, "\n"); + if (bond->params.mode == BOND_MODE_XOR || bond->params.mode == BOND_MODE_8023AD) { seq_printf(seq, "Transmit Hash Policy: %s (%d)\n", @@ -4008,8 +4017,12 @@ static int bond_set_mac_address(struct net_device *bond_dev, void *addr) dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); - if (!bond->do_set_mac_addr) - return -EOPNOTSUPP; + /* + * If fail_over_mac is enabled, do nothing and return success. + * Returning an error causes ifenslave to fail. + */ + if (bond->params.fail_over_mac) + return 0; if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; @@ -4402,10 +4415,6 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params) #ifdef CONFIG_PROC_FS bond_create_proc_entry(bond); #endif - - /* set do_set_mac_addr to true on startup */ - bond->do_set_mac_addr = 1; - list_add_tail(&bond->bond_list, &bond_dev_list); return 0; @@ -4739,6 +4748,11 @@ static int bond_check_params(struct bond_params *params) primary = NULL; } + if (fail_over_mac && (bond_mode != BOND_MODE_ACTIVEBACKUP)) + printk(KERN_WARNING DRV_NAME + ": Warning: fail_over_mac only affects " + "active-backup mode.\n"); + /* fill params struct with the proper values */ params->mode = bond_mode; params->xmit_policy = xmit_hashtype; @@ -4750,6 +4764,7 @@ static int bond_check_params(struct bond_params *params) params->use_carrier = use_carrier; params->lacp_fast = lacp_fast; params->primary[0] = 0; + params->fail_over_mac = fail_over_mac; if (primary) { strncpy(params->primary, primary, IFNAMSIZ); diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 71db5d9..a907b68 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -567,6 +567,54 @@ static ssize_t bonding_store_arp_validate(struct device *d, static DEVICE_ATTR(arp_validate, S_IRUGO | S_IWUSR, bonding_show_arp_validate, bonding_store_arp_validate); /* + * Show and store fail_over_mac. User only allowed to change the + * value when there are no slaves. + */ +static ssize_t bonding_show_fail_over_mac(struct device *d, struct device_attribute *attr, char *buf) +{ + struct bonding *bond = to_bond(d); + + return sprintf(buf, "%d\n", bond->params.fail_over_mac) + 1; +} + +static ssize_t bonding_store_fail_over_mac(struct device *d, struct device_attribute *attr, const char *buf, size_t count) +{ + int new_value; + int ret = count; + struct bonding *bond = to_bond(d); + + if (bond->slave_cnt != 0) { + printk(KERN_ERR DRV_NAME + ": %s: Can't alter fail_over_mac with slaves in bond.\n", + bond->dev->name); + ret = -EPERM; + goto out; + } + + if (sscanf(buf, "%d", &new_value) != 1) { + printk(KERN_ERR DRV_NAME + ": %s: no fail_over_mac value specified.\n", + bond->dev->name); + ret = -EINVAL; + goto out; + } + + if ((new_value == 0) || (new_value == 1)) { + bond->params.fail_over_mac = new_value; + printk(KERN_INFO DRV_NAME ": %s: Setting fail_over_mac to %d.\n", + bond->dev->name, new_value); + } else { + printk(KERN_INFO DRV_NAME + ": %s: Ignoring invalid fail_over_mac value %d.\n", + bond->dev->name, new_value); + } +out: + return ret; +} + +static DEVICE_ATTR(fail_over_mac, S_IRUGO | S_IWUSR, bonding_show_fail_over_mac, bonding_store_fail_over_mac); + +/* * Show and set the arp timer interval. There are two tricky bits * here. First, if ARP monitoring is activated, then we must disable * MII monitoring. Second, if the ARP timer isn't running, we must @@ -1390,6 +1438,7 @@ static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL); static struct attribute *per_bond_attrs[] = { &dev_attr_slaves.attr, &dev_attr_mode.attr, + &dev_attr_fail_over_mac.attr, &dev_attr_arp_validate.attr, &dev_attr_arp_interval.attr, &dev_attr_arp_ip_target.attr, diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index ed0f587..9d6153e 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,8 +22,8 @@ #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION "3.1.3" -#define DRV_RELDATE "June 13, 2007" +#define DRV_VERSION "3.2.0" +#define DRV_RELDATE "September 13, 2007" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" @@ -128,6 +128,7 @@ struct bond_params { int arp_interval; int arp_validate; int use_carrier; + int fail_over_mac; int updelay; int downdelay; int lacp_fast; @@ -186,7 +187,6 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; - s8 do_set_mac_addr; s8 send_grat_arp; s8 setup_by_slave; struct net_device_stats stats; -- 1.5.2-rc2.GIT From swise at opengridcomputing.com Mon Sep 24 09:04:48 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 24 Sep 2007 11:04:48 -0500 Subject: [ofa-general] Re: [ewg] OFED teleconference today In-Reply-To: <46F7CC9D.70009@mellanox.co.il> References: <46F7CC9D.70009@mellanox.co.il> Message-ID: <46F7E020.3000902@opengridcomputing.com> I cannot make the meeting today. I vote for 2.6.24 base. There is still the outstanding iwarp port space issue that will need to be pulled into ofed-1.3 when it finalizes. But its a bug fix really, so not a new feature I guess. Tziporet Koren wrote: > Jeff Squyres wrote: >> Friendly reminder: the OFED teleconference is several hours from now >> (Monday, September 24, 2007). >> >> Noon US eastern / 9am US Pacific / -=>6pm Israel<=- >> 1. Monday, Sep 24, code 210062024 (***TODAY***) >> > Agenda: > 1. Agree on the new OFED 1.3 schedule: > > * Feature freeze - Sep 25 > * Alpha release - Oct 1 > * Beta release - Oct 17 (may change according to 2.6.24 rc1 > availability) > * RC1 - Oct 24 > * RC2 - Nov 7 > * RC3 - Nov 20 > * RC4 - Dec 4 > * GA release - Dec 18 > > 2. Agree to move to kernel base 2.6.24 > Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is > available. > This will reduce many patches and with the new timeline seems more > appropriate. > > Please send if you have any other agenda items > > Tziporet > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From shemminger at linux-foundation.org Mon Sep 24 09:04:37 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Mon, 24 Sep 2007 09:04:37 -0700 Subject: [ofa-general] Re: [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <46F7D99C.3030602@voltaire.com> References: <46F7D770.4090500@voltaire.com> <46F7D99C.3030602@voltaire.com> Message-ID: <20070924090437.0406e147@freepuppy.rosehill> On Mon, 24 Sep 2007 17:37:00 +0200 Moni Shoua wrote: > Allow to enslave devices when the bonding device is not up. Over the discussion > held at the previous post this seemed to be the most clean way to go, where it > is not expected to cause instabilities. > > Normally, the bonding driver is UP before any enslavement takes place. > Once a netdevice is UP, the network stack acts to have it join some multicast groups > (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device > type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code > computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called > where for multicast joins taking place after the enslavement another ip_xxx_mc_map() > is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) > > Signed-off-by: Moni Shoua > Signed-off-by: Or Gerlitz > --- > drivers/net/bonding/bond_main.c | 5 +++-- > drivers/net/bonding/bond_sysfs.c | 6 ++---- > 2 files changed, 5 insertions(+), 6 deletions(-) > > Index: net-2.6/drivers/net/bonding/bond_main.c > =================================================================== > --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.000000000 +0300 > +++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300 > @@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond > > /* bond must be initialized by bond_open() before enslaving */ > if (!(bond_dev->flags & IFF_UP)) { > - dprintk("Error, master_dev is not up\n"); > - return -EPERM; > + printk(KERN_WARNING DRV_NAME > + " %s: master_dev is not up in bond_enslave\n", > + bond_dev->name); > } > > /* already enslaved */ > Index: net-2.6/drivers/net/bonding/bond_sysfs.c > =================================================================== > --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.000000000 +0300 > +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300 > @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru > > /* Quick sanity check -- is the bond interface up? */ > if (!(bond->dev->flags & IFF_UP)) { > - printk(KERN_ERR DRV_NAME > - ": %s: Unable to update slaves because interface is down.\n", > + printk(KERN_WARNING DRV_NAME > + ": %s: doing slave updates when interface is down.\n", > bond->dev->name); > - ret = -EPERM; > - goto out; > } > Please get rid of the warning. Make bonding work correctly and allow enslave/remove of device when bonding is down. From cvp105 at soton.ac.uk Mon Sep 24 09:09:39 2007 From: cvp105 at soton.ac.uk (YAHOO PROMOTIONS) Date: Mon, 24 Sep 2007 17:09:39 +0100 Subject: [ofa-general] Yahoo! Winner Message-ID: <1190650179.46f7e143eacbd@webmail.soton.ac.uk> We are pleased to inform you of the result of the YAHOO! International Promotions Program held today, the 24th of September, 2007. Your e-mail address attached to ticket number 883734657492-5319 with serial number 7263-267, batch number 8254297137 drew the lucky numbers 14-22-28-37-40-44 which consequently won in the 1st category, you have therefore been approved for a lump sum pay out of ÂŁ500,000,00(i.e Five Hundred Thousand United Kingdom Pounds Sterling) in cash credited to file REF:YAHOO6/315116127/27. This is from total prize money of ÂŁ8,500,000.00 shared among the seventeen international winners in this category. To claim your price kindly submit the completed form to your verification unit by email: VERIFICATION FORM BELOW: *Name of Winner*Address*City/State*Nationality*Sex*email address*Phone no *Occupation*Age*Amount won*. The claims officer contact email is: Name: Sir James Walker E-mail: email_sirjameswalker at yahoo.co.in Yours Sincerely, fabfeet preschool Online Promotions Managers, Uk Administrative Unit From mshefty at ichips.intel.com Mon Sep 24 09:28:08 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 09:28:08 -0700 Subject: [ofa-general] OFED 1.3 In-Reply-To: <20070923133608.GA11619@mellanox.co.il> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <20070911032054.GA21811@mellanox.co.il> <20070923133608.GA11619@mellanox.co.il> Message-ID: <46F7E598.60805@ichips.intel.com> > Any update? I see ~shefty/ofed_1_2.git but no 1.3 code. > Please note that I can not pull for-roland branch into OFED 1.3. The only kernel change outside of 2.6.24 that is needed is the local SA patch from OFED 1.2.5. I still need to push new releases of userspace. I needed to wait to see what was going into 2.6.24 first. - Sean From mshefty at ichips.intel.com Mon Sep 24 09:44:30 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 09:44:30 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F6267E.7090407@voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> Message-ID: <46F7E96E.4060302@ichips.intel.com> > Now, is this case there was --no-- previous event, when the port was > brought back online there was PORT_ACTIVE event (its a driver issue > which we look at). However, from the view point of the SA there was "GID > out" event, so the HCA port was dropped out from the multicast group and > the multicast routing (spanning tree, MFTs configuration etc) was > computed without this port being included. This is the ipoib logging of > what happens from its perspective (I have added the event number to the > "port state change event" print): Do you know why there wasn't some sort of port down event? >> node 1 <-> switch A <-> switch B <-> switch C <-> SA > > The host would only see port up/down events as of changes in the link > state in the local port or in the port which is connected to it through > the cable. So, if you brought the link down/up between switches A & B, node 1 wouldn't receive any events, but it would be removed from the multicast group? - Sean From mshefty at ichips.intel.com Mon Sep 24 09:47:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 09:47:47 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870709211702k1294cd79y5b7c987b04958adf@mail.gmail.com> References: <795c49870709041614p719177das4715bf372b47c099@mail.gmail.com> <46F0060E.1080505@ichips.intel.com> <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> <46F4512C.4010505@ichips.intel.com> <795c49870709211702k1294cd79y5b7c987b04958adf@mail.gmail.com> Message-ID: <46F7EA33.5050706@ichips.intel.com> Jeff Becker wrote: > I'm OK with these suggestions. Please let me know what you would like > implemented. Thanks. I tried changing my WEB_README, and the updates didn't show up on the download page. How often should be the page be updated? - Sean From sean.hefty at intel.com Mon Sep 24 10:17:15 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 10:17:15 -0700 Subject: [ofa-general] RE: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <20070924114713.GB32619@mellanox.co.il> References: <000101c7f3fe$fc52b530$1ccc180a@amr.corp.intel.com> <20070924114713.GB32619@mellanox.co.il> Message-ID: <000001c7fece$c0fc8080$ff0da8c0@amr.corp.intel.com> >I used git-format-patch to extract patches from this tree >and add them to ofed 1.3 kernel tree. Thanks From richard.frank at oracle.com Mon Sep 24 11:11:49 2007 From: richard.frank at oracle.com (rick) Date: Mon, 24 Sep 2007 14:11:49 -0400 Subject: [ofa-general] rdma_cm connect / disconnect / reject race....resulting in crash.... Message-ID: <46F7FDE5.9070305@oracle.com> Sean, per our discussion here's the problem description from Olaf... " We start to shut down the connection, and call rdma_destroy_qp on our cm_id. We haven't executed rdma_destroy_id yet. Now apparently a "connect reject" message comes in from the other host, and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which for some odd reason tries to modify the exact same QP we just destroyed. The crash looks like this: RDS/IB: connection request while the connection exist: 11.0.0.18, disconnecting and reconnecting ic f7ccb800 ic->i_cm_id f7cb2a00 rdma_destroy_qp(f7cb2a00) Unable to handle kernel NULL pointer dereference at virtual address 000000f8 .... EIP is at ib_modify_qp+0x5/0xe [ib_core] .... Stack: 00000000 f7cb2a00 f8ac36af 00000006 00000000 1a0f4680 f6742e7c c011cc85 c495ede0 f671ce30 c495ede0 c495ede0 00000086 c495ede0 c011d1a3 f671ce30 f671ce30 00000002 c4966de0 00000002 00000000 c495ede0 00000001 00000001 Call Trace: [] cma_modify_qp_err+0x22/0x2d [rdma_cm] [...] [] cma_disable_remove+0x35/0x3b [rdma_cm] [] cma_ib_handler+0xe6/0x158 [rdma_cm] [] cm_process_work+0x4a/0x80 [ib_cm] [] cm_rej_handler+0xd3/0x114 [ib_cm] It dies trying to dereference qp->device->modify_qp because qp->device is NULL. If you check the stack, you'll see the exact same cm_id that we just called rdma_destroy_qp() on (note that the printk("rdma_destroy_qp") that appears above comes *after* the call itself, so by the time this is printed, the QP is dead already. That's easy, I thought. Obviously, rdma_destroy_qp just forgets to clear cm_id->qp after destroying the queue pair: void rdma_destroy_qp(struct rdma_cm_id *id) { ib_destroy_qp(id->qp); + id->qp = NULL; } But that didn't really fix it. So either there's something else going on which I don't grok yet, or this is just another case of bad locking. " From becker at nas.nasa.gov Mon Sep 24 11:37:18 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Mon, 24 Sep 2007 11:37:18 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46F7EA33.5050706@ichips.intel.com> References: <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> <46F4512C.4010505@ichips.intel.com> <795c49870709211702k1294cd79y5b7c987b04958adf@mail.gmail.com> <46F7EA33.5050706@ichips.intel.com> Message-ID: <795c49870709241137g27b82df6ueba445ae4a3fdb6f@mail.gmail.com> Hi Sean. I just talked to Jeff Scott about this, as he had announced the new downloads page. It turns out that the new page does not use my php page that automatically updates, but rather took a "snapshot" of the page state. That's why your update doesn't show up. He said he would try to fix this. -jeff On 9/24/07, Sean Hefty wrote: > Jeff Becker wrote: > > I'm OK with these suggestions. Please let me know what you would like > > implemented. Thanks. > > I tried changing my WEB_README, and the updates didn't show up on the > download page. How often should be the page be updated? > > - Sean > From peter.p.waskiewicz.jr at intel.com Mon Sep 24 12:12:49 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 24 Sep 2007 12:12:49 -0700 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190570205.4256.56.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> Message-ID: > I have submitted this before; but here it is again. > Against net-2.6.24 from yesterday for this and all following patches. > > > cheers, > jamal Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? I apologize for not having working patches right now, but I am working on them slowly as I have some blips of spare time. Thanks, -PJ Waskiewicz From ggrundstrom at NetEffect.com Mon Sep 24 12:29:23 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Mon, 24 Sep 2007 14:29:23 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. In-Reply-To: <20070923203649.8324.64524.stgit@dell3.ogc.int> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> I'm sure I had seen a previous email in this thread that suggested using a userspace library to open a socket in the shared port space. It seems that suggestion was dropped without reason. Does anyone know why? Thanks, Glenn. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Steve Wise Sent: Sunday, September 23, 2007 3:37 PM To: rdreier at cisco.com; sean.hefty at intel.com Cc: netdev at vger.kernel.org; linux-kernel at vger.kernel.org; general at lists.openfabrics.org Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Version 3: - don't use list_del_init() where list_del() is sufficient. Version 2: - added a per-device mutex for the address and listening endpoints lists. - wait for all replies if sending multiple passive_open requests to rnic. - log warning if no addresses are available when a listen is issued. - tested --- Design: The sysadmin creates "for iwarp use only" alias interfaces of the form "devname:iw*" where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with "iw". The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. EG: ifconfig eth0 192.168.70.123 up ifconfig eth0:iw1 192.168.71.123 up ifconfig eth0:iw2 192.168.72.123 up In the above example, 192.168.70/24 is for TCP traffic, while 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. The rdma-only interface must be on its own IP subnet. This allows routing all rdma traffic onto this interface. The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. Implementation Details: - The iw_cxgb3 driver registers for inetaddr events via register_inetaddr_notifier(). This allows tracking the iwarp-only addresses/subnets as they get added and deleted. The iwarp driver maintains a list of the current iwarp-only addresses. - The iw_cxgb3 driver builds the list of iwarp-only addresses for its devices at module insert time. This is needed because the inetaddr notifier callbacks don't "replay" address-add events when someone registers. So the driver must build the initial list at module load time. - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver must translate that into a set of listens on the iwarp-only addresses. This is implemented by maintaining a list of stid/addr entries per listening endpoint. - When a new iwarp-only address is added or removed, the iw_cxgb3 driver must traverse the set of listening endpoints and update them accordingly. This allows an application to bind to 0.0.0.0 prior to the iwarp-only interfaces being configured. It also allows changing the iwarp-only set of addresses and getting the expected behavior for apps already bound to 0.0.0.0. This is done by maintaining a list of listening endpoints off the device struct. - The address list, the listening endpoint list, and each list of stid/addrs in use per listening endpoint are all protected via a mutex per iw_cxgb3 device. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 125 ++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 11 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 259 +++++++++++++++++++++++++++------ drivers/infiniband/hw/cxgb3/iwch_cm.h | 15 ++ 4 files changed, 360 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c index 0315c9d..d81d46e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.c +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = { static LIST_HEAD(dev_list); static DEFINE_MUTEX(dev_mutex); +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return; + } + addr->ifa = ifa; + mutex_lock(&rnicp->mutex); + list_add_tail(&addr->entry, &rnicp->addrlist); + mutex_unlock(&rnicp->mutex); +} + +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + if (addr->ifa == ifa) { + list_del(&addr->entry); + kfree(addr); + goto out; + } + } +out: + mutex_unlock(&rnicp->mutex); +} + +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) +{ + int i; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) + if (netdev == rnicp->rdev.port_info.lldevs[i]) + return 1; + return 0; +} + +static inline int is_iwarp_label(char *label) +{ + char *colon; + + colon = strchr(label, ':'); + if (colon && !strncmp(colon+1, "iw", 2)) + return 1; + return 0; +} + +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa->ifa_address); + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x deleted\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + iwch_listeners_del_addr(rnicp, ifa->ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + list_del(&addr->entry); + kfree(addr); + } + mutex_unlock(&rnicp->mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ + int i; + struct in_device *indev; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, + ifa->ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(&rnicp->qpidr); idr_init(&rnicp->mmidr); spin_lock_init(&rnicp->lock); + INIT_LIST_HEAD(&rnicp->addrlist); + INIT_LIST_HEAD(&rnicp->listen_eps); + mutex_init(&rnicp->mutex); + rnicp->nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(&rnicp->nb); rnicp->attr.vendor_id = 0x168; rnicp->attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(&dev_mutex); list_for_each_entry_safe(dev, tmp, &dev_list, entry) { if (dev->rdev.t3cdev_p == tdev) { + unregister_inetaddr_notifier(&dev->nb); + delete_addrlist(dev); list_del(&dev->entry); iwch_unregister_device(dev); cxio_rdev_close(&dev->rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,6 +36,8 @@ #include #include #include #include +#include +#include #include @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { u32 cq_overflow_detection; }; +struct iwch_addrlist { + struct list_head entry; + struct in_ifaddr *ifa; +}; + struct iwch_dev { struct ib_device ibdev; struct cxio_rdev rdev; @@ -111,6 +118,10 @@ struct iwch_dev { struct idr mmidr; spinlock_t lock; struct list_head entry; + struct notifier_block nb; + struct list_head addrlist; + struct list_head listen_eps; + struct mutex mutex; }; static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 1cdfcd4..afc8a48 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t return CPL_RET_BUF_DONE; } -static int listen_start(struct iwch_listen_ep *ep) +static int wait_for_reply(struct iwch_ep_common *epc) +{ + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); + wait_event(epc->waitq, epc->rpl_done); + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); + return epc->rpl_err; +} + +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, + __be32 addr) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_listen_entry *le; + + le = kmalloc(sizeof *le, GFP_KERNEL); + if (!le) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return NULL; + } + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, + &t3c_client, ep); + if (le->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", + __FUNCTION__); + kfree(le); + return NULL; + } + le->addr = addr; + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + return le; +} + +static void dealloc_listener(struct iwch_listen_ep *ep, + struct iwch_listen_entry *le) +{ + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); + cxgb3_free_stid(ep->com.tdev, le->stid); + kfree(le); +} + +static void dealloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le, *tmp; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + + mutex_lock(&h->mutex); + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { + list_del(&le->entry); + dealloc_listener(ep, le); + } + mutex_unlock(&h->mutex); +} + +static int alloc_listener_list(struct iwch_listen_ep *ep) +{ + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + struct iwch_addrlist *addr; + struct iwch_listen_entry *le; + int err = 0; + int added=0; + mutex_lock(&h->mutex); + list_for_each_entry(addr, &h->addrlist, entry) { + if (ep->com.local_addr.sin_addr.s_addr == 0 || + ep->com.local_addr.sin_addr.s_addr == + addr->ifa->ifa_address) { + le = alloc_listener(ep, addr->ifa->ifa_address); + if (!le) + break; + list_add_tail(&le->entry, &ep->listeners); + added++; + } + } + mutex_unlock(&h->mutex); + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) + err = -EADDRNOTAVAIL; + if (!err && !added) + printk(KERN_WARNING MOD + "No RDMA interface found for device %s\n", + pci_name(h->rdev.rnic_info.pdev)); + return err; +} + +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int stid) { struct sk_buff *skb; - struct cpl_pass_open_req *req; + struct cpl_close_listserv_req *req; + + PDBG("%s stid %u\n", __FUNCTION__, stid); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->cpu_idx = 0; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); + skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; + cxgb3_ofld_send(ep->com.tdev, skb); + return wait_for_reply(&ep->com); +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_stop_one(ep, le->stid); + if (err) + break; + } + mutex_unlock(&h->mutex); + return err; +} + +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int stid, + __be32 addr, __be16 port) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, ntohl(addr), + ntohs(port)); skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); if (!skb) { - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); return -ENOMEM; } req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); - req->local_port = ep->com.local_addr.sin_port; - req->local_ip = ep->com.local_addr.sin_addr.s_addr; + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); + req->local_port = port; + req->local_ip = addr; req->peer_port = 0; req->peer_ip = 0; req->peer_netmask = 0; @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; + ep->com.rpl_err = 0; + ep->com.rpl_done = 0; cxgb3_ofld_send(ep->com.tdev, skb); - return 0; + return wait_for_reply(&ep->com); +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct iwch_listen_entry *le; + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); + int err = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + mutex_lock(&h->mutex); + list_for_each_entry(le, &ep->listeners, entry) { + err = listen_start_one(ep, le->stid, le->addr, + ep->com.local_addr.sin_port); + if (err) + goto fail; + } + mutex_unlock(&h->mutex); + return err; +fail: + mutex_unlock(&h->mutex); + listen_stop(ep); + return err; } static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * return CPL_RET_BUF_DONE; } -static int listen_stop(struct iwch_listen_ep *ep) -{ - struct sk_buff *skb; - struct cpl_close_listserv_req *req; - - PDBG("%s ep %p\n", __FUNCTION__, ep); - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); - if (!skb) { - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); - return -ENOMEM; - } - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); - req->cpu_idx = 0; - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); - skb->priority = 1; - cxgb3_ofld_send(ep->com.tdev, skb); - return 0; -} - static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) { struct iwch_listen_ep *ep = ctx; struct cpl_close_listserv_rpl *rpl = cplhdr(skb); - PDBG("%s ep %p\n", __FUNCTION__, ep); + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); + ep->com.rpl_err = status2errno(rpl->status); ep->com.rpl_done = 1; wake_up(&ep->com.waitq); return CPL_RET_BUF_DONE; } +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + le = alloc_listener(listen_ep, addr); + if (le) { + list_add_tail(&le->entry, &listen_ep->listeners); + listen_start_one(listen_ep, le->stid, addr, + listen_ep->com.local_addr.sin_port); + } + } + mutex_unlock(&rnicp->mutex); +} + +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) +{ + struct iwch_listen_ep *listen_ep; + struct iwch_listen_entry *le, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { + if (listen_ep->com.local_addr.sin_addr.s_addr) + continue; + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, + entry) + if (le->addr == addr) { + listen_stop_one(listen_ep, le->stid); + list_del(&le->entry); + dealloc_listener(listen_ep, le); + } + } + mutex_unlock(&rnicp->mutex); +} + static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) { struct cpl_pass_accept_rpl *rpl; @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i goto err; /* wait for wr_ack */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; + err = wait_for_reply(&ep->com); if (err) goto err; @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * ep->com.cm_id = cm_id; ep->backlog = backlog; ep->com.local_addr = cm_id->local_addr; + INIT_LIST_HEAD(&ep->listeners); - /* - * Allocate a server TID. - */ - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); - if (ep->stid == -1) { - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); - err = -ENOMEM; + err = alloc_listener_list(ep); + if (err) goto fail2; - } state_set(&ep->com, LISTEN); err = listen_start(ep); - if (err) - goto fail3; - /* wait for pass_open_rpl */ - wait_event(ep->com.waitq, ep->com.rpl_done); - err = ep->com.rpl_err; if (!err) { cm_id->provider_data = ep; + mutex_lock(&h->mutex); + list_add_tail(&ep->entry, &h->listen_eps); + mutex_unlock(&h->mutex); goto out; } -fail3: - cxgb3_free_stid(ep->com.tdev, ep->stid); + dealloc_listener_list(ep); fail2: cm_id->rem_ref(cm_id); put_ep(&ep->com); @@ -1923,18 +2084,20 @@ out: int iwch_destroy_listen(struct iw_cm_id *cm_id) { int err; + struct iwch_dev *h = to_iwch_dev(cm_id->device); struct iwch_listen_ep *ep = to_listen_ep(cm_id); PDBG("%s ep %p\n", __FUNCTION__, ep); might_sleep(); + mutex_lock(&h->mutex); + list_del(&ep->entry); + mutex_unlock(&h->mutex); state_set(&ep->com, DEAD); ep->com.rpl_done = 0; ep->com.rpl_err = 0; err = listen_stop(ep); - wait_event(ep->com.waitq, ep->com.rpl_done); - cxgb3_free_stid(ep->com.tdev, ep->stid); - err = ep->com.rpl_err; + dealloc_listener_list(ep); cm_id->rem_ref(cm_id); put_ep(&ep->com); return err; diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h index 6107e7c..23e5a22 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -162,10 +162,19 @@ struct iwch_ep_common { int rpl_err; }; -struct iwch_listen_ep { - struct iwch_ep_common com; +struct iwch_listen_entry { + struct list_head entry; unsigned int stid; + __be32 addr; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; /* Must be first entry! */ + struct list_head entry; + struct list_head listeners; int backlog; + int listen_count; + int listen_rpls; }; struct iwch_ep { @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); void __free_ep(struct kref *kref); void iwch_rearp(struct iwch_ep *ep); int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t); +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); int __init iwch_cm_init(void); void __exit iwch_cm_term(void); _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Mon Sep 24 12:32:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 24 Sep 2007 12:32:31 -0700 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> (Glenn Grundstrom's message of "Mon, 24 Sep 2007 14:29:23 -0500") References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> Message-ID: > I'm sure I had seen a previous email in this thread that suggested using > a userspace library to open a socket > in the shared port space. It seems that suggestion was dropped without > reason. Does anyone know why? Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc). Does the neteffect NIC have the same issue as cxgb3 here? What are your thoughts on how to handle this? - R. From sean.hefty at intel.com Mon Sep 24 14:07:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 14:07:28 -0700 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: <46F7FDE5.9070305@oracle.com> References: <46F7FDE5.9070305@oracle.com> Message-ID: <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> If a user allocates a QP on an rdma_cm_id, the rdma_cm will automatically transition the QP through its states (RTR, RTS, error, etc.) While the QP state transitions are occurring, the QP itself must remain valid. Provide locking around the QP pointer to prevent its destruction while accessing the pointer. This fixes an issue reported by Olaf Kirch from Oracle that resulted in a system crash: "An incoming connection arrives and we decide to tear down the nascent connection. The remote ends decides to do the same. We start to shut down the connection, and call rdma_destroy_qp on our cm_id. ... Now apparently a 'connect reject' message comes in from the other host, and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which for some odd reason tries to modify the exact same QP we just destroyed." Signed-off-by: Sean Hefty --- Rick, can you please test this patch and let me know if it fixes your problem? drivers/infiniband/core/cma.c | 90 +++++++++++++++++++++++++++-------------- 1 files changed, 60 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..c6a6dba 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -120,6 +120,8 @@ struct rdma_id_private { enum cma_state state; spinlock_t lock; + struct mutex qp_mutex; + struct completion comp; atomic_t refcount; wait_queue_head_t wait_remove; @@ -387,6 +389,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, id_priv->id.event_handler = event_handler; id_priv->id.ps = ps; spin_lock_init(&id_priv->lock); + mutex_init(&id_priv->qp_mutex); init_completion(&id_priv->comp); atomic_set(&id_priv->refcount, 1); init_waitqueue_head(&id_priv->wait_remove); @@ -472,61 +475,86 @@ EXPORT_SYMBOL(rdma_create_qp); void rdma_destroy_qp(struct rdma_cm_id *id) { - ib_destroy_qp(id->qp); + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + mutex_lock(&id_priv->qp_mutex); + ib_destroy_qp(id_priv->id.qp); + id_priv->id.qp = NULL; + mutex_unlock(&id_priv->qp_mutex); } EXPORT_SYMBOL(rdma_destroy_qp); -static int cma_modify_qp_rtr(struct rdma_cm_id *id) +static int cma_modify_qp_rtr(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; qp_attr.qp_state = IB_QPS_RTR; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_rts(struct rdma_cm_id *id) +static int cma_modify_qp_rts(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_RTS; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_err(struct rdma_cm_id *id) +static int cma_modify_qp_err(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; + int ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_ERR; - return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, IB_QP_STATE); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, @@ -855,11 +883,11 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) { int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto reject; - ret = cma_modify_qp_rts(&id_priv->id); + ret = cma_modify_qp_rts(id_priv); if (ret) goto reject; @@ -869,7 +897,7 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) return 0; reject: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; @@ -945,7 +973,7 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) /* ignore event */ goto out; case IB_CM_REJ_RECEIVED: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); event.status = ib_event->param.rej_rcvd.reason; event.event = RDMA_CM_EVENT_REJECTED; event.param.conn.private_data = ib_event->private_data; @@ -2236,7 +2264,7 @@ static int cma_connect_iw(struct rdma_id_private *id_priv, sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; cm_id->remote_addr = *sin; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2303,7 +2331,7 @@ static int cma_accept_ib(struct rdma_id_private *id_priv, int qp_attr_mask, ret; if (id_priv->id.qp) { - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2342,7 +2370,7 @@ static int cma_accept_iw(struct rdma_id_private *id_priv, struct iw_cm_conn_param iw_param; int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) return ret; @@ -2414,7 +2442,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) return 0; reject: - cma_modify_qp_err(id); + cma_modify_qp_err(id_priv); rdma_reject(id, NULL, 0); return ret; } @@ -2484,7 +2512,7 @@ int rdma_disconnect(struct rdma_cm_id *id) switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - ret = cma_modify_qp_err(id); + ret = cma_modify_qp_err(id_priv); if (ret) goto out; /* Initiate or respond to a disconnect. */ @@ -2515,9 +2543,11 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) return 0; + mutex_lock(&id_priv->qp_mutex); if (!status && id_priv->id.qp) status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, multicast->rec.mlid); + mutex_unlock(&id_priv->qp_mutex); memset(&event, 0, sizeof event); event.status = status; From ggrundstrom at NetEffect.com Mon Sep 24 14:30:17 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Mon, 24 Sep 2007 16:30:17 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC076E489A@venom2> > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Monday, September 24, 2007 2:33 PM > To: Glenn Grundstrom > Cc: Steve Wise; sean.hefty at intel.com; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support > "iwarp-only" interfacesto avoid 4-tuple conflicts. > > > I'm sure I had seen a previous email in this thread that > suggested using > > a userspace library to open a socket > > in the shared port space. It seems that suggestion was > dropped without > > reason. Does anyone know why? > > Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, > iSER, etc). The kernel apps could open a Linux tcp socket and create an RDMA socket connection. Both calls are standard Linux kernel architected routines. Doesn't NFSoRDMA already open a TCP socket and another for RDMA traffic (ports 2049 & 2050 if I remember correctly)? I currently don't know if iSER, RDS, etc. already do the same thing, but if they don't, they probably could very easily. > > Does the neteffect NIC have the same issue as cxgb3 here? What are > your thoughts on how to handle this? Yes, the NetEffect RNIC will have the same issue as Chelsio. And all Future RNIC's which support a unified tcp address with Linux will as well. Steve has put a lot of thought and energy into the problem, but I don't think users & admins will be very happy with us in the long run. In summary, short of having the rdma_cm share kernel port space, I'd like to see the equivalent in userspace and have the kernel apps handle the issue in a similar way as described above. There are a few technical issues to work through (like passing the userspace IP address to the kernel), but I think we can solve that just like other information that gets passed from user into the IB/RDMA kernel modules. Glenn. > > - R. > From mshefty at ichips.intel.com Mon Sep 24 14:56:06 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Sep 2007 14:56:06 -0700 Subject: [ofa-general] Re: A question about rdma_get_cm_event In-Reply-To: <46F58166.1070204@opengridcomputing.com> References: <46F25B6D.9000000@dev.mellanox.co.il> <46F2AADC.7040201@ichips.intel.com> <46F58166.1070204@opengridcomputing.com> Message-ID: <46F83276.2040102@ichips.intel.com> > Note that the private data length _is_ correct for iwarp. So the man > pages should mention that this is an IB-only issue maybe? And maybe > indicate that transport-independent applications should not rely on the > length... I modified the man pages to describe private_data_len as: Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested. These changes have been pushed into my git tree. - Sean From kliteyn at dev.mellanox.co.il Mon Sep 24 15:23:39 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 25 Sep 2007 00:23:39 +0200 Subject: [ofa-general] [PATCH] osm/osm_sa_path_record: trivial cosmetic chage Message-ID: <46F838EB.10704@dev.mellanox.co.il> Trivial fix in osm_sa_path_record.c Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_sa_path_record.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 3b183d9..ce75ec8 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -723,7 +723,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, if (pkey) { p_prtn = (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, - pkey & cl_ntoh16((uint16_t) ~ + pkey & cl_hton16((uint16_t) ~ 0x8000)); if (p_prtn == (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) -- 1.5.1.4 From tom at opengridcomputing.com Mon Sep 24 15:25:51 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Sep 2007 17:25:51 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC076E489A@venom2> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> <5E701717F2B2ED4EA60F87C8AA57B7CC076E489A@venom2> Message-ID: <1190672751.24606.56.camel@trinity.ogc.int> On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote: > > > -----Original Message----- > > From: Roland Dreier [mailto:rdreier at cisco.com] > > Sent: Monday, September 24, 2007 2:33 PM > > To: Glenn Grundstrom > > Cc: Steve Wise; sean.hefty at intel.com; general at lists.openfabrics.org > > Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support > > "iwarp-only" interfacesto avoid 4-tuple conflicts. > > > > > I'm sure I had seen a previous email in this thread that > > suggested using > > > a userspace library to open a socket > > > in the shared port space. It seems that suggestion was > > dropped without > > > reason. Does anyone know why? > > > > Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, > > iSER, etc). > > The kernel apps could open a Linux tcp socket and create an RDMA > socket connection. Both calls are standard Linux kernel architected > routines. This approach was NAK'd by David Miller and others... > Doesn't NFSoRDMA already open a TCP socket and another for > RDMA traffic (ports 2049 & 2050 if I remember correctly)? The NFS RDMA transport driver does not open a socket for the RDMA connection. It uses a different port in order to allow both TCP and RDMA mounts to the same filer. > I currently > don't know if iSER, RDS, etc. already do the same thing, but if they > don't, they probably could very easily. > Woe be to those who do so... > > > > Does the neteffect NIC have the same issue as cxgb3 here? What are > > your thoughts on how to handle this? > > Yes, the NetEffect RNIC will have the same issue as Chelsio. And all > Future RNIC's which support a unified tcp address with Linux will as > well. > > Steve has put a lot of thought and energy into the problem, but > I don't think users & admins will be very happy with us in the long run. > Agreed. > In summary, short of having the rdma_cm share kernel port space, I'd > like to see the equivalent in userspace and have the kernel apps handle > the issue in a similar way as described above. There are a few > technical > issues to work through (like passing the userspace IP address to the > kernel), This just moves the socket creation to code that is outside the purview the kernel maintainers. The exchanging of the 4-tuple created with the kernel module, however, is back in the kernel and in the maintainer's control and responsibility. In my view anything like this will be viewed as an attempt to sneak code into the kernel that the maintainer has already vehemently rejected. This will make people angry and damage the cooperative working relationship that we are trying to build. > but I think we can solve that just like other information that > gets passed from user into the IB/RDMA kernel modules. > Sharing the IP 4-tuple space cooperatively with the core in any fashion has been nak'd. Without this cooperation, the options we've been able to come up with are administrative/policy based approaches. Any ideas you have along these lines are welcome. Tom > Glenn. > > > > > - R. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Mon Sep 24 15:30:00 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 25 Sep 2007 00:30:00 +0200 Subject: [ofa-general] [PATCH] osm: QoS parser - adding pkey in port groups Message-ID: <46F83A68.4040004@dev.mellanox.co.il> Adding option to specify partitions for port groups in QoS policy file using pkeys in addition to partition names. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.y | 138 +++++++++++++++++++++++++++++++--------- 1 files changed, 108 insertions(+), 30 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 3c54205..e0faaaf 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -107,11 +107,20 @@ static void __parser_add_port_to_port_map( cl_qmap_t * p_map, osm_physp_t * p_physp); -static void __parser_add_range_to_port_map( +static void __parser_add_guid_range_to_port_map( cl_qmap_t * p_map, uint64_t ** range_arr, unsigned range_len); +static void __parser_add_pkey_range_to_port_map( + cl_qmap_t * p_map, + uint64_t ** range_arr, + unsigned range_len); + +static void __parser_add_partition_list_to_port_map( + cl_qmap_t * p_map, + cl_list_t * p_list); + static void __parser_add_map_to_port_map( cl_qmap_t * p_dmap, cl_map_t * p_smap); @@ -237,6 +246,8 @@ qos_policy_entry: port_groups_section * ... * port-name: vs1/HCA-1/P1 * ... + * pkey: 0x00FF-0x0FFF + * ... * partition: Part1 * ... * node-type: ROUTER,CA,SWITCH,SELF,ALL @@ -278,6 +289,7 @@ port_group_entry: port_group_name | port_group_use | port_group_port_guid | port_group_port_name + | port_group_pkey | port_group_partition | port_group_node_type ; @@ -526,6 +538,7 @@ qos_match_rule_entry: qos_match_rule_use * port_group_use * port_group_port_guid * port_group_port_name + * port_group_pkey * port_group_partition * port_group_node_type */ @@ -625,9 +638,10 @@ port_group_port_guid: port_group_port_guid_start list_of_ranges { &range_arr, &range_len ); - __parser_add_range_to_port_map(&p_current_port_group->port_map, - range_arr, - range_len); + __parser_add_guid_range_to_port_map( + &p_current_port_group->port_map, + range_arr, + range_len); } } ; @@ -637,33 +651,36 @@ port_group_port_guid_start: TK_PORT_GUID { } ; -port_group_partition: port_group_partition_start string_list { - /* 'partition' in 'port-group' - any num of instances */ - cl_list_iterator_t list_iterator; - char * tmp_str; - osm_prtn_t * p_prtn; - - /* extract all the ports from the partition - to the port map of this port group */ - list_iterator = cl_list_head(&tmp_parser_struct.str_list); - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) +port_group_pkey: port_group_pkey_start list_of_ranges { + /* 'pkey' in 'port-group' - any num of instances */ + /* list of pkey ranges */ + if (cl_list_count(&tmp_parser_struct.num_pair_list)) { - tmp_str = (char*)cl_list_obj(list_iterator); - if (tmp_str) - { - p_prtn = osm_prtn_find_by_name(p_qos_policy->p_subn, tmp_str); - if (p_prtn) - { - __parser_add_map_to_port_map(&p_current_port_group->port_map, - &p_prtn->part_guid_tbl); - __parser_add_map_to_port_map(&p_current_port_group->port_map, - &p_prtn->full_guid_tbl); - } - free(tmp_str); - } - list_iterator = cl_list_next(list_iterator); + uint64_t ** range_arr; + unsigned range_len; + + __rangelist2rangearr( &tmp_parser_struct.num_pair_list, + &range_arr, + &range_len ); + + __parser_add_pkey_range_to_port_map( + &p_current_port_group->port_map, + range_arr, + range_len); } - cl_list_remove_all(&tmp_parser_struct.str_list); + } + ; + +port_group_pkey_start: TK_PKEY { + RESET_BUFFER; + } + ; + +port_group_partition: port_group_partition_start string_list { + /* 'partition' in 'port-group' - any num of instances */ + __parser_add_partition_list_to_port_map( + &p_current_port_group->port_map, + &tmp_parser_struct.str_list); } ; @@ -2226,7 +2243,7 @@ static void __parser_add_port_to_port_map( /*************************************************** ***************************************************/ -static void __parser_add_range_to_port_map( +static void __parser_add_guid_range_to_port_map( cl_qmap_t * p_map, uint64_t ** range_arr, unsigned range_len) @@ -2253,6 +2270,67 @@ static void __parser_add_range_to_port_map( /*************************************************** ***************************************************/ +static void __parser_add_pkey_range_to_port_map( + cl_qmap_t * p_map, + uint64_t ** range_arr, + unsigned range_len) +{ + unsigned i; + uint64_t pkey_64; + ib_net16_t pkey; + osm_prtn_t * p_prtn; + + if (!range_arr || !range_len) + return; + + for (i = 0; i < range_len; i++) { + for (pkey_64 = range_arr[i][0]; pkey_64 <= range_arr[i][1]; pkey_64++) { + pkey = cl_hton16((uint16_t)(pkey_64 & 0x7fff)); + p_prtn = (osm_prtn_t *) + cl_qmap_get(&p_qos_policy->p_subn->prtn_pkey_tbl, pkey); + if (p_prtn != (osm_prtn_t *)cl_qmap_end( + &p_qos_policy->p_subn->prtn_pkey_tbl)) { + __parser_add_map_to_port_map(p_map, &p_prtn->part_guid_tbl); + __parser_add_map_to_port_map(p_map, &p_prtn->full_guid_tbl); + } + } + free(range_arr[i]); + } + free(range_arr); +} + +/*************************************************** + ***************************************************/ + +static void __parser_add_partition_list_to_port_map( + cl_qmap_t * p_map, + cl_list_t * p_list) +{ + cl_list_iterator_t list_iterator; + char * tmp_str; + osm_prtn_t * p_prtn; + + /* extract all the ports from the partition + to the port map of this port group */ + list_iterator = cl_list_head(p_list); + while(list_iterator != cl_list_end(p_list)) { + tmp_str = (char*)cl_list_obj(list_iterator); + if (tmp_str) { + p_prtn = osm_prtn_find_by_name(p_qos_policy->p_subn, tmp_str); + if (p_prtn) { + __parser_add_map_to_port_map(p_map, &p_prtn->part_guid_tbl); + __parser_add_map_to_port_map(p_map, &p_prtn->full_guid_tbl); + } + free(tmp_str); + } + list_iterator = cl_list_next(list_iterator); + } + cl_list_remove_all(p_list); +} + +/*************************************************** + ***************************************************/ + static void __parser_add_map_to_port_map( cl_qmap_t * p_dmap, cl_map_t * p_smap) -- 1.5.1.4 From hadi at cyberus.ca Mon Sep 24 15:38:25 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 24 Sep 2007 18:38:25 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <46F76087.8030109@intel.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <46F6C059.6000600@intel.com> <1190582448.4240.2.camel@localhost> <46F76087.8030109@intel.com> Message-ID: <1190673505.4264.11.camel@localhost> On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote: > that's bad to begin with :) - please send those separately so I can fasttrack them > into e1000e and e1000 where applicable. Ive been CCing you ;-> Most of the changes are readability and reusability with the batching. > But yes, I'm very inclined to merge more features into e1000e than e1000. I intend > to put multiqueue support into e1000e, as *all* of the hardware that it will > support has multiple queues. Putting in any other performance feature like tx > batching would absolutely be interesting. I looked at the e1000e and it is very close to e1000 so i should be able to move the changes easily. Most importantly, can i kill LLTX? For tx batching, we have to wait to see how Dave wants to move forward; i will have the patches but it is not something you need to push until we see where that is going. cheers, jamal From hadi at cyberus.ca Mon Sep 24 15:51:38 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 24 Sep 2007 18:51:38 -0400 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> Message-ID: <1190674298.4264.24.camel@localhost> On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote: > Hi Jamal, > I've been (slowly) working on resurrecting the original design > of my multiqueue patches to address this exact issue of the queue_lock > being a hot item. I added a queue_lock to each queue in the subqueue > struct, and in the enqueue and dequeue, just lock that queue instead of > the global device queue_lock. The only two issues to overcome are the > QDISC_RUNNING state flag, since that also serializes entry into the > qdisc_restart() function, and the qdisc statistics maintenance, which > needs to be serialized. Do you think this work along with your patch > will benefit from one another? The one thing that seems obvious is to use dev->hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. cheers, jamal From auke-jan.h.kok at intel.com Mon Sep 24 15:52:58 2007 From: auke-jan.h.kok at intel.com (Kok, Auke) Date: Mon, 24 Sep 2007 15:52:58 -0700 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1190673505.4264.11.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <46F6C059.6000600@intel.com> <1190582448.4240.2.camel@localhost> <46F76087.8030109@intel.com> <1190673505.4264.11.camel@localhost> Message-ID: <46F83FCA.9000406@intel.com> jamal wrote: > On Mon, 2007-24-09 at 00:00 -0700, Kok, Auke wrote: > >> that's bad to begin with :) - please send those separately so I can fasttrack them >> into e1000e and e1000 where applicable. > > Ive been CCing you ;-> Most of the changes are readability and > reusability with the batching. > >> But yes, I'm very inclined to merge more features into e1000e than e1000. I intend >> to put multiqueue support into e1000e, as *all* of the hardware that it will >> support has multiple queues. Putting in any other performance feature like tx >> batching would absolutely be interesting. > > I looked at the e1000e and it is very close to e1000 so i should be able > to move the changes easily. Most importantly, can i kill LLTX? > For tx batching, we have to wait to see how Dave wants to move forward; > i will have the patches but it is not something you need to push until > we see where that is going. hmm, I though I already removed that, but now I see some remnants from that. By all means, please send a separate patch for that! Auke From hadi at cyberus.ca Mon Sep 24 15:54:19 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 24 Sep 2007 18:54:19 -0400 Subject: [ofa-general] [DOC] Net batching driver howto In-Reply-To: <1190574713.5030.4.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> Message-ID: <1190674459.4264.28.camel@localhost> I have updated the driver howto to match the patches i posted yesterday. attached. cheers, jamal -------------- next part -------------- Heres the begining of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Pre-requisites ------------------------------ For hardware based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API ----------------------------------- There are 3 new methods and one new variable introduced. These are: 1)dev->hard_prep_xmit() 2)dev->hard_end_xmit() 3)dev->hard_batch_xmit() 4)dev->xmit_win 2.1 Using Core driver changes ----------------------------- To provide context, lets look at a typical driver abstraction for dev->hard_start_xmit(). It has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functions anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1)use its dev->hard_prep_xmit() method to achieve #a 2)use its dev->hard_end_xmit() method to achieve #d 3)#b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are essentially optional. 2.1.1 Theory of operation -------------------------- The core will first do the packet formatting by invoking your supplied dev->hard_prep_xmit() method. It will then pass you the packet via your dev->hard_start_xmit() method for as many as packets you have advertised (via dev->xmit_win) you can consume. Lastly it will invoke your dev->hard_end_xmit() when it completes passing you all the packets queued for you. 2.1.1.1 Locking rules --------------------- dev->hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). So you have to ensure that whatever you put it dev->hard_prep_xmit() doesnt require locking. 2.1.1.2 The slippery LLTX ------------------------- LLTX drivers present a challenge in that we have to introduce a deviation from the norm and require the ->hard_batch_xmit() method. An LLTX driver presents us with ->hard_batch_xmit() to which we pass it a list of packets in a dev->blist skb queue. It is then the responsibility of the ->hard_batch_xmit() to exercise steps #b and #c for all packets passed in the dev->blist. Step #a and #d are done by the core should you register presence of dev->hard_prep_xmit() and dev->hard_end_xmit() in your setup. 2.1.1.3 xmit_win ---------------- dev->xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. dev->xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us, whenever it invokes netif_wake_queue, how much space it has for descriptors by setting this variable. 3.0 Driver Essentials --------------------- The typical driver tx state machine is: ---- -1-> +Core sends packets +--> Driver puts packet onto hardware queue + if hardware queue is full, netif_stop_queue(dev) + -2-> +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3-> +---> driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1-> +Cycle repeats and core sends more packets (step 1). ---- 3.1 Driver pre-requisite -------------------------- This is _a very important_ requirement in making batching useful. The pre-requisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Heres an example of how i added it to tun driver. Observe the setting of dev->xmit_win --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(&tun->readq); if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { tun->dev->xmit_win = tun->dev->tx_queue_len; netif_wake_queue(tun->dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned && netif_carrier_ok(netdev) && E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev->xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: ----- if (netif_queue_stopped(tp->dev) && (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp->dev); --- 3.2 Driver Setup ----------------- a) On initialization (before netdev registration) i) set NETIF_F_BTX in dev->features i.e dev->features |= NETIF_F_BTX This makes the core do proper initialization. ii) set dev->xmit_win to something reasonable like maybe half the tx DMA ring size etc. b) create proper pointer to the new methods desribed above if you need them. 3.3 Annotation on the different methods ---------------------------------------- This section shows examples and offers suggestions on how the different methods and variable could be used. 3.3.1 The dev->hard_prep_xmit() method --------------------------------------- Use this method to only do pre-processing of the skb passed. If in the current dev->hard_start_xmit() you are pre-processing packets before holding any locks (eg formating them to be put in any descriptor etc). Look at e1000_prep_queue_frame() for an example. You may use the skb->cb to store any state that you need to know of later when batching. PS: I have found when discussing with Michael Chan and Matt Carlson that skb->cb[0] (8bytes of it) is used by the VLAN code to pass VLAN info to the driver. I think this is a violation of the usage of the cb scratch pad. To work around this, you could use skb->cb[8] or do what the broadcom tg3 bacthing driver does which is to glean the vlan info first then re-use the skb->cb. 3.3.2 dev->hard_start_xmit() ---------------------------- Heres an example of tx routine that is similar to the one i added to the current tun driver. bxmit suffix is kept so that you can turn off batching if needed via and call already existing interface. ---- static int xxx_net_bxmit(struct net_device *dev) { .... .... enqueue onto hardware ring if (hardware ring full) { netif_stop_queue(dev); dev->xmit_win = 1; } ....... .. . } ------ All return codes like NETDEV_TX_OK etc still apply. 3.3.3 The LLTX batching method, dev->batch_xmit() ------------------------------------------------- Heres an example of a batch tx routine that is similar to the one i added to the older tun driver. Essentially this is what youd do if you wanted to support LLTX. ---- static int xxx_net_bxmit(struct net_device *dev) { .... .... while (skb_queue_len(dev->blist)) { dequeue from dev->blist enqueue onto hardware ring if hardware ring full break } if (hardware ring full) { netif_stop_queue(dev); dev->xmit_win = 1; } ....... .. . } ------ All return codes like NETDEV_TX_OK etc still apply. 3.3.4 The tx complete, dev->hard_end_xmit() ------------------------------------------------- In this method, if there are any IO operations that apply to a set of packets such as kicking DMA, setting of interupt thresholds etc, leave them to the end and apply them once if you have successfully enqueued. This provides a mechanism for saving a lot of cpu cycles since IO is cycle expensive. For an example of this look e1000 driver e1000_kick_DMA() function. 3.3.5 setting the dev->xmit_win ----------------------------- As mentioned earlier this variable provides hints on how much data to send from the core to the driver. Here are the obvious ways: a)on doing a netif_stop, set it to 1. By default all drivers have this value set to 1 to emulate old behavior where a driver only receives one packet at a time. b)on netif_wake_queue set it to the max available space. You have to be careful if your hardware does scatter-gather since the core will pass you scatter-gatherable skbs and so you want to at least leave enough space for the maximum allowed. Look at the tg3 and e1000 to see how this is implemented. The variable is important because it avoids the core sending any more than what the driver can handle therefore avoiding any need to muck with packet scheduling mechanisms. Appendix 1: History ------------------- June 11/2007: Initial revision June 11/2007: Fixed typo on e1000 netif_wake description .. Aug 08/2007: Added info on VLAN and the skb->cb[] danger .. Sep 24/2007: Revised and cleaned up From peter.p.waskiewicz.jr at intel.com Mon Sep 24 15:57:33 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 24 Sep 2007 15:57:33 -0700 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190674298.4264.24.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> Message-ID: > The one thing that seems obvious is to use > dev->hard_prep_xmit() in the patches i posted to select the > xmit ring. You should be able to do figure out the txmit ring > without holding any lock. I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. > I lost track of how/where things went since the last > discussion; so i need to wrap my mind around it to make > sensisble suggestions - I know the core patches are in the > kernel but havent paid attention to details and if you look > at my second patch youd see a comment in > dev_batch_xmit() which says i need to scrutinize multiqueue more. No worries. I'll try to get things together on my end and provide some patches to add a per-queue lock. In the meantime, I'll take a much closer look at the batching code, since I've stopped looking at the patches in-depth about a month ago. :-( Thanks, -PJ Waskiewicz From changquing.tang at hp.com Mon Sep 24 15:59:05 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 24 Sep 2007 22:59:05 -0000 Subject: [ofa-general] Atomic operation question. Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403025E4008@G3W0634.americas.hpqcorp.net> HI, I have a question for atmoic operation. If incoming atomic operations are from both ports of that HCA, can it work correctly ? Thanks. --CQ From hadi at cyberus.ca Mon Sep 24 16:38:19 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 24 Sep 2007 19:38:19 -0400 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> Message-ID: <1190677099.4264.37.camel@localhost> On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: > I've looked at that as a candidate to use. The lock for enqueue would > be needed when actually placing the skb into the appropriate software > queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? cheers, jamal From peter.p.waskiewicz.jr at intel.com Mon Sep 24 16:47:06 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 24 Sep 2007 16:47:06 -0700 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190677099.4264.37.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> Message-ID: > On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: > > > I've looked at that as a candidate to use. The lock for > enqueue would > > be needed when actually placing the skb into the > appropriate software > > queue for the qdisc, so it'd be quick. > > The enqueue is easy to comprehend. The single device queue > lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. > Maybe you can point me to some doc or describe to me the > dequeue aspect; are you planning to have an array of txlocks > per, one per ring? > How is the policy to define the qdisc queues locked/mapped to > tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb->queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz From ardavis at ichips.intel.com Mon Sep 24 16:53:52 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 24 Sep 2007 16:53:52 -0700 Subject: [ofa-general] Re: [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2 In-Reply-To: References: <000001c7fbb7$30cbad70$19b7020a@amr.corp.intel.com> Message-ID: <46F84E10.9040705@ichips.intel.com> James Lentini wrote: > Comments below: >> - >> +# version-info current:revision:age > > What does this comment do? just a comment regarding revisioning. > >> # >> -# This example shows netdev name, enabling administrator to use same copy across cluster >> +# Add examples for multiple interfaces and IPoIB HA fail over, and bonding > > The previous line is TODO, right? I'd suggest annotating it with that > text to make it clear to users. ok >> >> --- a/test/dtest/dtest.c >> +++ b/test/dtest/dtest.c >> @@ -44,7 +44,7 @@ >> #include >> >> #ifndef DAPL_PROVIDER >> -#define DAPL_PROVIDER "OpenIB-cma" >> +#define DAPL_PROVIDER "OpenIB-2-cma" > > Should we update OpenIB to ofa? Obviously, this isn't necessary as > part of this change I didn't want to change the 1.2 names for compatibility reasons but for 2.0 we could move to ofa names for both libraries and provider names. For example, libdaplcma.so becomes libdaplofa.so, OpenIB-cma becomes ofa. For example dat.conf 2.0 entries would look like this: ofa u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib0 0" "" ofa-1 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib1 0" "" ofa-2 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib2 0" "" ofa-3 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib3 0" "" ofa-bond u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "bond0 0" "" Is that what you had in mind? -arlin From shemminger at linux-foundation.org Mon Sep 24 17:14:11 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Mon, 24 Sep 2007 17:14:11 -0700 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> Message-ID: <20070924171411.36494656@freepuppy.rosehill> On Mon, 24 Sep 2007 16:47:06 -0700 "Waskiewicz Jr, Peter P" wrote: > > On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: > > > > > I've looked at that as a candidate to use. The lock for > > enqueue would > > > be needed when actually placing the skb into the > > appropriate software > > > queue for the qdisc, so it'd be quick. > > > > The enqueue is easy to comprehend. The single device queue > > lock should suffice. The dequeue is interesting: > > We should make sure we're symmetric with the locking on enqueue to > dequeue. If we use the single device queue lock on enqueue, then > dequeue will also need to check that lock in addition to the individual > queue lock. The details of this are more trivial than the actual > dequeue to make it efficient though. > > > Maybe you can point me to some doc or describe to me the > > dequeue aspect; are you planning to have an array of txlocks > > per, one per ring? > > How is the policy to define the qdisc queues locked/mapped to > > tx rings? > > The dequeue locking would be pushed into the qdisc itself. This is how > I had it originally, and it did make the code more complex, but it was > successful at breaking the heavily-contended queue_lock apart. I have a > subqueue structure right now in netdev, which only has queue_state (for > netif_{start|stop}_subqueue). This state is checked in sch_prio right > now in the dequeue for both prio and rr. My approach is to add a > queue_lock in that struct, so each queue allocated by the driver would > have a lock per queue. Then in dequeue, that lock would be taken when > the skb is about to be dequeued. > > The skb->queue_mapping field also maps directly to the queue index > itself, so it can be unlocked easily outside of the context of the > dequeue function. The policy would be to use a spin_trylock() in > dequeue, so that dequeue can still do work if enqueue or another dequeue > is busy. And the allocation of qdisc queues to device queues is assumed > to be one-to-one (that's how the qdisc behaves now). > > I really just need to put my nose to the grindstone and get the patches > together and to the list...stay tuned. > > Thanks, > -PJ Waskiewicz > - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. -- Stephen Hemminger From jeff at garzik.org Mon Sep 24 17:15:55 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 24 Sep 2007 20:15:55 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1190574713.5030.4.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> Message-ID: <46F8533B.9010405@garzik.org> jamal wrote: > If the intel folks will accept the patch i'd really like to kill > the e1000 LLTX interface. If I understood DaveM correctly, it is sounding like we want to deprecate all of use LLTX on "real" hardware? If so, several such projects might be considered, as well as possibly simplifying TX batching work perhaps. Also, WRT e1000 specifically, I was hoping to minimize changes, and focus people on e1000e. e1000e replaces (deprecates) large portions of e1000, namely the support for the PCI Express modern chips. When e1000e has proven itself in the field, we can potentially look at several e1000 simplifications, during the large scale code removal that becomes possible. Jeff From peter.p.waskiewicz.jr at intel.com Mon Sep 24 17:31:46 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 24 Sep 2007 17:31:46 -0700 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <20070924171411.36494656@freepuppy.rosehill> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com><20070916.161748.48388692.davem@davemloft.net><1189988958.4230.55.camel@localhost><1190569987.4256.52.camel@localhost><1190570205.4256.56.camel@localhost><1190674298.4264.24.camel@localhost><1190677099.4264.37.camel@localhost> <20070924171411.36494656@freepuppy.rosehill> Message-ID: > > I really just need to put my nose to the grindstone and get the > > patches together and to the list...stay tuned. > > > > Thanks, > > -PJ Waskiewicz > > - > > > Since we are redoing this, is there any way to make the whole > TX path more lockless? The existing model seems to be more > of a monitor than a real locking model. That seems quite reasonable. I will certainly see what I can do. Thanks Stephen, -PJ Waskiewicz From dotanb at dev.mellanox.co.il Mon Sep 24 23:11:14 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 25 Sep 2007 08:11:14 +0200 Subject: [ofa-general] Atomic operation question. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403025E4008@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA8403025E4008@G3W0634.americas.hpqcorp.net> Message-ID: <46F8A682.8020307@dev.mellanox.co.il> Hi. Tang, Changqing wrote: > HI, I have a question for atmoic operation. If incoming atomic > operations are from > both ports of that HCA, can it work correctly ? > Yes, it should (if the HCA supports atomic operations). Dotan From tziporet at dev.mellanox.co.il Tue Sep 25 00:16:53 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 25 Sep 2007 09:16:53 +0200 Subject: [ofa-general] ofed-1.2.5/ofed-1.2.5.1 In-Reply-To: <50C74E87FB16FB4F9356E175CA15423E02D6D9B9@STR.ciemat.es> References: <50C74E87FB16FB4F9356E175CA15423E02D6D9B9@STR.ciemat.es> Message-ID: <46F8B5E5.8070105@mellanox.co.il> Acero Fernandez Alicia wrote: > Hi, > > I am going to install OFED software in our cluster, but in the > download section there are two different versions 1.2.5 and 1.2.5.1. > Could anyone tell me what are the differences between both of them?and > I would like to know if the 1.2.5.1 is an stable version, as well. > > 1.2.5.1 has 2 fixes in the build.sh script for PPC64 systems. Beside this the code is the same, so if you are not using PPC64 you can use 1.2.5 Tziporet From monisonlists at gmail.com Tue Sep 25 02:13:47 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 25 Sep 2007 11:13:47 +0200 Subject: [ofa-general] Re: [PATCH V6 5/9] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <20070924090437.0406e147@freepuppy.rosehill> References: <46F7D770.4090500@voltaire.com> <46F7D99C.3030602@voltaire.com> <20070924090437.0406e147@freepuppy.rosehill> Message-ID: <46F8D14B.4050602@gmail.com> > > Please get rid of the warning. Make bonding work correctly and allow enslave/remove > of device when bonding is down. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi, I prefer to postpone it till I submit another version of the patches or till after the patches are merged. Anyway, I've added this to the TODO list. thanks MoniS From ogerlitz at voltaire.com Tue Sep 25 02:57:03 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 11:57:03 +0200 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> References: <46F7FDE5.9070305@oracle.com> <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> Message-ID: <46F8DB6F.8050901@voltaire.com> Sean Hefty wrote: > If a user allocates a QP on an rdma_cm_id, the rdma_cm will automatically > transition the QP through its states (RTR, RTS, error, etc.) While the > QP state transitions are occurring, the QP itself must remain valid. > Provide locking around the QP pointer to prevent its destruction while > accessing the pointer. > > This fixes an issue reported by Olaf Kirch from Oracle that resulted in > a system crash: > > "An incoming connection arrives and we decide to tear down the nascent > connection. The remote ends decides to do the same. We start to shut > down the connection, and call rdma_destroy_qp on our cm_id. ... Now > apparently a 'connect reject' message comes in from the other host, > and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. > It calls cma_modify_qp_err, which for some odd reason tries to modify > the exact same QP we just destroyed." Hi Sean, Rick, In iscsi/iser, the approach we took wrt to destruction of a pair (ID and QP are created/destroyed through and state-managed by the rdma-cm) is: A) call rdma_disconnect to make sure the QP was transitioned to error B) get the completions/flushes assoc. with all the WR posted to the QP C) make sure a disconnected event was received call rdma_destroy_qp only when B && C hold. What is your take on this approach? Or. From defaultant at reallifegiftbaskets.com Tue Sep 25 02:51:31 2007 From: defaultant at reallifegiftbaskets.com (perplexity) Date: Tue, 25 Sep 2007 09:51:31 +0000 Subject: [ofa-general] topworking overapplaud demitasse Message-ID: <569701c7ff59$10f6cdc4$8b75c053@ALille-151-1-43-139.w83-192.abo.wanadoo.fr> perorate thumbs malvasias outwrote the bone from which they originate. All intermediate forms between the Keep Your Eyes Open TUESDAY For News From SCYF SECURITY FINANCING (SCYF) Current: $.009 Big news means big returns with SCYF Set your buy for Tuesday morning on SCYF example, by the pressure of the head of the humerus in a dislocation of _The circumscribed or solitary tumour_ grows from the sheath of a nerve long and tortuous sinuses is often hastened by the injection of Beck's The cyst is at first multiple, but the partitions disappear, and the the weather. The roughness due to fibrillation of the articular cartilages hand, there is reason to believe that the products of certain organisms -- thickspread From ogerlitz at voltaire.com Tue Sep 25 03:22:24 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:22:24 +0200 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1190637355.4947.56.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> Message-ID: <46F8E160.5060004@voltaire.com> Eli Cohen wrote: > Add high dma support to ipoib > > This patch assumes all IB devices support 64 bit DMA. On some architectures DMA addresses are 32 bit, so I am not sure to follow your comment. This capability states that the network device can dma to high memory. Or. > > Signed-off-by: Eli Cohen > > --- > > Index: linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- linux-2.6.23-rc1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:16.000000000 +0300 > +++ linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:27.000000000 +0300 > @@ -1079,6 +1079,8 @@ static struct net_device *ipoib_add_port > > SET_NETDEV_DEV(priv->dev, hca->dma_device); > > + priv->dev->features |= NETIF_F_HIGHDMA; From eli at mellanox.co.il Tue Sep 25 03:24:25 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 12:24:25 +0200 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46F8E160.5060004@voltaire.com> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> Message-ID: <1190715865.4947.118.camel@mtls03> On Tue, 2007-09-25 at 12:22 +0200, Or Gerlitz wrote: > Eli Cohen wrote: > > Add high dma support to ipoib > > > > This patch assumes all IB devices support 64 bit DMA. > > On some architectures DMA addresses are 32 bit, so I am not sure to > follow your comment. This capability states that the network device can > dma to high memory. > > Or. > I believe it means that *if* the kernel hands buffers whose addresses exceed 32 bits then the device can handle them. From ogerlitz at voltaire.com Tue Sep 25 03:15:23 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:15:23 +0200 Subject: [ofa-general] [PATCH 3/11] ib_core: add checksum offload support In-Reply-To: <1190637451.4947.60.camel@mtls03> References: <1190637451.4947.60.camel@mtls03> Message-ID: <46F8DFBB.1000800@voltaire.com> Eli Cohen wrote: > Add checksum offload support to the core > A device that publishes IB_DEVICE_IP_CSUM actually supports > calculating checksum on transmit and provides indication whether > the checksum is OK on receive. Hi Eli, From the discussion over the "IB/ipoib: S/G and HW checksum support" thread, I understand that Linux actually never offloads the IP checksum calculation to the HW but rather only the TCP and UDP checksum. I find it more clear if the device capability (same for the send flag) name would follow one of these: > #define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */ > #define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */ > #define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */ > #define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */ Or. > Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h > enum ib_cq_notify_flags { > @@ -615,7 +617,9 @@ enum ib_send_flags { > IB_SEND_FENCE = 1, > IB_SEND_SIGNALED = (1<<1), > IB_SEND_SOLICITED = (1<<2), > - IB_SEND_INLINE = (1<<3) > + IB_SEND_INLINE = (1<<3), > + IB_SEND_IP_CSUM = (1<<4), there's no point for the HW to compute the IP csum, asking this is a pure waste, since the stack always does it > + IB_SEND_UDP_TCP_CSUM = (1<<5) > }; From eli at mellanox.co.il Tue Sep 25 03:30:52 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 12:30:52 +0200 Subject: [ofa-general] [PATCH 3/11] ib_core: add checksum offload support In-Reply-To: <46F8DFBB.1000800@voltaire.com> References: <1190637451.4947.60.camel@mtls03> <46F8DFBB.1000800@voltaire.com> Message-ID: <1190716252.4947.125.camel@mtls03> On Tue, 2007-09-25 at 12:15 +0200, Or Gerlitz wrote: > Eli Cohen wrote: > > Add checksum offload support to the core > > A device that publishes IB_DEVICE_IP_CSUM actually supports > > calculating checksum on transmit and provides indication whether > > the checksum is OK on receive. > > Hi Eli, > > From the discussion over the "IB/ipoib: S/G and HW checksum support" > thread, I understand that Linux actually never offloads the IP checksum > calculation to the HW but rather only the TCP and UDP checksum. > > I find it more clear if the device capability (same for the send flag) > name would follow one of these: > > #define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */ > > #define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */ > > #define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */ > > #define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */ > > Or. I am not sure that defining all kinds of capabilities to the device are useful. For example if a device defines only IP checksum than there is no useful. > > > Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h > > > enum ib_cq_notify_flags { > > @@ -615,7 +617,9 @@ enum ib_send_flags { > > IB_SEND_FENCE = 1, > > IB_SEND_SIGNALED = (1<<1), > > IB_SEND_SOLICITED = (1<<2), > > - IB_SEND_INLINE = (1<<3) > > + IB_SEND_INLINE = (1<<3), > > + IB_SEND_IP_CSUM = (1<<4), > there's no point for the HW to compute the IP csum, asking this is a > pure waste, since the stack always does it > > + IB_SEND_UDP_TCP_CSUM = (1<<5) > > }; > I am not sure this is always true but anyway it is not a waste since computing the checksum does put more "work" on the hardware. From ogerlitz at voltaire.com Tue Sep 25 03:33:48 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:33:48 +0200 Subject: [ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <1190637551.4947.66.camel@mtls03> References: <1190637551.4947.66.camel@mtls03> Message-ID: <46F8E40C.3030203@voltaire.com> Eli Cohen wrote: > Add checksum offload support to ipoib Can you clarify the relation between this patch to "[PATCHv3] IB/ipoib: HW checksum support" patch posted later by Michael? for example, I see that you patch makes IPoIB to publish the NETIF_F_IP_CSUM capability and Michael's one publishes NETIF_F_HW_CSUM, etc > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.000000000 +0200 > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.000000000 +0200 > @@ -86,6 +86,7 @@ enum { > IPOIB_MCAST_STARTED = 8, > IPOIB_FLAG_NETIF_STOPPED = 9, > IPOIB_FLAG_ADMIN_CM = 10, > + IPOIB_FLAG_RX_CSUM = 11, > > IPOIB_MAX_BACKOFF_SECONDS = 16, > > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.000000000 +0200 > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 13:05:21.000000000 +0200 > @@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d > set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); > ipoib_warn(priv, "enabling connected mode " > "will cause multicast packet drops\n"); > + > + /* clear ipv6 flag too */ > + dev->features &= ~NETIF_F_IP_CSUM; > + > + priv->tx_wr.send_flags &= > + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); > + > ipoib_flush_paths(dev); > return count; > } > @@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d > clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); > dev->mtu = min(priv->mcast_mtu, dev->mtu); > ipoib_flush_paths(dev); > + > + if (priv->ca->flags & IB_DEVICE_IP_CSUM) > + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ didn't you want to use NETIF_F_HW_CSUM here? > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 12:23:00.000000000 +0200 > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.000000000 +0200 > @@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic > return device_create_file(&dev->dev, &dev_attr_pkey); > } > > +static void set_tx_csum(struct net_device *dev) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + > + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) > + return; > + > + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) > + return; > + > + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ can you explain why this line belongs specifically to set_tx_csum() ? > +} > + > +static void set_rx_csum(struct net_device *dev) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + > + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) > + return; > + > + set_bit(IPOIB_FLAG_RX_CSUM, &priv->flags); > +} > + > static struct net_device *ipoib_add_port(const char *format, > struct ib_device *hca, u8 port) > { > @@ -1165,6 +1188,9 @@ static struct net_device *ipoib_add_port > goto event_failed; > } > > + set_tx_csum(priv->dev); > + set_rx_csum(priv->dev); From ogerlitz at voltaire.com Tue Sep 25 03:40:10 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:40:10 +0200 Subject: [ofa-general] [PATCH 10/11]: IB/ipoib modify cq params In-Reply-To: <1190637684.4947.74.camel@mtls03> References: <1190637684.4947.74.camel@mtls03> Message-ID: <46F8E58A.30107@voltaire.com> Eli Cohen wrote: > Implement support for modifying IPOIB CQ moderation params > > This can be used to tune at run time the paramters controlling > the event (interrupt) generation rate and thus reduce the overhead > incurred by hadling interrupts resulting in better throughput. > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:07:43.000000000 +0200 > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 13:12:21.000000000 +0200 > @@ -270,6 +270,13 @@ struct ipoib_cm_dev_priv { > struct ib_recv_wr rx_wr; > }; > > +struct ipoib_ethtool_st { > + u16 rx_coalesce_usecs; > + u16 tx_coalesce_usecs; > + u16 rx_max_coalesced_frames; > + u16 tx_max_coalesced_frames; > +}; As IPoIB uses one CQ per device, why you use the tx_ and rx_ prefixes in the structure name (and later propagated into the documentation, mindset of users etc etc). Its confusing, please change it to be Or. From ogerlitz at voltaire.com Tue Sep 25 03:41:06 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:41:06 +0200 Subject: [ofa-general] [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters In-Reply-To: <1190637727.4947.76.camel@mtls03> References: <1190637727.4947.76.camel@mtls03> Message-ID: <46F8E5C2.4000700@voltaire.com> Eli Cohen wrote: > From: Michael S. Tsirkin > Subject: IB/ipoib: support for sending gather skbs > > Enable interrupt coalescing for CQs in mlx4. > > Signed-off-by: Michael S. Tsirkin > > --- > > Index: ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/cq.c 2007-09-24 13:08:55.000000000 +0200 > +++ ofa_1_3_dev_kernel/drivers/net/mlx4/cq.c 2007-09-24 13:12:42.000000000 +0200 > @@ -43,6 +43,14 @@ > #include "mlx4.h" > #include "icm.h" > > +static int cq_max_count = 16; > +static int cq_period = 10; > + > +module_param(cq_max_count, int, 0444); > +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); > +module_param(cq_period, int, 0444); > +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); I failed to find where these two module param are used anywhere along this patch set, please clarify. Or. From ogerlitz at voltaire.com Tue Sep 25 03:46:43 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:46:43 +0200 Subject: [ofa-general] [PATCH 10/11]: IB/ipoib modify cq params In-Reply-To: <1190637684.4947.74.camel@mtls03> References: <1190637684.4947.74.camel@mtls03> Message-ID: <46F8E713.5060009@voltaire.com> Eli Cohen wrote: > Implement support for modifying IPOIB CQ moderation params > > This can be used to tune at run time the paramters controlling > the event (interrupt) generation rate and thus reduce the overhead > incurred by hadling interrupts resulting in better throughput. I think we have to carefully think if/how does this feature goes hand in hand with NAPI. Since NAPI is not optional, with this feature the network stack tries to do its best to reduce interrupts with the NAPI logic, and on top of that the HW is instructed to apply this logic before issuing an interrupt. Does the need here suggests that NAPI can be improved? if yes how? maybe for some infiniband devices interrupt moderation for itself would be better so NAPI should be disabled? To suggest this for merge, I think you would need to share the list with the IPoIB results you had with NAPI vs with NAPI AND interrupt moderation. Or. From ogerlitz at voltaire.com Tue Sep 25 03:57:55 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 12:57:55 +0200 Subject: [ofa-general] [PATCH 3/11] ib_core: add checksum offload support In-Reply-To: <1190716252.4947.125.camel@mtls03> References: <1190637451.4947.60.camel@mtls03> <46F8DFBB.1000800@voltaire.com> <1190716252.4947.125.camel@mtls03> Message-ID: <46F8E9B3.80302@voltaire.com> Eli Cohen wrote: > On Tue, 2007-09-25 at 12:15 +0200, Or Gerlitz wrote: >> Eli Cohen wrote: >>> A device that publishes IB_DEVICE_IP_CSUM actually supports >>> calculating checksum on transmit and provides indication whether >> I find it more clear if the device capability (same for the send flag) >> name would follow one of these: >>> #define NETIF_F_IP_CSUM 2 /* Can checksum TCP/UDP over IPv4. */ >>> #define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */ >>> #define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */ >>> #define NETIF_F_IPV6_CSUM 16 /* Can checksum TCP/UDP over IPV6 */ > I am not sure that defining all kinds of capabilities to the device are > useful. For example if a device defines only IP checksum than there is > no useful. I did not say that you need to add four capabilities, you can add one that fits the connectX feature eg IB_DEVICE_HW_CSUM and document that if the device supports this, it is capable to compute TCP and UDP checksum for both IPv4 and IPv6 packets, etc. Later if some new HW will be capable to offload only IPv4, they will add a new capability etc. >>> Index: ofa_1_3_dev_kernel/include/rdma/ib_verbs.h >>> enum ib_cq_notify_flags { >>> @@ -615,7 +617,9 @@ enum ib_send_flags { >>> IB_SEND_FENCE = 1, >>> IB_SEND_SIGNALED = (1<<1), >>> IB_SEND_SOLICITED = (1<<2), >>> - IB_SEND_INLINE = (1<<3) >>> + IB_SEND_INLINE = (1<<3), >>> + IB_SEND_IP_CSUM = (1<<4), >> there's no point for the HW to compute the IP csum, asking this is a >> pure waste, since the stack always does it >>> + IB_SEND_UDP_TCP_CSUM = (1<<5) >>> }; > I am not sure this is always true but anyway it is not a waste since > computing the checksum does put more "work" on the hardware. Again, per the discussion over the thread it --is-- true, also, it creates confusion while reading the code (why to ask the HW to do something which is always done by SW?), also this does not put more work on the private case of the connectX HW which relies on the IB ICRC, second HW might go and actually compute the IP header csum. Or. From vlad at lists.openfabrics.org Tue Sep 25 03:59:13 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 25 Sep 2007 03:59:13 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070925-0200 daily build status Message-ID: <20070925105916.85305E6083E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.17_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.14 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.14_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.14' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.15 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.15_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.15' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.19_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.13 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.13_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.13' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.12 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.12_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.12' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.22 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.22_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.22' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.21.1 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.c:187: warning: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.21.1_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18-8.el5 Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.c:187: error: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.18-8.el5_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.18-8.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16.21-0.8-default Log: /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:162: error: for each function it appears in.) /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:162: warning: assignment makes pointer from integer without a cast /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c: In function 'mlx4_buf_free': /home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.c:187: error: implicit declaration of function 'vunmap' make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4/alloc.o] Error 1 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check/drivers/net/mlx4] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070925-0200_linux-2.6.16.21-0.8-default_ia64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ia64/linux-2.6.16.21-0.8-default' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From monis at voltaire.com Tue Sep 25 04:01:58 2007 From: monis at voltaire.com (Moni Shoua) Date: Tue, 25 Sep 2007 13:01:58 +0200 Subject: [ofa-general] [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <46F8EAA6.3040409@voltaire.com> Jay, I think that all comments to the patches were discussed and handled. If you agree, can you please push then to the networking tree so they will be merged into 2.6.24? This includes the IPoIB patches (agreed with Roland). Note that there are *no* patches to net/core (like in V5). thanks MoniS From ogerlitz at voltaire.com Tue Sep 25 04:06:46 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 13:06:46 +0200 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1190715865.4947.118.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> Message-ID: <46F8EBC6.40100@voltaire.com> Eli Cohen wrote: > On Tue, 2007-09-25 at 12:22 +0200, Or Gerlitz wrote: >> Eli Cohen wrote: >>> Add high dma support to ipoib >>> This patch assumes all IB devices support 64 bit DMA. >> On some architectures DMA addresses are 32 bit, so I am not sure to >> follow your comment. This capability states that the network device can >> dma to high memory. > I believe it means that *if* the kernel hands buffers whose addresses > exceed 32 bits then the device can handle them. High-memory is well documents in books and elsewhere. I just want to say that the change-log comment is confusing and unrelated. What you want to say is that this patch assumes that for all IB devices, ib_dma_map_single and ib_dma_map_page supports high memory, which is not the case, see below. Ralph? Or. > static u64 ipath_dma_map_page(struct ib_device *dev, > struct page *page, > unsigned long offset, > size_t size, > enum dma_data_direction direction) > { > u64 addr; > > BUG_ON(!valid_dma_direction(direction)); > > if (offset + size > PAGE_SIZE) { > addr = BAD_DMA_ADDRESS; > goto done; > } > > addr = (u64) page_address(page); > if (addr) > addr += offset; > /* TODO: handle highmem pages */ > > done: > return addr; > } From eli at mellanox.co.il Tue Sep 25 04:55:04 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 13:55:04 +0200 Subject: [ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <46F8E40C.3030203@voltaire.com> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> Message-ID: <1190721304.4947.134.camel@mtls03> On Tue, 2007-09-25 at 12:33 +0200, Or Gerlitz wrote: > Eli Cohen wrote: > > Add checksum offload support to ipoib > > Can you clarify the relation between this patch to "[PATCHv3] IB/ipoib: > HW checksum support" patch posted later by Michael? for example, I see > that you patch makes IPoIB to publish the NETIF_F_IP_CSUM capability and > Michael's one publishes NETIF_F_HW_CSUM, etc These two patches are not related. Michael's patch relies on infinband's icrc to not require checksum generation/validation while my patch relies on HW's capability to insert checksum on outgoing packets and calculate checksum of incoming packtes. > > > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h > > =================================================================== > > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.000000000 +0200 > > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.000000000 +0200 > > @@ -86,6 +86,7 @@ enum { > > IPOIB_MCAST_STARTED = 8, > > IPOIB_FLAG_NETIF_STOPPED = 9, > > IPOIB_FLAG_ADMIN_CM = 10, > > + IPOIB_FLAG_RX_CSUM = 11, > > > > IPOIB_MAX_BACKOFF_SECONDS = 16, > > > > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > =================================================================== > > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.000000000 +0200 > > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 13:05:21.000000000 +0200 > > @@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d > > set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); > > ipoib_warn(priv, "enabling connected mode " > > "will cause multicast packet drops\n"); > > + > > + /* clear ipv6 flag too */ > > + dev->features &= ~NETIF_F_IP_CSUM; > > + > > + priv->tx_wr.send_flags &= > > + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); > > + > > ipoib_flush_paths(dev); > > return count; > > } > > @@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d > > clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); > > dev->mtu = min(priv->mcast_mtu, dev->mtu); > > ipoib_flush_paths(dev); > > + > > + if (priv->ca->flags & IB_DEVICE_IP_CSUM) > > + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ > > didn't you want to use NETIF_F_HW_CSUM here? No, our hw does ip/udp/tcp checksum only. > > > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c > > =================================================================== > > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 12:23:00.000000000 +0200 > > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.000000000 +0200 > > @@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic > > return device_create_file(&dev->dev, &dev_attr_pkey); > > } > > > > +static void set_tx_csum(struct net_device *dev) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + > > + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) > > + return; > > + > > + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) > > + return; > > + > > + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ > can you explain why this line belongs specifically to set_tx_csum() ? > > > +} > > + > > +static void set_rx_csum(struct net_device *dev) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + > > + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) > > + return; > > + > > + set_bit(IPOIB_FLAG_RX_CSUM, &priv->flags); > > +} > > + > > static struct net_device *ipoib_add_port(const char *format, > > struct ib_device *hca, u8 port) > > { > > @@ -1165,6 +1188,9 @@ static struct net_device *ipoib_add_port > > goto event_failed; > > } > > > > + set_tx_csum(priv->dev); > > + set_rx_csum(priv->dev); > From ogerlitz at voltaire.com Tue Sep 25 05:18:51 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 14:18:51 +0200 Subject: [ofa-general] [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <1190721304.4947.134.camel@mtls03> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> Message-ID: <46F8FCAB.4010002@voltaire.com> Eli Cohen wrote: > On Tue, 2007-09-25 at 12:33 +0200, Or Gerlitz wrote: >> Eli Cohen wrote: >>> Add checksum offload support to ipoib >> Can you clarify the relation between this patch to "[PATCHv3] IB/ipoib: >> HW checksum support" patch posted later by Michael? for example, I see >> that you patch makes IPoIB to publish the NETIF_F_IP_CSUM capability and >> Michael's one publishes NETIF_F_HW_CSUM, etc > > These two patches are not related. Michael's patch relies on infinband's > icrc to not require checksum generation/validation while my patch relies > on HW's capability to insert checksum on outgoing packets and calculate > checksum of incoming packtes. I am not with you. Does the connectX actually computes and inserts tcp/udp checksum on outgoing packets? through the discussion at the other thread I understood that the answer is no. If it does not insert a checksum, over which HW will your patch be operative, a future one? If it does insert checksum, why is Michael's patch needed at all? see more comments below, >>> Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h >>> =================================================================== >>> --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:09:21.000000000 +0200 >>> +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-24 12:49:00.000000000 +0200 >>> @@ -86,6 +86,7 @@ enum { >>> IPOIB_MCAST_STARTED = 8, >>> IPOIB_FLAG_NETIF_STOPPED = 9, >>> IPOIB_FLAG_ADMIN_CM = 10, >>> + IPOIB_FLAG_RX_CSUM = 11, >>> >>> IPOIB_MAX_BACKOFF_SECONDS = 16, >>> >>> Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c >>> =================================================================== >>> --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 12:23:26.000000000 +0200 >>> +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-24 13:05:21.000000000 +0200 >>> @@ -1258,6 +1258,13 @@ static ssize_t set_mode(struct device *d >>> @@ -1266,6 +1273,10 @@ static ssize_t set_mode(struct device *d >>> + if (priv->ca->flags & IB_DEVICE_IP_CSUM) >>> + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ >> didn't you want to use NETIF_F_HW_CSUM here? > No, our hw does ip/udp/tcp checksum only. If its only for IPv4 then NETIF_F_IP_CSUM is fine, if you support also IPv6 then you want to OR also NETIF_F_IPV6_CSUM, correct? >>> Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c >>> =================================================================== >>> --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 12:23:00.000000000 +0200 >>> +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-24 13:04:52.000000000 +0200 >>> @@ -1109,6 +1109,29 @@ int ipoib_add_pkey_attr(struct net_devic >>> return device_create_file(&dev->dev, &dev_attr_pkey); >>> } >>> >>> +static void set_tx_csum(struct net_device *dev) >>> +{ >>> + struct ipoib_dev_priv *priv = netdev_priv(dev); >>> + >>> + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) >>> + return; >>> + >>> + if (!(priv->ca->flags & IB_DEVICE_IP_CSUM)) >>> + return; >>> + >>> + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ >> can you explain why this line belongs specifically to set_tx_csum() ? ??? Or. From mst at dev.mellanox.co.il Tue Sep 25 05:34:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 14:34:25 +0200 Subject: [ofa-general] Re: [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <46F8FCAB.4010002@voltaire.com> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> <46F8FCAB.4010002@voltaire.com> Message-ID: <20070925123425.GA20844@mellanox.co.il> > I am not with you. Does the connectX actually computes and inserts > tcp/udp checksum on outgoing packets? ConnectX can only compute checksum for TCP and UDP protocols for datagram packets. That's what Eli's patches enable. OTOH, my patch also makes it possible to save on checksumming for this case, but additionally, for all of - IPoIB connected mode - HCAs besides ConnectX - protocols besides TCP/UDP Unfortunately, this approach - only speeds up communication if patch has been applied at both communicating nodes - only speeds up communication within IPoIB subnet In all other cases my patch provides no speedup So the 2 patches are complementary. -- MST From ogerlitz at voltaire.com Tue Sep 25 06:00:12 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 15:00:12 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F7E96E.4060302@ichips.intel.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> Message-ID: <46F9065C.3090907@voltaire.com> Sean Hefty wrote: >>> node 1 <-> switch A <-> switch B <-> switch C <-> SA >> The host would only see port up/down events as of changes in the link >> state in the local port or in the port which is connected to it through >> the cable. > So, if you brought the link down/up between switches A & B, node 1 > wouldn't receive any events, but it would be removed from the multicast > group? good catch! Indeed, when the link between switches A and B goes down, per the view point of the SM, the whole sub-fabric across A is lost and hence the node is dropped from all the multicast groups it is joined to. However, from the view point of the node, no port down is experienced. When the A-B link goes up, the SM discovers all nodes across A and probes their ports, though this process a port active event --might-- be generated by the HCA FW, but I am not sure its mandatory. Since the only trigger for ipoib to rejoin to multicast groups is delivery of event by the hw driver, namely one of: port down/up, lid change, sm lid change, client re-register. I think we might have a hole here if none of these events is generated. Please note that through this discovery, at least one mad is sent from the SM to the node. If we enforce the SM to set the re-register bit --each-- time it discovers a node, then the bug is solved. I will test this scheme and let you know what I get (with the voltaire SM and mthca driver). Eitan, Michael - any insight on the matter? Or. From hadi at cyberus.ca Tue Sep 25 06:08:54 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 25 Sep 2007 09:08:54 -0400 Subject: [ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> Message-ID: <1190725734.4264.97.camel@localhost> On Mon, 2007-24-09 at 16:47 -0700, Waskiewicz Jr, Peter P wrote: > We should make sure we're symmetric with the locking on enqueue to > dequeue. If we use the single device queue lock on enqueue, then > dequeue will also need to check that lock in addition to the individual > queue lock. The details of this are more trivial than the actual > dequeue to make it efficient though. It would be interesting to observe the performance implications. > The dequeue locking would be pushed into the qdisc itself. This is how > I had it originally, and it did make the code more complex, but it was > successful at breaking the heavily-contended queue_lock apart. I have a > subqueue structure right now in netdev, which only has queue_state (for > netif_{start|stop}_subqueue). This state is checked in sch_prio right > now in the dequeue for both prio and rr. My approach is to add a > queue_lock in that struct, so each queue allocated by the driver would > have a lock per queue. Then in dequeue, that lock would be taken when > the skb is about to be dequeued. more locks implies degraded performance. If only one processor can enter that region, presumably after acquiring the outer lock , why this secondary lock per queue? > The skb->queue_mapping field also maps directly to the queue index > itself, so it can be unlocked easily outside of the context of the > dequeue function. The policy would be to use a spin_trylock() in > dequeue, so that dequeue can still do work if enqueue or another dequeue > is busy. So there could be a parallel cpu dequeueing at the same time? Wouldnt this have implications depending on what the scheduling algorithm used? If for example i was doing priority queueing i would want to make sure the highest priority is being dequeued first AND by all means goes out first to the driver; i dont want a parallell cpu dequeing a lower prio packet at the same time. > And the allocation of qdisc queues to device queues is assumed > to be one-to-one (that's how the qdisc behaves now). Ok, that brings back the discussion we had; my thinking was something like dev->hard_prep_xmit() would select the ring and i think you staticly already map the ring to a qdisc queue. So i dont think dev->hard_prep_xmit() is useful to you. In any case, there is nothing the batching patches do that interfere or prevent you from going the path you intend to. instead of dequeueing one packet, you dequeue several and instead of sending to the driver one packet, you send several. And using the xmit_win, you should never ever have to requeue. cheers, jamal From ogerlitz at voltaire.com Tue Sep 25 06:08:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 15:08:59 +0200 Subject: [ofa-general] Re: [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <20070925123425.GA20844@mellanox.co.il> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> <46F8FCAB.4010002@voltaire.com> <20070925123425.GA20844@mellanox.co.il> Message-ID: <46F9086B.5020804@voltaire.com> Michael S. Tsirkin wrote: >> I am not with you. Does the connectX actually computes and inserts >> tcp/udp checksum on outgoing packets? > ConnectX can only compute checksum for TCP and UDP protocols > for datagram packets. That's what Eli's patches enable. So for datagram mode (ie UD QP) the connectX HW does support computing the TCP and UDP checksum and inserting them to the on-the-wire-packet?! Or. From hadi at cyberus.ca Tue Sep 25 06:15:38 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 25 Sep 2007 09:15:38 -0400 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <20070924171411.36494656@freepuppy.rosehill> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20070924171411.36494656@freepuppy.rosehill> Message-ID: <1190726138.4264.105.camel@localhost> On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote: > Since we are redoing this, > is there any way to make the whole TX path > more lockless? The existing model seems to be more of a monitor than > a real locking model. What do you mean it is "more of a monitor"? On the challenge of making it lockless: About every NAPI driver combines the tx prunning with rx polling. If you are dealing with tx resources on receive thread as well as tx thread, _you need_ locking. The only other way we can do avoid it is to separate the rx path interupts from ones on tx related resources; the last NAPI driver that did that was tulip; i think the e1000 for a short period in its life did the same as well. But that has been frowned on and people have evolved away from it. cheers, jamal From hrosenstock at xsigo.com Tue Sep 25 06:20:39 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 25 Sep 2007 06:20:39 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F9065C.3090907@voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> Message-ID: <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-09-25 at 15:00 +0200, Or Gerlitz wrote: > Sean Hefty wrote: > >>> node 1 <-> switch A <-> switch B <-> switch C <-> SA > > >> The host would only see port up/down events as of changes in the link > >> state in the local port or in the port which is connected to it through > >> the cable. > > > So, if you brought the link down/up between switches A & B, node 1 > > wouldn't receive any events, but it would be removed from the multicast > > group? > > good catch! > > Indeed, when the link between switches A and B goes down, per the view > point of the SM, the whole sub-fabric across A is lost and hence the > node is dropped from all the multicast groups it is joined to. No, it is not (dropped from all multicast groups it is joined to). It may be removed from the multicast forwarding tables if there is no route available but it is still a member of the group. > However, from the view point of the node, no port down is experienced. > > When the A-B link goes up, the SM discovers all nodes across A and > probes their ports, though this process a port active event --might-- be > generated by the HCA FW, but I am not sure its mandatory. > > Since the only trigger for ipoib to rejoin to multicast groups is > delivery of event by the hw driver, namely one of: port down/up, lid > change, sm lid change, client re-register. I think we might have a hole > here if none of these events is generated. It doesn't need to rejoin for this case. See above explanation. -- Hal > Please note that through this discovery, at least one mad is sent from > the SM to the node. If we enforce the SM to set the re-register bit > --each-- time it discovers a node, then the bug is solved. > > I will test this scheme and let you know what I get (with the voltaire > SM and mthca driver). > > Eitan, Michael - any insight on the matter? > > Or. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Tue Sep 25 06:25:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 15:25:19 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> Message-ID: <46F90C3F.1030701@voltaire.com> Hal Rosenstock wrote: > On Tue, 2007-09-25 at 15:00 +0200, Or Gerlitz wrote: >> Sean Hefty wrote: >>>>> node 1 <-> switch A <-> switch B <-> switch C <-> SA >>>> The host would only see port up/down events as of changes in the link >>>> state in the local port or in the port which is connected to it through >>>> the cable. >>> So, if you brought the link down/up between switches A & B, node 1 >>> wouldn't receive any events, but it would be removed from the multicast >>> group? >> good catch! >> >> Indeed, when the link between switches A and B goes down, per the view >> point of the SM, the whole sub-fabric across A is lost and hence the >> node is dropped from all the multicast groups it is joined to. > > No, it is not (dropped from all multicast groups it is joined to). It > may be removed from the multicast forwarding tables if there is no route > available but it is still a member of the group. Hi Hal, So the node (port) is a member of a multicast group for which routing is not configured but when the port is discovered again, the SM runs the multicast routing engine (for all groups? for all groups for which discovered ports are member of?) again and configures the routing, nice. I will test it, thanks for the explanation. Or. >> However, from the view point of the node, no port down is experienced. >> >> When the A-B link goes up, the SM discovers all nodes across A and >> probes their ports, though this process a port active event --might-- be >> generated by the HCA FW, but I am not sure its mandatory. >> >> Since the only trigger for ipoib to rejoin to multicast groups is >> delivery of event by the hw driver, namely one of: port down/up, lid >> change, sm lid change, client re-register. I think we might have a hole >> here if none of these events is generated. > > It doesn't need to rejoin for this case. See above explanation. From mst at dev.mellanox.co.il Tue Sep 25 06:40:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 15:40:43 +0200 Subject: [ofa-general] Re: [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <46F9086B.5020804@voltaire.com> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> <46F8FCAB.4010002@voltaire.com> <20070925123425.GA20844@mellanox.co.il> <46F9086B.5020804@voltaire.com> Message-ID: <20070925134043.GC20844@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [PATCH 6/11] IB/ipoib: add checksum offload?support > > Michael S. Tsirkin wrote: > >>I am not with you. Does the connectX actually computes and inserts > >>tcp/udp checksum on outgoing packets? > > >ConnectX can only compute checksum for TCP and UDP protocols > >for datagram packets. That's what Eli's patches enable. > > So for datagram mode (ie UD QP) the connectX HW does support computing > the TCP and UDP checksum and inserting them to the on-the-wire-packet?! Yes. -- MST From ogerlitz at voltaire.com Tue Sep 25 06:55:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 25 Sep 2007 15:55:09 +0200 Subject: [ofa-general] Re: [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <20070925134043.GC20844@mellanox.co.il> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> <46F8FCAB.4010002@voltaire.com> <20070925123425.GA20844@mellanox.co.il> <46F9086B.5020804@voltaire.com> <20070925134043.GC20844@mellanox.co.il> Message-ID: <46F9133D.8020507@voltaire.com> Michael S. Tsirkin wrote: >>>> I am not with you. Does the connectX actually computes and inserts >>>> tcp/udp checksum on outgoing packets? >>> ConnectX can only compute checksum for TCP and UDP protocols >>> for datagram packets. That's what Eli's patches enable. >> So for datagram mode (ie UD QP) the connectX HW does support computing >> the TCP and UDP checksum and inserting them to the on-the-wire-packet?! > Yes. cool. I must say that over the thread that followed your "[PATCHv2] IB/ipoib: S/G and HW checksum support" posting, I did not understand this is the case, nor I suspect some or all of the other participants. Have you stated that clearly anywhere on that thread? Or. From mst at dev.mellanox.co.il Tue Sep 25 06:58:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 15:58:28 +0200 Subject: [ofa-general] Re: [PATCH 6/11] IB/ipoib: add checksum offload support In-Reply-To: <46F9133D.8020507@voltaire.com> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> <46F8FCAB.4010002@voltaire.com> <20070925123425.GA20844@mellanox.co.il> <46F9086B.5020804@voltaire.com> <20070925134043.GC20844@mellanox.co.il> <46F9133D.8020507@voltaire.com> Message-ID: <20070925135828.GF20844@mellanox.co.il> > Have you stated that clearly anywhere on that thread? Yes. -- MST From hnguyen at linux.vnet.ibm.com Tue Sep 25 07:09:29 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Tue, 25 Sep 2007 16:09:29 +0200 Subject: [ofa-general] Please pull libehca.git/libehca ofed_1_3 branch Message-ID: <200709251609.29808.hnguyen@linux.vnet.ibm.com> Hi Michael and Vlad! Please pull from git://git.openfabrics.org/~hnguyen/libehca.git branch ofed_1_3 to get the fixes below. Thanks Nam commit cc3534dfb6c48415036d8c239ae6d235e3027675 Author: Hoang-Nam Nguyen Date: Tue Sep 25 15:42:38 2007 +0200 fix unused and undefined var fw_handle Signed-off-by: Hoang-Nam Nguyen commit 2abc7a5c3c2b3d872068e41040d8b2b1115c749c Author: Hoang-Nam Nguyen Date: Tue Sep 25 15:40:06 2007 +0200 declare extern vars (kpage_size/shift) in one place Signed-off-by: Hoang-Nam Nguyen From mst at dev.mellanox.co.il Tue Sep 25 07:12:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 16:12:56 +0200 Subject: [ofa-general] Re: Please pull libehca.git/libehca ofed_1_3 branch In-Reply-To: <200709251609.29808.hnguyen@linux.vnet.ibm.com> References: <200709251609.29808.hnguyen@linux.vnet.ibm.com> Message-ID: <20070925141256.GH20844@mellanox.co.il> > Quoting Hoang-Nam Nguyen : > Subject: Please pull libehca.git/libehca ofed_1_3 branch > > Hi Michael and Vlad! > Please pull from git://git.openfabrics.org/~hnguyen/libehca.git > branch ofed_1_3 to get the fixes below. done -- MST From mst at dev.mellanox.co.il Tue Sep 25 07:20:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 16:20:45 +0200 Subject: [ofa-general] Re: [PATCH v8] IB/mlx4: shrinking WQE In-Reply-To: <20070920074712.GB7141@mellanox.co.il> References: <20070920074712.GB7141@mellanox.co.il> Message-ID: <20070925142045.GI20844@mellanox.co.il> Quoting Michael S. Tsirkin : Subject: [PATCH v8] IB/mlx4: shrinking WQE ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use WR with NOP opcode to avoid wrap-around in the middle of WR. We set NoErrorCompletion bit to avoid getting completions with error for NOP WRs. Since NEC is only supported starting with firmware 2.2.232, we use constant-sized WRs for older firmware. And, since MLX QPs only support SEND, we use constant-sized WRs in this case. Signed-off-by: Michael S. Tsirkin --- Changes since v8: - fix thinko in stamping code: owner bit value should be invalid Changes since v7: - avoid mis-detecting recv write with immediate completion as NOP - increase min. wqe_shift for RC QPs to 64 bytes, so that stamping (which is done each 64 bytes) invalidates all WQEs - disable WQE shrinking if FW version is < 2.2.232, otherwise we could get CQE with error for NOP, which might overflow the CQ diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..20ba988 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -331,6 +331,12 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP && + is_send)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +359,10 @@ static int mlx4_ib_poll_one(struct mlx4_ib_cq *cq, if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 705ff2f..a72ecb9 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -115,6 +115,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 158507d..c844498 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp) static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib_qp *qp, int n) /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0x7fffffff) : + cpu_to_be32(0xffffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + stamp_send_wqe(qp, (n + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1), size); + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); + + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -237,6 +310,8 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +327,69 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * We set NEC bit to avoid getting completions with error for NOP WRs. + * Since NEC is only supported starting with firmware 2.2.232, + * we use constant-sized WRs for older firmware. + * + * And, since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. + * + * We set WQE size to at least 64 bytes, this way stamping invalidates each WQE. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && + qp->sq_signal_bits && BITS_PER_LONG == 64 && + type != IB_QPT_SMI && type != IB_QPT_GSI) + qp->sq.wqe_shift = ilog2(64); + else + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +401,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +440,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +536,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1030,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1238,13 +1364,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1260,7 +1387,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1371,16 +1498,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = (ind + qp->sq_spare_wqes) & (qp->sq.wqe_cnt - 1); + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1402,8 +1536,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); diff --git a/drivers/net/mlx4/alloc.c b/drivers/net/mlx4/alloc.c index f8d63d3..0fce74d 100644 --- a/drivers/net/mlx4/alloc.c +++ b/drivers/net/mlx4/alloc.c @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf) dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index cfb78fb..2c6c768 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -129,6 +129,11 @@ enum { MLX4_STAT_RATE_OFFSET = 5 }; +static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) +{ + return (major << 32) | (minor << 16) | subminor; +} + struct mlx4_caps { u64 fw_ver; int num_ports; @@ -185,7 +190,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 3968b94..09a2230 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -154,7 +154,11 @@ struct mlx4_qp_context { u32 reserved5[10]; }; +/* Which firmware version adds support for NEC (NoErrorCompletion) bit */ +#define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) + enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, -- MST From changquing.tang at hp.com Tue Sep 25 07:23:51 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 25 Sep 2007 14:23:51 -0000 Subject: [ofa-general] Atomic operation question. In-Reply-To: <46F8A682.8020307@dev.mellanox.co.il> References: <349DCDA352EACF42A0C49FA6DCEA8403025E4008@G3W0634.americas.hpqcorp.net> <46F8A682.8020307@dev.mellanox.co.il> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403025E43E8@G3W0634.americas.hpqcorp.net> Even if the two ports are on different subnet ? According to you, atomic operation is on per card basis, not on per port basis, right ? --CQ > -----Original Message----- > From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] > Sent: Tuesday, September 25, 2007 1:11 AM > To: Tang, Changqing > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Atomic operation question. > > Hi. > > Tang, Changqing wrote: > > HI, I have a question for atmoic operation. If incoming atomic > > operations are from both ports of that HCA, can it work correctly ? > > > Yes, it should (if the HCA supports atomic operations). > > Dotan > From dotanb at dev.mellanox.co.il Tue Sep 25 07:33:07 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 25 Sep 2007 16:33:07 +0200 Subject: [ofa-general] Atomic operation question. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403025E43E8@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA8403025E4008@G3W0634.americas.hpqcorp.net> <46F8A682.8020307@dev.mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA8403025E43E8@G3W0634.americas.hpqcorp.net> Message-ID: <46F91C23.7060504@dev.mellanox.co.il> Tang, Changqing wrote: > Even if the two ports are on different subnet ? > > According to you, atomic operation is on per card basis, not on per port > basis, right ? > > > --CQ > Yes, even if the ports are in different subnets. The atomicity is being done in the HCA to a specific memory location and it don't care from which QP (or from which IB port) is was received from. Dotan From eli at mellanox.co.il Tue Sep 25 07:31:18 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 16:31:18 +0200 Subject: [ofa-general] [PATCH 10/11]: IB/ipoib modify cq params In-Reply-To: <46F8E713.5060009@voltaire.com> References: <1190637684.4947.74.camel@mtls03> <46F8E713.5060009@voltaire.com> Message-ID: <1190730678.4947.150.camel@mtls03> > I think we have to carefully think if/how does this feature goes hand in > hand with NAPI. Since NAPI is not optional, with this feature the > network stack tries to do its best to reduce interrupts with the NAPI > logic, and on top of that the HW is instructed to apply this logic > before issuing an interrupt. My experience shows that interrupt moderation helps make better us of NAPI. Without it, chances are higher that the device generates an interrupt request, a fast CPU polls the CQE and returns after calling netif_rx_complete (which will cause the next CQE to generate another interrupt request). When using moderation it is more probable that the CQ will already contain more CQEs that will cause NAPI to contribute even more to coalescing. > > Does the need here suggests that NAPI can be improved? if yes how? maybe > for some infiniband devices interrupt moderation for itself would be > better so NAPI should be disabled? I think you can diminish the effect of NAPI by setting the weight parameter to a lower value. Not all devices support interrupt moderation. > > To suggest this for merge, I think you would need to share the list with > the IPoIB results you had with NAPI vs with NAPI AND interrupt moderation. I don't have orderly records of the effect but I can verify that using moderation does improve performance. Anyone can experiment with it. You can use ethtool to set different values. > > Or. > From eli at mellanox.co.il Tue Sep 25 07:35:36 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 16:35:36 +0200 Subject: [ofa-general] [PATCH 3/11] ib_core: add checksum offload support In-Reply-To: <46F8E9B3.80302@voltaire.com> References: <1190637451.4947.60.camel@mtls03> <46F8DFBB.1000800@voltaire.com> <1190716252.4947.125.camel@mtls03> <46F8E9B3.80302@voltaire.com> Message-ID: <1190730936.4947.154.camel@mtls03> On Tue, 2007-09-25 at 12:57 +0200, Or Gerlitz wrote: > I did not say that you need to add four capabilities, you can add one > that fits the connectX feature eg IB_DEVICE_HW_CSUM and document that if > the device supports this, it is capable to compute TCP and UDP checksum > for both IPv4 and IPv6 packets, etc. Later if some new HW will be > capable to offload only IPv4, they will add a new capability etc. > OK, I will add a comment. > Again, per the discussion over the thread it --is-- true, also, it > creates confusion while reading the code (why to ask the HW to do > something which is always done by SW?), also this does not put more work > on the private case of the connectX HW which relies on the IB ICRC, > second HW might go and actually compute the IP header csum. > Mellanox HW does not rely in this case on IB ICRC. The HW really computes checksum on outgoing packets. From eli at mellanox.co.il Tue Sep 25 07:41:09 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 25 Sep 2007 16:41:09 +0200 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46F8EBC6.40100@voltaire.com> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> Message-ID: <1190731269.4947.158.camel@mtls03> On Tue, 2007-09-25 at 13:06 +0200, Or Gerlitz wrote: > Eli Cohen wrote: > > On Tue, 2007-09-25 at 12:22 +0200, Or Gerlitz wrote: > >> Eli Cohen wrote: > > >>> Add high dma support to ipoib > >>> This patch assumes all IB devices support 64 bit DMA. > > >> On some architectures DMA addresses are 32 bit, so I am not sure to > >> follow your comment. This capability states that the network device can > >> dma to high memory. > > > I believe it means that *if* the kernel hands buffers whose addresses > > exceed 32 bits then the device can handle them. > > High-memory is well documents in books and elsewhere. I just want to say > that the change-log comment is confusing and unrelated. > > What you want to say is that this patch assumes that for all IB devices, > ib_dma_map_single and ib_dma_map_page supports high memory, which is not > the case, see below. > > Ralph? > > Or. > > > static u64 ipath_dma_map_page(struct ib_device *dev, > > struct page *page, > > unsigned long offset, > > size_t size, > > enum dma_data_direction direction) > > { > > u64 addr; > > > > BUG_ON(!valid_dma_direction(direction)); > > > > if (offset + size > PAGE_SIZE) { > > addr = BAD_DMA_ADDRESS; > > goto done; > > } > > > > addr = (u64) page_address(page); > > if (addr) > > addr += offset; > > /* TODO: handle highmem pages */ > > > > done: > > return addr; > > } > I got the impression that all supported IB devices support dma-ing to/from memory > 4GB. Perhaps other vendors can comment. From tziporet at mellanox.co.il Tue Sep 25 07:46:09 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 25 Sep 2007 16:46:09 +0200 Subject: [ofa-general] OFED Sep 24 meeting summary on OFED 1.3 development status Message-ID: <6C2C79E72C305246B504CBA17B5500C901563E71@mtlexch01.mtl.com> Sep 24 OFED meeting summary: ============================ OFED 1.3 related decisions: --------------------------- 1. Agreed on the new OFED 1.3 schedule: * Feature freeze - Oct 3 * Alpha release - Oct 8 * Beta release - Oct 17 (may change according to 2.6.24 rc1 availability) * RC1 - Oct 24 * RC2 - Nov 7 * RC3 - Nov 20 * RC4 - Dec 4 * GA release - Dec 18 2. Agree to move to kernel base 2.6.24 Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is available. This will reduce many patches and with the new timeline seems more appropriate. 3. We wish to reduce the amount of compilation warnings in the backport patches. Betsy from Qlogic will drive this. OFA conference after CS07: -------------------------- * We think we need at least half a day for EWG discussions on future plans. * Need to decide on agenda items in the next meetings Tziporet From mst at dev.mellanox.co.il Tue Sep 25 07:53:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 16:53:15 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1190731269.4947.158.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> Message-ID: <20070925145314.GA4987@mellanox.co.il> > > > static u64 ipath_dma_map_page(struct ib_device *dev, > > > struct page *page, > > > unsigned long offset, > > > size_t size, > > > enum dma_data_direction direction) > > > { > > > u64 addr; > > > > > > BUG_ON(!valid_dma_direction(direction)); > > > > > > if (offset + size > PAGE_SIZE) { > > > addr = BAD_DMA_ADDRESS; > > > goto done; > > > } > > > > > > addr = (u64) page_address(page); > > > if (addr) > > > addr += offset; > > > /* TODO: handle highmem pages */ > > > > > > done: > > > return addr; > > > } > > > > I got the impression that all supported IB devices support dma-ing > to/from memory > 4GB. Perhaps other vendors can comment. I think it's true. The only reason ipath doesn't support this at the moment, is because the maintainer doesn't seem to care about supporting 32 bit systems. -- MST From tziporet at dev.mellanox.co.il Tue Sep 25 08:00:29 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 25 Sep 2007 17:00:29 +0200 Subject: [ofa-general] OFED Sep 24 meeting summary on OFED 1.3 development status In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563E71@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563E71@mtlexch01.mtl.com> Message-ID: <46F9228D.7080608@mellanox.co.il> Note: OFED 1.3 plans and schedule are updated on: https://wiki.openfabrics.org/tiki-index.php?page=OFED+1.3+release+plan+and+features Tziporet Koren wrote: > Sep 24 OFED meeting summary: > ============================ > OFED 1.3 related decisions: > --------------------------- > 1. Agreed on the new OFED 1.3 schedule: > * Feature freeze - Oct 3 > * Alpha release - Oct 8 > * Beta release - Oct 17 (may change according to 2.6.24 rc1 > availability) > * RC1 - Oct 24 > * RC2 - Nov 7 > * RC3 - Nov 20 > * RC4 - Dec 4 > * GA release - Dec 18 > > 2. Agree to move to kernel base 2.6.24 > Start with what we have now (2.6.23) and move to 2.6.24 when RC1 is > available. > This will reduce many patches and with the new timeline seems more > appropriate. > > 3. We wish to reduce the amount of compilation warnings in the backport > patches. > Betsy from Qlogic will drive this. > > OFA conference after CS07: > -------------------------- > * We think we need at least half a day for EWG discussions on future > plans. > * Need to decide on agenda items in the next meetings > > > Tziporet > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From fubar at us.ibm.com Tue Sep 25 08:24:29 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Tue, 25 Sep 2007 08:24:29 -0700 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46F7D770.4090500@voltaire.com> References: <46F7D770.4090500@voltaire.com> Message-ID: <10376.1190733869@death> ACK patches 3 - 9. Roland, are you comfortable with the IB changes in patches 1 and 2? Jeff, when Roland acks patches 1 and 2, please apply all 9. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com Moni Shoua wrote: >This patch series is the sixth version (see below link to V5) of the >suggested changes to the bonding driver so it would be able to support >non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. > >Patches 1-8 were originally submitted in V5 and patch 9 is an addition by Jay. > > >Major changes from the previous version: >---------------------------------------- > >1. Remove the patches to net/core. Bonding will use the NETDEV_GOING_DOWN notification > instead of NETDEV_CHANGE+IFF_SLAVE_DETACH. This reduces the amount of patches from 11 > to 9. > >Links to earlier discussion: >---------------------------- > >1. A discussion in netdev about bonding support for IPoIB. >http://lists.openwall.net/netdev/2006/11/30/46 > >2. V5 series >http://lists.openfabrics.org/pipermail/general/2007-September/040996.html > >- >To unsubscribe from this list: send the line "unsubscribe netdev" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html From shemminger at linux-foundation.org Tue Sep 25 08:24:57 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Tue, 25 Sep 2007 08:24:57 -0700 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190726138.4264.105.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20070924171411.36494656@freepuppy.rosehill> <1190726138.4264.105.camel@localhost> Message-ID: <20070925082457.6fec30d6@freepuppy.rosehill> On Tue, 25 Sep 2007 09:15:38 -0400 jamal wrote: > On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote: > > > Since we are redoing this, > > is there any way to make the whole TX path > > more lockless? The existing model seems to be more of a monitor than > > a real locking model. > http://en.wikipedia.org/wiki/Monitor_(synchronization) > What do you mean it is "more of a monitor"? The transmit code path is locked as a code region, rather than just object locking on the transmit queue or other fine grained object. This leads to moderately long lock hold times when multiple qdisc's and classification is being done. > > On the challenge of making it lockless: > About every NAPI driver combines the tx prunning with rx polling. If you > are dealing with tx resources on receive thread as well as tx thread, > _you need_ locking. The only other way we can do avoid it is to separate > the rx path interupts from ones on tx related resources; the last NAPI > driver that did that was tulip; i think the e1000 for a short period in > its life did the same as well. But that has been frowned on and people > have evolved away from it. If we went to finer grain locking it would also mean changes to all network devices using the new locking model. My assumption is that we would use something like the features flag to do the transition for backward compatibility. Take this as a purely "what if" or "it would be nice if" kind of suggestion not a requirement or some grand plan. -- Stephen Hemminger From rdreier at cisco.com Tue Sep 25 08:39:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 25 Sep 2007 08:39:23 -0700 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <10376.1190733869@death> (Jay Vosburgh's message of "Tue, 25 Sep 2007 08:24:29 -0700") References: <46F7D770.4090500@voltaire.com> <10376.1190733869@death> Message-ID: > Roland, are you comfortable with the IB changes in patches 1 and 2? Yes, they look fine to me. Feel free to apply, with Acked-by: Roland Dreier - R. From evarandolph at 10kaweek.info Tue Sep 25 07:39:48 2007 From: evarandolph at 10kaweek.info (Janine Carmichael) Date: Tue, 25 Sep 2007 14:39:48 +0000 Subject: [ofa-general] Fwd: Thanks, we are ready to give you a loan Message-ID: <000501c7ff90$0416eb47$86555f83@cronfea> If you have your own business and wish IMMEDIATE cash to spend ANY way you like or need Extra money to give your business a boost or need A low interest loan - NO STRINGS ATTACHED, here is our deal we can offer you THIS EVENING (hurry, this deal will expire THIS NIGHT):   $25,000+ loan   Hurry, when our best deal is gone, it is gone. Simply Call Us Free on 877-292-6892 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Sep 25 09:46:21 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Sep 2007 09:46:21 -0700 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: <46F8DB6F.8050901@voltaire.com> References: <46F7FDE5.9070305@oracle.com> <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> <46F8DB6F.8050901@voltaire.com> Message-ID: <46F93B5D.8000606@ichips.intel.com> > In iscsi/iser, the approach we took wrt to destruction of a pair > (ID and QP are created/destroyed through and state-managed by the > rdma-cm) is: > > A) call rdma_disconnect to make sure the QP was transitioned to error > > B) get the completions/flushes assoc. with all the WR posted to the QP > C) make sure a disconnected event was received > > call rdma_destroy_qp only when B && C hold. > > What is your take on this approach? That should work fine and avoid any races. It was just my intent to support calling rdma_destroy_id() at any time in order to cancel any outstanding operation. I didn't consider that this also meant supporting rdma_destroy_qp() being called at any time. This patch should protect against QP destruction, as long as the user does not call rdma_destroy_qp() while calling another API, such as rdma_accept(). - Sean From rdreier at cisco.com Tue Sep 25 09:55:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 25 Sep 2007 09:55:19 -0700 Subject: [ofa-general] [PATCH 10/11]: IB/ipoib modify cq params In-Reply-To: <46F8E713.5060009@voltaire.com> (Or Gerlitz's message of "Tue, 25 Sep 2007 12:46:43 +0200") References: <1190637684.4947.74.camel@mtls03> <46F8E713.5060009@voltaire.com> Message-ID: > I think we have to carefully think if/how does this feature goes hand > in hand with NAPI. Since NAPI is not optional, with this feature the > network stack tries to do its best to reduce interrupts with the NAPI > logic, and on top of that the HW is instructed to apply this logic > before issuing an interrupt. Actually if you read the history of NAPI on netdev, you will see that it is highly recommended to use interrupt mitigation in conjuction with NAPI. So the idea here is good; I just have some questions about some of the details. - R. From sashak at voltaire.com Tue Sep 25 10:32:45 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 19:32:45 +0200 Subject: [ofa-general] Re: [PATCH] osm/osm_sa_path_record: trivial cosmetic chage In-Reply-To: <46F838EB.10704@dev.mellanox.co.il> References: <46F838EB.10704@dev.mellanox.co.il> Message-ID: <20070925173245.GX29384@sashak.voltaire.com> Hi Yevgeny, On 00:23 Tue 25 Sep , Yevgeny Kliteynik wrote: > Trivial fix in osm_sa_path_record.c > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_sa_path_record.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 3b183d9..ce75ec8 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -723,7 +723,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > if (pkey) { Is this "if (pkey)" needed? Following flow above I found that pkey should always be initialized. Right? Sasha > p_prtn = > (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, > - pkey & cl_ntoh16((uint16_t) ~ > + pkey & cl_hton16((uint16_t) ~ > 0x8000)); > if (p_prtn == > (osm_prtn_t *) cl_qmap_end(&p_rcv->p_subn->prtn_pkey_tbl)) > -- > 1.5.1.4 > From sashak at voltaire.com Tue Sep 25 10:33:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 19:33:22 +0200 Subject: [ofa-general] Re: [PATCH] osm/osm_sa_path_record: trivial cosmetic chage In-Reply-To: <46F838EB.10704@dev.mellanox.co.il> References: <46F838EB.10704@dev.mellanox.co.il> Message-ID: <20070925173322.GY29384@sashak.voltaire.com> On 00:23 Tue 25 Sep , Yevgeny Kliteynik wrote: > Trivial fix in osm_sa_path_record.c > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From ralph.campbell at qlogic.com Tue Sep 25 10:33:29 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 25 Sep 2007 10:33:29 -0700 Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1190731269.4947.158.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> Message-ID: <1190741609.20700.101.camel@brick.pathscale.com> On Tue, 2007-09-25 at 16:41 +0200, Eli Cohen wrote: > On Tue, 2007-09-25 at 13:06 +0200, Or Gerlitz wrote: > > Eli Cohen wrote: > > > On Tue, 2007-09-25 at 12:22 +0200, Or Gerlitz wrote: > > >> Eli Cohen wrote: > > > > >>> Add high dma support to ipoib > > >>> This patch assumes all IB devices support 64 bit DMA. > > > > >> On some architectures DMA addresses are 32 bit, so I am not sure to > > >> follow your comment. This capability states that the network device can > > >> dma to high memory. > > > > > I believe it means that *if* the kernel hands buffers whose addresses > > > exceed 32 bits then the device can handle them. > > > > High-memory is well documents in books and elsewhere. I just want to say > > that the change-log comment is confusing and unrelated. > > > > What you want to say is that this patch assumes that for all IB devices, > > ib_dma_map_single and ib_dma_map_page supports high memory, which is not > > the case, see below. > > > > Ralph? > > > > Or. Correct. ib_ipath doesn't support high memory and it would be inefficient to do so. > > > static u64 ipath_dma_map_page(struct ib_device *dev, > > > struct page *page, > > > unsigned long offset, > > > size_t size, > > > enum dma_data_direction direction) > > > { > > > u64 addr; > > > > > > BUG_ON(!valid_dma_direction(direction)); > > > > > > if (offset + size > PAGE_SIZE) { > > > addr = BAD_DMA_ADDRESS; > > > goto done; > > > } > > > > > > addr = (u64) page_address(page); > > > if (addr) > > > addr += offset; > > > /* TODO: handle highmem pages */ > > > > > > done: > > > return addr; > > > } > > > > I got the impression that all supported IB devices support dma-ing > to/from memory > 4GB. Perhaps other vendors can comment. The QLogic HCAs don't support DMA to or from the physical memory for the verbs Lkey/Rkey memory regions. The whole reason I added the ib_dma_*() functions was so to avoid ib_ipoib, etc. from calling dma_*() directly and passing a physical address as the offset in the posted work requests. What happens instead, is that ib_dma_*() returns a kernel virtual address which is passed in the work request and the driver copies the data to/from the HW as needed. So, in order to support HIGHMEM, I would need to change the ipath_dma_*() functions to call kmap()/kunmap() for HIGHMEM pages. I'm sure there would be all kinds of performance and coding issues around doing this. From sashak at voltaire.com Tue Sep 25 10:43:25 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 19:43:25 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - adding pkey in port groups In-Reply-To: <46F83A68.4040004@dev.mellanox.co.il> References: <46F83A68.4040004@dev.mellanox.co.il> Message-ID: <20070925174325.GA29384@sashak.voltaire.com> On 00:30 Tue 25 Sep , Yevgeny Kliteynik wrote: > Adding option to specify partitions for port groups in QoS > policy file using pkeys in addition to partition names. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Tue Sep 25 10:44:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 19:44:47 +0200 Subject: [ofa-general] Re: [PATCH] osm/osm_sa_path_record: trivial cosmetic chage In-Reply-To: <46F838EB.10704@dev.mellanox.co.il> References: <46F838EB.10704@dev.mellanox.co.il> Message-ID: <20070925174447.GB29384@sashak.voltaire.com> On 00:23 Tue 25 Sep , Yevgeny Kliteynik wrote: > Trivial fix in osm_sa_path_record.c > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_sa_path_record.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 3b183d9..ce75ec8 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -723,7 +723,7 @@ __osm_pr_rcv_get_path_parms(IN osm_pr_rcv_t * const p_rcv, > if (pkey) { > p_prtn = > (osm_prtn_t *) cl_qmap_get(&p_rcv->p_subn->prtn_pkey_tbl, > - pkey & cl_ntoh16((uint16_t) ~ > + pkey & cl_hton16((uint16_t) ~ Also guess the same fix is relevant for MPR, right? Sasha From sean.hefty at intel.com Tue Sep 25 11:01:32 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Sep 2007 11:01:32 -0700 Subject: [ofa-general] [PATCH-2.6.24 1/2] [RFC] ib/mad: report number of times a mad was retried Message-ID: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> To allow ULPs to tune timeout values and capture retry statistics, report the number of times that a mad send operation was retried. For RMPP mads, report the total number of times that any portion (send window) of the send operation was retried. Signed-off-by: Sean Hefty --- drivers/infiniband/core/mad.c | 9 +++++++-- drivers/infiniband/core/mad_priv.h | 3 ++- drivers/infiniband/core/mad_rmpp.c | 2 +- include/rdma/ib_mad.h | 4 +++- 4 files changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..91e62c3 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1100,7 +1100,9 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf, mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid; /* Timeout will be updated after send completes */ mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms); - mad_send_wr->retries = send_buf->retries; + mad_send_wr->max_retries = send_buf->retries; + mad_send_wr->retries_left = send_buf->retries; + send_buf->retries = 0; /* Reference for work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -2445,9 +2447,12 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr) { int ret; - if (!mad_send_wr->retries--) + if (!mad_send_wr->retries_left) return -ETIMEDOUT; + mad_send_wr->retries_left--; + mad_send_wr->send_buf.retries++; + mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms); if (mad_send_wr->mad_agent_priv->agent.rmpp_version) { diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 9be5cc0..8b75010 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -131,7 +131,8 @@ struct ib_mad_send_wr_private { struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; __be64 tid; unsigned long timeout; - int retries; + int max_retries; + int retries_left; int retry; int refcount; enum ib_wc_status status; diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index d43bc62..a5e2a31 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -684,7 +684,7 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent, if (seg_num > mad_send_wr->last_ack) { adjust_last_ack(mad_send_wr, seg_num); - mad_send_wr->retries = mad_send_wr->send_buf.retries; + mad_send_wr->retries_left = mad_send_wr->max_retries; } mad_send_wr->newwin = newwin; if (mad_send_wr->last_ack == mad_send_wr->send_buf.seg_count) { diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 8ec3799..7228c05 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -230,7 +230,9 @@ struct ib_class_port_info * @seg_count: The number of RMPP segments allocated for this send. * @seg_size: Size of each RMPP segment. * @timeout_ms: Time to wait for a response. - * @retries: Number of times to retry a request for a response. + * @retries: Number of times to retry a request for a response. For MADs + * using RMPP, this applies per window. On completion, returns the number + * of retries needed to complete the transfer. * * Users are responsible for initializing the MAD buffer itself, with the * exception of any RMPP header. Additional segment buffer space allocated From sean.hefty at intel.com Tue Sep 25 11:05:14 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Sep 2007 11:05:14 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> Message-ID: <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> Add performance/debug counters to track sent/received messages, retries, and duplicates. Counters are tracked per CM message type, per port. The counters are always enabled, so intrusive state tracking is not done. Signed-off-by: Sean Hefty --- This exports the CM counters through debugfs. The implementation of the counters changed to use a 2D array, but the type of counters are the same as in the previous version of this patch. drivers/infiniband/core/cm.c | 206 ++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 194 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 2e39236..481b9e7 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2007 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. @@ -37,6 +37,7 @@ #include #include +#include #include #include #include @@ -78,19 +79,60 @@ static struct ib_cm { struct workqueue_struct *wq; } cm; +/* Counter indexes ordered by attribute ID */ +enum { + CM_REQ_COUNTER, + CM_MRA_COUNTER, + CM_REJ_COUNTER, + CM_REP_COUNTER, + CM_RTU_COUNTER, + CM_DREQ_COUNTER, + CM_DREP_COUNTER, + CM_SIDR_REQ_COUNTER, + CM_SIDR_REP_COUNTER, + CM_LAP_COUNTER, + CM_APR_COUNTER, + CM_ATTR_COUNT, + CM_ATTR_ID_OFFSET = 0x0010, +}; + +static char const attr_names[CM_ATTR_COUNT][sizeof("SIDR_REQ")] = { + "REQ", "MRA", "REJ", "REP", "RTU", "DREQ", "DREP", + "SIDR_REQ", "SIDR_REP", "LAP", "APR" +}; + +enum { + CM_XMIT, + CM_XMIT_RETRIES, + CM_RECV, + CM_RECV_DUPLICATES, + CM_COUNTERS +}; + +static char const counter_names[CM_COUNTERS][sizeof("cm_rx_duplicates")] = { + "cm_tx_msgs", "cm_tx_retries", + "cm_rx_msgs", "cm_rx_duplicates" +}; + struct cm_port { struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; + struct dentry *port_dir; u8 port_num; + atomic_long_t counters[CM_COUNTERS][CM_ATTR_COUNT]; + struct dentry *counter_file[CM_COUNTERS]; }; struct cm_device { struct list_head list; struct ib_device *device; + struct dentry *dev_dir; u8 ack_delay; struct cm_port port[0]; }; +static struct dentry *cm_dir; + struct cm_av { struct cm_port *port; union ib_gid dgid; @@ -1270,6 +1312,9 @@ static void cm_dup_req_handler(struct cm_work *work, struct ib_mad_send_buf *msg = NULL; int ret; + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_REQ_COUNTER]); + /* Quick state check to discard duplicate REQs. */ if (cm_id_priv->id.state == IB_CM_REQ_RCVD) return; @@ -1616,6 +1661,8 @@ static void cm_dup_rep_handler(struct cm_work *work) if (!cm_id_priv) return; + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_REP_COUNTER]); ret = cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg); if (ret) goto deref; @@ -1781,6 +1828,8 @@ static int cm_rtu_handler(struct cm_work *work) if (cm_id_priv->id.state != IB_CM_REP_SENT && cm_id_priv->id.state != IB_CM_MRA_REP_RCVD) { spin_unlock_irq(&cm_id_priv->lock); + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_RTU_COUNTER]); goto out; } cm_id_priv->id.state = IB_CM_ESTABLISHED; @@ -1958,6 +2007,8 @@ static int cm_dreq_handler(struct cm_work *work) cm_id_priv = cm_acquire_id(dreq_msg->remote_comm_id, dreq_msg->local_comm_id); if (!cm_id_priv) { + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_DREQ_COUNTER]); cm_issue_drep(work->port, work->mad_recv_wc); return -EINVAL; } @@ -1977,6 +2028,8 @@ static int cm_dreq_handler(struct cm_work *work) case IB_CM_MRA_REP_RCVD: break; case IB_CM_TIMEWAIT: + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_DREQ_COUNTER]); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -1988,6 +2041,10 @@ static int cm_dreq_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_DREQ_RCVD: + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_DREQ_COUNTER]); + goto unlock; default: goto unlock; } @@ -2339,10 +2396,19 @@ static int cm_mra_handler(struct cm_work *work) if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_OTHER || cm_id_priv->id.lap_state != IB_CM_LAP_SENT || ib_modify_mad(cm_id_priv->av.port->mad_agent, - cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) { + if (cm_id_priv->id.lap_state == IB_CM_MRA_LAP_RCVD) + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_MRA_COUNTER]); goto out; + } cm_id_priv->id.lap_state = IB_CM_MRA_LAP_RCVD; break; + case IB_CM_MRA_REQ_RCVD: + case IB_CM_MRA_REP_RCVD: + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_MRA_COUNTER]); + /* fall through */ default: goto out; } @@ -2502,6 +2568,8 @@ static int cm_lap_handler(struct cm_work *work) case IB_CM_LAP_IDLE: break; case IB_CM_MRA_LAP_SENT: + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_LAP_COUNTER]); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -2515,6 +2583,10 @@ static int cm_lap_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_LAP_RCVD: + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_LAP_COUNTER]); + goto unlock; default: goto unlock; } @@ -2796,6 +2868,8 @@ static int cm_sidr_req_handler(struct cm_work *work) cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv); if (cur_cm_id_priv) { spin_unlock_irq(&cm.lock); + atomic_long_inc(&work->port->counters + [CM_RECV_DUPLICATES][CM_SIDR_REQ_COUNTER]); goto out; /* Duplicate message. */ } cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; @@ -2990,6 +3064,25 @@ static void cm_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { struct ib_mad_send_buf *msg = mad_send_wc->send_buf; + struct cm_port *port; + u16 attr_index; + + port = mad_agent->context; + attr_index = be16_to_cpu(((struct ib_mad_hdr *) + msg->mad)->attr_id) - CM_ATTR_ID_OFFSET; + + /* + * If the send was in response to a received message (context[0] is not + * set to a cm_id), and is not a REJ, then it is a send that was + * manually retried. + */ + if (!msg->context[0] && (attr_index != CM_REJ_COUNTER)) + msg->retries = 1; + + atomic_long_add(1 + msg->retries, &port->counters[CM_XMIT][attr_index]); + if (msg->retries) + atomic_long_add(msg->retries, + &port->counters[CM_XMIT_RETRIES][attr_index]); switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -3148,8 +3241,10 @@ EXPORT_SYMBOL(ib_cm_notify); static void cm_recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { + struct cm_port *port = mad_agent->context; struct cm_work *work; enum ib_cm_event_type event; + u16 attr_id; int paths = 0; switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) { @@ -3194,6 +3289,9 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, return; } + attr_id = be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id); + atomic_long_inc(&port->counters[CM_RECV][attr_id - CM_ATTR_ID_OFFSET]); + work = kmalloc(sizeof *work + sizeof(struct ib_sa_path_rec) * paths, GFP_KERNEL); if (!work) { @@ -3204,7 +3302,7 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, INIT_DELAYED_WORK(&work->work, cm_work_handler); work->cm_event.event = event; work->mad_recv_wc = mad_recv_wc; - work->port = (struct cm_port *)mad_agent->context; + work->port = port; queue_delayed_work(cm.wq, &work->work, 0); } @@ -3379,6 +3477,65 @@ static void cm_get_ack_delay(struct cm_device *cm_dev) cm_dev->ack_delay = attr.local_ca_ack_delay; } +static ssize_t cm_read_counter(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + atomic_long_t *counter; + u64 value; + + if (*pos >= CM_ATTR_COUNT) + return 0; + + counter = filp->f_dentry->d_inode->i_private; + value = (u64) atomic_long_read(&counter[*pos]); + + return snprintf(buf, count, "%s %lld\n", attr_names[(*pos)++], value); +} + +static const struct file_operations cm_file_ops = { + .owner = THIS_MODULE, + .read = cm_read_counter +}; + +static int cm_create_port_fs(struct cm_port *port) +{ + char port_name[4]; + int i; + + sprintf(port_name, "%d", port->port_num); + port->port_dir = debugfs_create_dir(port_name, port->cm_dev->dev_dir); + if (!port->port_dir) + return -ENOMEM; + + for (i = 0; i < CM_COUNTERS; i++) { + port->counter_file[i] = debugfs_create_file(counter_names[i], + S_IFREG | S_IRUGO, + port->port_dir, + &port->counters[i], + &cm_file_ops); + if (!port->counter_file[i]) + goto error; + } + return 0; + +error: + while (i--) + debugfs_remove(port->counter_file[i]); + + debugfs_remove(port->port_dir); + return -ENOMEM; +} + +static void cm_remove_port_fs(struct cm_port *port) +{ + int i; + + for (i = 0; i < CM_COUNTERS; i++) + debugfs_remove(port->counter_file[i]); + + debugfs_remove(port->port_dir); +} + static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; @@ -3397,11 +3554,15 @@ static void cm_add_one(struct ib_device *device) if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; - cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * + cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) return; + cm_dev->dev_dir = debugfs_create_dir(device->name, cm_dir); + if (!cm_dev->dev_dir) + goto error1; + cm_dev->device = device; cm_get_ack_delay(cm_dev); @@ -3410,6 +3571,11 @@ static void cm_add_one(struct ib_device *device) port = &cm_dev->port[i-1]; port->cm_dev = cm_dev; port->port_num = i; + + ret = cm_create_port_fs(port); + if (ret) + goto error2; + port->mad_agent = ib_register_mad_agent(device, i, IB_QPT_GSI, ®_req, @@ -3418,11 +3584,11 @@ static void cm_add_one(struct ib_device *device) cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error1; + goto error3; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error2; + goto error4; } ib_set_client_data(device, &cm_client, cm_dev); @@ -3431,16 +3597,21 @@ static void cm_add_one(struct ib_device *device) write_unlock_irqrestore(&cm.device_lock, flags); return; -error2: +error4: ib_unregister_mad_agent(port->mad_agent); -error1: +error3: + cm_remove_port_fs(port); +error2: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; while (--i) { port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); } + debugfs_remove(cm_dev->dev_dir); +error1: kfree(cm_dev); } @@ -3466,7 +3637,9 @@ static void cm_remove_one(struct ib_device *device) port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); } + debugfs_remove(cm_dev->dev_dir); kfree(cm_dev); } @@ -3488,17 +3661,25 @@ static int __init ib_cm_init(void) idr_pre_get(&cm.local_id_table, GFP_KERNEL); INIT_LIST_HEAD(&cm.timewait_list); - cm.wq = create_workqueue("ib_cm"); - if (!cm.wq) + cm_dir = debugfs_create_dir("infiniband_cm", NULL); + if (!cm_dir) return -ENOMEM; + cm.wq = create_workqueue("ib_cm"); + if (!cm.wq) { + ret = -ENOMEM; + goto error1; + } + ret = ib_register_client(&cm_client); if (ret) - goto error; + goto error2; return 0; -error: +error2: destroy_workqueue(cm.wq); +error1: + debugfs_remove(cm_dir); return ret; } @@ -3519,6 +3700,7 @@ static void __exit ib_cm_cleanup(void) } ib_unregister_client(&cm_client); + debugfs_remove(cm_dir); idr_destroy(&cm.local_id_table); } From ardavis at ichips.intel.com Tue Sep 25 11:31:24 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 25 Sep 2007 11:31:24 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> Message-ID: <46F953FC.50101@ichips.intel.com> Sean Hefty wrote: > Add performance/debug counters to track sent/received messages, retries, > and duplicates. Counters are tracked per CM message type, per port. > > The counters are always enabled, so intrusive state tracking is not done. > > Signed-off-by: Sean Hefty > --- > This exports the CM counters through debugfs. The implementation of > the counters changed to use a 2D array, but the type of counters are > the same as in the previous version of this patch. > Thanks for adding counters. They will be extremely helpful with our large scale cluster support. However, exporting them via debugfs will be useless in our production environments that do not have debugfs support built in the kernel. Can we expose these with the same mechanism as other IB/iWARP modules? Thanks, -arlin From mst at dev.mellanox.co.il Tue Sep 25 11:32:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 25 Sep 2007 20:32:43 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1190741609.20700.101.camel@brick.pathscale.com> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> Message-ID: <20070925183243.GC9670@mellanox.co.il> > Quoting Ralph Campbell : > Subject: Re: [PATCH 1/11] IB/ipoib: high dma support > > On Tue, 2007-09-25 at 16:41 +0200, Eli Cohen wrote: > > On Tue, 2007-09-25 at 13:06 +0200, Or Gerlitz wrote: > > > Eli Cohen wrote: > > > > On Tue, 2007-09-25 at 12:22 +0200, Or Gerlitz wrote: > > > >> Eli Cohen wrote: > > > > > > >>> Add high dma support to ipoib > > > >>> This patch assumes all IB devices support 64 bit DMA. > > > > > > >> On some architectures DMA addresses are 32 bit, so I am not sure to > > > >> follow your comment. This capability states that the network device can > > > >> dma to high memory. > > > > > > > I believe it means that *if* the kernel hands buffers whose addresses > > > > exceed 32 bits then the device can handle them. > > > > > > High-memory is well documents in books and elsewhere. I just want to say > > > that the change-log comment is confusing and unrelated. > > > > > > What you want to say is that this patch assumes that for all IB devices, > > > ib_dma_map_single and ib_dma_map_page supports high memory, which is not > > > the case, see below. > > > > > > Ralph? > > > > > > Or. > > Correct. ib_ipath doesn't support high memory and it would be > inefficient to do so. > > > > > static u64 ipath_dma_map_page(struct ib_device *dev, > > > > struct page *page, > > > > unsigned long offset, > > > > size_t size, > > > > enum dma_data_direction direction) > > > > { > > > > u64 addr; > > > > > > > > BUG_ON(!valid_dma_direction(direction)); > > > > > > > > if (offset + size > PAGE_SIZE) { > > > > addr = BAD_DMA_ADDRESS; > > > > goto done; > > > > } > > > > > > > > addr = (u64) page_address(page); > > > > if (addr) > > > > addr += offset; > > > > /* TODO: handle highmem pages */ > > > > > > > > done: > > > > return addr; > > > > } > > > > > > > I got the impression that all supported IB devices support dma-ing > > to/from memory > 4GB. Perhaps other vendors can comment. > > The QLogic HCAs don't support DMA to or from the physical memory > for the verbs Lkey/Rkey memory regions. The whole reason I added > the ib_dma_*() functions was so to avoid ib_ipoib, etc. from > calling dma_*() directly and passing a physical address as the > offset in the posted work requests. > What happens instead, is that ib_dma_*() returns a kernel virtual > address which is passed in the work request and the driver copies > the data to/from the HW as needed. > So, in order to support HIGHMEM, I would need to change the > ipath_dma_*() functions to call kmap()/kunmap() for HIGHMEM pages. > I'm sure there would be all kinds of performance and coding issues > around doing this. So, we need some kind of HIGHDMA capability flag? -- MST From becker at nas.nasa.gov Tue Sep 25 12:04:54 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 25 Sep 2007 12:04:54 -0700 Subject: [ofa-general] ibnetdiscover topology output Message-ID: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> Is there a script available to convert this to a topology file usable by IBMgtSim? Thanks. -jeff From hrosenstock at xsigo.com Tue Sep 25 12:06:20 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 25 Sep 2007 12:06:20 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> Message-ID: <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-09-25 at 12:04 -0700, Jeff Becker wrote: > Is there a script available to convert this to a topology file usable > by IBMgtSim? Not that I'm aware of but this format is usable by ibsim (another IB management simulator). -- Hal > Thanks. > > -jeff From mshefty at ichips.intel.com Tue Sep 25 12:09:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Sep 2007 12:09:19 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F90C3F.1030701@voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> <46F90C3F.1030701@voltaire.com> Message-ID: <46F95CDF.3070406@ichips.intel.com> >>>>>> node 1 <-> switch A <-> switch B <-> switch C <-> SA {snip} >> No, it is not (dropped from all multicast groups it is joined to). It >> may be removed from the multicast forwarding tables if there is no route >> available but it is still a member of the group. > > So the node (port) is a member of a multicast group for which routing is > not configured but when the port is discovered again, the SM runs the > multicast routing engine (for all groups? for all groups for which > discovered ports are member of?) again and configures the routing, nice. I'm getting lost regarding the problem. If I understand correctly, multicast will work fine if the link between switches A & B are brought down/up. But if the link is brought down/up between switch A and node 1 (similar to the first reported issue) multicast fails? - Sean From sashak at voltaire.com Tue Sep 25 12:21:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 21:21:20 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070925192120.GD29384@sashak.voltaire.com> On 06:20 Tue 25 Sep , Hal Rosenstock wrote: > On Tue, 2007-09-25 at 15:00 +0200, Or Gerlitz wrote: > > Sean Hefty wrote: > > >>> node 1 <-> switch A <-> switch B <-> switch C <-> SA > > > > >> The host would only see port up/down events as of changes in the link > > >> state in the local port or in the port which is connected to it through > > >> the cable. > > > > > So, if you brought the link down/up between switches A & B, node 1 > > > wouldn't receive any events, but it would be removed from the multicast > > > group? > > > > good catch! > > > > Indeed, when the link between switches A and B goes down, per the view > > point of the SM, the whole sub-fabric across A is lost and hence the > > node is dropped from all the multicast groups it is joined to. > > No, it is not (dropped from all multicast groups it is joined to). It > may be removed from the multicast forwarding tables if there is no route > available but it is still a member of the group. I cannot see it. With normal flow OpenSM will get trap on switch ports disconnection, this will trigger heavy sweep and whole A sub-fabrics will be dropped right after discovery phase (including multicast groups - it is in __osm_drop_mgr_remove_port()). > > > However, from the view point of the node, no port down is experienced. > > > > When the A-B link goes up, the SM discovers all nodes across A and > > probes their ports, though this process a port active event --might-- be > > generated by the HCA FW, but I am not sure its mandatory. > > > > Since the only trigger for ipoib to rejoin to multicast groups is > > delivery of event by the hw driver, namely one of: port down/up, lid > > change, sm lid change, client re-register. I think we might have a hole > > here if none of these events is generated. OpenSM will request client reregistration for all ports in A sub-fabric when it will be connected back and discovered again. Sasha > > It doesn't need to rejoin for this case. See above explanation. > > -- Hal > > > Please note that through this discovery, at least one mad is sent from > > the SM to the node. If we enforce the SM to set the re-register bit > > --each-- time it discovers a node, then the bug is solved. > > > > I will test this scheme and let you know what I get (with the voltaire > > SM and mthca driver). > > > > Eitan, Michael - any insight on the matter? > > > > Or. > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Sep 25 12:25:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 25 Sep 2007 21:25:01 +0200 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> Message-ID: <20070925192501.GE29384@sashak.voltaire.com> On 12:06 Tue 25 Sep , Hal Rosenstock wrote: > On Tue, 2007-09-25 at 12:04 -0700, Jeff Becker wrote: > > Is there a script available to convert this to a topology file usable > > by IBMgtSim? > > Not that I'm aware of but this format is usable by ibsim (another IB > management simulator). Which also does not require rebuilding of libibumad based programs. Sasha From hrosenstock at xsigo.com Tue Sep 25 12:21:31 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 25 Sep 2007 12:21:31 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <20070925192120.GD29384@sashak.voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <1190726439.7075.405.camel@hrosenstock-ws.xsigo.com> <20070925192120.GD29384@sashak.voltaire.com> Message-ID: <1190748091.7075.476.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-09-25 at 21:21 +0200, Sasha Khapyorsky wrote: > On 06:20 Tue 25 Sep , Hal Rosenstock wrote: > > On Tue, 2007-09-25 at 15:00 +0200, Or Gerlitz wrote: > > > Sean Hefty wrote: > > > >>> node 1 <-> switch A <-> switch B <-> switch C <-> SA > > > > > > >> The host would only see port up/down events as of changes in the link > > > >> state in the local port or in the port which is connected to it through > > > >> the cable. > > > > > > > So, if you brought the link down/up between switches A & B, node 1 > > > > wouldn't receive any events, but it would be removed from the multicast > > > > group? > > > > > > good catch! > > > > > > Indeed, when the link between switches A and B goes down, per the view > > > point of the SM, the whole sub-fabric across A is lost and hence the > > > node is dropped from all the multicast groups it is joined to. > > > > No, it is not (dropped from all multicast groups it is joined to). It > > may be removed from the multicast forwarding tables if there is no route > > available but it is still a member of the group. > > I cannot see it. With normal flow OpenSM will get trap on switch ports > disconnection, this will trigger heavy sweep and whole A sub-fabrics > will be dropped right after discovery phase (including multicast groups > - it is in __osm_drop_mgr_remove_port()). I was talking "theory"/spec rather than OpenSM. There are a number of ways to handle this. > > > However, from the view point of the node, no port down is experienced. > > > > > > When the A-B link goes up, the SM discovers all nodes across A and > > > probes their ports, though this process a port active event --might-- be > > > generated by the HCA FW, but I am not sure its mandatory. > > > > > > Since the only trigger for ipoib to rejoin to multicast groups is > > > delivery of event by the hw driver, namely one of: port down/up, lid > > > change, sm lid change, client re-register. I think we might have a hole > > > here if none of these events is generated. > > OpenSM will request client reregistration for all ports in A sub-fabric > when it will be connected back and discovered again. Other SMs may be capable of dealing with this with less "drastic" measures than client reregistration. -- Hal > Sasha > > > > > It doesn't need to rejoin for this case. See above explanation. > > > > -- Hal > > > > > Please note that through this discovery, at least one mad is sent from > > > the SM to the node. If we enforce the SM to set the re-register bit > > > --each-- time it discovers a node, then the bug is solved. > > > > > > I will test this scheme and let you know what I get (with the voltaire > > > SM and mthca driver). > > > > > > Eitan, Michael - any insight on the matter? > > > > > > Or. > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ralph.campbell at qlogic.com Tue Sep 25 12:27:13 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 25 Sep 2007 12:27:13 -0700 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <20070925183243.GC9670@mellanox.co.il> References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> Message-ID: <1190748433.20700.124.camel@brick.pathscale.com> On Tue, 2007-09-25 at 20:32 +0200, Michael S. Tsirkin wrote: > > Quoting Ralph Campbell : > > Subject: Re: [PATCH 1/11] IB/ipoib: high dma support ... > > > I got the impression that all supported IB devices support dma-ing > > > to/from memory > 4GB. Perhaps other vendors can comment. > > > > The QLogic HCAs don't support DMA to or from the physical memory > > for the verbs Lkey/Rkey memory regions. The whole reason I added > > the ib_dma_*() functions was so to avoid ib_ipoib, etc. from > > calling dma_*() directly and passing a physical address as the > > offset in the posted work requests. > > What happens instead, is that ib_dma_*() returns a kernel virtual > > address which is passed in the work request and the driver copies > > the data to/from the HW as needed. > > So, in order to support HIGHMEM, I would need to change the > > ipath_dma_*() functions to call kmap()/kunmap() for HIGHMEM pages. > > I'm sure there would be all kinds of performance and coding issues > > around doing this. > > So, we need some kind of HIGHDMA capability flag? Yes, A HIGHDMA capability flag would be useful. I think that setting NETIF_F_HIGHDMA for ib_ipoib is a NOP for the systems QLogic supports (i.e., only 64-bit kernels) but that could change. Better to plan ahead. From randy.dunlap at oracle.com Tue Sep 25 13:16:20 2007 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Tue, 25 Sep 2007 13:16:20 -0700 Subject: [ofa-general] Re: [DOC] Net batching driver howto In-Reply-To: <1190674459.4264.28.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <1190674459.4264.28.camel@localhost> Message-ID: <20070925131620.617794ae.randy.dunlap@oracle.com> On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote: > I have updated the driver howto to match the patches i posted yesterday. > attached. Thanks for sending this. This is an early draft, right? I'll fix some typos etc. in it (patch attached) and add some whitespace. Please see RD: in the patch for more questions/comments. IMO it needs some changes to eliminate words like "we", "you", and "your" (words that personify code). Those words are OK when talking about yourself. --- ~Randy Phaedrus says that Quality is about caring. -------------- next part -------------- A non-text attachment was scrubbed... Name: batch-howto.patch Type: text/x-patch Size: 11579 bytes Desc: not available URL: From becker at nas.nasa.gov Tue Sep 25 14:41:58 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 25 Sep 2007 14:41:58 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <20070925192501.GE29384@sashak.voltaire.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> <20070925192501.GE29384@sashak.voltaire.com> Message-ID: <795c49870709251441y4ee77d43wd91ea62c62e832d2@mail.gmail.com> Thanks for the pointer. One problem I can see is that ibsim builds against OFED 1.2.5, but we are running OFED 1.2. Since I'm mainly interested in simulating different routing tests, does opensm from 1.2.5 differ much in it's routing strategy from 1.2? Thanks again. -jeff On 9/25/07, Sasha Khapyorsky wrote: > On 12:06 Tue 25 Sep , Hal Rosenstock wrote: > > On Tue, 2007-09-25 at 12:04 -0700, Jeff Becker wrote: > > > Is there a script available to convert this to a topology file usable > > > by IBMgtSim? > > > > Not that I'm aware of but this format is usable by ibsim (another IB > > management simulator). > > Which also does not require rebuilding of libibumad based programs. > > Sasha > From hadi at cyberus.ca Tue Sep 25 15:14:51 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 25 Sep 2007 18:14:51 -0400 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <20070925082457.6fec30d6@freepuppy.rosehill> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20070924171411.36494656@freepuppy.rosehill> <1190726138.4264.105.camel@localhost> <20070925082457.6fec30d6@freepuppy.rosehill> Message-ID: <1190758491.4244.15.camel@localhost> On Tue, 2007-25-09 at 08:24 -0700, Stephen Hemminger wrote: > The transmit code path is locked as a code region, rather than just object locking > on the transmit queue or other fine grained object. This leads to moderately long > lock hold times when multiple qdisc's and classification is being done. It will be pretty tricky to optimize that path given the dependencies between the queues, classifiers, and actions in enqueues; schedulers in dequeues as well as their config/queries from user space which could happen concurently on all "N" CPUs. The txlock optimization i added in patch1 intends to let go of the queue lock when we enter the dequeue region sooner to reduce the contention. A further optimization i made was to reduce the time it takes to hold the tx lock at the driver by moving gunk that doesnt need lock-holding into the new method dev->hard_end_xmit() (refer to patch #2) > If we went to finer grain locking it would also mean changes to all network > devices using the new locking model. My assumption is that we would use > something like the features flag to do the transition for backward compatibility. > Take this as a purely "what if" or "it would be nice if" kind of suggestion > not a requirement or some grand plan. Ok, hopefully someone would demonstrate how to achieve it; seems a hard thing to achieve. cheers, jamal From hadi at cyberus.ca Tue Sep 25 15:28:26 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 25 Sep 2007 18:28:26 -0400 Subject: [ofa-general] Re: [DOC] Net batching driver howto In-Reply-To: <20070925131620.617794ae.randy.dunlap@oracle.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <46F6AE18.7080708@garzik.org> <1190574713.5030.4.camel@localhost> <1190674459.4264.28.camel@localhost> <20070925131620.617794ae.randy.dunlap@oracle.com> Message-ID: <1190759306.4244.30.camel@localhost> On Tue, 2007-25-09 at 13:16 -0700, Randy Dunlap wrote: > On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote: > > > I have updated the driver howto to match the patches i posted yesterday. > > attached. > > Thanks for sending this. Thank you for reading it Randy. > This is an early draft, right? Its a third revision - but you could call it early. When it is done, i will probably put a pointer to it in some patch. > I'll fix some typos etc. in it (patch attached) and add some whitespace. > Please see RD: in the patch for more questions/comments. Thanks, will do and changes will show up in the next update. > IMO it needs some changes to eliminate words like "we", "you", > and "your" (words that personify code). Those words are OK > when talking about yourself. The narrative intent is supposed to be i (or someone doing the description) sitting there with a pen and paper and maybe a laptop and walking through the details with someone who needs to understand those details. If you think it is important to make it formal, then by all means be my guest. Again, thanks for taking the time. cheers, jamal From hadi at cyberus.ca Tue Sep 25 15:43:38 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 25 Sep 2007 18:43:38 -0400 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190758491.4244.15.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20070924171411.36494656@freepuppy.rosehill> <1190726138.4264.105.camel@localhost> <20070925082457.6fec30d6@freepuppy.rosehill> <1190758491.4244.15.camel@localhost> Message-ID: <1190760218.4244.35.camel@localhost> On Tue, 2007-25-09 at 18:15 -0400, jamal wrote: > A further optimization i made was to reduce the time it takes to hold > the tx lock at the driver by moving gunk that doesnt need lock-holding > into the new method dev->hard_end_xmit() (refer to patch #2) Sorry, that should have read dev->hard_prep_xmit() cheers, jamal From Nathan.Dauchy at noaa.gov Tue Sep 25 15:49:55 2007 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Tue, 25 Sep 2007 16:49:55 -0600 Subject: [ofa-general] SDP memory allocation policy problem? In-Reply-To: References: Message-ID: <46F99093.7000907@noaa.gov> Is there anyone among the OFED development team that is looking into this issue? I believe that it is causing nodes to hang at our site. We are running ofed-1.2 and the 2.6.9-55.ELsmp kernel. Workarounds or even untested patches would be appreciated. Thanks! -Nathan Ken Phillips wrote: > Greetings, > > Teammates here report the following: > > Problem > > The method SDP uses to allocate socket buffers may cause the > node to hang under memory pressure. > > Details > > Each kernel level socket has an allocation flag to specify the > memory allocation policy for socket buffers, the default is GFP_ATOMIC > (or GFP_KERNEL for SDP). If the caller creates a socket with the > policy set to GFP_NOFS or GFP_NOIO this should be the allocation > policy used by the SDP layer. > > The problem we are seeing is that if a node is under load, and > a memory allocation fails (say in sock_sendmsg()), the kernel will > use the allocation policy to decide how to proceed with the allocation. > If GFP_KERNEL is specified, then the kernel may attempt to free pages > through the iSCSI block device that is making the socket call, which > would result in a deadlock. Use of GFP_NOIO should prevent the kernel > from using the IO backend to free memory resources. > > here is a sample stack trace from Alt-Sysrq during one of these > lockups, > >> tx_worker D ffffff0014d14000 0 10195 1 10196 10194 >> (L-TLB) >> 00000100707e98d8 0000000000000046 0000000000000004 0000000000000212 >> 0000000000000212 ffffffffa018ccae 0000000000000246 0000000000000246 >> 000001007873c7f0 0000000000000320 >> Call Trace:{:ib_mthca:mthca_poll_cq+2258} >> {schedule_timeout+224} >> {lock_sock+152} >> {autoremove_wake_function+0} >> {:ib_sdp:sdp_poll_cq+58} >> {autoremove_wake_function+0} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+33} >> {sock_sendmsg+271} >> {:ib_sdp:sdp_post_sends+619} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+2222} >> {autoremove_wake_function+0} >> {:rs_iscsi:iscsi_sock_msg+1265} >> {:rs_iscsi:iscsi_sock_msg+1261} >> {recalc_task_prio+337} >> {:rs_iscsi:scsi_command_i+5283} >> {thread_return+0} >> {thread_return+88} >> {del_timer+107} >> {del_singleshot_timer_sync+9} >> {schedule_timeout+375} >> {:rs_iscsi:tx_worker_proc_i+6819} >> {child_rip+8} >> {:rs_iscsi:tx_worker_proc_i+0} >> {child_rip+0} >> >> > > We still don't know the scope of changes to be made, but we think, > at minimum that some of the memory allocation in SDP should be changed, > for example. > > diff -Naur old/drivers/infiniband/ulp/sdp/sdp_bcopy.c > new/drivers/infiniband/ulp/sdp/sdp_bcopy.c > --- old/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-06-21 > 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-08-31 > 12:25:58.000000000 -0400 > @@ -224,13 +224,27 @@ > > /* Now, allocate and repost recv */ > /* TODO: allocate from cache */ > + > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > h = (struct sdp_bsdh *)skb->head; > for (i = 0; i < ssk->recv_frags; ++i) { > +#if (PROPOSED_SDP_FIX == 1) > + page = alloc_pages((ssk->isk.sk.sk_allocation == 0) > + ? (GFP_HIGHUSER) : > + (ssk->isk.sk.sk_allocation | (__GFP_HIGHMEM)), > + 0); > +#else > page = alloc_pages(GFP_HIGHUSER, 0); > +#endif > BUG_ON(!page); > frag = &skb_shinfo(skb)->frags[i]; > frag->page = page; > @@ -406,10 +420,18 @@ > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *resp_size; > ssk->recv_request = 0; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*resp_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*resp_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); > @@ -431,10 +453,18 @@ > ssk->tx_head > ssk->sent_request_head + SDP_RESIZE_WAIT && > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *req_size; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*req_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*req_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; > @@ -463,9 +493,16 @@ > (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && > !ssk->isk.sk.sk_send_head && > ssk->bufs) { > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > sdp_post_send(ssk, skb, SDP_MID_DISCONN); > diff -Naur old/drivers/infiniband/ulp/sdp/sdp.h > new/drivers/infiniband/ulp/sdp/sdp.h > --- old/drivers/infiniband/ulp/sdp/sdp.h 2007-06-21 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp.h 2007-08-31 12:25:45.000000000 -0400 > @@ -7,6 +7,8 @@ > #include /* For urgent data flags */ > #include > > +#define PROPOSED_SDP_FIX 1 > + > #define sdp_printk(level, sk, format, arg...) \ > printk(level "sdp_sock(%d:%d): " format, \ > (sk) ? inet_sk(sk)->num : -1, \ > > > > > --------------------- > Best Regards > K Phillips > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Sep 25 16:34:21 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 26 Sep 2007 01:34:21 +0200 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <795c49870709251441y4ee77d43wd91ea62c62e832d2@mail.gmail.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> <20070925192501.GE29384@sashak.voltaire.com> <795c49870709251441y4ee77d43wd91ea62c62e832d2@mail.gmail.com> Message-ID: <20070925233421.GA19757@sashak.voltaire.com> On 14:41 Tue 25 Sep , Jeff Becker wrote: > Thanks for the pointer. One problem I can see is that ibsim builds > against OFED 1.2.5, but we are running OFED 1.2. I would expect that ibsim should work against any recent libibmad - doesn't matter OFED 1.2, 1.2.5 or master. Did you have any problems with 1.2? > Since I'm mainly > interested in simulating different routing tests, does opensm from > 1.2.5 differ much in it's routing strategy from 1.2? It is the same. Sasha > Thanks again. > > -jeff > > On 9/25/07, Sasha Khapyorsky wrote: > > On 12:06 Tue 25 Sep , Hal Rosenstock wrote: > > > On Tue, 2007-09-25 at 12:04 -0700, Jeff Becker wrote: > > > > Is there a script available to convert this to a topology file usable > > > > by IBMgtSim? > > > > > > Not that I'm aware of but this format is usable by ibsim (another IB > > > management simulator). > > > > Which also does not require rebuilding of libibumad based programs. > > > > Sasha > > From becker at nas.nasa.gov Tue Sep 25 16:30:48 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 25 Sep 2007 16:30:48 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <20070925233421.GA19757@sashak.voltaire.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> <20070925192501.GE29384@sashak.voltaire.com> <795c49870709251441y4ee77d43wd91ea62c62e832d2@mail.gmail.com> <20070925233421.GA19757@sashak.voltaire.com> Message-ID: <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> Hi Sasha. Thanks for the info. I did have the following problem when building against the 1.2 libibmad: cc -Wall -g -fpic -I. -I../include -I/home/becker//include -c -o sim_mad.o sim_mad.c sim_mad.c: In function 'encode_trap144': sim_mad.c:1261: error: 'IB_NOTICE_DATA_144_LID_F' undeclared (first use in this function) sim_mad.c:1261: error: (Each undeclared identifier is reported only once sim_mad.c:1261: error: for each function it appears in.) sim_mad.c:1262: error: 'IB_NOTICE_DATA_144_CAPMASK_F' undeclared (first use in this function) make[1]: *** [sim_mad.o] Error 1 make[1]: Leaving directory `/home/becker/ibrouting/ibsim/ibsim' -jeff On 9/25/07, Sasha Khapyorsky wrote: > On 14:41 Tue 25 Sep , Jeff Becker wrote: > > Thanks for the pointer. One problem I can see is that ibsim builds > > against OFED 1.2.5, but we are running OFED 1.2. > > I would expect that ibsim should work against any recent libibmad - > doesn't matter OFED 1.2, 1.2.5 or master. Did you have any problems with > 1.2? > > > Since I'm mainly > > interested in simulating different routing tests, does opensm from > > 1.2.5 differ much in it's routing strategy from 1.2? > > It is the same. > > Sasha > > > Thanks again. > > > > -jeff > > > > On 9/25/07, Sasha Khapyorsky wrote: > > > On 12:06 Tue 25 Sep , Hal Rosenstock wrote: > > > > On Tue, 2007-09-25 at 12:04 -0700, Jeff Becker wrote: > > > > > Is there a script available to convert this to a topology file usable > > > > > by IBMgtSim? > > > > > > > > Not that I'm aware of but this format is usable by ibsim (another IB > > > > management simulator). > > > > > > Which also does not require rebuilding of libibumad based programs. > > > > > > Sasha > > > > From kliteyn at mellanox.co.il Tue Sep 25 22:22:02 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 26 Sep 2007 07:22:02 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-26:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-25 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From keshetti.mahesh at gmail.com Wed Sep 26 02:24:02 2007 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Wed, 26 Sep 2007 14:54:02 +0530 Subject: Fw: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9024AAA1F@mtlexch01.mtl.com> References: <479359.26315.qm@web8315.mail.in.yahoo.com> <829ded920709240023v1282341cq4e14ce29f19fba1b@mail.gmail.com> <6C2C79E72C305246B504CBA17B5500C9024AA95A@mtlexch01.mtl.com> <829ded920709240128s3fde49f6pe49c05f4300261af@mail.gmail.com> <6C2C79E72C305246B504CBA17B5500C9024AAA1F@mtlexch01.mtl.com> Message-ID: <829ded920709260224o151da169g78d4ff89c18e07f6@mail.gmail.com> > The sharing paths is orthogonal to the min-hop requirement. > The min-hop requirements is a common way to avoid routing loops. > All algorithms I know are using it. > Even with that requirement there regularly multiple paths from A to B > available both for fat-tree or mesh/tori topologies. Thanks for clarifying it. Now, I can see that openSM supports four different algorithms(Min-hop being the default). Depending on the physical network topology whether the openSM decides the routing policy on its own or one has to configure openSM's routing algorithm before starting it. Is there any document describing which algorithm should be used when? And is there document describing the current openSM routing algorithms in detail ? regards, Mahesh. From vlad at lists.openfabrics.org Wed Sep 26 02:55:15 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 26 Sep 2007 02:55:15 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070926-0200 daily build status Message-ID: <20070926095516.39D41E6087C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070926-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From stormy9yalta at iletisimreklam.com Wed Sep 26 04:46:15 2007 From: stormy9yalta at iletisimreklam.com (Timothy Earl) Date: Wed, 26 Sep 2007 09:46:15 -0200 Subject: [ofa-general] ***SPAM*** list of medical doctors Message-ID: <863551h0dii0$f3920dg0$8470w5o0@Delldim5150 Only for the week ending Sep 28, you will get a Listing for Nursing Homes, Hospitals, Dentists and Chiropractors at no additional cost when you order the Medical Doctor Listing Licensed Medical Doctors in the USA 788,480 in total ďż˝ 17,400 emails Many different medical specialties Many unique fields like 'medical school attended' and 'location of residency training' Price for this week only = $382 *** Recieve the 4 medical Lists below without charge when you buy the Medical Doctor Directory above *** Directory of US Hospitals 23,000 Admins in more than 7,000 hospitals {a $399 value] Dentists in the USA More than half a million listings [worth $299 alone!] US Nursing Home List Full data for CFO, Nursing Directors, Senior Admins [ worth $249 alone ] Chiropractors in the USA Complete data for all chiropractors in the USA (a $249 value) reply to: medlistmaster at hotmail.com by sending us an email with "exit" in the subject we will know not to contact you again From mst at dev.mellanox.co.il Wed Sep 26 04:30:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Sep 2007 13:30:01 +0200 Subject: [ofa-general] ofed 1.3 kernel tree updated to 2.6.23-rc8 Message-ID: <20070926113001.GC2778@mellanox.co.il> Hello! I have updated the OFED 1.3 kernel tree at git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel to upstream 2.6.23-rc8. I have resolved minor conflicts in libiscsi backports for RHEL4, and everything seems to build fine now. iSER maintainers, please verify that I did the right thing. -- MST From jimmott at austin.rr.com Wed Sep 26 04:39:03 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 06:39:03 -0500 Subject: [ofa-general] SDP memory allocation policy problem? In-Reply-To: <46F99093.7000907@noaa.gov> References: <46F99093.7000907@noaa.gov> Message-ID: <000301c80031$d6ff9250$84feb6f0$@rr.com> This would be on my plate. I was travelling and have not gotten a chance to test the fix. On inspection, I see no problems with this approach and do not expect to see any testing issues. If you want to rework the patch to remove the PROPOSED_SDP_FIX and submit it, I will test it today. Otherwise, I will do the patch and testing by tomorrow. Sorry for taking so long. JIm -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Nathan Dauchy Sent: Tuesday, September 25, 2007 5:50 PM To: general at lists.openfabrics.org Subject: Re: [ofa-general] SDP memory allocation policy problem? Is there anyone among the OFED development team that is looking into this issue? I believe that it is causing nodes to hang at our site. We are running ofed-1.2 and the 2.6.9-55.ELsmp kernel. Workarounds or even untested patches would be appreciated. Thanks! -Nathan Ken Phillips wrote: > Greetings, > > Teammates here report the following: > > Problem > > The method SDP uses to allocate socket buffers may cause the > node to hang under memory pressure. > > Details > > Each kernel level socket has an allocation flag to specify the > memory allocation policy for socket buffers, the default is GFP_ATOMIC > (or GFP_KERNEL for SDP). If the caller creates a socket with the > policy set to GFP_NOFS or GFP_NOIO this should be the allocation > policy used by the SDP layer. > > The problem we are seeing is that if a node is under load, and > a memory allocation fails (say in sock_sendmsg()), the kernel will > use the allocation policy to decide how to proceed with the allocation. > If GFP_KERNEL is specified, then the kernel may attempt to free pages > through the iSCSI block device that is making the socket call, which > would result in a deadlock. Use of GFP_NOIO should prevent the kernel > from using the IO backend to free memory resources. > > here is a sample stack trace from Alt-Sysrq during one of these > lockups, > >> tx_worker D ffffff0014d14000 0 10195 1 10196 10194 >> (L-TLB) >> 00000100707e98d8 0000000000000046 0000000000000004 0000000000000212 >> 0000000000000212 ffffffffa018ccae 0000000000000246 0000000000000246 >> 000001007873c7f0 0000000000000320 >> Call Trace:{:ib_mthca:mthca_poll_cq+2258} >> {schedule_timeout+224} >> {lock_sock+152} >> {autoremove_wake_function+0} >> {:ib_sdp:sdp_poll_cq+58} >> {autoremove_wake_function+0} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+33} >> {sock_sendmsg+271} >> {:ib_sdp:sdp_post_sends+619} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+2222} >> {autoremove_wake_function+0} >> {:rs_iscsi:iscsi_sock_msg+1265} >> {:rs_iscsi:iscsi_sock_msg+1261} >> {recalc_task_prio+337} >> {:rs_iscsi:scsi_command_i+5283} >> {thread_return+0} >> {thread_return+88} >> {del_timer+107} >> {del_singleshot_timer_sync+9} >> {schedule_timeout+375} >> {:rs_iscsi:tx_worker_proc_i+6819} >> {child_rip+8} >> {:rs_iscsi:tx_worker_proc_i+0} >> {child_rip+0} >> >> > > We still don't know the scope of changes to be made, but we think, > at minimum that some of the memory allocation in SDP should be changed, > for example. > > diff -Naur old/drivers/infiniband/ulp/sdp/sdp_bcopy.c > new/drivers/infiniband/ulp/sdp/sdp_bcopy.c > --- old/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-06-21 > 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-08-31 > 12:25:58.000000000 -0400 > @@ -224,13 +224,27 @@ > > /* Now, allocate and repost recv */ > /* TODO: allocate from cache */ > + > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > h = (struct sdp_bsdh *)skb->head; > for (i = 0; i < ssk->recv_frags; ++i) { > +#if (PROPOSED_SDP_FIX == 1) > + page = alloc_pages((ssk->isk.sk.sk_allocation == 0) > + ? (GFP_HIGHUSER) : > + (ssk->isk.sk.sk_allocation | (__GFP_HIGHMEM)), > + 0); > +#else > page = alloc_pages(GFP_HIGHUSER, 0); > +#endif > BUG_ON(!page); > frag = &skb_shinfo(skb)->frags[i]; > frag->page = page; > @@ -406,10 +420,18 @@ > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *resp_size; > ssk->recv_request = 0; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*resp_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*resp_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); > @@ -431,10 +453,18 @@ > ssk->tx_head > ssk->sent_request_head + SDP_RESIZE_WAIT && > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *req_size; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*req_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*req_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; > @@ -463,9 +493,16 @@ > (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && > !ssk->isk.sk.sk_send_head && > ssk->bufs) { > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > sdp_post_send(ssk, skb, SDP_MID_DISCONN); > diff -Naur old/drivers/infiniband/ulp/sdp/sdp.h > new/drivers/infiniband/ulp/sdp/sdp.h > --- old/drivers/infiniband/ulp/sdp/sdp.h 2007-06-21 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp.h 2007-08-31 12:25:45.000000000 -0400 > @@ -7,6 +7,8 @@ > #include /* For urgent data flags */ > #include > > +#define PROPOSED_SDP_FIX 1 > + > #define sdp_printk(level, sk, format, arg...) \ > printk(level "sdp_sock(%d:%d): " format, \ > (sk) ? inet_sk(sk)->num : -1, \ > > > > > --------------------- > Best Regards > K Phillips > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bimgqvlp at bstfyn.dk Wed Sep 26 08:19:06 2007 From: bimgqvlp at bstfyn.dk (Walter upbring) Date: Wed, 26 Sep 2007 12:19:06 -0300 Subject: [ofa-general] MD lists Message-ID: <525100m3vka0$o5837kq0$7431q3m0@Delldim5150 Only until Sep 28 - When you purchase the Physician Directory at the sale price you will also get Hospital, Nursing Home, Dentist and Chiropractor data completely free! Licensed Physicians in the USA 788,818 in total ďż˝ 17,400 emails Featuring coverage for more than 30 specialties like Internal Medicine, Family Practice, Opthalmology, Anesthesiologists, Cardiologists and more Many unique fields like 'medical school attended' and 'location of residency training' Price for this week only = $380 *** FREE OFFER: Get the 4 directories below for FREE with the purchase of the Doctor data *** Hospitals in the USA 23,000 Admins in more than 7,000 hospitals {a $399 value] Database of American Dentists A complete List or dentists and related services (valued at $299) US Nursing Home List Full data for CFO, Nursing Directors, Senior Admins [ worth $249 alone ] American Chiropractors Database 100k Chiropractors offices with full contact data including email, postal address, phone and fax send us an email: medlistmaster at hotmail.com put "cease" in the subject of an email to us if you'd rather not be contacted From rdreier at cisco.com Wed Sep 26 10:51:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 10:51:24 -0700 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <20070925183243.GC9670@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 25 Sep 2007 20:32:43 +0200") References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> Message-ID: > > So, in order to support HIGHMEM, I would need to change the > > ipath_dma_*() functions to call kmap()/kunmap() for HIGHMEM pages. > > I'm sure there would be all kinds of performance and coding issues > > around doing this. > > So, we need some kind of HIGHDMA capability flag? I don't think so. An RDMA adapter that can't handle highmem pages would be kind of pointless: you wouldn't be able to handle userspace memory regions, for one thing. So if ipath ever tries to handle 32-bit kernels then I think handling highmem will be part of it. Actually, maybe something like this probably makes sense for IPoIB while we're at it: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..dbc845f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -129,7 +129,7 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int } for (i = 0; i < frags; i++) { - struct page *page = alloc_page(GFP_ATOMIC); + struct page *page = alloc_page(GFP_ATOMIC | GFP_HIGHUSER); if (!page) goto partial_error; From rdreier at cisco.com Wed Sep 26 10:53:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 10:53:26 -0700 Subject: [ofa-general] Re: [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters In-Reply-To: <1190637727.4947.76.camel@mtls03> (Eli Cohen's message of "Mon, 24 Sep 2007 14:42:07 +0200") References: <1190637727.4947.76.camel@mtls03> Message-ID: > +static int cq_max_count = 16; > +static int cq_period = 10; > + > +module_param(cq_max_count, int, 0444); > +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); > +module_param(cq_period, int, 0444); > +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); I assume this is just a leftover from some earlier approach? These module parameters are just ignored now, so the patch seems kind of pointless. Anyway I think the approach of having one global setting for all CQs is not a good one -- it seems likely that for example IPoIB and SDP would want different settings, not to mention userspace applications. - R. From rdreier at cisco.com Wed Sep 26 10:55:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 10:55:08 -0700 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: (Roland Dreier's message of "Wed, 26 Sep 2007 10:51:24 -0700") References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> Message-ID: > + struct page *page = alloc_page(GFP_ATOMIC | GFP_HIGHUSER); actually: + struct page *page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); - R. From rdreier at cisco.com Wed Sep 26 11:01:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 11:01:18 -0700 Subject: [ofa-general] [PATCH] IB/core - possible bug in handling link down in ib_sa_join_multicast() In-Reply-To: <46F4274C.9080108@ichips.intel.com> (Sean Hefty's message of "Fri, 21 Sep 2007 13:19:24 -0700") References: <1190331224.20700.27.camel@brick.pathscale.com> <46F3099E.7040008@ichips.intel.com> <46F4274C.9080108@ichips.intel.com> Message-ID: > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland Thanks, I queued the Ralph's patch for 2.6.24 From mshefty at ichips.intel.com Wed Sep 26 11:42:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Sep 2007 11:42:47 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: <46F953FC.50101@ichips.intel.com> References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> <46F953FC.50101@ichips.intel.com> Message-ID: <46FAA827.90504@ichips.intel.com> > Thanks for adding counters. They will be extremely helpful with our > large scale cluster support. However, exporting them via debugfs will be > useless in our production environments that do not have debugfs support > built in the kernel. Can we expose these with the same mechanism as > other IB/iWARP modules? The userspace IB CM is exported through /sys/class/infiniband_cm. Does anyone object to placing the counters under this same directory? (I'm not sure how to do this yet, but I can look into it.) - Sean From swise at opengridcomputing.com Wed Sep 26 12:02:18 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 26 Sep 2007 14:02:18 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070923203649.8324.64524.stgit@dell3.ogc.int> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> Message-ID: <46FAACBA.7020102@opengridcomputing.com> Rolan/Sean, What do you all think? Steve. Steve Wise wrote: > iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. > > Version 3: > > - don't use list_del_init() where list_del() is sufficient. > > Version 2: > > - added a per-device mutex for the address and listening endpoints lists. > > - wait for all replies if sending multiple passive_open requests to rnic. > > - log warning if no addresses are available when a listen is issued. > > - tested > > --- > > Design: > > The sysadmin creates "for iwarp use only" alias interfaces of the form > "devname:iw*" where devname is the native interface name (eg eth0) for the > iwarp netdev device. The alias label can be anything starting with "iw". > The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. > > EG: > ifconfig eth0 192.168.70.123 up > ifconfig eth0:iw1 192.168.71.123 up > ifconfig eth0:iw2 192.168.72.123 up > > In the above example, 192.168.70/24 is for TCP traffic, while > 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. > > The rdma-only interface must be on its own IP subnet. This allows routing > all rdma traffic onto this interface. > > The iWARP driver must translate all listens on address 0.0.0.0 to the > set of rdma-only ip addresses for the device in question. This prevents > incoming connect requests to the TCP ipaddresses from going up the > rdma stack. > > Implementation Details: > > - The iw_cxgb3 driver registers for inetaddr events via > register_inetaddr_notifier(). This allows tracking the iwarp-only > addresses/subnets as they get added and deleted. The iwarp driver > maintains a list of the current iwarp-only addresses. > > - The iw_cxgb3 driver builds the list of iwarp-only addresses for its > devices at module insert time. This is needed because the inetaddr > notifier callbacks don't "replay" address-add events when someone > registers. So the driver must build the initial list at module load time. > > - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver > must translate that into a set of listens on the iwarp-only addresses. > This is implemented by maintaining a list of stid/addr entries per > listening endpoint. > > - When a new iwarp-only address is added or removed, the iw_cxgb3 driver > must traverse the set of listening endpoints and update them accordingly. > This allows an application to bind to 0.0.0.0 prior to the iwarp-only > interfaces being configured. It also allows changing the iwarp-only set > of addresses and getting the expected behavior for apps already bound > to 0.0.0.0. This is done by maintaining a list of listening endpoints > off the device struct. > > - The address list, the listening endpoint list, and each list of > stid/addrs in use per listening endpoint are all protected via a mutex > per iw_cxgb3 device. > > Signed-off-by: Steve Wise > --- > > drivers/infiniband/hw/cxgb3/iwch.c | 125 ++++++++++++++++ > drivers/infiniband/hw/cxgb3/iwch.h | 11 + > drivers/infiniband/hw/cxgb3/iwch_cm.c | 259 +++++++++++++++++++++++++++------ > drivers/infiniband/hw/cxgb3/iwch_cm.h | 15 ++ > 4 files changed, 360 insertions(+), 50 deletions(-) > > diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c > index 0315c9d..d81d46e 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch.c > +++ b/drivers/infiniband/hw/cxgb3/iwch.c > @@ -63,6 +63,123 @@ struct cxgb3_client t3c_client = { > static LIST_HEAD(dev_list); > static DEFINE_MUTEX(dev_mutex); > > +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > +{ > + struct iwch_addrlist *addr; > + > + addr = kmalloc(sizeof *addr, GFP_KERNEL); > + if (!addr) { > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > + __FUNCTION__); > + return; > + } > + addr->ifa = ifa; > + mutex_lock(&rnicp->mutex); > + list_add_tail(&addr->entry, &rnicp->addrlist); > + mutex_unlock(&rnicp->mutex); > +} > + > +static void remove_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > +{ > + struct iwch_addrlist *addr, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > + if (addr->ifa == ifa) { > + list_del(&addr->entry); > + kfree(addr); > + goto out; > + } > + } > +out: > + mutex_unlock(&rnicp->mutex); > +} > + > +static int netdev_is_ours(struct iwch_dev *rnicp, struct net_device *netdev) > +{ > + int i; > + > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) > + if (netdev == rnicp->rdev.port_info.lldevs[i]) > + return 1; > + return 0; > +} > + > +static inline int is_iwarp_label(char *label) > +{ > + char *colon; > + > + colon = strchr(label, ':'); > + if (colon && !strncmp(colon+1, "iw", 2)) > + return 1; > + return 0; > +} > + > +static int nb_callback(struct notifier_block *self, unsigned long event, > + void *ctx) > +{ > + struct in_ifaddr *ifa = ctx; > + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); > + > + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); > + > + switch (event) { > + case NETDEV_UP: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + iwch_listeners_add_addr(rnicp, ifa->ifa_address); > + } > + break; > + case NETDEV_DOWN: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x deleted\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + iwch_listeners_del_addr(rnicp, ifa->ifa_address); > + remove_ifa(rnicp, ifa); > + } > + break; > + default: > + break; > + } > + return 0; > +} > + > +static void delete_addrlist(struct iwch_dev *rnicp) > +{ > + struct iwch_addrlist *addr, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > + list_del(&addr->entry); > + kfree(addr); > + } > + mutex_unlock(&rnicp->mutex); > +} > + > +static void populate_addrlist(struct iwch_dev *rnicp) > +{ > + int i; > + struct in_device *indev; > + > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { > + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); > + if (!indev) > + continue; > + for_ifa(indev) > + if (is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, > + ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + } > + endfor_ifa(indev); > + } > +} > + > static void rnic_init(struct iwch_dev *rnicp) > { > PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); > @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r > idr_init(&rnicp->qpidr); > idr_init(&rnicp->mmidr); > spin_lock_init(&rnicp->lock); > + INIT_LIST_HEAD(&rnicp->addrlist); > + INIT_LIST_HEAD(&rnicp->listen_eps); > + mutex_init(&rnicp->mutex); > + rnicp->nb.notifier_call = nb_callback; > + populate_addrlist(rnicp); > + register_inetaddr_notifier(&rnicp->nb); > > rnicp->attr.vendor_id = 0x168; > rnicp->attr.vendor_part_id = 7; > @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev > mutex_lock(&dev_mutex); > list_for_each_entry_safe(dev, tmp, &dev_list, entry) { > if (dev->rdev.t3cdev_p == tdev) { > + unregister_inetaddr_notifier(&dev->nb); > + delete_addrlist(dev); > list_del(&dev->entry); > iwch_unregister_device(dev); > cxio_rdev_close(&dev->rdev); > diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h > index caf4e60..7fa0a47 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch.h > +++ b/drivers/infiniband/hw/cxgb3/iwch.h > @@ -36,6 +36,8 @@ #include > #include > #include > #include > +#include > +#include > > #include > > @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { > u32 cq_overflow_detection; > }; > > +struct iwch_addrlist { > + struct list_head entry; > + struct in_ifaddr *ifa; > +}; > + > struct iwch_dev { > struct ib_device ibdev; > struct cxio_rdev rdev; > @@ -111,6 +118,10 @@ struct iwch_dev { > struct idr mmidr; > spinlock_t lock; > struct list_head entry; > + struct notifier_block nb; > + struct list_head addrlist; > + struct list_head listen_eps; > + struct mutex mutex; > }; > > static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c > index 1cdfcd4..afc8a48 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c > @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t > return CPL_RET_BUF_DONE; > } > > -static int listen_start(struct iwch_listen_ep *ep) > +static int wait_for_reply(struct iwch_ep_common *epc) > +{ > + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); > + wait_event(epc->waitq, epc->rpl_done); > + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); > + return epc->rpl_err; > +} > + > +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, > + __be32 addr) > +{ > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + struct iwch_listen_entry *le; > + > + le = kmalloc(sizeof *le, GFP_KERNEL); > + if (!le) { > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > + __FUNCTION__); > + return NULL; > + } > + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, > + &t3c_client, ep); > + if (le->stid == -1) { > + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", > + __FUNCTION__); > + kfree(le); > + return NULL; > + } > + le->addr = addr; > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > + return le; > +} > + > +static void dealloc_listener(struct iwch_listen_ep *ep, > + struct iwch_listen_entry *le) > +{ > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > + cxgb3_free_stid(ep->com.tdev, le->stid); > + kfree(le); > +} > + > +static void dealloc_listener_list(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le, *tmp; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + > + mutex_lock(&h->mutex); > + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { > + list_del(&le->entry); > + dealloc_listener(ep, le); > + } > + mutex_unlock(&h->mutex); > +} > + > +static int alloc_listener_list(struct iwch_listen_ep *ep) > +{ > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + struct iwch_addrlist *addr; > + struct iwch_listen_entry *le; > + int err = 0; > + int added=0; > + mutex_lock(&h->mutex); > + list_for_each_entry(addr, &h->addrlist, entry) { > + if (ep->com.local_addr.sin_addr.s_addr == 0 || > + ep->com.local_addr.sin_addr.s_addr == > + addr->ifa->ifa_address) { > + le = alloc_listener(ep, addr->ifa->ifa_address); > + if (!le) > + break; > + list_add_tail(&le->entry, &ep->listeners); > + added++; > + } > + } > + mutex_unlock(&h->mutex); > + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) > + err = -EADDRNOTAVAIL; > + if (!err && !added) > + printk(KERN_WARNING MOD > + "No RDMA interface found for device %s\n", > + pci_name(h->rdev.rnic_info.pdev)); > + return err; > +} > + > +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int stid) > { > struct sk_buff *skb; > - struct cpl_pass_open_req *req; > + struct cpl_close_listserv_req *req; > + > + PDBG("%s stid %u\n", __FUNCTION__, stid); > + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > + if (!skb) { > + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > + return -ENOMEM; > + } > + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > + req->cpu_idx = 0; > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); > + skb->priority = 1; > + ep->com.rpl_err = 0; > + ep->com.rpl_done = 0; > + cxgb3_ofld_send(ep->com.tdev, skb); > + return wait_for_reply(&ep->com); > +} > + > +static int listen_stop(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + int err = 0; > > PDBG("%s ep %p\n", __FUNCTION__, ep); > + mutex_lock(&h->mutex); > + list_for_each_entry(le, &ep->listeners, entry) { > + err = listen_stop_one(ep, le->stid); > + if (err) > + break; > + } > + mutex_unlock(&h->mutex); > + return err; > +} > + > +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int stid, > + __be32 addr, __be16 port) > +{ > + struct sk_buff *skb; > + struct cpl_pass_open_req *req; > + > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, ntohl(addr), > + ntohs(port)); > skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > if (!skb) { > - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); > + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > return -ENOMEM; > } > > req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); > req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); > - req->local_port = ep->com.local_addr.sin_port; > - req->local_ip = ep->com.local_addr.sin_addr.s_addr; > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); > + req->local_port = port; > + req->local_ip = addr; > req->peer_port = 0; > req->peer_ip = 0; > req->peer_netmask = 0; > @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list > req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); > > skb->priority = 1; > + ep->com.rpl_err = 0; > + ep->com.rpl_done = 0; > cxgb3_ofld_send(ep->com.tdev, skb); > - return 0; > + return wait_for_reply(&ep->com); > +} > + > +static int listen_start(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + int err = 0; > + > + PDBG("%s ep %p\n", __FUNCTION__, ep); > + mutex_lock(&h->mutex); > + list_for_each_entry(le, &ep->listeners, entry) { > + err = listen_start_one(ep, le->stid, le->addr, > + ep->com.local_addr.sin_port); > + if (err) > + goto fail; > + } > + mutex_unlock(&h->mutex); > + return err; > +fail: > + mutex_unlock(&h->mutex); > + listen_stop(ep); > + return err; > } > > static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) > @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * > return CPL_RET_BUF_DONE; > } > > -static int listen_stop(struct iwch_listen_ep *ep) > -{ > - struct sk_buff *skb; > - struct cpl_close_listserv_req *req; > - > - PDBG("%s ep %p\n", __FUNCTION__, ep); > - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > - if (!skb) { > - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > - return -ENOMEM; > - } > - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); > - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > - req->cpu_idx = 0; > - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); > - skb->priority = 1; > - cxgb3_ofld_send(ep->com.tdev, skb); > - return 0; > -} > - > static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, > void *ctx) > { > struct iwch_listen_ep *ep = ctx; > struct cpl_close_listserv_rpl *rpl = cplhdr(skb); > > - PDBG("%s ep %p\n", __FUNCTION__, ep); > + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); > + > ep->com.rpl_err = status2errno(rpl->status); > ep->com.rpl_done = 1; > wake_up(&ep->com.waitq); > return CPL_RET_BUF_DONE; > } > > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) > +{ > + struct iwch_listen_ep *listen_ep; > + struct iwch_listen_entry *le; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > + if (listen_ep->com.local_addr.sin_addr.s_addr) > + continue; > + le = alloc_listener(listen_ep, addr); > + if (le) { > + list_add_tail(&le->entry, &listen_ep->listeners); > + listen_start_one(listen_ep, le->stid, addr, > + listen_ep->com.local_addr.sin_port); > + } > + } > + mutex_unlock(&rnicp->mutex); > +} > + > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) > +{ > + struct iwch_listen_ep *listen_ep; > + struct iwch_listen_entry *le, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > + if (listen_ep->com.local_addr.sin_addr.s_addr) > + continue; > + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, > + entry) > + if (le->addr == addr) { > + listen_stop_one(listen_ep, le->stid); > + list_del(&le->entry); > + dealloc_listener(listen_ep, le); > + } > + } > + mutex_unlock(&rnicp->mutex); > +} > + > static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) > { > struct cpl_pass_accept_rpl *rpl; > @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i > goto err; > > /* wait for wr_ack */ > - wait_event(ep->com.waitq, ep->com.rpl_done); > - err = ep->com.rpl_err; > + err = wait_for_reply(&ep->com); > if (err) > goto err; > > @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * > ep->com.cm_id = cm_id; > ep->backlog = backlog; > ep->com.local_addr = cm_id->local_addr; > + INIT_LIST_HEAD(&ep->listeners); > > - /* > - * Allocate a server TID. > - */ > - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); > - if (ep->stid == -1) { > - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); > - err = -ENOMEM; > + err = alloc_listener_list(ep); > + if (err) > goto fail2; > - } > > state_set(&ep->com, LISTEN); > err = listen_start(ep); > - if (err) > - goto fail3; > > - /* wait for pass_open_rpl */ > - wait_event(ep->com.waitq, ep->com.rpl_done); > - err = ep->com.rpl_err; > if (!err) { > cm_id->provider_data = ep; > + mutex_lock(&h->mutex); > + list_add_tail(&ep->entry, &h->listen_eps); > + mutex_unlock(&h->mutex); > goto out; > } > -fail3: > - cxgb3_free_stid(ep->com.tdev, ep->stid); > + dealloc_listener_list(ep); > fail2: > cm_id->rem_ref(cm_id); > put_ep(&ep->com); > @@ -1923,18 +2084,20 @@ out: > int iwch_destroy_listen(struct iw_cm_id *cm_id) > { > int err; > + struct iwch_dev *h = to_iwch_dev(cm_id->device); > struct iwch_listen_ep *ep = to_listen_ep(cm_id); > > PDBG("%s ep %p\n", __FUNCTION__, ep); > > might_sleep(); > + mutex_lock(&h->mutex); > + list_del(&ep->entry); > + mutex_unlock(&h->mutex); > state_set(&ep->com, DEAD); > ep->com.rpl_done = 0; > ep->com.rpl_err = 0; > err = listen_stop(ep); > - wait_event(ep->com.waitq, ep->com.rpl_done); > - cxgb3_free_stid(ep->com.tdev, ep->stid); > - err = ep->com.rpl_err; > + dealloc_listener_list(ep); > cm_id->rem_ref(cm_id); > put_ep(&ep->com); > return err; > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h > index 6107e7c..23e5a22 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h > @@ -162,10 +162,19 @@ struct iwch_ep_common { > int rpl_err; > }; > > -struct iwch_listen_ep { > - struct iwch_ep_common com; > +struct iwch_listen_entry { > + struct list_head entry; > unsigned int stid; > + __be32 addr; > +}; > + > +struct iwch_listen_ep { > + struct iwch_ep_common com; /* Must be first entry! */ > + struct list_head entry; > + struct list_head listeners; > int backlog; > + int listen_count; > + int listen_rpls; > }; > > struct iwch_ep { > @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); > void __free_ep(struct kref *kref); > void iwch_rearp(struct iwch_ep *ep); > int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t); > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); > > int __init iwch_cm_init(void); > void __exit iwch_cm_term(void); > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jimmott at austin.rr.com Wed Sep 26 12:06:25 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 14:06:25 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <000301c80031$d6ff9250$84feb6f0$@rr.com> References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> Message-ID: <000701c80070$560d0490$02270db0$@rr.com> This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is what I believe to be a bug in the mlx4 driver. 1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. Normally the difference between max_sge and the actual maximum value allowed (and checked) for max_send_sge is 1 or 2. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. 2) mlx4 setting of max send SQEs The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). The patch that follows will allow an MLX4 to support the number of send SGEs returned by ib_query_devce, and in fact quite a few more. It probably breaks shrinking WQEs and thus should not be applied directly. Note that if ib_query_device() returned max_sge adjusted for the raddr and atomic segments, this fix would not be needed. MLX4 would still support more SGEs in hardware than can be used through the API, but that is a different problem. --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.000000000 -0500 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + if (s > dev->dev->caps.max_sq_desc_sz) return -EINVAL; qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); From rdreier at cisco.com Wed Sep 26 12:54:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 12:54:56 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: <200709180914.18560.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 18 Sep 2007 09:14:18 +0200") References: <200709180914.18560.jackm@dev.mellanox.co.il> Message-ID: Thanks, applied with the cleanup suggested by MST. BTW when applying I had to edit the patch, because of: > MLX4_MGM_ENTRY_SIZE = 0x100, this context didn't match the upstream kernel (I see 0x40 there). Is there some reason you have a bigger size in your tree? If so should we make the change upstream too? Also I deobfuscated as below: diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index be3c6fc..9e590e1 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -524,8 +524,8 @@ static int __devinit mlx4_init_hca(struct mlx4_dev *dev) } priv->eq_table.inta_pin = adapter.inta_pin; - priv->dev.rev_id = adapter.revision_id; - memcpy(priv->dev.board_id, adapter.board_id, sizeof priv->dev.board_id); + dev->rev_id = adapter.revision_id; + memcpy(dev->board_id, adapter.board_id, sizeof dev->board_id); return 0; From rdreier at cisco.com Wed Sep 26 13:02:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 13:02:44 -0700 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <000701c80070$560d0490$02270db0$@rr.com> (Jim Mott's message of "Wed, 26 Sep 2007 14:06:25 -0500") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> Message-ID: > 1) ib_create_qp() fails with max_sge > If you use ib_query_device() to return the device specific > attribute max_sge, it seems reasonable to expect you can create > a QP with max_send_sge=max_sge. The problem is that this often > fails. > > The reason is that depending on the QP type (RC, UD, etc.) and > how the QP will be used (send, RDMA, atomic, etc.), there can be > extra segments required in the WQE that eat up SGE entries. So > while some send WQE might have max_sge available SGEs, many will > not. > This issue may need API extensions to definitively resolve. In > the short term, it would be very nice if max_sge reported by > ib_query_device() could always return a value that ib_create_qp() > could use. Think of it as the minimum max_send_sge value that > will work for all QP types. The intention is that any attempt to create a QP with the maximum number of S/G entries as reported by query device should succeed. However, as you note there may be issues that make this fail, but I would consider them as bugs to be fixed. You mention API extensions to handle this -- do you have any concrete ideas? In the past we've talked a little about this, but I don't think anyone has suggested any changes that would help matters while still keeping the API no more complex than it already is. > The recent patch to support shrinking WQEs introduces a > behavior that creates a big difference between the mlx4 > supported send SGEs (checked against 61, should be 59 or 60, > and reported in ib_query_device as 32 to equal receive side > max_rq_sg value). I'm not sure I understand this. What's the new behavior? Are you trying to take advantage of the fact that using non-power-of-2 size send WQEs would let you have a send queue with more than 32 S/G entries? I think doing that actually would require a change in the API to allow different values for max_sge_rq and max_sge_sq to be reported from ib_query_device(). Which in turn would break the userspace ABI, etc, etc. and leaves me wondering if it's really worth it. (BTW I hate the "shrinking WQE" terminology for this, although obviously you weren't the one to introduce it) - R. From rdreier at cisco.com Wed Sep 26 13:23:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 13:23:58 -0700 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: (Or Gerlitz's message of "Thu, 20 Sep 2007 15:16:50 +0200 (IST)") References: Message-ID: > To support this inter-op for the case where the receiving party resides at > the IB side, there is a need to handle IGMP (reports/queries) else the local > IP router would not forward multicast traffic towards the IB network. > > This patch does a lookup on the database used for multicast reference counting and > enhances IPoIB to ignore mulicast group which is already handled by user space, all > this under a per device policy flag. That is when the policy flag allows it, IPoIB > will not join and attach its QP to a multicast group which has an entry on the database. I don't really follow this explanation. OK, I see in the first paragraph that you want to handle IGMP. How does the second paragraph follow? Why does IGMP mean the kernel IPoIB interface should avoid joining certain multicast groups? (Which groups?) > > + /* ignore group which is directly joined by user space */ > + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags) && > + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) I don't follow this. Why does ib_sa_get_mcmember_rec() returning 0 imply that userspace has already joined the multicast group? > +module_param_named(umcast_allowed, ipoib_umcast_allowed, int, 0444); Not sure I understand why you added the module parameter... > +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); The set_umcast attribute is writable by root anyway so why are there two ways of setting this? > + if (!strcmp(buf, "1\n")) { I don't think this is the most robust way of parsing things. for example it will break in a very confusing way if someone uses "echo -n" Could you use simple_strtoul() or something like that to handle leading/trailing whitespace etc? - R. From rdreier at cisco.com Wed Sep 26 13:24:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 13:24:41 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: <46FAA827.90504@ichips.intel.com> (Sean Hefty's message of "Wed, 26 Sep 2007 11:42:47 -0700") References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> <46F953FC.50101@ichips.intel.com> <46FAA827.90504@ichips.intel.com> Message-ID: > The userspace IB CM is exported through /sys/class/infiniband_cm. > Does anyone object to placing the counters under this same directory? I think that would be fine, with the usual one-value-per-file sysfs rules. From jlentini at netapp.com Wed Sep 26 14:34:44 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 26 Sep 2007 17:34:44 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] uDAPL 2.0 mods to co-exist with uDAPL 1.2 In-Reply-To: <46F84E10.9040705@ichips.intel.com> References: <000001c7fbb7$30cbad70$19b7020a@amr.corp.intel.com> <46F84E10.9040705@ichips.intel.com> Message-ID: On Mon, 24 Sep 2007, Arlin Davis wrote: > > > --- a/test/dtest/dtest.c > > > +++ b/test/dtest/dtest.c > > > @@ -44,7 +44,7 @@ > > > #include > > > #ifndef DAPL_PROVIDER > > > -#define DAPL_PROVIDER "OpenIB-cma" > > > +#define DAPL_PROVIDER "OpenIB-2-cma" > > > > Should we update OpenIB to ofa? Obviously, this isn't necessary as part of > > this change > > I didn't want to change the 1.2 names for compatibility reasons but for 2.0 we > could move to ofa names for both libraries and provider names. For example, > libdaplcma.so becomes libdaplofa.so, OpenIB-cma becomes ofa. > > For example dat.conf 2.0 entries would look like this: > > ofa u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib0 0" "" > ofa-1 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib1 0" "" > ofa-2 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib2 0" "" > ofa-3 u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "ib3 0" "" > ofa-bond u2.0 nonthreadsafe default libdaplofa.so dapl.2.0 "bond0 0" "" > > Is that what you had in mind? Yes. From jimmott at austin.rr.com Wed Sep 26 14:58:53 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 16:58:53 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> Message-ID: <001001c80088$6e04f4f0$4a0eded0$@rr.com> This problem comes about because ib_query_device() has only one field (max_sge) to return all types of SGE maximums. This value must work for receive WQEs, send WQEs, and all the permutations of QP type and hardware. A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. If an application wanted to better use hardware that supports asymmetric SGE counts, it could add the appropriate delta_sge_xx value to max_sge and get more useful value. It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) If we do add any new fields to deal with this problem, we should probably make sure all the drivers support them. I guess that portable applications check max_sge_rd and max_srp_sge for zero and use max_sge if they are? To fully solve the problem and let applications make optimal use of hardware, we probably need a new function that takes the create_qp parameters along with a list of OPCODEs to be used (or excluded?) on this QP and returns the actual send and receive SGE maximums. ================================ The issue with the "shrinking WQE" (sorry) is best shown by example. The MLX4 supports a send WQE that is 1008 bytes long unless you are doing RDMA_READ when you can only use 512 byte send WQEs. A receive WQE can be 512 bytes maximum. Ignore the non-power-of-2 size stuff and just assume that all WQEs are fixed size power-of-2 with maximums of 1024 or 512. This is 63 or 32 segments. One segment for ctrl means that we get max_sge_rq of 31 and a matrix for max_sge_sq: RDMA_READ : 30 (raddr) RDMA_WRITE: 61 (raddr) SEND-RC : 62 SEND-UD : 59 (AV, AV, dest) The problem with: if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) is that since max_sq_desc is 1008 instead of 1024 we are forced to use wqe_shift of 9 instead of 10. That means that even though the hardware supports an RC send with 62 SGEs, the most we can actually ask for is 31. ================================ All this brings us back to the original bug. ib_query_device() returned max_sge=32, so we use it in max_send_sge when we create a QP. In mlx4/qp.c, we verify max_send_sge <= max_sq_sg (62; 1008-16) in a sanity check at entry to set_kernel_sq_size(). This passes. Then we calculate the size of the WQE based on the QP type: cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg) + send_wqe_overhead(type); The send_wqe_overhead(RC) function returns 3 segments: - ctrl + atomic + raddr So we get a WQE size of 560 bytes (32 SGEs + 3 overhead segments) and this fails the power-of-2 test because 1024 is greater than 1008. Sorry for all the words. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, September 26, 2007 3:03 PM To: Jim Mott Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device > 1) ib_create_qp() fails with max_sge > If you use ib_query_device() to return the device specific > attribute max_sge, it seems reasonable to expect you can create > a QP with max_send_sge=max_sge. The problem is that this often > fails. > > The reason is that depending on the QP type (RC, UD, etc.) and > how the QP will be used (send, RDMA, atomic, etc.), there can be > extra segments required in the WQE that eat up SGE entries. So > while some send WQE might have max_sge available SGEs, many will > not. > This issue may need API extensions to definitively resolve. In > the short term, it would be very nice if max_sge reported by > ib_query_device() could always return a value that ib_create_qp() > could use. Think of it as the minimum max_send_sge value that > will work for all QP types. The intention is that any attempt to create a QP with the maximum number of S/G entries as reported by query device should succeed. However, as you note there may be issues that make this fail, but I would consider them as bugs to be fixed. You mention API extensions to handle this -- do you have any concrete ideas? In the past we've talked a little about this, but I don't think anyone has suggested any changes that would help matters while still keeping the API no more complex than it already is. > The recent patch to support shrinking WQEs introduces a > behavior that creates a big difference between the mlx4 > supported send SGEs (checked against 61, should be 59 or 60, > and reported in ib_query_device as 32 to equal receive side > max_rq_sg value). I'm not sure I understand this. What's the new behavior? Are you trying to take advantage of the fact that using non-power-of-2 size send WQEs would let you have a send queue with more than 32 S/G entries? I think doing that actually would require a change in the API to allow different values for max_sge_rq and max_sge_sq to be reported from ib_query_device(). Which in turn would break the userspace ABI, etc, etc. and leaves me wondering if it's really worth it. (BTW I hate the "shrinking WQE" terminology for this, although obviously you weren't the one to introduce it) - R. From rdreier at cisco.com Wed Sep 26 15:31:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 15:31:52 -0700 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <001001c80088$6e04f4f0$4a0eded0$@rr.com> (Jim Mott's message of "Wed, 26 Sep 2007 16:58:53 -0500") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> Message-ID: > A minimal API change that could help would be to add two new fields > to ib_device_attr structure returned by ib_query_device: > - delta_sge_sg > - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. > The behavior would be that in all cases using max_sge for send or > receive SGE count in create_qp would always succeed. That means > the current value the drivers return there would have to be reduced > to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any "shrinking WQE" patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) > It looks like there is some movement in this direction already > with the fields: > - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. > - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. From kilian at stanford.edu Wed Sep 26 16:14:09 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Wed, 26 Sep 2007 16:14:09 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <20070925233421.GA19757@sashak.voltaire.com> <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> Message-ID: <200709261614.09499.kilian@stanford.edu> On Tuesday 25 September 2007 04:30:48 pm Jeff Becker wrote: > Hi Sasha. Thanks for the info. I did have the following problem when > building against the 1.2 libibmad: > > cc -Wall -g -fpic -I. -I../include -I/home/becker//include -c -o > sim_mad.o sim_mad.c > sim_mad.c: In function 'encode_trap144': > sim_mad.c:1261: error: 'IB_NOTICE_DATA_144_LID_F' undeclared (first > use in this function) > sim_mad.c:1261: error: (Each undeclared identifier is reported only > once sim_mad.c:1261: error: for each function it appears in.) > sim_mad.c:1262: error: 'IB_NOTICE_DATA_144_CAPMASK_F' undeclared > (first use in this function) > make[1]: *** [sim_mad.o] Error 1 > make[1]: Leaving directory `/home/becker/ibrouting/ibsim/ibsim' And indeed those have been introduced by this patch in 1.2.5: http://lists.openfabrics.org/pipermail/general/2007-June/036912.html Cheers, -- Kilian From mshefty at ichips.intel.com Wed Sep 26 16:56:52 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Sep 2007 16:56:52 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> <46F953FC.50101@ichips.intel.com> <46FAA827.90504@ichips.intel.com> Message-ID: <46FAF1C4.1090109@ichips.intel.com> > > The userspace IB CM is exported through /sys/class/infiniband_cm. > > Does anyone object to placing the counters under this same directory? > > I think that would be fine, with the usual one-value-per-file sysfs rules. Er, I could use some help here. Is there a preferred way to share /sys/class/infiniband_cm between the ib_cm and ib_user_cm modules? Currently, ib_user_cm registers the infiniband_cm class and registers devices (ucm0, ucm1, ...) on that class. It ends up making use of the infiniband_cm class 'release' callback for this. I want to make sure that I'm not overlooking some simple way of maintaining this while letting the ib_cm module stick statistics under it. - Sean From jimmott at austin.rr.com Wed Sep 26 17:47:39 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 19:47:39 -0500 Subject: [ofa-general] SDP memory allocation policy problem? In-Reply-To: <46F99093.7000907@noaa.gov> References: <46F99093.7000907@noaa.gov> Message-ID: <001401c800a0$01ea5180$05bef480$@rr.com> I have reworked your patch slightly and run my simple unit tests on it. No correctness problems detected in latency or bandwidth paths. No performance regressions either. If your proposed patch worked for you, then this one ought to work too. Could you please give it a go and let me know? Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-09-26 13:27:43.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-09-26 17:52:12.000000000 -0500 @@ -221,16 +221,26 @@ static void sdp_post_recv(struct sdp_soc skb_frag_t *frag; struct sdp_bsdh *h; int id = ssk->rx_head; + unsigned int gfp_page; /* Now, allocate and repost recv */ /* TODO: allocate from cache */ - skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, - GFP_KERNEL); + + if (unlikely(ssk->isk.sk.sk_allocation)) { + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, + ssk->isk.sk.sk_allocation); + gfp_page = ssk->isk.sk.sk_allocation | __GFP_HIGHMEM; + } else { + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, + GFP_KERNEL); + gfp_page = GFP_HIGHUSER; + } + /* FIXME */ BUG_ON(!skb); h = (struct sdp_bsdh *)skb->head; for (i = 0; i < ssk->recv_frags; ++i) { - page = alloc_pages(GFP_HIGHUSER, 0); + page = alloc_pages(gfp_page, 0); BUG_ON(!page); frag = &skb_shinfo(skb)->frags[i]; frag->page = page; @@ -404,6 +414,7 @@ void sdp_post_sends(struct sdp_sock *ssk /* TODO: nonagle? */ struct sk_buff *skb; int c; + int gfp_page; if (unlikely(!ssk->id)) { if (ssk->isk.sk.sk_send_head) { @@ -415,6 +426,11 @@ void sdp_post_sends(struct sdp_sock *ssk return; } + if (unlikely(ssk->isk.sk.sk_allocation)) + gfp_page = ssk->isk.sk.sk_allocation; + else + gfp_page = GFP_KERNEL; + if (ssk->recv_request && ssk->rx_tail >= ssk->recv_request_head && ssk->bufs >= SDP_MIN_BUFS && @@ -424,7 +440,7 @@ void sdp_post_sends(struct sdp_sock *ssk skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh) + sizeof(*resp_size), - GFP_KERNEL); + gfp_page); /* FIXME */ BUG_ON(!skb); resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); @@ -449,7 +465,7 @@ void sdp_post_sends(struct sdp_sock *ssk skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh) + sizeof(*req_size), - GFP_KERNEL); + gfp_page); /* FIXME */ BUG_ON(!skb); ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; @@ -480,7 +496,7 @@ void sdp_post_sends(struct sdp_sock *ssk ssk->bufs) { skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh), - GFP_KERNEL); + gfp_page); /* FIXME */ BUG_ON(!skb); sdp_post_send(ssk, skb, SDP_MID_DISCONN); -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Nathan Dauchy Sent: Tuesday, September 25, 2007 5:50 PM To: general at lists.openfabrics.org Subject: Re: [ofa-general] SDP memory allocation policy problem? Is there anyone among the OFED development team that is looking into this issue? I believe that it is causing nodes to hang at our site. We are running ofed-1.2 and the 2.6.9-55.ELsmp kernel. Workarounds or even untested patches would be appreciated. Thanks! -Nathan Ken Phillips wrote: > Greetings, > > Teammates here report the following: > > Problem > > The method SDP uses to allocate socket buffers may cause the > node to hang under memory pressure. > > Details > > Each kernel level socket has an allocation flag to specify the > memory allocation policy for socket buffers, the default is GFP_ATOMIC > (or GFP_KERNEL for SDP). If the caller creates a socket with the > policy set to GFP_NOFS or GFP_NOIO this should be the allocation > policy used by the SDP layer. > > The problem we are seeing is that if a node is under load, and > a memory allocation fails (say in sock_sendmsg()), the kernel will > use the allocation policy to decide how to proceed with the allocation. > If GFP_KERNEL is specified, then the kernel may attempt to free pages > through the iSCSI block device that is making the socket call, which > would result in a deadlock. Use of GFP_NOIO should prevent the kernel > from using the IO backend to free memory resources. > > here is a sample stack trace from Alt-Sysrq during one of these > lockups, > >> tx_worker D ffffff0014d14000 0 10195 1 10196 10194 >> (L-TLB) >> 00000100707e98d8 0000000000000046 0000000000000004 0000000000000212 >> 0000000000000212 ffffffffa018ccae 0000000000000246 0000000000000246 >> 000001007873c7f0 0000000000000320 >> Call Trace:{:ib_mthca:mthca_poll_cq+2258} >> {schedule_timeout+224} >> {lock_sock+152} >> {autoremove_wake_function+0} >> {:ib_sdp:sdp_poll_cq+58} >> {autoremove_wake_function+0} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+33} >> {sock_sendmsg+271} >> {:ib_sdp:sdp_post_sends+619} >> {release_sock+16} >> {:ib_sdp:sdp_sendmsg+2222} >> {autoremove_wake_function+0} >> {:rs_iscsi:iscsi_sock_msg+1265} >> {:rs_iscsi:iscsi_sock_msg+1261} >> {recalc_task_prio+337} >> {:rs_iscsi:scsi_command_i+5283} >> {thread_return+0} >> {thread_return+88} >> {del_timer+107} >> {del_singleshot_timer_sync+9} >> {schedule_timeout+375} >> {:rs_iscsi:tx_worker_proc_i+6819} >> {child_rip+8} >> {:rs_iscsi:tx_worker_proc_i+0} >> {child_rip+0} >> >> > > We still don't know the scope of changes to be made, but we think, > at minimum that some of the memory allocation in SDP should be changed, > for example. > > diff -Naur old/drivers/infiniband/ulp/sdp/sdp_bcopy.c > new/drivers/infiniband/ulp/sdp/sdp_bcopy.c > --- old/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-06-21 > 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-08-31 > 12:25:58.000000000 -0400 > @@ -224,13 +224,27 @@ > > /* Now, allocate and repost recv */ > /* TODO: allocate from cache */ > + > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > h = (struct sdp_bsdh *)skb->head; > for (i = 0; i < ssk->recv_frags; ++i) { > +#if (PROPOSED_SDP_FIX == 1) > + page = alloc_pages((ssk->isk.sk.sk_allocation == 0) > + ? (GFP_HIGHUSER) : > + (ssk->isk.sk.sk_allocation | (__GFP_HIGHMEM)), > + 0); > +#else > page = alloc_pages(GFP_HIGHUSER, 0); > +#endif > BUG_ON(!page); > frag = &skb_shinfo(skb)->frags[i]; > frag->page = page; > @@ -406,10 +420,18 @@ > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *resp_size; > ssk->recv_request = 0; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*resp_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*resp_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); > @@ -431,10 +453,18 @@ > ssk->tx_head > ssk->sent_request_head + SDP_RESIZE_WAIT && > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > struct sdp_chrecvbuf *req_size; > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh) + > + sizeof(*req_size), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*req_size), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; > @@ -463,9 +493,16 @@ > (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && > !ssk->isk.sk.sk_send_head && > ssk->bufs) { > +#if (PROPOSED_SDP_FIX == 1) > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > + sizeof(struct sdp_bsdh), > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > + ssk->isk.sk.sk_allocation); > +#else > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh), > GFP_KERNEL); > +#endif > /* FIXME */ > BUG_ON(!skb); > sdp_post_send(ssk, skb, SDP_MID_DISCONN); > diff -Naur old/drivers/infiniband/ulp/sdp/sdp.h > new/drivers/infiniband/ulp/sdp/sdp.h > --- old/drivers/infiniband/ulp/sdp/sdp.h 2007-06-21 10:38:47.000000000 -0400 > +++ new/drivers/infiniband/ulp/sdp/sdp.h 2007-08-31 12:25:45.000000000 -0400 > @@ -7,6 +7,8 @@ > #include /* For urgent data flags */ > #include > > +#define PROPOSED_SDP_FIX 1 > + > #define sdp_printk(level, sk, format, arg...) \ > printk(level "sdp_sock(%d:%d): " format, \ > (sk) ? inet_sk(sk)->num : -1, \ > > > > > --------------------- > Best Regards > K Phillips > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jimmott at austin.rr.com Wed Sep 26 18:41:44 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 20:41:44 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> Message-ID: <001501c800a7$8fd5efc0$af81cf40$@rr.com> The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works I only have the two types of adapters to test with. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, September 26, 2007 5:32 PM To: Jim Mott Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device > A minimal API change that could help would be to add two new fields > to ib_device_attr structure returned by ib_query_device: > - delta_sge_sg > - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. > The behavior would be that in all cases using max_sge for send or > receive SGE count in create_qp would always succeed. That means > the current value the drivers return there would have to be reduced > to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any "shrinking WQE" patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) > It looks like there is some movement in this direction already > with the fields: > - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. > - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. From rdreier at cisco.com Wed Sep 26 18:56:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 26 Sep 2007 18:56:47 -0700 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <001501c800a7$8fd5efc0$af81cf40$@rr.com> (Jim Mott's message of "Wed, 26 Sep 2007 20:41:44 -0500") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> Message-ID: > The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. > > ibv_query_device(MT25204) returns max_sge=30 > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works Which transport type? - R. From tom at opengridcomputing.com Wed Sep 26 19:06:45 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 26 Sep 2007 21:06:45 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <001501c800a7$8fd5efc0$af81cf40$@rr.com> References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> Message-ID: <1190858805.16774.90.camel@trinity.ogc.int> FWIW, I have code in my apps that retries QP creation with reduced values when the allocation with max fails. There was also an earlier e-mail thread on this exact same issue, but the "solution" bantered about was to use special values in the qp_attr structure ala QP_MAX_SEND_SGE (-1?). The provider would recognize this value and allocate the max for that attribute that would succeed given the current resource situation. The qp_attr structure would then be updated by the provider with the values given. This approach extends, but doesn't break the API, allows existing apps to work as usual, and avoids the retry logic that I've added to my apps. Just a thought, Tom On Wed, 2007-09-26 at 20:41 -0500, Jim Mott wrote: > The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. > > ibv_query_device(MT25204) returns max_sge=30 > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works > > I only have the two types of adapters to test with. > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Wednesday, September 26, 2007 5:32 PM > To: Jim Mott > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device > > > A minimal API change that could help would be to add two new fields > > to ib_device_attr structure returned by ib_query_device: > > - delta_sge_sg > > - delta_sge_rd > > Hmm, a cute idea but I'm still left wondering if it's worth the ABI > breakage etc just to give a few more S/G entries in some situations. > > > The behavior would be that in all cases using max_sge for send or > > receive SGE count in create_qp would always succeed. That means > > the current value the drivers return there would have to be reduced > > to fix this bug. All existing codes would continue to run. > > Actually are there any drivers other than patched mlx4 where max_sge > doesn't always work? I agree we do want to get this right, but I > thought we had fixed all such bugs. (And we should make sure that any > "shrinking WQE" patch for mlx4 doesn't introduce new bugs) > > (BTW I see a different bug in unpatched mlx4, namely that it might > report a too-big number of S/G entries allowed for the SQ) > > > It looks like there is some movement in this direction already > > with the fields: > > - max_sge_rd (nes, amso1100, ehca, cxgb3 only) > > This field is obsolete, since we don't handle RD and almost certainly > never will. I'm not sure why anyone is setting a value. > > > - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) > > Any devices that handle SRQ should set this. I think cxgb3 does not > support SRQ. > > - R. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jimmott at austin.rr.com Wed Sep 26 20:05:17 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Wed, 26 Sep 2007 22:05:17 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> Message-ID: <001601c800b3$3e032a80$ba097f80$@rr.com> IBV_QPT_RC -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, September 26, 2007 8:57 PM To: Jim Mott Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device > The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. > > ibv_query_device(MT25204) returns max_sge=30 > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails > - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works Which transport type? - R. From kliteyn at mellanox.co.il Wed Sep 26 22:08:45 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 27 Sep 2007 07:08:45 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-27:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-26 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From adit.262 at gmail.com Wed Sep 26 22:59:35 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Thu, 27 Sep 2007 01:59:35 -0400 Subject: [ofa-general] OFA Kernel for XenIB Message-ID: Hi, I have been working with the xen-smartio source tree from the xensource site and wanted to whether this kernel is a different implementation for XenIB. Also does this kernel be used in place of the xen0 kernel? Does anyone have pointers on how kernel can be used.. there doesnt seem to be any readme on the install process? Thanks, Adit -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From keshetti85-student at yahoo.co.in Thu Sep 27 00:06:37 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Thu, 27 Sep 2007 12:36:37 +0530 Subject: [ofa-general] [query] openSM routing algorithms Message-ID: <829ded920709270006s32c06325p381bfa12f80dd11f@mail.gmail.com> In the latest openSM release, I could see it supports four different algorithms(Min-hop algorithm being the default). I want to know in detail how these algorithms work and which one to use to when. Can anyone of you help me by giving references to some documents describing the same. regards, Mahesh From contato at clickmkt.com Thu Sep 27 00:44:23 2007 From: contato at clickmkt.com (Triz Jóias) Date: Thu, 27 Sep 2007 07:44:23 GMT Subject: [ofa-general] =?iso-8859-1?q?Oportunidade_de_neg=F3cio?= Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: convite_triz_especial.jpg Type: image/jpeg Size: 154722 bytes Desc: not available URL: From desta.danby at vibysko.dk Thu Sep 27 02:24:11 2007 From: desta.danby at vibysko.dk (Kelley Culver) Date: Thu, 27 Sep 2007 10:24:11 +0100 Subject: [ofa-general] Being young and inexperienced Message-ID: <978259186.07320458564594@vibysko.dk> -------------- next part -------------- A non-text attachment was scrubbed... Name: img20.gif Type: image/gif Size: 4994 bytes Desc: not available URL: From vlad at lists.openfabrics.org Thu Sep 27 02:55:34 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 27 Sep 2007 02:55:34 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070927-0200 daily build status Message-ID: <20070927095534.C6981E608F9@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070927-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From dev_hyd2001 at yahoo.com Thu Sep 27 03:37:52 2007 From: dev_hyd2001 at yahoo.com (Dev) Date: Thu, 27 Sep 2007 03:37:52 -0700 (PDT) Subject: [ofa-general] ***SPAM*** uDAPL thread safety Message-ID: <605833.19627.qm@web53704.mail.re2.yahoo.com> HI, Is the uDAPL provider in OFED 1.2 thread safe ? the dat.conf by default has an entry nonthreadsafe and the spec says for some of the routines thread safety depends on the provider. cheers /Dev --------------------------------- Check out the hottest 2008 models today at Yahoo! Autos. -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Thu Sep 27 03:38:43 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 27 Sep 2007 12:38:43 +0200 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: Message-ID: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> On 9/26/07, Roland Dreier wrote: > > > To support this inter-op for the case where the receiving party resides > at > > the IB side, there is a need to handle IGMP (reports/queries) else the > local > > IP router would not forward multicast traffic towards the IB network. > > > > This patch does a lookup on the database used for multicast reference > counting and > > enhances IPoIB to ignore mulicast group which is already handled by user > space, all > > this under a per device policy flag. That is when the policy flag allows > it, IPoIB > > will not join and attach its QP to a multicast group which has an entry > on the database. > > I don't really follow this explanation. OK, I see in the first > paragraph that you want to handle IGMP. How does the second paragraph > follow? Why does IGMP mean the kernel IPoIB interface should avoid > joining certain multicast groups? (Which groups?) The user space app first joins to the multicast group through the rdma-cm (by calling rdma_join_multicast etc) and then lets the kernel IGMP state machine that it has to join / respond on queries for this group. This can be achieved if, second, the app issues a SOL_IP / IP_ADD_MEMBERSHIP setsockopt call. Since this setsockopt has two impcast A) IGMP etc B) IPoIB set_multicast_list is called, the patch comes to avoid IPoIB from joining / attaching to this group, since the app actually attaches its own UD QP to the group. So my change log comment wasn't detailed enough to make it clear this is the design, sorry. > > + /* ignore group which is directly joined by user > space */ > > + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, > &priv->flags) && > > + !ib_sa_get_mcmember_rec(priv->ca, priv->port, > &mgid, &rec)) > > I don't follow this. Why does ib_sa_get_mcmember_rec() returning 0 > imply that userspace has already joined the multicast group? Since both the rdma-cm and ipoib are consumers of the core mutlicast management code (core/multicast.c which is linked into ib_sa.ko), and the app (through the rdma-cm) --first-- inserts a record into the database and only then issues the setsockopt call, if ipoib has a hit on a group it was told to join, this group must be offloaded by the rdma-cm consumer. > > +module_param_named(umcast_allowed, ipoib_umcast_allowed, int, 0444); > > Not sure I understand why you added the module parameter... The per device flag is initialized by the module param value at ipoib_dev_init() > +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); > > The set_umcast attribute is writable by root anyway so why are there > two ways of setting this? I am not sure to fully follow your comment. I just wanted to make the sysfs /sys/class/net/$dev/umcast entry writable and I actually did copy-paste from the set_mode code... > + if (!strcmp(buf, "1\n")) { > > I don't think this is the most robust way of parsing things. for > example it will break in a very confusing way if someone uses "echo -n" > Could you use simple_strtoul() or something like that to handle > leading/trailing whitespace etc? sure, I will fix it. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Thu Sep 27 04:26:13 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 27 Sep 2007 04:26:13 -0700 Subject: [ofa-general] [query] openSM routing algorithms In-Reply-To: <829ded920709270006s32c06325p381bfa12f80dd11f@mail.gmail.com> References: <829ded920709270006s32c06325p381bfa12f80dd11f@mail.gmail.com> Message-ID: <1190892373.7075.673.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-09-27 at 12:36 +0530, Keshetti Mahesh wrote: > In the latest openSM release, I could see it supports four different > algorithms(Min-hop algorithm being the default). I want to know in detail > how these algorithms work and which one to use to when. Can anyone of > you help me by giving references to some documents describing the same. The descriptions of and references to (papers on) the routing algorithms are in the OpenSM man page. -- Hal > > regards, > Mahesh > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Sep 27 08:16:40 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 27 Sep 2007 10:16:40 -0500 Subject: [ofa-general] Re: [ewg] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <46F6CCA4.1010607@opengridcomputing.com> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> <46F3E3D2.70601@opengridcomputing.com> <20070923085052.GC24557@mellanox.co.il> <46F6CCA4.1010607@opengridcomputing.com> Message-ID: <46FBC958.4090209@opengridcomputing.com> Michael, Have you pulled this in yet? I want to close out the bug I have open... Thanks, Steve. Steve Wise wrote: > > > Michael S. Tsirkin wrote: >> Yes, please push this into your git tree (and please verify that >> cross-build to all OS-es passes). >> > > done! > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2_c > >> Further, please do it this way: add the patch in ofed-1.2.5 >> and then merge 1.2.5 into 1.3. >> > > done! > > git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel > > > Steve. > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From tanzi.dalzell at urschall.com Thu Sep 27 11:00:28 2007 From: tanzi.dalzell at urschall.com (Enrique Reilly) Date: Thu, 27 Sep 2007 11:00:28 -0700 Subject: [ofa-general] Are you strong man? Message-ID: <01c80130$4975bd10$e38f5289@tanzi.dalzell> -------------- next part -------------- A non-text attachment was scrubbed... Name: img20.gif Type: image/gif Size: 4937 bytes Desc: not available URL: From phillips.ken at gmail.com Thu Sep 27 11:27:34 2007 From: phillips.ken at gmail.com (Ken Phillips) Date: Thu, 27 Sep 2007 14:27:34 -0400 Subject: [ofa-general] SDP memory allocation policy problem? In-Reply-To: <001401c800a0$01ea5180$05bef480$@rr.com> References: <46F99093.7000907@noaa.gov> <001401c800a0$01ea5180$05bef480$@rr.com> Message-ID: Thanks for your help. We'll setup to get this tested under pressure. We'll keep you posted. Regards KP On 9/26/07, Jim Mott wrote: > I have reworked your patch slightly and run my simple unit tests on it. No correctness problems detected in latency or bandwidth > paths. No performance regressions either. > > If your proposed patch worked for you, then this one ought to work too. Could you please give it a go and let me know? > > Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c > =================================================================== > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-09-26 13:27:43.000000000 -0500 > +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-09-26 17:52:12.000000000 -0500 > @@ -221,16 +221,26 @@ static void sdp_post_recv(struct sdp_soc > skb_frag_t *frag; > struct sdp_bsdh *h; > int id = ssk->rx_head; > + unsigned int gfp_page; > > /* Now, allocate and repost recv */ > /* TODO: allocate from cache */ > - skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > - GFP_KERNEL); > + > + if (unlikely(ssk->isk.sk.sk_allocation)) { > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > + ssk->isk.sk.sk_allocation); > + gfp_page = ssk->isk.sk.sk_allocation | __GFP_HIGHMEM; > + } else { > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > + GFP_KERNEL); > + gfp_page = GFP_HIGHUSER; > + } > + > /* FIXME */ > BUG_ON(!skb); > h = (struct sdp_bsdh *)skb->head; > for (i = 0; i < ssk->recv_frags; ++i) { > - page = alloc_pages(GFP_HIGHUSER, 0); > + page = alloc_pages(gfp_page, 0); > BUG_ON(!page); > frag = &skb_shinfo(skb)->frags[i]; > frag->page = page; > @@ -404,6 +414,7 @@ void sdp_post_sends(struct sdp_sock *ssk > /* TODO: nonagle? */ > struct sk_buff *skb; > int c; > + int gfp_page; > > if (unlikely(!ssk->id)) { > if (ssk->isk.sk.sk_send_head) { > @@ -415,6 +426,11 @@ void sdp_post_sends(struct sdp_sock *ssk > return; > } > > + if (unlikely(ssk->isk.sk.sk_allocation)) > + gfp_page = ssk->isk.sk.sk_allocation; > + else > + gfp_page = GFP_KERNEL; > + > if (ssk->recv_request && > ssk->rx_tail >= ssk->recv_request_head && > ssk->bufs >= SDP_MIN_BUFS && > @@ -424,7 +440,7 @@ void sdp_post_sends(struct sdp_sock *ssk > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*resp_size), > - GFP_KERNEL); > + gfp_page); > /* FIXME */ > BUG_ON(!skb); > resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); > @@ -449,7 +465,7 @@ void sdp_post_sends(struct sdp_sock *ssk > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh) + > sizeof(*req_size), > - GFP_KERNEL); > + gfp_page); > /* FIXME */ > BUG_ON(!skb); > ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; > @@ -480,7 +496,7 @@ void sdp_post_sends(struct sdp_sock *ssk > ssk->bufs) { > skb = sk_stream_alloc_skb(&ssk->isk.sk, > sizeof(struct sdp_bsdh), > - GFP_KERNEL); > + gfp_page); > /* FIXME */ > BUG_ON(!skb); > sdp_post_send(ssk, skb, SDP_MID_DISCONN); > > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Nathan Dauchy > Sent: Tuesday, September 25, 2007 5:50 PM > To: general at lists.openfabrics.org > Subject: Re: [ofa-general] SDP memory allocation policy problem? > > Is there anyone among the OFED development team that is looking into > this issue? I believe that it is causing nodes to hang at our site. We > are running ofed-1.2 and the 2.6.9-55.ELsmp kernel. > > Workarounds or even untested patches would be appreciated. > > Thanks! > > -Nathan > > > Ken Phillips wrote: > > Greetings, > > > > Teammates here report the following: > > > > Problem > > > > The method SDP uses to allocate socket buffers may cause the > > node to hang under memory pressure. > > > > Details > > > > Each kernel level socket has an allocation flag to specify the > > memory allocation policy for socket buffers, the default is GFP_ATOMIC > > (or GFP_KERNEL for SDP). If the caller creates a socket with the > > policy set to GFP_NOFS or GFP_NOIO this should be the allocation > > policy used by the SDP layer. > > > > The problem we are seeing is that if a node is under load, and > > a memory allocation fails (say in sock_sendmsg()), the kernel will > > use the allocation policy to decide how to proceed with the allocation. > > If GFP_KERNEL is specified, then the kernel may attempt to free pages > > through the iSCSI block device that is making the socket call, which > > would result in a deadlock. Use of GFP_NOIO should prevent the kernel > > from using the IO backend to free memory resources. > > > > here is a sample stack trace from Alt-Sysrq during one of these > > lockups, > > > >> tx_worker D ffffff0014d14000 0 10195 1 10196 10194 > >> (L-TLB) > >> 00000100707e98d8 0000000000000046 0000000000000004 0000000000000212 > >> 0000000000000212 ffffffffa018ccae 0000000000000246 0000000000000246 > >> 000001007873c7f0 0000000000000320 > >> Call Trace:{:ib_mthca:mthca_poll_cq+2258} > >> {schedule_timeout+224} > >> {lock_sock+152} > >> {autoremove_wake_function+0} > >> {:ib_sdp:sdp_poll_cq+58} > >> {autoremove_wake_function+0} > >> {release_sock+16} > >> {:ib_sdp:sdp_sendmsg+33} > >> {sock_sendmsg+271} > >> {:ib_sdp:sdp_post_sends+619} > >> {release_sock+16} > >> {:ib_sdp:sdp_sendmsg+2222} > >> {autoremove_wake_function+0} > >> {:rs_iscsi:iscsi_sock_msg+1265} > >> {:rs_iscsi:iscsi_sock_msg+1261} > >> {recalc_task_prio+337} > >> {:rs_iscsi:scsi_command_i+5283} > >> {thread_return+0} > >> {thread_return+88} > >> {del_timer+107} > >> {del_singleshot_timer_sync+9} > >> {schedule_timeout+375} > >> {:rs_iscsi:tx_worker_proc_i+6819} > >> {child_rip+8} > >> {:rs_iscsi:tx_worker_proc_i+0} > >> {child_rip+0} > >> > >> > > > > We still don't know the scope of changes to be made, but we think, > > at minimum that some of the memory allocation in SDP should be changed, > > for example. > > > > diff -Naur old/drivers/infiniband/ulp/sdp/sdp_bcopy.c > > new/drivers/infiniband/ulp/sdp/sdp_bcopy.c > > --- old/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-06-21 > > 10:38:47.000000000 -0400 > > +++ new/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-08-31 > > 12:25:58.000000000 -0400 > > @@ -224,13 +224,27 @@ > > > > /* Now, allocate and repost recv */ > > /* TODO: allocate from cache */ > > + > > +#if (PROPOSED_SDP_FIX == 1) > > + skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > > + ssk->isk.sk.sk_allocation); > > +#else > > skb = sk_stream_alloc_skb(&ssk->isk.sk, SDP_HEAD_SIZE, > > GFP_KERNEL); > > +#endif > > /* FIXME */ > > BUG_ON(!skb); > > h = (struct sdp_bsdh *)skb->head; > > for (i = 0; i < ssk->recv_frags; ++i) { > > +#if (PROPOSED_SDP_FIX == 1) > > + page = alloc_pages((ssk->isk.sk.sk_allocation == 0) > > + ? (GFP_HIGHUSER) : > > + (ssk->isk.sk.sk_allocation | (__GFP_HIGHMEM)), > > + 0); > > +#else > > page = alloc_pages(GFP_HIGHUSER, 0); > > +#endif > > BUG_ON(!page); > > frag = &skb_shinfo(skb)->frags[i]; > > frag->page = page; > > @@ -406,10 +420,18 @@ > > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > > struct sdp_chrecvbuf *resp_size; > > ssk->recv_request = 0; > > +#if (PROPOSED_SDP_FIX == 1) > > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > > + sizeof(struct sdp_bsdh) + > > + sizeof(*resp_size), > > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > > + ssk->isk.sk.sk_allocation); > > +#else > > skb = sk_stream_alloc_skb(&ssk->isk.sk, > > sizeof(struct sdp_bsdh) + > > sizeof(*resp_size), > > GFP_KERNEL); > > +#endif > > /* FIXME */ > > BUG_ON(!skb); > > resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); > > @@ -431,10 +453,18 @@ > > ssk->tx_head > ssk->sent_request_head + SDP_RESIZE_WAIT && > > ssk->tx_head - ssk->tx_tail < SDP_TX_SIZE) { > > struct sdp_chrecvbuf *req_size; > > +#if (PROPOSED_SDP_FIX == 1) > > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > > + sizeof(struct sdp_bsdh) + > > + sizeof(*req_size), > > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > > + ssk->isk.sk.sk_allocation); > > +#else > > skb = sk_stream_alloc_skb(&ssk->isk.sk, > > sizeof(struct sdp_bsdh) + > > sizeof(*req_size), > > GFP_KERNEL); > > +#endif > > /* FIXME */ > > BUG_ON(!skb); > > ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; > > @@ -463,9 +493,16 @@ > > (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && > > !ssk->isk.sk.sk_send_head && > > ssk->bufs) { > > +#if (PROPOSED_SDP_FIX == 1) > > + skb = sk_stream_alloc_skb(&ssk->isk.sk, > > + sizeof(struct sdp_bsdh), > > + (ssk->isk.sk.sk_allocation == 0) ? (GFP_KERNEL) : > > + ssk->isk.sk.sk_allocation); > > +#else > > skb = sk_stream_alloc_skb(&ssk->isk.sk, > > sizeof(struct sdp_bsdh), > > GFP_KERNEL); > > +#endif > > /* FIXME */ > > BUG_ON(!skb); > > sdp_post_send(ssk, skb, SDP_MID_DISCONN); > > diff -Naur old/drivers/infiniband/ulp/sdp/sdp.h > > new/drivers/infiniband/ulp/sdp/sdp.h > > --- old/drivers/infiniband/ulp/sdp/sdp.h 2007-06-21 10:38:47.000000000 -0400 > > +++ new/drivers/infiniband/ulp/sdp/sdp.h 2007-08-31 12:25:45.000000000 -0400 > > @@ -7,6 +7,8 @@ > > #include /* For urgent data flags */ > > #include > > > > +#define PROPOSED_SDP_FIX 1 > > + > > #define sdp_printk(level, sk, format, arg...) \ > > printk(level "sdp_sock(%d:%d): " format, \ > > (sk) ? inet_sk(sk)->num : -1, \ > > > > > > > > > > --------------------- > > Best Regards > > K Phillips > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Thu Sep 27 11:38:39 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Sep 2007 11:38:39 -0700 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <20070923203649.8324.64524.stgit@dell3.ogc.int> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> Message-ID: <46FBF8AF.9040700@ichips.intel.com> > The sysadmin creates "for iwarp use only" alias interfaces of the form > "devname:iw*" where devname is the native interface name (eg eth0) for the > iwarp netdev device. The alias label can be anything starting with "iw". > The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. I'm still not sure about this, but haven't come up with anything better myself. And if there's a good chance of other rnic's needing the same support, I'd rather see the common code separated out, even if just encapsulated within this module for easy re-use. As for the code, I have a couple of questions about whether deadlock and a race condition are possible, plus a few minor comments. > +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) > +{ > + struct iwch_addrlist *addr; > + > + addr = kmalloc(sizeof *addr, GFP_KERNEL); > + if (!addr) { > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > + __FUNCTION__); > + return; > + } > + addr->ifa = ifa; > + mutex_lock(&rnicp->mutex); > + list_add_tail(&addr->entry, &rnicp->addrlist); > + mutex_unlock(&rnicp->mutex); > +} Should this return success/failure? > +static int nb_callback(struct notifier_block *self, unsigned long event, > + void *ctx) > +{ > + struct in_ifaddr *ifa = ctx; > + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); > + > + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); > + > + switch (event) { > + case NETDEV_UP: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + iwch_listeners_add_addr(rnicp, ifa->ifa_address); If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not easily seeing the relationship between the address list and the listen list at this point.) > + } > + break; > + case NETDEV_DOWN: > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > + is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x deleted\n", > + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); > + iwch_listeners_del_addr(rnicp, ifa->ifa_address); > + remove_ifa(rnicp, ifa); > + } > + break; > + default: > + break; > + } > + return 0; > +} > + > +static void delete_addrlist(struct iwch_dev *rnicp) > +{ > + struct iwch_addrlist *addr, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > + list_del(&addr->entry); > + kfree(addr); > + } > + mutex_unlock(&rnicp->mutex); > +} > + > +static void populate_addrlist(struct iwch_dev *rnicp) > +{ > + int i; > + struct in_device *indev; > + > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { > + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); > + if (!indev) > + continue; > + for_ifa(indev) > + if (is_iwarp_label(ifa->ifa_label)) { > + PDBG("%s label %s addr 0x%x added\n", > + __FUNCTION__, ifa->ifa_label, > + ifa->ifa_address); > + insert_ifa(rnicp, ifa); > + } > + endfor_ifa(indev); > + } > +} > + > static void rnic_init(struct iwch_dev *rnicp) > { > PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); > @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r > idr_init(&rnicp->qpidr); > idr_init(&rnicp->mmidr); > spin_lock_init(&rnicp->lock); > + INIT_LIST_HEAD(&rnicp->addrlist); > + INIT_LIST_HEAD(&rnicp->listen_eps); > + mutex_init(&rnicp->mutex); > + rnicp->nb.notifier_call = nb_callback; > + populate_addrlist(rnicp); > + register_inetaddr_notifier(&rnicp->nb); > > rnicp->attr.vendor_id = 0x168; > rnicp->attr.vendor_part_id = 7; > @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev > mutex_lock(&dev_mutex); > list_for_each_entry_safe(dev, tmp, &dev_list, entry) { > if (dev->rdev.t3cdev_p == tdev) { > + unregister_inetaddr_notifier(&dev->nb); > + delete_addrlist(dev); > list_del(&dev->entry); > iwch_unregister_device(dev); > cxio_rdev_close(&dev->rdev); > diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h > index caf4e60..7fa0a47 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch.h > +++ b/drivers/infiniband/hw/cxgb3/iwch.h > @@ -36,6 +36,8 @@ #include > #include > #include > #include > +#include > +#include > > #include > > @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { > u32 cq_overflow_detection; > }; > > +struct iwch_addrlist { > + struct list_head entry; > + struct in_ifaddr *ifa; > +}; > + > struct iwch_dev { > struct ib_device ibdev; > struct cxio_rdev rdev; > @@ -111,6 +118,10 @@ struct iwch_dev { > struct idr mmidr; > spinlock_t lock; > struct list_head entry; > + struct notifier_block nb; > + struct list_head addrlist; > + struct list_head listen_eps; The behavior of the listen lists is similar to what's done in the rdma_cm: Wildcard listens are stored in a listen_any_list. When new devices are added, associated listens are added to each device. This is similar, except we're dealing with devices and addresses. I'm wondering if we shouldn't mimic the same behavior and track listens in iwch_addrlist directly. (I don't see anything wrong with this approach though.) What happens if an address changes between iwarp only and non-iwarp? How are listens on specific addresses handled from an rdma_cm level? Does the rdma_cm map the address to the device, call the iw_cm to listen, which in turn calls the device listen function? The device then checks that the address has been marked as iwarp only? (I'm being too lazy to trace this through the code, but if you don't know off the top of your head, I will do that.) > + struct mutex mutex; > }; > > static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c > index 1cdfcd4..afc8a48 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c > @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t > return CPL_RET_BUF_DONE; > } > > -static int listen_start(struct iwch_listen_ep *ep) > +static int wait_for_reply(struct iwch_ep_common *epc) > +{ > + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); > + wait_event(epc->waitq, epc->rpl_done); > + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, epc->rpl_err); > + return epc->rpl_err; > +} What thread is being blocked here, and what sets the event? > + > +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep *ep, > + __be32 addr) > +{ > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + struct iwch_listen_entry *le; > + > + le = kmalloc(sizeof *le, GFP_KERNEL); > + if (!le) { > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > + __FUNCTION__); > + return NULL; > + } > + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, > + &t3c_client, ep); > + if (le->stid == -1) { > + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", > + __FUNCTION__); > + kfree(le); > + return NULL; > + } > + le->addr = addr; > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > + return le; > +} > + > +static void dealloc_listener(struct iwch_listen_ep *ep, > + struct iwch_listen_entry *le) > +{ > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > + cxgb3_free_stid(ep->com.tdev, le->stid); > + kfree(le); > +} > + > +static void dealloc_listener_list(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le, *tmp; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + > + mutex_lock(&h->mutex); > + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { > + list_del(&le->entry); > + dealloc_listener(ep, le); > + } > + mutex_unlock(&h->mutex); > +} > + > +static int alloc_listener_list(struct iwch_listen_ep *ep) > +{ > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); nit: in place of 'h' in several places, how about 'rnicp', which is also used? > + struct iwch_addrlist *addr; > + struct iwch_listen_entry *le; > + int err = 0; > + int added=0; > + mutex_lock(&h->mutex); > + list_for_each_entry(addr, &h->addrlist, entry) { > + if (ep->com.local_addr.sin_addr.s_addr == 0 || > + ep->com.local_addr.sin_addr.s_addr == > + addr->ifa->ifa_address) { > + le = alloc_listener(ep, addr->ifa->ifa_address); > + if (!le) > + break; > + list_add_tail(&le->entry, &ep->listeners); > + added++; > + } > + } > + mutex_unlock(&h->mutex); > + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) > + err = -EADDRNOTAVAIL; > + if (!err && !added) > + printk(KERN_WARNING MOD > + "No RDMA interface found for device %s\n", > + pci_name(h->rdev.rnic_info.pdev)); > + return err; > +} Adding some white space would improve readability. > + > +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int stid) > { > struct sk_buff *skb; > - struct cpl_pass_open_req *req; > + struct cpl_close_listserv_req *req; > + > + PDBG("%s stid %u\n", __FUNCTION__, stid); > + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > + if (!skb) { > + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > + return -ENOMEM; > + } > + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > + req->cpu_idx = 0; > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); > + skb->priority = 1; > + ep->com.rpl_err = 0; > + ep->com.rpl_done = 0; > + cxgb3_ofld_send(ep->com.tdev, skb); > + return wait_for_reply(&ep->com); > +} > + > +static int listen_stop(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + int err = 0; > > PDBG("%s ep %p\n", __FUNCTION__, ep); > + mutex_lock(&h->mutex); > + list_for_each_entry(le, &ep->listeners, entry) { > + err = listen_stop_one(ep, le->stid); This ends up blocking while holding a mutex, which looks like deadlock potential. > + if (err) > + break; > + } > + mutex_unlock(&h->mutex); > + return err; > +} > + > +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int stid, > + __be32 addr, __be16 port) > +{ > + struct sk_buff *skb; > + struct cpl_pass_open_req *req; > + > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, ntohl(addr), > + ntohs(port)); > skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > if (!skb) { > - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); > + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > return -ENOMEM; > } > > req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); > req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); > - req->local_port = ep->com.local_addr.sin_port; > - req->local_ip = ep->com.local_addr.sin_addr.s_addr; > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); > + req->local_port = port; > + req->local_ip = addr; > req->peer_port = 0; > req->peer_ip = 0; > req->peer_netmask = 0; > @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list > req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); > > skb->priority = 1; > + ep->com.rpl_err = 0; > + ep->com.rpl_done = 0; > cxgb3_ofld_send(ep->com.tdev, skb); > - return 0; > + return wait_for_reply(&ep->com); > +} > + > +static int listen_start(struct iwch_listen_ep *ep) > +{ > + struct iwch_listen_entry *le; > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > + int err = 0; > + > + PDBG("%s ep %p\n", __FUNCTION__, ep); > + mutex_lock(&h->mutex); > + list_for_each_entry(le, &ep->listeners, entry) { > + err = listen_start_one(ep, le->stid, le->addr, > + ep->com.local_addr.sin_port); Similar to above - blocking while holding a mutex. There are a couple of other places where this also occurs. > + if (err) > + goto fail; > + } > + mutex_unlock(&h->mutex); > + return err; > +fail: > + mutex_unlock(&h->mutex); > + listen_stop(ep); > + return err; > } > > static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) > @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * > return CPL_RET_BUF_DONE; > } > > -static int listen_stop(struct iwch_listen_ep *ep) > -{ > - struct sk_buff *skb; > - struct cpl_close_listserv_req *req; > - > - PDBG("%s ep %p\n", __FUNCTION__, ep); > - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > - if (!skb) { > - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > - return -ENOMEM; > - } > - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); > - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > - req->cpu_idx = 0; > - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); > - skb->priority = 1; > - cxgb3_ofld_send(ep->com.tdev, skb); > - return 0; > -} > - > static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, > void *ctx) > { > struct iwch_listen_ep *ep = ctx; > struct cpl_close_listserv_rpl *rpl = cplhdr(skb); > > - PDBG("%s ep %p\n", __FUNCTION__, ep); > + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); > + > ep->com.rpl_err = status2errno(rpl->status); > ep->com.rpl_done = 1; > wake_up(&ep->com.waitq); > return CPL_RET_BUF_DONE; > } > > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) > +{ > + struct iwch_listen_ep *listen_ep; > + struct iwch_listen_entry *le; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > + if (listen_ep->com.local_addr.sin_addr.s_addr) > + continue; > + le = alloc_listener(listen_ep, addr); > + if (le) { > + list_add_tail(&le->entry, &listen_ep->listeners); > + listen_start_one(listen_ep, le->stid, addr, > + listen_ep->com.local_addr.sin_port); > + } > + } > + mutex_unlock(&rnicp->mutex); > +} > + > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) > +{ > + struct iwch_listen_ep *listen_ep; > + struct iwch_listen_entry *le, *tmp; > + > + mutex_lock(&rnicp->mutex); > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > + if (listen_ep->com.local_addr.sin_addr.s_addr) > + continue; > + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, > + entry) > + if (le->addr == addr) { > + listen_stop_one(listen_ep, le->stid); > + list_del(&le->entry); > + dealloc_listener(listen_ep, le); > + } > + } > + mutex_unlock(&rnicp->mutex); > +} > + > static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) > { > struct cpl_pass_accept_rpl *rpl; > @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i > goto err; > > /* wait for wr_ack */ > - wait_event(ep->com.waitq, ep->com.rpl_done); > - err = ep->com.rpl_err; > + err = wait_for_reply(&ep->com); > if (err) > goto err; > > @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * > ep->com.cm_id = cm_id; > ep->backlog = backlog; > ep->com.local_addr = cm_id->local_addr; > + INIT_LIST_HEAD(&ep->listeners); > > - /* > - * Allocate a server TID. > - */ > - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); > - if (ep->stid == -1) { > - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); > - err = -ENOMEM; > + err = alloc_listener_list(ep); > + if (err) > goto fail2; > - } > > state_set(&ep->com, LISTEN); > err = listen_start(ep); > - if (err) > - goto fail3; > > - /* wait for pass_open_rpl */ > - wait_event(ep->com.waitq, ep->com.rpl_done); > - err = ep->com.rpl_err; > if (!err) { > cm_id->provider_data = ep; > + mutex_lock(&h->mutex); > + list_add_tail(&ep->entry, &h->listen_eps); > + mutex_unlock(&h->mutex); Is there a race between listen_start() being called and inserting the ep into the list? Could anything try to find the ep on the list after listen_start returns? > goto out; > } > -fail3: > - cxgb3_free_stid(ep->com.tdev, ep->stid); > + dealloc_listener_list(ep); > fail2: > cm_id->rem_ref(cm_id); > put_ep(&ep->com); > @@ -1923,18 +2084,20 @@ out: > int iwch_destroy_listen(struct iw_cm_id *cm_id) > { > int err; > + struct iwch_dev *h = to_iwch_dev(cm_id->device); > struct iwch_listen_ep *ep = to_listen_ep(cm_id); > > PDBG("%s ep %p\n", __FUNCTION__, ep); > > might_sleep(); > + mutex_lock(&h->mutex); > + list_del(&ep->entry); > + mutex_unlock(&h->mutex); > state_set(&ep->com, DEAD); > ep->com.rpl_done = 0; > ep->com.rpl_err = 0; > err = listen_stop(ep); > - wait_event(ep->com.waitq, ep->com.rpl_done); > - cxgb3_free_stid(ep->com.tdev, ep->stid); > - err = ep->com.rpl_err; > + dealloc_listener_list(ep); > cm_id->rem_ref(cm_id); > put_ep(&ep->com); > return err; > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h > index 6107e7c..23e5a22 100644 > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h > @@ -162,10 +162,19 @@ struct iwch_ep_common { > int rpl_err; > }; > > -struct iwch_listen_ep { > - struct iwch_ep_common com; > +struct iwch_listen_entry { > + struct list_head entry; > unsigned int stid; > + __be32 addr; > +}; > + > +struct iwch_listen_ep { > + struct iwch_ep_common com; /* Must be first entry! */ > + struct list_head entry; > + struct list_head listeners; > int backlog; > + int listen_count; I didn't notice where this was used. > + int listen_rpls; or this. > }; > > struct iwch_ep { > @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); > void __free_ep(struct kref *kref); > void iwch_rearp(struct iwch_ep *ep); > int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t); > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); > > int __init iwch_cm_init(void); > void __exit iwch_cm_term(void); > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Thu Sep 27 11:56:17 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 27 Sep 2007 14:56:17 -0400 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfacesto avoid 4-tuple conflicts. In-Reply-To: <46FBF8AF.9040700@ichips.intel.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <46FBF8AF.9040700@ichips.intel.com> Message-ID: Sean, What is the model on how client connects, say for iSCSI, when client and server both support, iWARP and 10GbE or 1GbE, and would like to setup "most" performant "connection" for ULP? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, September 27, 2007 2:39 PM > To: Steve Wise > Cc: netdev at vger.kernel.org; rdreier at cisco.com; > general at lists.openfabrics.org; linux-kernel at vger.kernel.org > Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support > "iwarp-only" interfacesto avoid 4-tuple conflicts. > > > The sysadmin creates "for iwarp use only" alias interfaces > of the form > > "devname:iw*" where devname is the native interface name > (eg eth0) for > > the iwarp netdev device. The alias label can be anything > starting with "iw". > > The "iw" immediately after the ':' is the key used by the > iw_cxgb3 driver. > > I'm still not sure about this, but haven't come up with > anything better myself. And if there's a good chance of > other rnic's needing the same support, I'd rather see the > common code separated out, even if just encapsulated within > this module for easy re-use. > > As for the code, I have a couple of questions about whether > deadlock and a race condition are possible, plus a few minor comments. > > > +static void insert_ifa(struct iwch_dev *rnicp, struct > in_ifaddr *ifa) > > +{ > > + struct iwch_addrlist *addr; > > + > > + addr = kmalloc(sizeof *addr, GFP_KERNEL); > > + if (!addr) { > > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > > + __FUNCTION__); > > + return; > > + } > > + addr->ifa = ifa; > > + mutex_lock(&rnicp->mutex); > > + list_add_tail(&addr->entry, &rnicp->addrlist); > > + mutex_unlock(&rnicp->mutex); > > +} > > Should this return success/failure? > > > +static int nb_callback(struct notifier_block *self, > unsigned long event, > > + void *ctx) > > +{ > > + struct in_ifaddr *ifa = ctx; > > + struct iwch_dev *rnicp = container_of(self, struct > iwch_dev, nb); > > + > > + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); > > + > > + switch (event) { > > + case NETDEV_UP: > > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > > + is_iwarp_label(ifa->ifa_label)) { > > + PDBG("%s label %s addr 0x%x added\n", > > + __FUNCTION__, ifa->ifa_label, > ifa->ifa_address); > > + insert_ifa(rnicp, ifa); > > + iwch_listeners_add_addr(rnicp, > ifa->ifa_address); > > If insert_ifa() fails, what will iwch_listeners_add_addr() > do? (I'm not easily seeing the relationship between the > address list and the listen list at this point.) > > > + } > > + break; > > + case NETDEV_DOWN: > > + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && > > + is_iwarp_label(ifa->ifa_label)) { > > + PDBG("%s label %s addr 0x%x deleted\n", > > + __FUNCTION__, ifa->ifa_label, > ifa->ifa_address); > > + iwch_listeners_del_addr(rnicp, > ifa->ifa_address); > > + remove_ifa(rnicp, ifa); > > + } > > + break; > > + default: > > + break; > > + } > > + return 0; > > +} > > + > > +static void delete_addrlist(struct iwch_dev *rnicp) { > > + struct iwch_addrlist *addr, *tmp; > > + > > + mutex_lock(&rnicp->mutex); > > + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { > > + list_del(&addr->entry); > > + kfree(addr); > > + } > > + mutex_unlock(&rnicp->mutex); > > +} > > + > > +static void populate_addrlist(struct iwch_dev *rnicp) { > > + int i; > > + struct in_device *indev; > > + > > + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { > > + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); > > + if (!indev) > > + continue; > > + for_ifa(indev) > > + if (is_iwarp_label(ifa->ifa_label)) { > > + PDBG("%s label %s addr 0x%x added\n", > > + __FUNCTION__, ifa->ifa_label, > > + ifa->ifa_address); > > + insert_ifa(rnicp, ifa); > > + } > > + endfor_ifa(indev); > > + } > > +} > > + > > static void rnic_init(struct iwch_dev *rnicp) { > > PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); @@ > -70,6 +187,12 @@ > > static void rnic_init(struct iwch_dev *r > > idr_init(&rnicp->qpidr); > > idr_init(&rnicp->mmidr); > > spin_lock_init(&rnicp->lock); > > + INIT_LIST_HEAD(&rnicp->addrlist); > > + INIT_LIST_HEAD(&rnicp->listen_eps); > > + mutex_init(&rnicp->mutex); > > + rnicp->nb.notifier_call = nb_callback; > > + populate_addrlist(rnicp); > > + register_inetaddr_notifier(&rnicp->nb); > > > > rnicp->attr.vendor_id = 0x168; > > rnicp->attr.vendor_part_id = 7; > > @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev > > mutex_lock(&dev_mutex); > > list_for_each_entry_safe(dev, tmp, &dev_list, entry) { > > if (dev->rdev.t3cdev_p == tdev) { > > + unregister_inetaddr_notifier(&dev->nb); > > + delete_addrlist(dev); > > list_del(&dev->entry); > > iwch_unregister_device(dev); > > cxio_rdev_close(&dev->rdev); > > diff --git a/drivers/infiniband/hw/cxgb3/iwch.h > > b/drivers/infiniband/hw/cxgb3/iwch.h > > index caf4e60..7fa0a47 100644 > > --- a/drivers/infiniband/hw/cxgb3/iwch.h > > +++ b/drivers/infiniband/hw/cxgb3/iwch.h > > @@ -36,6 +36,8 @@ #include #include > > > #include #include > > +#include > > +#include > > > > #include > > > > @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { > > u32 cq_overflow_detection; > > }; > > > > +struct iwch_addrlist { > > + struct list_head entry; > > + struct in_ifaddr *ifa; > > +}; > > + > > struct iwch_dev { > > struct ib_device ibdev; > > struct cxio_rdev rdev; > > @@ -111,6 +118,10 @@ struct iwch_dev { > > struct idr mmidr; > > spinlock_t lock; > > struct list_head entry; > > + struct notifier_block nb; > > + struct list_head addrlist; > > + struct list_head listen_eps; > > The behavior of the listen lists is similar to what's done in the > rdma_cm: Wildcard listens are stored in a listen_any_list. > When new devices are added, associated listens are added to > each device. This is similar, except we're dealing with > devices and addresses. I'm wondering if we shouldn't mimic > the same behavior and track listens in iwch_addrlist > directly. (I don't see anything wrong with this approach > though.) > > What happens if an address changes between iwarp only and non-iwarp? > > How are listens on specific addresses handled from an rdma_cm level? > Does the rdma_cm map the address to the device, call the > iw_cm to listen, which in turn calls the device listen > function? The device then checks that the address has been > marked as iwarp only? (I'm being too lazy to trace this > through the code, but if you don't know off the top of your > head, I will do that.) > > > + struct mutex mutex; > > }; > > > > static inline struct iwch_dev *to_iwch_dev(struct > ib_device *ibdev) > > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c > > b/drivers/infiniband/hw/cxgb3/iwch_cm.c > > index 1cdfcd4..afc8a48 100644 > > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c > > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c > > @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t > > return CPL_RET_BUF_DONE; > > } > > > > -static int listen_start(struct iwch_listen_ep *ep) > > +static int wait_for_reply(struct iwch_ep_common *epc) { > > + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); > > + wait_event(epc->waitq, epc->rpl_done); > > + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, > epc, epc->rpl_err); > > + return epc->rpl_err; > > +} > > What thread is being blocked here, and what sets the event? > > > + > > +static struct iwch_listen_entry *alloc_listener(struct > iwch_listen_ep *ep, > > + __be32 addr) > > +{ > > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > + struct iwch_listen_entry *le; > > + > > + le = kmalloc(sizeof *le, GFP_KERNEL); > > + if (!le) { > > + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", > > + __FUNCTION__); > > + return NULL; > > + } > > + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, > > + &t3c_client, ep); > > + if (le->stid == -1) { > > + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", > > + __FUNCTION__); > > + kfree(le); > > + return NULL; > > + } > > + le->addr = addr; > > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > > + return le; > > +} > > + > > +static void dealloc_listener(struct iwch_listen_ep *ep, > > + struct iwch_listen_entry *le) { > > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, > > + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); > > + cxgb3_free_stid(ep->com.tdev, le->stid); > > + kfree(le); > > +} > > + > > +static void dealloc_listener_list(struct iwch_listen_ep *ep) { > > + struct iwch_listen_entry *le, *tmp; > > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > + > > + mutex_lock(&h->mutex); > > + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { > > + list_del(&le->entry); > > + dealloc_listener(ep, le); > > + } > > + mutex_unlock(&h->mutex); > > +} > > + > > +static int alloc_listener_list(struct iwch_listen_ep *ep) { > > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > nit: in place of 'h' in several places, how about 'rnicp', > which is also used? > > > + struct iwch_addrlist *addr; > > + struct iwch_listen_entry *le; > > + int err = 0; > > + int added=0; > > + mutex_lock(&h->mutex); > > + list_for_each_entry(addr, &h->addrlist, entry) { > > + if (ep->com.local_addr.sin_addr.s_addr == 0 || > > + ep->com.local_addr.sin_addr.s_addr == > > + addr->ifa->ifa_address) { > > + le = alloc_listener(ep, addr->ifa->ifa_address); > > + if (!le) > > + break; > > + list_add_tail(&le->entry, &ep->listeners); > > + added++; > > + } > > + } > > + mutex_unlock(&h->mutex); > > + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) > > + err = -EADDRNOTAVAIL; > > + if (!err && !added) > > + printk(KERN_WARNING MOD > > + "No RDMA interface found for device %s\n", > > + pci_name(h->rdev.rnic_info.pdev)); > > + return err; > > +} > > Adding some white space would improve readability. > > > + > > +static int listen_stop_one(struct iwch_listen_ep *ep, > unsigned int > > +stid) > > { > > struct sk_buff *skb; > > - struct cpl_pass_open_req *req; > > + struct cpl_close_listserv_req *req; > > + > > + PDBG("%s stid %u\n", __FUNCTION__, stid); > > + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > > + if (!skb) { > > + printk(KERN_ERR MOD "%s - failed to alloc > skb\n", __FUNCTION__); > > + return -ENOMEM; > > + } > > + req = (struct cpl_close_listserv_req *) skb_put(skb, > sizeof(*req)); > > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > > + req->cpu_idx = 0; > > + OPCODE_TID(req) = > htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); > > + skb->priority = 1; > > + ep->com.rpl_err = 0; > > + ep->com.rpl_done = 0; > > + cxgb3_ofld_send(ep->com.tdev, skb); > > + return wait_for_reply(&ep->com); > > +} > > + > > +static int listen_stop(struct iwch_listen_ep *ep) { > > + struct iwch_listen_entry *le; > > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > + int err = 0; > > > > PDBG("%s ep %p\n", __FUNCTION__, ep); > > + mutex_lock(&h->mutex); > > + list_for_each_entry(le, &ep->listeners, entry) { > > + err = listen_stop_one(ep, le->stid); > > This ends up blocking while holding a mutex, which looks like > deadlock potential. > > > + if (err) > > + break; > > + } > > + mutex_unlock(&h->mutex); > > + return err; > > +} > > + > > +static int listen_start_one(struct iwch_listen_ep *ep, > unsigned int stid, > > + __be32 addr, __be16 port) > > +{ > > + struct sk_buff *skb; > > + struct cpl_pass_open_req *req; > > + > > + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, > stid, ntohl(addr), > > + ntohs(port)); > > skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > > if (!skb) { > > - printk(KERN_ERR MOD "t3c_listen_start failed to > alloc skb!\n"); > > + printk(KERN_ERR MOD "%s - failed to alloc > skb\n", __FUNCTION__); > > return -ENOMEM; > > } > > > > req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); > > req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > > - OPCODE_TID(req) = > htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); > > - req->local_port = ep->com.local_addr.sin_port; > > - req->local_ip = ep->com.local_addr.sin_addr.s_addr; > > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); > > + req->local_port = port; > > + req->local_ip = addr; > > req->peer_port = 0; > > req->peer_ip = 0; > > req->peer_netmask = 0; > > @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list > > req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); > > > > skb->priority = 1; > > + ep->com.rpl_err = 0; > > + ep->com.rpl_done = 0; > > cxgb3_ofld_send(ep->com.tdev, skb); > > - return 0; > > + return wait_for_reply(&ep->com); > > +} > > + > > +static int listen_start(struct iwch_listen_ep *ep) { > > + struct iwch_listen_entry *le; > > + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > + int err = 0; > > + > > + PDBG("%s ep %p\n", __FUNCTION__, ep); > > + mutex_lock(&h->mutex); > > + list_for_each_entry(le, &ep->listeners, entry) { > > + err = listen_start_one(ep, le->stid, le->addr, > > + ep->com.local_addr.sin_port); > > Similar to above - blocking while holding a mutex. There are > a couple of other places where this also occurs. > > > + if (err) > > + goto fail; > > + } > > + mutex_unlock(&h->mutex); > > + return err; > > +fail: > > + mutex_unlock(&h->mutex); > > + listen_stop(ep); > > + return err; > > } > > > > static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, > > void *ctx) @@ -1170,39 +1320,59 @@ static int > pass_open_rpl(struct t3cdev * > > return CPL_RET_BUF_DONE; > > } > > > > -static int listen_stop(struct iwch_listen_ep *ep) -{ > > - struct sk_buff *skb; > > - struct cpl_close_listserv_req *req; > > - > > - PDBG("%s ep %p\n", __FUNCTION__, ep); > > - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > > - if (!skb) { > > - printk(KERN_ERR MOD "%s - failed to alloc > skb\n", __FUNCTION__); > > - return -ENOMEM; > > - } > > - req = (struct cpl_close_listserv_req *) skb_put(skb, > sizeof(*req)); > > - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > > - req->cpu_idx = 0; > > - OPCODE_TID(req) = > htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); > > - skb->priority = 1; > > - cxgb3_ofld_send(ep->com.tdev, skb); > > - return 0; > > -} > > - > > static int close_listsrv_rpl(struct t3cdev *tdev, struct > sk_buff *skb, > > void *ctx) > > { > > struct iwch_listen_ep *ep = ctx; > > struct cpl_close_listserv_rpl *rpl = cplhdr(skb); > > > > - PDBG("%s ep %p\n", __FUNCTION__, ep); > > + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); > > + > > ep->com.rpl_err = status2errno(rpl->status); > > ep->com.rpl_done = 1; > > wake_up(&ep->com.waitq); > > return CPL_RET_BUF_DONE; > > } > > > > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) { > > + struct iwch_listen_ep *listen_ep; > > + struct iwch_listen_entry *le; > > + > > + mutex_lock(&rnicp->mutex); > > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > > + if (listen_ep->com.local_addr.sin_addr.s_addr) > > + continue; > > + le = alloc_listener(listen_ep, addr); > > + if (le) { > > + list_add_tail(&le->entry, > &listen_ep->listeners); > > + listen_start_one(listen_ep, le->stid, addr, > > + > listen_ep->com.local_addr.sin_port); > > + } > > + } > > + mutex_unlock(&rnicp->mutex); > > +} > > + > > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) { > > + struct iwch_listen_ep *listen_ep; > > + struct iwch_listen_entry *le, *tmp; > > + > > + mutex_lock(&rnicp->mutex); > > + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { > > + if (listen_ep->com.local_addr.sin_addr.s_addr) > > + continue; > > + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, > > + entry) > > + if (le->addr == addr) { > > + listen_stop_one(listen_ep, le->stid); > > + list_del(&le->entry); > > + dealloc_listener(listen_ep, le); > > + } > > + } > > + mutex_unlock(&rnicp->mutex); > > +} > > + > > static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct > > sk_buff *skb) { > > struct cpl_pass_accept_rpl *rpl; > > @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i > > goto err; > > > > /* wait for wr_ack */ > > - wait_event(ep->com.waitq, ep->com.rpl_done); > > - err = ep->com.rpl_err; > > + err = wait_for_reply(&ep->com); > > if (err) > > goto err; > > > > @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * > > ep->com.cm_id = cm_id; > > ep->backlog = backlog; > > ep->com.local_addr = cm_id->local_addr; > > + INIT_LIST_HEAD(&ep->listeners); > > > > - /* > > - * Allocate a server TID. > > - */ > > - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); > > - if (ep->stid == -1) { > > - printk(KERN_ERR MOD "%s - cannot alloc > atid.\n", __FUNCTION__); > > - err = -ENOMEM; > > + err = alloc_listener_list(ep); > > + if (err) > > goto fail2; > > - } > > > > state_set(&ep->com, LISTEN); > > err = listen_start(ep); > > - if (err) > > - goto fail3; > > > > - /* wait for pass_open_rpl */ > > - wait_event(ep->com.waitq, ep->com.rpl_done); > > - err = ep->com.rpl_err; > > if (!err) { > > cm_id->provider_data = ep; > > + mutex_lock(&h->mutex); > > + list_add_tail(&ep->entry, &h->listen_eps); > > + mutex_unlock(&h->mutex); > > Is there a race between listen_start() being called and > inserting the ep into the list? Could anything try to find > the ep on the list after listen_start returns? > > > goto out; > > } > > -fail3: > > - cxgb3_free_stid(ep->com.tdev, ep->stid); > > + dealloc_listener_list(ep); > > fail2: > > cm_id->rem_ref(cm_id); > > put_ep(&ep->com); > > @@ -1923,18 +2084,20 @@ out: > > int iwch_destroy_listen(struct iw_cm_id *cm_id) { > > int err; > > + struct iwch_dev *h = to_iwch_dev(cm_id->device); > > struct iwch_listen_ep *ep = to_listen_ep(cm_id); > > > > PDBG("%s ep %p\n", __FUNCTION__, ep); > > > > might_sleep(); > > + mutex_lock(&h->mutex); > > + list_del(&ep->entry); > > + mutex_unlock(&h->mutex); > > state_set(&ep->com, DEAD); > > ep->com.rpl_done = 0; > > ep->com.rpl_err = 0; > > err = listen_stop(ep); > > - wait_event(ep->com.waitq, ep->com.rpl_done); > > - cxgb3_free_stid(ep->com.tdev, ep->stid); > > - err = ep->com.rpl_err; > > + dealloc_listener_list(ep); > > cm_id->rem_ref(cm_id); > > put_ep(&ep->com); > > return err; > > diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h > > b/drivers/infiniband/hw/cxgb3/iwch_cm.h > > index 6107e7c..23e5a22 100644 > > --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h > > +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h > > @@ -162,10 +162,19 @@ struct iwch_ep_common { > > int rpl_err; > > }; > > > > -struct iwch_listen_ep { > > - struct iwch_ep_common com; > > +struct iwch_listen_entry { > > + struct list_head entry; > > unsigned int stid; > > + __be32 addr; > > +}; > > + > > +struct iwch_listen_ep { > > + struct iwch_ep_common com; /* Must be first entry! */ > > + struct list_head entry; > > + struct list_head listeners; > > int backlog; > > + int listen_count; > > I didn't notice where this was used. > > > + int listen_rpls; > > or this. > > > }; > > > > struct iwch_ep { > > @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); void > > __free_ep(struct kref *kref); void iwch_rearp(struct > iwch_ep *ep); > > int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct > > dst_entry *new, struct l2t_entry *l2t); > > +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); > > +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); > > > > int __init iwch_cm_init(void); > > void __exit iwch_cm_term(void); > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Thu Sep 27 12:11:49 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Sep 2007 12:11:49 -0700 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> Message-ID: <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> >What is the model on how client connects, say for iSCSI, >when client and server both support, iWARP and 10GbE or 1GbE, >and would like to setup "most" performant "connection" for ULP? For the "most" performance connection, the ULP would use IB, and all these problems go away. :) This proposal is for each iwarp interface to have its own IP address. Clients would need an iwarp usable address of the server and would connect using rdma_connect(). If that call (or rdma_resolve_addr/route) fails, the client could try connecting using sockets, aoi, or some other interface. I don't see that Steve's proposal changes anything from the client's perspective. - Sean From swise at opengridcomputing.com Thu Sep 27 12:25:44 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 27 Sep 2007 14:25:44 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46FBF8AF.9040700@ichips.intel.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <46FBF8AF.9040700@ichips.intel.com> Message-ID: <46FC03B8.1030106@opengridcomputing.com> Sean Hefty wrote: >> The sysadmin creates "for iwarp use only" alias interfaces of the form >> "devname:iw*" where devname is the native interface name (eg eth0) for >> the >> iwarp netdev device. The alias label can be anything starting with "iw". >> The "iw" immediately after the ':' is the key used by the iw_cxgb3 >> driver. > > I'm still not sure about this, but haven't come up with anything better > myself. And if there's a good chance of other rnic's needing the same > support, I'd rather see the common code separated out, even if just > encapsulated within this module for easy re-use. > > As for the code, I have a couple of questions about whether deadlock and > a race condition are possible, plus a few minor comments. > Thanks for reviewing! Responses are in-line below. >> +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) >> +{ >> + struct iwch_addrlist *addr; >> + >> + addr = kmalloc(sizeof *addr, GFP_KERNEL); >> + if (!addr) { >> + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", >> + __FUNCTION__); >> + return; >> + } >> + addr->ifa = ifa; >> + mutex_lock(&rnicp->mutex); >> + list_add_tail(&addr->entry, &rnicp->addrlist); >> + mutex_unlock(&rnicp->mutex); >> +} > > Should this return success/failure? > I think so. See below... >> +static int nb_callback(struct notifier_block *self, unsigned long event, >> + void *ctx) >> +{ >> + struct in_ifaddr *ifa = ctx; >> + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); >> + >> + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); >> + >> + switch (event) { >> + case NETDEV_UP: >> + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && >> + is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x added\n", >> + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); >> + insert_ifa(rnicp, ifa); >> + iwch_listeners_add_addr(rnicp, ifa->ifa_address); > > If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not > easily seeing the relationship between the address list and the listen > list at this point.) > I guess insert_ifa() needs to return success/failure. Then if we failed to add the ifa to the list we won't update the listeners. The relationship is this: - when a listen is done on addr 0.0.0.0, the code walks the list of addresses to do specific listens on each address. - when an address is added or deleted, then the list of current listeners is walked and updated accordingly. >> + } >> + break; >> + case NETDEV_DOWN: >> + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && >> + is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x deleted\n", >> + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); >> + iwch_listeners_del_addr(rnicp, ifa->ifa_address); >> + remove_ifa(rnicp, ifa); >> + } >> + break; >> + default: >> + break; >> + } >> + return 0; >> +} >> + >> +static void delete_addrlist(struct iwch_dev *rnicp) >> +{ >> + struct iwch_addrlist *addr, *tmp; >> + >> + mutex_lock(&rnicp->mutex); >> + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { >> + list_del(&addr->entry); >> + kfree(addr); >> + } >> + mutex_unlock(&rnicp->mutex); >> +} >> + >> +static void populate_addrlist(struct iwch_dev *rnicp) >> +{ >> + int i; >> + struct in_device *indev; >> + >> + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { >> + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); >> + if (!indev) >> + continue; >> + for_ifa(indev) >> + if (is_iwarp_label(ifa->ifa_label)) { >> + PDBG("%s label %s addr 0x%x added\n", >> + __FUNCTION__, ifa->ifa_label, >> + ifa->ifa_address); >> + insert_ifa(rnicp, ifa); >> + } >> + endfor_ifa(indev); >> + } >> +} >> + >> static void rnic_init(struct iwch_dev *rnicp) >> { >> PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); >> @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r >> idr_init(&rnicp->qpidr); >> idr_init(&rnicp->mmidr); >> spin_lock_init(&rnicp->lock); >> + INIT_LIST_HEAD(&rnicp->addrlist); >> + INIT_LIST_HEAD(&rnicp->listen_eps); >> + mutex_init(&rnicp->mutex); >> + rnicp->nb.notifier_call = nb_callback; >> + populate_addrlist(rnicp); >> + register_inetaddr_notifier(&rnicp->nb); >> >> rnicp->attr.vendor_id = 0x168; >> rnicp->attr.vendor_part_id = 7; >> @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev >> mutex_lock(&dev_mutex); >> list_for_each_entry_safe(dev, tmp, &dev_list, entry) { >> if (dev->rdev.t3cdev_p == tdev) { >> + unregister_inetaddr_notifier(&dev->nb); >> + delete_addrlist(dev); >> list_del(&dev->entry); >> iwch_unregister_device(dev); >> cxio_rdev_close(&dev->rdev); >> diff --git a/drivers/infiniband/hw/cxgb3/iwch.h >> b/drivers/infiniband/hw/cxgb3/iwch.h >> index caf4e60..7fa0a47 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch.h >> +++ b/drivers/infiniband/hw/cxgb3/iwch.h >> @@ -36,6 +36,8 @@ #include >> #include >> #include >> #include >> +#include >> +#include >> >> #include >> >> @@ -101,6 +103,11 @@ struct iwch_rnic_attributes { >> u32 cq_overflow_detection; >> }; >> >> +struct iwch_addrlist { >> + struct list_head entry; >> + struct in_ifaddr *ifa; >> +}; >> + >> struct iwch_dev { >> struct ib_device ibdev; >> struct cxio_rdev rdev; >> @@ -111,6 +118,10 @@ struct iwch_dev { >> struct idr mmidr; >> spinlock_t lock; >> struct list_head entry; >> + struct notifier_block nb; >> + struct list_head addrlist; >> + struct list_head listen_eps; > > The behavior of the listen lists is similar to what's done in the > rdma_cm: Wildcard listens are stored in a listen_any_list. When new > devices are added, associated listens are added to each device. This is > similar, except we're dealing with devices and addresses. I'm wondering > if we shouldn't mimic the same behavior and track listens in > iwch_addrlist directly. (I don't see anything wrong with this approach > though.) > > What happens if an address changes between iwarp only and non-iwarp? > That results in a NETDEV_DOWN event indicating the iwarp only address is getting deleted. All the affected listening endpoints are updated to stop listening on that address. A NETDEV_UP event would happen when the ipaddress is switched over to the TCP interface, but our callback function ignores this since the interface name is not ethX:iw. > How are listens on specific addresses handled from an rdma_cm level? > Does the rdma_cm map the address to the device, call the iw_cm to > listen, which in turn calls the device listen function? Yes. > The device then > checks that the address has been marked as iwarp only? (I'm being too > lazy to trace this through the code, but if you don't know off the top > of your head, I will do that.) Actually, I don't enforce this. If the app explicitly binds/listens to a non-iwarp address, then the code happily allows it. I could fail this case though. That would be best I guess. > >> + struct mutex mutex; >> }; >> >> static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) >> diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c >> b/drivers/infiniband/hw/cxgb3/iwch_cm.c >> index 1cdfcd4..afc8a48 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c >> +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c >> @@ -1127,23 +1127,149 @@ static int act_open_rpl(struct t3cdev *t >> return CPL_RET_BUF_DONE; >> } >> >> -static int listen_start(struct iwch_listen_ep *ep) >> +static int wait_for_reply(struct iwch_ep_common *epc) >> +{ >> + PDBG("%s ep %p waiting\n", __FUNCTION__, epc); >> + wait_event(epc->waitq, epc->rpl_done); >> + PDBG("%s ep %p done waiting err %d\n", __FUNCTION__, epc, >> epc->rpl_err); >> + return epc->rpl_err; >> +} > > What thread is being blocked here, and what sets the event? > The thread calling rdma_listen() gets blocked here until the rnic posts a response to the listen request. Ditto for rdma_destroy_id() on a listening endpoint. The event is set and the wakeup don in pass_open_rpl() and close_listsrv_rpl(). >> + >> +static struct iwch_listen_entry *alloc_listener(struct iwch_listen_ep >> *ep, >> + __be32 addr) >> +{ >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >> + struct iwch_listen_entry *le; >> + >> + le = kmalloc(sizeof *le, GFP_KERNEL); >> + if (!le) { >> + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", >> + __FUNCTION__); >> + return NULL; >> + } >> + le->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, >> + &t3c_client, ep); >> + if (le->stid == -1) { >> + printk(KERN_ERR MOD "%s - cannot alloc stid.\n", >> + __FUNCTION__); >> + kfree(le); >> + return NULL; >> + } >> + le->addr = addr; >> + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, >> + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); >> + return le; >> +} >> + >> +static void dealloc_listener(struct iwch_listen_ep *ep, >> + struct iwch_listen_entry *le) >> +{ >> + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, le->stid, >> + ntohl(le->addr), ntohs(ep->com.local_addr.sin_port)); >> + cxgb3_free_stid(ep->com.tdev, le->stid); >> + kfree(le); >> +} >> + >> +static void dealloc_listener_list(struct iwch_listen_ep *ep) >> +{ >> + struct iwch_listen_entry *le, *tmp; >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >> + >> + mutex_lock(&h->mutex); >> + list_for_each_entry_safe(le, tmp, &ep->listeners, entry) { >> + list_del(&le->entry); >> + dealloc_listener(ep, le); >> + } >> + mutex_unlock(&h->mutex); >> +} >> + >> +static int alloc_listener_list(struct iwch_listen_ep *ep) >> +{ >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); > > nit: in place of 'h' in several places, how about 'rnicp', which is also > used? > >> + struct iwch_addrlist *addr; >> + struct iwch_listen_entry *le; >> + int err = 0; >> + int added=0; >> + mutex_lock(&h->mutex); >> + list_for_each_entry(addr, &h->addrlist, entry) { >> + if (ep->com.local_addr.sin_addr.s_addr == 0 || >> + ep->com.local_addr.sin_addr.s_addr == >> + addr->ifa->ifa_address) { >> + le = alloc_listener(ep, addr->ifa->ifa_address); >> + if (!le) >> + break; >> + list_add_tail(&le->entry, &ep->listeners); >> + added++; >> + } >> + } >> + mutex_unlock(&h->mutex); >> + if (ep->com.local_addr.sin_addr.s_addr != 0 && !added) >> + err = -EADDRNOTAVAIL; >> + if (!err && !added) >> + printk(KERN_WARNING MOD >> + "No RDMA interface found for device %s\n", >> + pci_name(h->rdev.rnic_info.pdev)); >> + return err; >> +} > > Adding some white space would improve readability. > >> + >> +static int listen_stop_one(struct iwch_listen_ep *ep, unsigned int >> stid) >> { >> struct sk_buff *skb; >> - struct cpl_pass_open_req *req; >> + struct cpl_close_listserv_req *req; >> + >> + PDBG("%s stid %u\n", __FUNCTION__, stid); >> + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); >> + if (!skb) { >> + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); >> + return -ENOMEM; >> + } >> + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); >> + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); >> + req->cpu_idx = 0; >> + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, stid)); >> + skb->priority = 1; >> + ep->com.rpl_err = 0; >> + ep->com.rpl_done = 0; >> + cxgb3_ofld_send(ep->com.tdev, skb); >> + return wait_for_reply(&ep->com); >> +} >> + >> +static int listen_stop(struct iwch_listen_ep *ep) >> +{ >> + struct iwch_listen_entry *le; >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >> + int err = 0; >> >> PDBG("%s ep %p\n", __FUNCTION__, ep); >> + mutex_lock(&h->mutex); >> + list_for_each_entry(le, &ep->listeners, entry) { >> + err = listen_stop_one(ep, le->stid); > > This ends up blocking while holding a mutex, which looks like deadlock > potential. > I don't think there are any deadlocks. I don't know how to avoid blocking while holding the mutex. But its ok, I think. >> + if (err) >> + break; >> + } >> + mutex_unlock(&h->mutex); >> + return err; >> +} >> + >> +static int listen_start_one(struct iwch_listen_ep *ep, unsigned int >> stid, >> + __be32 addr, __be16 port) >> +{ >> + struct sk_buff *skb; >> + struct cpl_pass_open_req *req; >> + >> + PDBG("%s stid %u addr %x port %x\n", __FUNCTION__, stid, >> ntohl(addr), >> + ntohs(port)); >> skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); >> if (!skb) { >> - printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); >> + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); >> return -ENOMEM; >> } >> >> req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); >> req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); >> - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); >> - req->local_port = ep->com.local_addr.sin_port; >> - req->local_ip = ep->com.local_addr.sin_addr.s_addr; >> + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, stid)); >> + req->local_port = port; >> + req->local_ip = addr; >> req->peer_port = 0; >> req->peer_ip = 0; >> req->peer_netmask = 0; >> @@ -1152,8 +1278,32 @@ static int listen_start(struct iwch_list >> req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); >> >> skb->priority = 1; >> + ep->com.rpl_err = 0; >> + ep->com.rpl_done = 0; >> cxgb3_ofld_send(ep->com.tdev, skb); >> - return 0; >> + return wait_for_reply(&ep->com); >> +} >> + >> +static int listen_start(struct iwch_listen_ep *ep) >> +{ >> + struct iwch_listen_entry *le; >> + struct iwch_dev *h = to_iwch_dev(ep->com.cm_id->device); >> + int err = 0; >> + >> + PDBG("%s ep %p\n", __FUNCTION__, ep); >> + mutex_lock(&h->mutex); >> + list_for_each_entry(le, &ep->listeners, entry) { >> + err = listen_start_one(ep, le->stid, le->addr, >> + ep->com.local_addr.sin_port); > > Similar to above - blocking while holding a mutex. There are a couple > of other places where this also occurs. It is ok to block while holding a mutex, yes? > >> + if (err) >> + goto fail; >> + } >> + mutex_unlock(&h->mutex); >> + return err; >> +fail: >> + mutex_unlock(&h->mutex); >> + listen_stop(ep); >> + return err; >> } >> >> static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, >> void *ctx) >> @@ -1170,39 +1320,59 @@ static int pass_open_rpl(struct t3cdev * >> return CPL_RET_BUF_DONE; >> } >> >> -static int listen_stop(struct iwch_listen_ep *ep) >> -{ >> - struct sk_buff *skb; >> - struct cpl_close_listserv_req *req; >> - >> - PDBG("%s ep %p\n", __FUNCTION__, ep); >> - skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); >> - if (!skb) { >> - printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); >> - return -ENOMEM; >> - } >> - req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); >> - req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); >> - req->cpu_idx = 0; >> - OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, >> ep->stid)); >> - skb->priority = 1; >> - cxgb3_ofld_send(ep->com.tdev, skb); >> - return 0; >> -} >> - >> static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, >> void *ctx) >> { >> struct iwch_listen_ep *ep = ctx; >> struct cpl_close_listserv_rpl *rpl = cplhdr(skb); >> >> - PDBG("%s ep %p\n", __FUNCTION__, ep); >> + PDBG("%s ep %p stid %u\n", __FUNCTION__, ep, GET_TID(rpl)); >> + >> ep->com.rpl_err = status2errno(rpl->status); >> ep->com.rpl_done = 1; >> wake_up(&ep->com.waitq); >> return CPL_RET_BUF_DONE; >> } >> >> +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr) >> +{ >> + struct iwch_listen_ep *listen_ep; >> + struct iwch_listen_entry *le; >> + >> + mutex_lock(&rnicp->mutex); >> + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { >> + if (listen_ep->com.local_addr.sin_addr.s_addr) >> + continue; >> + le = alloc_listener(listen_ep, addr); >> + if (le) { >> + list_add_tail(&le->entry, &listen_ep->listeners); >> + listen_start_one(listen_ep, le->stid, addr, >> + listen_ep->com.local_addr.sin_port); >> + } >> + } >> + mutex_unlock(&rnicp->mutex); >> +} >> + >> +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr) >> +{ >> + struct iwch_listen_ep *listen_ep; >> + struct iwch_listen_entry *le, *tmp; >> + >> + mutex_lock(&rnicp->mutex); >> + list_for_each_entry(listen_ep, &rnicp->listen_eps, entry) { >> + if (listen_ep->com.local_addr.sin_addr.s_addr) >> + continue; >> + list_for_each_entry_safe(le, tmp, &listen_ep->listeners, >> + entry) >> + if (le->addr == addr) { >> + listen_stop_one(listen_ep, le->stid); >> + list_del(&le->entry); >> + dealloc_listener(listen_ep, le); >> + } >> + } >> + mutex_unlock(&rnicp->mutex); >> +} >> + >> static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct >> sk_buff *skb) >> { >> struct cpl_pass_accept_rpl *rpl; >> @@ -1767,8 +1937,7 @@ int iwch_accept_cr(struct iw_cm_id *cm_i >> goto err; >> >> /* wait for wr_ack */ >> - wait_event(ep->com.waitq, ep->com.rpl_done); >> - err = ep->com.rpl_err; >> + err = wait_for_reply(&ep->com); >> if (err) >> goto err; >> >> @@ -1887,31 +2056,23 @@ int iwch_create_listen(struct iw_cm_id * >> ep->com.cm_id = cm_id; >> ep->backlog = backlog; >> ep->com.local_addr = cm_id->local_addr; >> + INIT_LIST_HEAD(&ep->listeners); >> >> - /* >> - * Allocate a server TID. >> - */ >> - ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); >> - if (ep->stid == -1) { >> - printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); >> - err = -ENOMEM; >> + err = alloc_listener_list(ep); >> + if (err) >> goto fail2; >> - } >> >> state_set(&ep->com, LISTEN); >> err = listen_start(ep); >> - if (err) >> - goto fail3; >> >> - /* wait for pass_open_rpl */ >> - wait_event(ep->com.waitq, ep->com.rpl_done); >> - err = ep->com.rpl_err; >> if (!err) { >> cm_id->provider_data = ep; >> + mutex_lock(&h->mutex); >> + list_add_tail(&ep->entry, &h->listen_eps); >> + mutex_unlock(&h->mutex); > > Is there a race between listen_start() being called and inserting the ep > into the list? Could anything try to find the ep on the list after > listen_start returns? > I guess if the iwarp address was removed between after the listen_start() and before we add it to the list, then we would not stop the listen for this address. Perhaps I need to hold the mutex around the listen_start() -and- the insert... >> goto out; >> } >> -fail3: >> - cxgb3_free_stid(ep->com.tdev, ep->stid); >> + dealloc_listener_list(ep); >> fail2: >> cm_id->rem_ref(cm_id); >> put_ep(&ep->com); >> @@ -1923,18 +2084,20 @@ out: >> int iwch_destroy_listen(struct iw_cm_id *cm_id) >> { >> int err; >> + struct iwch_dev *h = to_iwch_dev(cm_id->device); >> struct iwch_listen_ep *ep = to_listen_ep(cm_id); >> >> PDBG("%s ep %p\n", __FUNCTION__, ep); >> >> might_sleep(); >> + mutex_lock(&h->mutex); >> + list_del(&ep->entry); >> + mutex_unlock(&h->mutex); >> state_set(&ep->com, DEAD); >> ep->com.rpl_done = 0; >> ep->com.rpl_err = 0; >> err = listen_stop(ep); >> - wait_event(ep->com.waitq, ep->com.rpl_done); >> - cxgb3_free_stid(ep->com.tdev, ep->stid); >> - err = ep->com.rpl_err; >> + dealloc_listener_list(ep); >> cm_id->rem_ref(cm_id); >> put_ep(&ep->com); >> return err; >> diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h >> b/drivers/infiniband/hw/cxgb3/iwch_cm.h >> index 6107e7c..23e5a22 100644 >> --- a/drivers/infiniband/hw/cxgb3/iwch_cm.h >> +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h >> @@ -162,10 +162,19 @@ struct iwch_ep_common { >> int rpl_err; >> }; >> >> -struct iwch_listen_ep { >> - struct iwch_ep_common com; >> +struct iwch_listen_entry { >> + struct list_head entry; >> unsigned int stid; >> + __be32 addr; >> +}; >> + >> +struct iwch_listen_ep { >> + struct iwch_ep_common com; /* Must be first entry! */ >> + struct list_head entry; >> + struct list_head listeners; >> int backlog; >> + int listen_count; > > I didn't notice where this was used. > >> + int listen_rpls; > > or this. > Yea, I think this is dead code. I'll remove these. >> }; >> >> struct iwch_ep { >> @@ -222,6 +231,8 @@ int iwch_resume_tid(struct iwch_ep *ep); >> void __free_ep(struct kref *kref); >> void iwch_rearp(struct iwch_ep *ep); >> int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct >> dst_entry *new, struct l2t_entry *l2t); >> +void iwch_listeners_add_addr(struct iwch_dev *rnicp, __be32 addr); >> +void iwch_listeners_del_addr(struct iwch_dev *rnicp, __be32 addr); >> >> int __init iwch_cm_init(void); >> void __exit iwch_cm_term(void); >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From sean.hefty at intel.com Thu Sep 27 13:14:59 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Sep 2007 13:14:59 -0700 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. In-Reply-To: <46FC03B8.1030106@opengridcomputing.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <46FBF8AF.9040700@ichips.intel.com> <46FC03B8.1030106@opengridcomputing.com> Message-ID: <000001c80143$15019140$82c8180a@amr.corp.intel.com> >It is ok to block while holding a mutex, yes? It's okay, I just didn't try to trace through the code to see if it ever tries to acquire the same mutex in the thread that needs to signal the event. - Sean From Arkady.Kanevsky at netapp.com Thu Sep 27 13:19:51 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 27 Sep 2007 16:19:51 -0400 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> Message-ID: Sean, IB aside, it looks like an ULP which is capable of being both RDMA aware and RDMA not-aware, like iSER and iSCSI, NFS-RDMA and NFS, SDP and sockets, will be treated as two separete ULPs. Each has its own IP address, since there is a different IP address for iWARP port and "regular" Ethernet port. So it falls on the users of ULPs to "handle" it via DNS or some other services. Is this "acceptable" to users? I doubt it. Recall that ULPs are going in opposite directions by having a different port number for RDMA aware and RDMA unaware versions of the ULP. This way, ULP "connection manager" handles RDMA-ness under the covers, while users plug an IP address for a server to connect to. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, September 27, 2007 3:12 PM > To: Kanevsky, Arkady; Sean Hefty; Steve Wise > Cc: netdev at vger.kernel.org; rdreier at cisco.com; > linux-kernel at vger.kernel.org; general at lists.openfabrics.org > Subject: RE: [ofa-general] [PATCH v3] iw_cxgb3: > Support"iwarp-only"interfacesto avoid 4-tuple conflicts. > > >What is the model on how client connects, say for iSCSI, when client > >and server both support, iWARP and 10GbE or 1GbE, and would like to > >setup "most" performant "connection" for ULP? > > For the "most" performance connection, the ULP would use IB, > and all these problems go away. :) > > This proposal is for each iwarp interface to have its own IP > address. Clients would need an iwarp usable address of the > server and would connect using rdma_connect(). If that call > (or rdma_resolve_addr/route) fails, the client could try > connecting using sockets, aoi, or some other interface. I > don't see that Steve's proposal changes anything from the > client's perspective. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From zulfiimani at gmail.com Thu Sep 27 13:22:11 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Thu, 27 Sep 2007 15:22:11 -0500 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 Message-ID: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> Hi, I installed the OFED1.2 stack and am trying to run a simple socket server and client over the SDP stack. The Infiniband hardware is QLogic. First I set the ENV vars export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf The SDP config file has: use sdp server * *:* use sdp client * *:* Then started the socket server and did a 'sdpnetstat -San' and found that it listed the SDP port on which the server was listening. On the client machine too I did the same; exported the variables, setup the SDP config file and on running the client './client port# server_machine' it gave me a "network not reachable" error. I tried to get some information about the error on the net but could not find any. I then checked the /proc//maps file and found that libsdp.so was being loaded. also: /root > lsmod | grep sdp ib_sdp 120224 3 Does QLogic support SDP applications ? Or am I missing something in the SDP config file or do I need to make changes to my code ? Any information on this will be a big help. Thanks, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Thu Sep 27 13:37:13 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 27 Sep 2007 15:37:13 -0500 Subject: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <000701c80070$560d0490$02270db0$@rr.com> References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> Message-ID: <1190925433.10604.25.camel@trinity.ogc.int> On Wed, 2007-09-26 at 14:06 -0500, Jim Mott wrote: > This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is > what I believe to be a bug in the mlx4 driver. mthca has the same issue. > > 1) ib_create_qp() fails with max_sge > If you use ib_query_device() to return the device specific > attribute max_sge, it seems reasonable to expect you can create > a QP with max_send_sge=max_sge. The problem is that this often > fails. > > The reason is that depending on the QP type (RC, UD, etc.) and > how the QP will be used (send, RDMA, atomic, etc.), there can be > extra segments required in the WQE that eat up SGE entries. So > while some send WQE might have max_sge available SGEs, many will > not. > > Normally the difference between max_sge and the actual maximum > value allowed (and checked) for max_send_sge is 1 or 2. > > This issue may need API extensions to definitively resolve. In > the short term, it would be very nice if max_sge reported by > ib_query_device() could always return a value that ib_create_qp() > could use. Think of it as the minimum max_send_sge value that > will work for all QP types. > > > 2) mlx4 setting of max send SQEs > The recent patch to support shrinking WQEs introduces a > behavior that creates a big difference between the mlx4 > supported send SGEs (checked against 61, should be 59 or 60, > and reported in ib_query_device as 32 to equal receive side > max_rq_sg value). > > The patch that follows will allow an MLX4 to support the > number of send SGEs returned by ib_query_devce, and in fact > quite a few more. It probably breaks shrinking WQEs and thus > should not be applied directly. > > Note that if ib_query_device() returned max_sge adjusted > for the raddr and atomic segments, this fix would not be > needed. MLX4 would still support more SGEs in hardware than > can be used through the API, but that is a different problem. > > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.000000000 -0500 > +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.000000000 -0500 > @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx > qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); > > for (;;) { > - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) > + if (s > dev->dev->caps.max_sq_desc_sz) > return -EINVAL; > > qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Thu Sep 27 13:39:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Sep 2007 22:39:14 +0200 Subject: [ofa-general] Re: [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters In-Reply-To: References: <1190637727.4947.76.camel@mtls03> Message-ID: <20070927203914.GD2778@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 11/11]: mlx4_core use fixed CQ moderation paramters > > > +static int cq_max_count = 16; > > +static int cq_period = 10; > > + > > +module_param(cq_max_count, int, 0444); > > +MODULE_PARM_DESC(cq_max_count, "number of CQEs to generate event"); > > +module_param(cq_period, int, 0444); > > +MODULE_PARM_DESC(cq_period, "time in usec for CQ event generation"); > > I assume this is just a leftover from some earlier approach? These > module parameters are just ignored now, so the patch seems kind of > pointless. These should go into create CQ inbox. I'll recheck. > Anyway I think the approach of having one global setting for all CQs > is not a good one -- it seems likely that for example IPoIB and SDP > would want different settings, not to mention userspace applications. I agree. But what should be the default setting? Consider also that there's currently no userspace API to control event coalescing. So global setting to control the defaults might still make sense. No? -- MST From mst at dev.mellanox.co.il Thu Sep 27 13:55:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Sep 2007 22:55:41 +0200 Subject: [ofa-general] Re: [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: <000701c80070$560d0490$02270db0$@rr.com> References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> Message-ID: <20070927205541.GF2778@mellanox.co.il> > Quoting Jim Mott : > Subject: [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device > > This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is > what I believe to be a bug in the mlx4 driver. > > 1) ib_create_qp() fails with max_sge > If you use ib_query_device() to return the device specific > attribute max_sge, it seems reasonable to expect you can create > a QP with max_send_sge=max_sge. The problem is that this often > fails. > > The reason is that depending on the QP type (RC, UD, etc.) and > how the QP will be used (send, RDMA, atomic, etc.), there can be > extra segments required in the WQE that eat up SGE entries. So > while some send WQE might have max_sge available SGEs, many will > not. > > Normally the difference between max_sge and the actual maximum > value allowed (and checked) for max_send_sge is 1 or 2. > > This issue may need API extensions to definitively resolve. In > the short term, it would be very nice if max_sge reported by > ib_query_device() could always return a value that ib_create_qp() > could use. Think of it as the minimum max_send_sge value that > will work for all QP types. > > > 2) mlx4 setting of max send SQEs > The recent patch to support shrinking WQEs introduces a > behavior that creates a big difference between the mlx4 > supported send SGEs (checked against 61, should be 59 or 60, > and reported in ib_query_device as 32 to equal receive side > max_rq_sg value). > > The patch that follows will allow an MLX4 to support the > number of send SGEs returned by ib_query_devce, and in fact > quite a few more. It probably breaks shrinking WQEs and thus > should not be applied directly. > > Note that if ib_query_device() returned max_sge adjusted > for the raddr and atomic segments, this fix would not be > needed. MLX4 would still support more SGEs in hardware than > can be used through the API, but that is a different problem. > > --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.000000000 -0500 > +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.000000000 -0500 > @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx > qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); > > for (;;) { > - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) > + if (s > dev->dev->caps.max_sq_desc_sz) > return -EINVAL; > > qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); Good idea, but that patch needs more work: max_send_sge returned to user should be made smaller to avoid corrupting the WQE. -- MST From mst at dev.mellanox.co.il Thu Sep 27 13:59:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Sep 2007 22:59:50 +0200 Subject: [ofa-general] Re: [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> Message-ID: <20070927205950.GG2778@mellanox.co.il> > BTW I hate the "shrinking WQE" terminology for this, although > obviously you weren't the one to introduce it) We are making WQEs smaller so "shrinking", and that's how hardware guys seem to call the feature. But it doesn't really matter: the only place the word is used is in the commit log. -- MST From mst at dev.mellanox.co.il Thu Sep 27 14:08:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Sep 2007 23:08:47 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> Message-ID: <20070927210847.GH2778@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support > > > + struct page *page = alloc_page(GFP_ATOMIC | GFP_HIGHUSER); > > actually: > > + struct page *page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); Isn't this likely to hurt performance on 32 bit systems? -- MST From rdreier at cisco.com Thu Sep 27 14:09:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Sep 2007 14:09:47 -0700 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> (Or Gerlitz's message of "Thu, 27 Sep 2007 12:38:43 +0200") References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> Message-ID: > Since both the rdma-cm and ipoib are consumers of the core mutlicast > management code (core/multicast.c which is linked into ib_sa.ko), and the > app (through the rdma-cm) --first-- inserts a record into the database and > only then issues the setsockopt call, if ipoib has a hit on a group it was > told to join, this group must be offloaded by the rdma-cm consumer. I'm not sure I understand why that follows. Couldn't there be some other kernel or userspace entity that caused the record to be added? > The per device flag is initialized by the module param value at > ipoib_dev_init() I still don't really get why there's a module parameter to set the initial value of a flag that only root can change anyway. Why not just the flag through sysfs after loading ipoib rather than having a module parameter to do the same thing? - R. From rdreier at cisco.com Thu Sep 27 14:10:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Sep 2007 14:10:56 -0700 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <20070927210847.GH2778@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 27 Sep 2007 23:08:47 +0200") References: <1190637355.4947.56.camel@mtls03> <46F8E160.5060004@voltaire.com> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> Message-ID: > > + struct page *page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); > > Isn't this likely to hurt performance on 32 bit systems? Yeah, I guess the kernel would need to kmap the data in most cases anyway. So there's not much point in trying to use high memory. - R. From jimmott at austin.rr.com Thu Sep 27 14:11:47 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Thu, 27 Sep 2007 16:11:47 -0500 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> Message-ID: <000901c8014b$0435f880$0ca1e980$@rr.com> Were you able to connect IPoIB between the nodes? Are you sure opensm was running? I am ashamed to admit that occasionally I forget to start opensm and wonder why SDP does not connect. From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Zulfi Imani Sent: Thursday, September 27, 2007 3:22 PM To: general at lists.openfabrics.org Subject: [ofa-general] Problem running SDP apps using OFED 1.2 Hi, I installed the OFED1.2 stack and am trying to run a simple socket server and client over the SDP stack. The Infiniband hardware is QLogic. First I set the ENV vars export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf The SDP config file has: use sdp server * *:* use sdp client * *:* Then started the socket server and did a 'sdpnetstat -San' and found that it listed the SDP port on which the server was listening. On the client machine too I did the same; exported the variables, setup the SDP config file and on running the client './client port# server_machine' it gave me a "network not reachable" error. I tried to get some information about the error on the net but could not find any. I then checked the /proc//maps file and found that libsdp.so was being loaded. also: /root > lsmod | grep sdp ib_sdp 120224 3 Does QLogic support SDP applications ? Or am I missing something in the SDP config file or do I need to make changes to my code ? Any information on this will be a big help. Thanks, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Sep 27 14:46:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Sep 2007 14:46:45 -0700 Subject: [ofa-general] Re: send max_sge lower than reported by ib_query_device In-Reply-To: <001501c800a7$8fd5efc0$af81cf40$@rr.com> (Jim Mott's message of "Wed, 26 Sep 2007 20:41:44 -0500") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> Message-ID: > The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. Thanks. The patch below seems to fix this for me. I guess I'll queue this for 2.6.24. I'm also including the test program I wrote to verify this; mlx4 and mthca seem OK on my system now. diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 60de6f9..0c22cf0 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -45,6 +45,7 @@ #include "mthca_cmd.h" #include "mthca_profile.h" #include "mthca_memfree.h" +#include "mthca_wqe.h" MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); @@ -205,7 +206,20 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) mdev->limits.gid_table_len = dev_lim->max_gids; mdev->limits.pkey_table_len = dev_lim->max_pkeys; mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; - mdev->limits.max_sg = dev_lim->max_sg; + /* + * Reduce max_sg to a value so that all possible send requests + * will fit into max_desc_sz; send requests will need a next + * segment plus possibly another extra segment, and the UD + * segment is the biggest extra segment. + */ + mdev->limits.max_sg = + min_t(int, dev_lim->max_sg, + (dev_lim->max_desc_sz - + (sizeof (struct mthca_next_seg) + + (mthca_is_memfree(mdev) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg)))) / + sizeof (struct mthca_data_seg)); mdev->limits.max_wqes = dev_lim->max_qp_sz; mdev->limits.max_qp_init_rdma = dev_lim->max_requester_per_qp; mdev->limits.reserved_qps = dev_lim->reserved_qps; --- Here's the test program: #include #include #include int main(int argc, char *argv) { struct ibv_device **dev_list; struct ibv_device_attr dev_attr; struct ibv_context *context; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp_init_attr qp_attr; int t; static const struct { enum ibv_qp_type type; char *name; } type_tab[] = { { IBV_QPT_RC, "RC" }, { IBV_QPT_UC, "UC" }, { IBV_QPT_UD, "UD" }, }; dev_list = ibv_get_device_list(NULL); if (!dev_list) { printf("No IB devices found\n"); return 1; } for (; *dev_list; ++dev_list) { printf("%s:\n", ibv_get_device_name(*dev_list)); context = ibv_open_device(*dev_list); if (!context) { printf(" ibv_open_device failed\n"); continue; } if (ibv_query_device(context, &dev_attr)) { printf(" ibv_query_device failed\n"); continue; } cq = ibv_create_cq(context, 1, NULL, NULL, 0); if (!cq) { printf(" ibv_create_cq failed\n"); continue; } pd = ibv_alloc_pd(context); if (!pd) { printf(" ibv_alloc_pd failed\n"); continue; } for (t = 0; t < sizeof type_tab / sizeof type_tab[0]; ++t) { memset(&qp_attr, 0, sizeof qp_attr); qp_attr.send_cq = cq; qp_attr.recv_cq = cq; qp_attr.cap.max_send_wr = 1; qp_attr.cap.max_recv_wr = 1; qp_attr.cap.max_send_sge = dev_attr.max_sge; qp_attr.cap.max_recv_sge = dev_attr.max_sge; qp_attr.qp_type = type_tab[t].type; printf(" %s: SGE %d ", type_tab[t].name, dev_attr.max_sge); if (ibv_create_qp(pd, &qp_attr)) printf("ok\n"); else printf("FAILED\n"); } } return 0; } From johann.george at qlogic.com Thu Sep 27 15:13:28 2007 From: johann.george at qlogic.com (Johann George) Date: Thu, 27 Sep 2007 15:13:28 -0700 Subject: [ofa-general] Save the date: OFA Developer's Summit: November 15-16 in Nevada Message-ID: <20070927221328.GA16000@cuprite.pathscale.com> We hope you will plan on attending the OpenFabrics Developer's Summit being held November 15-16, 2007 at the Boomtown Hotel in Verdi, Nevada. It will begin at 1pm on Thursday, November 15th and run until the early evening. Friday's session will begin at 8am and end at noon. Last year, this turned out to be a good forum to work through issues that required collaboration. If you have items that ought to be on the agenda, please email them to me. We will have a proposed agenda shortly. This event takes place at the tail end of SC07. The Boomtown hotel is about a twenty minute drive from the Reno-Sparks convention center where SC07 is being held. Rooms are available if needed at the Boomtown hotel starting at $70/night. Thanks for your participation. Johann From zulfiimani at gmail.com Thu Sep 27 15:28:06 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Thu, 27 Sep 2007 17:28:06 -0500 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <000901c8014b$0435f880$0ca1e980$@rr.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> Message-ID: <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> I have not tried over IPoIB, but opensm is running /home/zulfi > sminfo sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 priority 0 state 3 SMINFO_MASTER I also tried a few iband utilities and they all work fine. Not able to run any socket apps over SDP. Thanks Zulfi On 9/27/07, Jim Mott wrote: > > Were you able to connect IPoIB between the nodes? Are you sure opensm > was running? I am ashamed to admit that occasionally I forget to start > opensm and wonder why SDP does not connect. > > > > *From:* general-bounces at lists.openfabrics.org [mailto: > general-bounces at lists.openfabrics.org] *On Behalf Of *Zulfi Imani > *Sent:* Thursday, September 27, 2007 3:22 PM > *To:* general at lists.openfabrics.org > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > Hi, > > I installed the OFED1.2 stack and am trying to run a simple socket server > and client over the SDP stack. The Infiniband hardware is QLogic. > > First I set the ENV vars > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > The SDP config file has: > *use sdp server * *:* > use sdp client * *:* > * > Then started the socket server and did a 'sdpnetstat -San' and found that > it listed the SDP port on which the server was listening. > > On the client machine too I did the same; exported the variables, setup > the SDP config file and on running the client './client port# > server_machine' it gave me a "network not reachable" error. > > I tried to get some information about the error on the net but could not > find any. > > I then checked the /proc//maps file and found that libsdp.so was > being loaded. > also: > /root > lsmod | grep sdp > ib_sdp 120224 3 > > Does QLogic support SDP applications ? Or am I missing something in the > SDP config file or do I need to make changes to my code ? > > Any information on this will be a big help. > > Thanks, > Zulfi > > > -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Thu Sep 27 15:44:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 28 Sep 2007 00:44:52 +0200 Subject: [ofa-general] Re: send max_sge lower than reported by ib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> Message-ID: <20070927224452.GI2778@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: send max_sge lower than reported by ib_query_device > > > The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. > > Thanks. The patch below seems to fix this for me. I guess I'll queue > this for 2.6.24. I'm not sure this is a good approach: the fact that user attempts to use the max value from query device indicates that he really wants to get as large a value as possible. So lowering this value in query means we are wasting performance for such an app. -- MST From rdreier at cisco.com Thu Sep 27 15:56:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Sep 2007 15:56:43 -0700 Subject: [ofa-general] Re: send max_sge lower than reported by ib_query_device In-Reply-To: <20070927224452.GI2778@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 28 Sep 2007 00:44:52 +0200") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> <20070927224452.GI2778@mellanox.co.il> Message-ID: Michael> I'm not sure this is a good approach: the fact that user Michael> attempts to use the max value from query device indicates Michael> that he really wants to get as large a value as Michael> possible. So lowering this value in query means we are Michael> wasting performance for such an app. Right now we report a value of 30 and then give an error if the consumer tries to use that value to actually create a QP. That's a clear bug to me. How do you suggest we resolve this bug? - R. From sean.hefty at intel.com Thu Sep 27 17:02:04 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Sep 2007 17:02:04 -0700 Subject: [ofa-general] Re: send max_sge lower than reported byib_query_device In-Reply-To: References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com><000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> <20070927224452.GI2778@mellanox.co.il> Message-ID: <000001c80162$cdb71650$38c8180a@amr.corp.intel.com> >Right now we report a value of 30 and then give an error if the >consumer tries to use that value to actually create a QP. That's a >clear bug to me. How do you suggest we resolve this bug? I like the idea of this call returning a value that's usable for any QP, with Jim's idea of providing a new call of returning maximum attributes based on QP attributes. - Sean From rdreier at cisco.com Thu Sep 27 17:38:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 27 Sep 2007 17:38:59 -0700 Subject: [ofa-general] Re: send max_sge lower than reported byib_query_device In-Reply-To: <000001c80162$cdb71650$38c8180a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 27 Sep 2007 17:02:04 -0700") References: <46F99093.7000907@noaa.gov> <000301c80031$d6ff9250$84feb6f0$@rr.com> <000701c80070$560d0490$02270db0$@rr.com> <001001c80088$6e04f4f0$4a0eded0$@rr.com> <001501c800a7$8fd5efc0$af81cf40$@rr.com> <20070927224452.GI2778@mellanox.co.il> <000001c80162$cdb71650$38c8180a@amr.corp.intel.com> Message-ID: > I like the idea of this call returning a value that's usable for any QP, with > Jim's idea of providing a new call of returning maximum attributes based on QP > attributes. OK, so fixing ib_query_device() for mthca to report a value usable for all QPs (as my patch does) is a step in this direction. - R. From ggrundstrom at NetEffect.com Thu Sep 27 17:42:01 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Thu, 27 Sep 2007 19:42:01 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: <1190672751.24606.56.camel@trinity.ogc.int> References: <20070923203649.8324.64524.stgit@dell3.ogc.int> <5E701717F2B2ED4EA60F87C8AA57B7CC076E481D@venom2> <5E701717F2B2ED4EA60F87C8AA57B7CC076E489A@venom2> <1190672751.24606.56.camel@trinity.ogc.int> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0775792A@venom2> > > > > I'm sure I had seen a previous email in this thread that > > > suggested using > > > > a userspace library to open a socket > > > > in the shared port space. It seems that suggestion was > > > dropped without > > > > reason. Does anyone know why? > > > > > > Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, > > > iSER, etc). > > > > The kernel apps could open a Linux tcp socket and create an RDMA > > socket connection. Both calls are standard Linux kernel architected > > routines. > > This approach was NAK'd by David Miller and others... > > > Doesn't NFSoRDMA already open a TCP socket and another for > > RDMA traffic (ports 2049 & 2050 if I remember correctly)? > > The NFS RDMA transport driver does not open a socket for the RDMA > connection. It uses a different port in order to allow both > TCP and RDMA > mounts to the same filer. > > > I currently > > don't know if iSER, RDS, etc. already do the same thing, but if they > > don't, they probably could very easily. > > > > Woe be to those who do so... > > > > > > > Does the neteffect NIC have the same issue as cxgb3 here? > What are > > > your thoughts on how to handle this? > > > > Yes, the NetEffect RNIC will have the same issue as > Chelsio. And all > > Future RNIC's which support a unified tcp address with Linux will as > > well. > > > > Steve has put a lot of thought and energy into the problem, but > > I don't think users & admins will be very happy with us in > the long run. > > > > Agreed. > > > In summary, short of having the rdma_cm share kernel port space, I'd > > like to see the equivalent in userspace and have the kernel > apps handle > > the issue in a similar way as described above. There are a few > > technical > > issues to work through (like passing the userspace IP address to the > > kernel), > > This just moves the socket creation to code that is outside > the purview > the kernel maintainers. The exchanging of the 4-tuple created with the > kernel module, however, is back in the kernel and in the maintainer's > control and responsibility. In my view anything like this > will be viewed > as an attempt to sneak code into the kernel that the maintainer has > already vehemently rejected. This will make people angry and > damage the > cooperative working relationship that we are trying to build. > > > but I think we can solve that just like other information that > > gets passed from user into the IB/RDMA kernel modules. > > > > > Sharing the IP 4-tuple space cooperatively with the core in > any fashion > has been nak'd. Without this cooperation, the options we've > been able to > come up with are administrative/policy based approaches. > > Any ideas you have along these lines are welcome. I am aware of the pending nak's and certainly don't want to sneak anything by anyone. Since we all agree that user & admins won't like the current approach I'm trying to come up with alternatives. Arkady has raised some good points regarding iSCSI and I would hope a similar solution could be used for iWARP. Glenn. > > Tom > > > Glenn. > > > > > > > > - R. > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From kliteyn at mellanox.co.il Thu Sep 27 22:13:16 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 28 Sep 2007 07:13:16 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-28:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-27 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From cbasu at rediffmail.com Fri Sep 28 02:08:19 2007 From: cbasu at rediffmail.com (Chandan Basu) Date: 28 Sep 2007 09:08:19 -0000 Subject: [ofa-general] verbs.h Message-ID: <20070928090819.32723.qmail@f5mail16.rediffmail.com> Hi, What is the procedure of calling ibv_pst_send and ibv_post_recv. Is there any documentation available? Thanks   -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Sep 28 02:55:57 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 28 Sep 2007 02:55:57 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070928-0200 daily build status Message-ID: <20070928095557.87DD5E6088A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070928-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From eli at dev.mellanox.co.il Fri Sep 28 06:43:22 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 28 Sep 2007 15:43:22 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> Message-ID: <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> So after all we do need a flag? On 9/27/07, Roland Dreier wrote: > > > > + struct page *page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); > > > > Isn't this likely to hurt performance on 32 bit systems? > > Yeah, I guess the kernel would need to kmap the data in most cases > anyway. So there's not much point in trying to use high memory. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Fri Sep 28 06:43:43 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 28 Sep 2007 15:43:43 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> Message-ID: <4e6a6b3c0709280643x204e8dd7mb972cf8b165ce829@mail.gmail.com> So after all we do need a flag? On 9/27/07, Roland Dreier wrote: > > > > + struct page *page = alloc_page(GFP_ATOMIC | __GFP_HIGHMEM); > > > > Isn't this likely to hurt performance on 32 bit systems? > > Yeah, I guess the kernel would need to kmap the data in most cases > anyway. So there's not much point in trying to use high memory. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Sep 28 08:06:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 08:06:40 -0700 Subject: [ofa-general] verbs.h In-Reply-To: <20070928090819.32723.qmail@f5mail16.rediffmail.com> (Chandan Basu's message of "28 Sep 2007 09:08:19 -0000") References: <20070928090819.32723.qmail@f5mail16.rediffmail.com> Message-ID: > What is the procedure of calling ibv_pst_send and ibv_post_recv. Is there any documentation available? Not sure what you mean -- you just call them like any other function. There are descriptions of the parameters etc in comments in verbs.h, plus man pages, plus examples in the examples/ directory of the libibiverbs source package. - R. From rdreier at cisco.com Fri Sep 28 08:07:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 08:07:29 -0700 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> (Eli Cohen's message of "Fri, 28 Sep 2007 15:43:22 +0200") References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> Message-ID: > So after all we do need a flag? No, after thinking about it I don't think there's any reason to use __GFP_HIGHMEM... it would use less kernel memory on highmem systems but I don't think it really helps in the end. - R. From hnguyen at linux.vnet.ibm.com Fri Sep 28 08:14:07 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 28 Sep 2007 17:14:07 +0200 Subject: [ofa-general] [PATCH 0/3] IB/ehca: various bug fixes Message-ID: <200709281714.07632.hnguyen@linux.vnet.ibm.com> Hi Roland! This patch set contains "small" bug fixes for ehca. They are: [1/3] fix mem leak of firmware ctrl block when creating SRQ [2/3] 64-bit alignment for qp response block for user space [3/3] return SRQ attr max_sge in ehca_query_srq() They should apply and build cleanly against your git branch for-2.6.24. Thanks Nam From hnguyen at linux.vnet.ibm.com Fri Sep 28 08:16:27 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 28 Sep 2007 17:16:27 +0200 Subject: [ofa-general] [PATCH 1/3] IB/ehca: Fix mem leak of firmware ctrlblock In-Reply-To: <200709281714.07632.hnguyen@linux.vnet.ibm.com> References: <200709281714.07632.hnguyen@linux.vnet.ibm.com> Message-ID: <200709281716.27474.hnguyen@linux.vnet.ibm.com> From 9506aada669ee539e80cc78205582eb6f213a1c3 Mon Sep 17 00:00:00 2001 From: Hoang-Nam Nguyen Date: Fri, 28 Sep 2007 09:41:01 +0200 Subject: [PATCH] IB/ehca: Fix mem leak of firmware ctrlblock when creating SRQ Signed-off-by: Hoang-Nam Nguyen --- drivers/infiniband/hw/ehca/ehca_qp.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index b10c7df..2591651 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -890,6 +890,8 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, goto create_srq2; } + ehca_free_fw_ctrlblock(mqpcb); + return &my_qp->ib_srq; create_srq2: -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Sep 28 08:18:47 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 28 Sep 2007 17:18:47 +0200 Subject: [ofa-general] [PATCH 2/3] IB/ehca: Adjust 64-bit alignment of qp response block for user space In-Reply-To: <200709281714.07632.hnguyen@linux.vnet.ibm.com> References: <200709281714.07632.hnguyen@linux.vnet.ibm.com> Message-ID: <200709281718.47587.hnguyen@linux.vnet.ibm.com> From 5768677f792b6162cd23ab64278c7228ea1a9a8a Mon Sep 17 00:00:00 2001 From: Hoang-Nam Nguyen Date: Fri, 28 Sep 2007 09:42:05 +0200 Subject: [PATCH] IB/ehca: Adjust 64-bit alignment of qp response block for user space Signed-off-by: Hoang-Nam Nguyen --- drivers/infiniband/hw/ehca/ehca_classes.h | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index d670696..0f7a55d 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -351,6 +351,7 @@ struct ehca_create_qp_resp { /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ u32 real_qp_num; u32 fw_handle_ofs; + u32 dummy; struct ipzu_queue_resp ipz_squeue; struct ipzu_queue_resp ipz_rqueue; }; -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Sep 28 08:20:05 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 28 Sep 2007 17:20:05 +0200 Subject: [ofa-general] [PATCH 3/3] IB/ehca: Return srq_attr->max_sge in ehca_query_srq() In-Reply-To: <200709281714.07632.hnguyen@linux.vnet.ibm.com> References: <200709281714.07632.hnguyen@linux.vnet.ibm.com> Message-ID: <200709281720.05419.hnguyen@linux.vnet.ibm.com> From aa488c3cd5a036bbec73d92c96be7aff5030274c Mon Sep 17 00:00:00 2001 From: Joachim Fenkes Date: Fri, 28 Sep 2007 15:40:04 +0200 Subject: [PATCH] IB/ehca: Return srq_attr->max_sge in ehca_query_srq() Totally forgot this. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 2591651..e2bd62b 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1753,6 +1753,7 @@ int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr) } srq_attr->max_wr = qpcb->max_nr_outst_recv_wr - 1; + srq_attr->max_sge = qpcb->actual_nr_sges_in_rq_wqe; srq_attr->srq_limit = EHCA_BMASK_GET( MQPCB_CURR_SRQ_LIMIT, qpcb->curr_srq_limit); -- 1.5.2 From ardavis at ichips.intel.com Fri Sep 28 10:17:27 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 28 Sep 2007 10:17:27 -0700 Subject: [ofa-general] ***SPAM*** uDAPL thread safety In-Reply-To: <605833.19627.qm@web53704.mail.re2.yahoo.com> References: <605833.19627.qm@web53704.mail.re2.yahoo.com> Message-ID: <46FD3727.5090404@ichips.intel.com> Dev wrote: > HI, > Is the uDAPL provider in OFED 1.2 thread safe ? the dat.conf by default > has an entry nonthreadsafe and the spec says for some of the routines > thread safety depends on the provider. > The underlying OFA provider (openib_cma) and stack (rdma_cma,verbs) are all thread safe but according to udat_config.h the reference implementation (uDAT,uDAPL common code) is not. James, can you speak to state of uDAT/uDAPL common code? Is this comment still true? -arlin From swise at opengridcomputing.com Fri Sep 28 12:46:55 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Sep 2007 14:46:55 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> Message-ID: <46FD5A2F.7010409@opengridcomputing.com> Kanevsky, Arkady wrote: > Sean, > IB aside, > it looks like an ULP which is capable of being both RDMA aware and RDMA > not-aware, > like iSER and iSCSI, NFS-RDMA and NFS, SDP and sockets, > will be treated as two separete ULPs. > Each has its own IP address, since there is a different IP address for > iWARP > port and "regular" Ethernet port. So it falls on the users of ULPs to > "handle" it > via DNS or some other services. > Is this "acceptable" to users? I doubt it. > > Recall that ULPs are going in opposite directions by having a different > port number for RDMA aware and RDMA unaware versions of the ULP. > This way, ULP "connection manager" handles RDMA-ness under the covers, > while users plug an IP address for a server to connect to. > Thanks, Arkady, I'm confused about how this proposed design changes the behavior of the ULPs that run on TCP and iWARP. I don't see much difference from the point of view of the ULPs. The NFS-RDMA server, for example, will not need to change since it binds to address 0.0.0.0 which will translate into a bind/listen on the specific iwarp address for each iwarp device on the rdma side, and address 0.0.0.0 for the TCP side. Am I missing your point? The real pain, IMO, with this solution is that it FORCES the admins to use 2 subnets when 1 is sufficient if the net maintainers would unify the port space... Steve. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > >> -----Original Message----- >> From: Sean Hefty [mailto:sean.hefty at intel.com] >> Sent: Thursday, September 27, 2007 3:12 PM >> To: Kanevsky, Arkady; Sean Hefty; Steve Wise >> Cc: netdev at vger.kernel.org; rdreier at cisco.com; >> linux-kernel at vger.kernel.org; general at lists.openfabrics.org >> Subject: RE: [ofa-general] [PATCH v3] iw_cxgb3: >> Support"iwarp-only"interfacesto avoid 4-tuple conflicts. >> >>> What is the model on how client connects, say for iSCSI, when client >>> and server both support, iWARP and 10GbE or 1GbE, and would like to >>> setup "most" performant "connection" for ULP? >> For the "most" performance connection, the ULP would use IB, >> and all these problems go away. :) >> >> This proposal is for each iwarp interface to have its own IP >> address. Clients would need an iwarp usable address of the >> server and would connect using rdma_connect(). If that call >> (or rdma_resolve_addr/route) fails, the client could try >> connecting using sockets, aoi, or some other interface. I >> don't see that Steve's proposal changes anything from the >> client's perspective. >> >> - Sean >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From dev_hyd2001 at yahoo.com Fri Sep 28 13:31:06 2007 From: dev_hyd2001 at yahoo.com (Dev) Date: Fri, 28 Sep 2007 13:31:06 -0700 (PDT) Subject: [ofa-general] ***SPAM*** uDAPL thread safety In-Reply-To: <46FD3727.5090404@ichips.intel.com> Message-ID: <911716.18098.qm@web53711.mail.re2.yahoo.com> Hi Arlin, Please correct me if I'm wrong ! Does that mean that the OFED uDAPL implementation is thread safe for those routines which the spec describes as thread safe but non threadsafe for those routines which the spec states as "provider dependent"? cheers /Dev Arlin Davis wrote: Dev wrote: > HI, > Is the uDAPL provider in OFED 1.2 thread safe ? the dat.conf by default > has an entry nonthreadsafe and the spec says for some of the routines > thread safety depends on the provider. > The underlying OFA provider (openib_cma) and stack (rdma_cma,verbs) are all thread safe but according to udat_config.h the reference implementation (uDAT,uDAPL common code) is not. James, can you speak to state of uDAT/uDAPL common code? Is this comment still true? -arlin --------------------------------- Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Fri Sep 28 13:36:18 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 28 Sep 2007 16:36:18 -0400 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: <46FD5A2F.7010409@opengridcomputing.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> <46FD5A2F.7010409@opengridcomputing.com> Message-ID: Exactly, it forces the burden on administrator. And one will be forced to try one mount for iWARP and it does not work issue another one TCP or UDP if it fails. Yack! And server will need to listen on different IP address and simple * will not work since it will need to listen in two different domains. Had we run this proposal by administrators? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Friday, September 28, 2007 3:47 PM > To: Kanevsky, Arkady > Cc: Sean Hefty; Sean Hefty; netdev at vger.kernel.org; > rdreier at cisco.com; linux-kernel at vger.kernel.org; > general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: > Support"iwarp-only"interfacesto avoid 4-tuple conflicts. > > > > Kanevsky, Arkady wrote: > > Sean, > > IB aside, > > it looks like an ULP which is capable of being both RDMA aware and > > RDMA not-aware, like iSER and iSCSI, NFS-RDMA and NFS, SDP and > > sockets, will be treated as two separete ULPs. > > Each has its own IP address, since there is a different IP > address for > > iWARP port and "regular" Ethernet port. So it falls on the users of > > ULPs to "handle" it via DNS or some other services. > > Is this "acceptable" to users? I doubt it. > > > > Recall that ULPs are going in opposite directions by having a > > different port number for RDMA aware and RDMA unaware > versions of the ULP. > > This way, ULP "connection manager" handles RDMA-ness under > the covers, > > while users plug an IP address for a server to connect to. > > Thanks, > > Arkady, I'm confused about how this proposed design changes > the behavior of the ULPs that run on TCP and iWARP. I don't > see much difference from the point of view of the ULPs. > > The NFS-RDMA server, for example, will not need to change > since it binds to address 0.0.0.0 which will translate into a > bind/listen on the specific iwarp address for each iwarp > device on the rdma side, and address 0.0.0.0 for the TCP side. > > Am I missing your point? > > The real pain, IMO, with this solution is that it FORCES the > admins to use 2 subnets when 1 is sufficient if the net > maintainers would unify the port space... > > Steve. > > > > > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance Inc. phone: 781-768-5395 > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > >> -----Original Message----- > >> From: Sean Hefty [mailto:sean.hefty at intel.com] > >> Sent: Thursday, September 27, 2007 3:12 PM > >> To: Kanevsky, Arkady; Sean Hefty; Steve Wise > >> Cc: netdev at vger.kernel.org; rdreier at cisco.com; > >> linux-kernel at vger.kernel.org; general at lists.openfabrics.org > >> Subject: RE: [ofa-general] [PATCH v3] iw_cxgb3: > >> Support"iwarp-only"interfacesto avoid 4-tuple conflicts. > >> > >>> What is the model on how client connects, say for iSCSI, > when client > >>> and server both support, iWARP and 10GbE or 1GbE, and > would like to > >>> setup "most" performant "connection" for ULP? > >> For the "most" performance connection, the ULP would use > IB, and all > >> these problems go away. :) > >> > >> This proposal is for each iwarp interface to have its own > IP address. > >> Clients would need an iwarp usable address of the server and would > >> connect using rdma_connect(). If that call (or > >> rdma_resolve_addr/route) fails, the client could try > connecting using > >> sockets, aoi, or some other interface. I don't see that Steve's > >> proposal changes anything from the client's perspective. > >> > >> - Sean > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > From ardavis at ichips.intel.com Fri Sep 28 13:42:01 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 28 Sep 2007 13:42:01 -0700 Subject: [ofa-general] rdma_read scale-up (8 ppn) issue with iMPI IMB alltoallv, MT25208 SDR Message-ID: <46FD6719.2010008@ichips.intel.com> We are running into IBV_WC_RETRY_EXC_ERR errors with large rdma_reads using iMPI and IMB alltoallv. Problem always occurs between processes on the same node. Loopback issue? Has anyone else run into rdma_read issues like this? Here are details: 2 node Clovertown X5355 servers (8 cores each), RHEL4u4, iMPI 3.0. retry_count is set to 7 [ardavis at compute-0-14 src]$ ibv_devinfo hca_id: mthca0 fw_ver: 4.8.200 node_guid: 0002:c902:0000:4fa8 sys_image_guid: 0002:c902:0000:4fa8 vendor_id: 0x02c9 vendor_part_id: 25208 hw_ver: 0xA0 board_id: MT_00A0000001 phys_port_cnt: 2 [ardavis at compute-0-14 src]$ mpiexec -perhost 8 -n 8 -env DAPL_DBG_TYPE 0x83 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE rdma ./IMB-MPI1 alltoallv -npmin 16 #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V3.0, MPI-1 part #--------------------------------------------------- # Date : Fri Sep 28 12:26:05 2007 # Machine : x86_64 # System : Linux # Release : 2.6.9-42.ELsmp # Version : #1 SMP Wed Jul 12 23:32:02 EDT 2006 # MPI Version : 2.0 # MPI Thread Environment: MPI_THREAD_SINGLE # # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Alltoallv #---------------------------------------------------------------- # Benchmarking Alltoallv # #processes = 8 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.66 0.70 0.68 1 1000 483.97 484.49 484.32 2 1000 483.35 483.41 483.37 4 1000 484.29 484.41 484.39 8 1000 483.86 484.01 483.97 16 1000 479.72 479.87 479.82 32 1000 483.95 484.07 484.00 64 1000 482.13 482.27 482.22 128 1000 485.00 485.13 485.09 256 1000 485.93 486.06 486.00 512 1000 487.68 487.78 487.72 1024 1000 487.82 487.98 487.94 2048 1000 497.09 497.27 497.21 4096 1000 510.79 510.95 510.86 8192 1000 506.51 506.64 506.59 16384 1000 642.15 642.26 642.21 32768 1000 1816.55 1816.80 1816.67 65536 640 2926.42 2926.65 2926.51 131072 320 5214.20 5215.18 5214.64 262144 160 10018.31 10021.30 10020.22 524288 80 19554.79 19581.09 19573.01 1048576 40 43291.05 43342.45 43323.24 2097152 20 109898.01 110455.85 110361.47 DTO completion ERROR: 12: op 0x2 DTO completion ERROR: 12: op 0x2 (ep disconnected) [0][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with error. status=0x1. cookie=0x0 DTO completion ERROR: 5: op 0x2 [7][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with error. status=0x8. cookie=0x4 Thanks, -arlin From jlentini at netapp.com Fri Sep 28 14:12:50 2007 From: jlentini at netapp.com (James Lentini) Date: Fri, 28 Sep 2007 17:12:50 -0400 (EDT) Subject: [ofa-general] ***SPAM*** uDAPL thread safety In-Reply-To: <911716.18098.qm@web53711.mail.re2.yahoo.com> References: <911716.18098.qm@web53711.mail.re2.yahoo.com> Message-ID: That is correct. The full definition is given in the uDAPL spec. starting on page 47. On Fri, 28 Sep 2007, Dev wrote: > Hi Arlin, > > Please correct me if I'm wrong ! Does that mean that the OFED uDAPL implementation is thread safe for those routines which the spec describes as thread safe but non threadsafe for those routines which the spec states as "provider dependent"? > > cheers > > /Dev > > > Arlin Davis wrote: Dev wrote: > > HI, > > Is the uDAPL provider in OFED 1.2 thread safe ? the dat.conf by default > > has an entry nonthreadsafe and the spec says for some of the routines > > thread safety depends on the provider. > > > > The underlying OFA provider (openib_cma) and stack (rdma_cma,verbs) are > all thread safe but according to udat_config.h the reference > implementation (uDAT,uDAPL common code) is not. > > James, can you speak to state of uDAT/uDAPL common code? Is this comment > still true? > > -arlin > > > > --------------------------------- > Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. From swise at opengridcomputing.com Fri Sep 28 14:27:06 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Sep 2007 16:27:06 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> <46FD5A2F.7010409@opengridcomputing.com> Message-ID: <46FD71AA.8070901@opengridcomputing.com> Kanevsky, Arkady wrote: > Exactly, > it forces the burden on administrator. > And one will be forced to try one mount for iWARP and it does not > work issue another one TCP or UDP if it fails. > Yack! > I see your point. I have no defense. My hands have been tied on fixing this properly... > And server will need to listen on different IP address and simple > * will not work since it will need to listen in two different domains. > No, the server will listen on 0.0.0.0:2049 for TCP, and 0.0.0.0:2050 for rdma. The rdma subsystem will translate 0.0.0.0:2050 into listens on specific iwarp ip addresses on every iwarp device... > Had we run this proposal by administrators? There has been no other solution proposed that Dave Miller and Jeff Garzik won't NAK... Steve. From mshefty at ichips.intel.com Fri Sep 28 14:34:56 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Sep 2007 14:34:56 -0700 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> <46FD5A2F.7010409@opengridcomputing.com> Message-ID: <46FD7380.6050107@ichips.intel.com> Kanevsky, Arkady wrote: > Exactly, > it forces the burden on administrator. > And one will be forced to try one mount for iWARP and it does not > work issue another one TCP or UDP if it fails. > Yack! > > And server will need to listen on different IP address and simple > * will not work since it will need to listen in two different domains. The server already has to call listen twice. Once for the rdma_cm and once for sockets. Similarly on the client side, connect must be made over rdma_cm or sockets. I really don't see any impact on the application for this approach. We just end up separating the port space based on networking addresses, rather than keeping the problem at the transport level. If you have an alternate approach that will be accepted upstream, feel free to post it. - Sean From rdreier at cisco.com Fri Sep 28 15:08:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 15:08:33 -0700 Subject: [ofa-general] Re: [PATCH 3/3] IB/ehca: Return srq_attr->max_sge in ehca_query_srq() In-Reply-To: <200709281720.05419.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Fri, 28 Sep 2007 17:20:05 +0200") References: <200709281714.07632.hnguyen@linux.vnet.ibm.com> <200709281720.05419.hnguyen@linux.vnet.ibm.com> Message-ID: I applied all three, although I had to fix up corrupted patches by hand (it seems your mailer encoded everything as quoted-printable and added all sorts of =2D junk) - R. From rdreier at cisco.com Fri Sep 28 15:18:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 15:18:01 -0700 Subject: [ofa-general] [PATCH] IPoIB: Convert to netdevice internal stats Message-ID: Use the stats member of struct netdevice in IPoIB, so we can save memory by deleting the stats member of struct ipoib_dev_priv, and save code by deleting ipoib_get_stats(). Signed-off-by: Roland Dreier --- Dave, can you queue this in net-2.6.24 please? I would ordinarily merge IPoIB changes but since this depends on the netdevice internal stats change it becomes a cross-tree dependency if I try to do that. And I'd like to get it queued in git now before the merge window. Thanks... drivers/infiniband/ulp/ipoib/ipoib.h | 2 -- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 20 ++++++++++---------- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 18 +++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 22 +++++++--------------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 10 +++++----- 5 files changed, 31 insertions(+), 41 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 3a6ef14..1e627ee 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -300,8 +300,6 @@ struct ipoib_dev_priv { struct ib_event_handler event_handler; - struct net_device_stats stats; - struct net_device *parent; struct list_head child_intfs; struct list_head list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..1afd93c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -430,7 +430,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) ipoib_dbg(priv, "cm recv error " "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); - ++priv->stats.rx_dropped; + ++dev->stats.rx_dropped; goto repost; } @@ -457,7 +457,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) * this packet and reuse the old buffer. */ ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); - ++priv->stats.rx_dropped; + ++dev->stats.rx_dropped; goto repost; } @@ -474,8 +474,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_pull(skb, IPOIB_ENCAP_LEN); dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; + ++dev->stats.rx_packets; + dev->stats.rx_bytes += skb->len; skb->dev = dev; /* XXX get correct PACKET_ type here */ @@ -512,8 +512,8 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ if (unlikely(skb->len > tx->mtu)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", skb->len, tx->mtu); - ++priv->stats.tx_dropped; - ++priv->stats.tx_errors; + ++dev->stats.tx_dropped; + ++dev->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, tx->mtu - IPOIB_ENCAP_LEN); return; } @@ -532,7 +532,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ tx_req->skb = skb; addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { - ++priv->stats.tx_errors; + ++dev->stats.tx_errors; dev_kfree_skb_any(skb); return; } @@ -542,7 +542,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); - ++priv->stats.tx_errors; + ++dev->stats.tx_errors; ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); dev_kfree_skb_any(skb); } else { @@ -580,8 +580,8 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); /* FIXME: is this right? Shouldn't we only increment on success? */ - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + ++dev->stats.tx_packets; + dev->stats.tx_bytes += tx_req->skb->len; dev_kfree_skb_any(tx_req->skb); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index b664b98..1a77e79 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -208,7 +208,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) * this packet and reuse the old buffer. */ if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { - ++priv->stats.rx_dropped; + ++dev->stats.rx_dropped; goto repost; } @@ -225,8 +225,8 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_pull(skb, IPOIB_ENCAP_LEN); dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; + ++dev->stats.rx_packets; + dev->stats.rx_bytes += skb->len; skb->dev = dev; /* XXX get correct PACKET_ type here */ @@ -260,8 +260,8 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + ++dev->stats.tx_packets; + dev->stats.tx_bytes += tx_req->skb->len; dev_kfree_skb_any(tx_req->skb); @@ -362,8 +362,8 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); - ++priv->stats.tx_dropped; - ++priv->stats.tx_errors; + ++dev->stats.tx_dropped; + ++dev->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); return; } @@ -383,7 +383,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { - ++priv->stats.tx_errors; + ++dev->stats.tx_errors; dev_kfree_skb_any(skb); return; } @@ -392,7 +392,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); - ++priv->stats.tx_errors; + ++dev->stats.tx_errors; ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); dev_kfree_skb_any(skb); } else { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index d8754dd..e079cca 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -519,7 +519,7 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); return; } @@ -584,7 +584,7 @@ err_list: err_path: ipoib_neigh_free(dev, neigh); err_drop: - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); spin_unlock(&priv->lock); @@ -633,7 +633,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, } else __path_add(dev, path); } else { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } @@ -652,7 +652,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); } else { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } @@ -720,7 +720,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) __skb_queue_tail(&neigh->queue, skb); spin_unlock(&priv->lock); } else { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } } else { @@ -746,7 +746,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) IPOIB_QPN(phdr->hwaddr), IPOIB_GID_RAW_ARG(phdr->hwaddr + 4)); dev_kfree_skb_any(skb); - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; goto out; } @@ -760,13 +760,6 @@ out: return NETDEV_TX_OK; } -static struct net_device_stats *ipoib_get_stats(struct net_device *dev) -{ - struct ipoib_dev_priv *priv = netdev_priv(dev); - - return &priv->stats; -} - static void ipoib_timeout(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -867,7 +860,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) struct sk_buff *skb; *to_ipoib_neigh(neigh->neighbour) = NULL; while ((skb = __skb_dequeue(&neigh->queue))) { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } if (ipoib_cm_get(neigh)) @@ -950,7 +943,6 @@ static void ipoib_setup(struct net_device *dev) dev->stop = ipoib_stop; dev->change_mtu = ipoib_change_mtu; dev->hard_start_xmit = ipoib_start_xmit; - dev->get_stats = ipoib_get_stats; dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 94a5709..8b92ea8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -103,7 +103,7 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) } spin_lock_irqsave(&priv->tx_lock, flags); - priv->stats.tx_dropped += tx_dropped; + dev->stats.tx_dropped += tx_dropped; spin_unlock_irqrestore(&priv->tx_lock, flags); kfree(mcast); @@ -298,7 +298,7 @@ ipoib_mcast_sendonly_join_complete(int status, /* Flush out any queued packets */ spin_lock_irq(&priv->tx_lock); while (!skb_queue_empty(&mcast->pkt_queue)) { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); } spin_unlock_irq(&priv->tx_lock); @@ -653,7 +653,7 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb) if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast || !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; } @@ -668,7 +668,7 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb) if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); goto out; } @@ -683,7 +683,7 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb) if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); else { - ++priv->stats.tx_dropped; + ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } From davem at davemloft.net Fri Sep 28 15:34:25 2007 From: davem at davemloft.net (David Miller) Date: Fri, 28 Sep 2007 15:34:25 -0700 (PDT) Subject: [ofa-general] Re: [PATCH] IPoIB: Convert to netdevice internal stats In-Reply-To: References: Message-ID: <20070928.153425.43869012.davem@davemloft.net> From: Roland Dreier Date: Fri, 28 Sep 2007 15:18:01 -0700 > Use the stats member of struct netdevice in IPoIB, so we can save > memory by deleting the stats member of struct ipoib_dev_priv, and save > code by deleting ipoib_get_stats(). > > Signed-off-by: Roland Dreier Applied to net-2.6.24, thanks. How is that ibm_emac NAPI conversion coming along? :-) From rdreier at cisco.com Fri Sep 28 15:36:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 15:36:45 -0700 Subject: [ofa-general] Re: [PATCH] IPoIB: Convert to netdevice internal stats In-Reply-To: <20070928.153425.43869012.davem@davemloft.net> (David Miller's message of "Fri, 28 Sep 2007 15:34:25 -0700 (PDT)") References: <20070928.153425.43869012.davem@davemloft.net> Message-ID: > How is that ibm_emac NAPI conversion coming along? :-) Sorry, trying to reduce my backlog first, but it is still on my list of things to work on :) - R. From rdreier at cisco.com Fri Sep 28 15:45:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 28 Sep 2007 15:45:59 -0700 Subject: [ofa-general] [PATCH][RFC] P_Key support for umad In-Reply-To: <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> (Hal Rosenstock's message of "Mon, 17 Sep 2007 06:14:24 -0700") References: <46EACC6B.5060702@ichips.intel.com> <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> Message-ID: thanks for the review, I added this to my for-2.6.24 tree. I'll try to finish the libibumad changes soon... - R. From ardavis at ichips.intel.com Fri Sep 28 17:17:57 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 28 Sep 2007 17:17:57 -0700 Subject: [ofa-general] Re: [PATCH] uDAPL 1.2 mods to coexist with uDAPL 2.0 In-Reply-To: References: <000501c7fbbb$0d084390$19b7020a@amr.corp.intel.com> Message-ID: <46FD99B5.8020009@ichips.intel.com> James Lentini wrote: > I agree with the goal of supporting both 1.2 and 2.0 implementations > on the same system. Thanks for working on this. > > Comments below: > comments incorporated and committed. From kliteyn at mellanox.co.il Fri Sep 28 22:05:10 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 29 Sep 2007 07:05:10 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-29:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-28 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From DavidFricton at hotel-vier-linden.de Fri Sep 28 23:36:29 2007 From: DavidFricton at hotel-vier-linden.de (David Fricton) Date: Sat, 29 Sep 2007 09:36:29 +0300 Subject: [ofa-general] secalas Message-ID: Hi To general Emergency report. Check DMXC! Price up 21% in 30 minutes! 5 day price: ~$0.50 From hnguyen at linux.vnet.ibm.com Sat Sep 29 00:25:36 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Sat, 29 Sep 2007 09:25:36 +0200 Subject: [ofa-general] Re: [PATCH 3/3] IB/ehca: Return srq_attr->max_sge in ehca_query_srq() In-Reply-To: References: <200709281714.07632.hnguyen@linux.vnet.ibm.com> <200709281720.05419.hnguyen@linux.vnet.ibm.com> Message-ID: <200709290925.36533.hnguyen@linux.vnet.ibm.com> On Saturday 29 September 2007 00:08, Roland Dreier wrote: > I applied all three, although I had to fix up corrupted patches by > hand (it seems your mailer encoded everything as quoted-printable and > added all sorts of =2D junk) Sorry, will check/repair my kmail setup. Thanks for your help! Nam From richard.c.runge at fsmd.co.uk Sat Sep 29 14:08:56 2007 From: richard.c.runge at fsmd.co.uk (richard.c.runge at fsmd.co.uk) Date: Sun, 30 Sep 2007 00:08:56 +0300 Subject: [ofa-general] New yacht hits 81 MPH Message-ID: <46FEBEE8.3070501@fsmd.co.uk> FRLE makes Time Magazine's Top 100 Design List! FEARLESS INTL INC F R L E Current Price: $0.32 "Dream Come True" was Power Boat Magazines response. Time Magazine picks the Fearless 28 in its Top 100. 0-60 in 20 seconds, shows its power. Imagine a Yacht topping out at 81MPH! Watch the price on this climb just as fast. Monday is the day to get on board! From vlad at lists.openfabrics.org Sat Sep 29 02:56:14 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 29 Sep 2007 02:56:14 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070929-0200 daily build status Message-ID: <20070929095614.6C501E6085B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070929-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hnguyen at linux.vnet.ibm.com Sat Sep 29 04:44:48 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Sat, 29 Sep 2007 13:44:48 +0200 Subject: [ofa-general] Please pull libehca.git/libehca ofed_1_3 branch Message-ID: <200709291344.48191.hnguyen@linux.vnet.ibm.com> Hi Michael! Please pull from git://git.openfabrics.org/~hnguyen/libehca.git branch ofed_1_3 to get the fixes below for alpha release. Thanks Nam 06 refactorize xxxx_abiver5() that are required to support older kernels add create_srq_abiver5() for older kernels 05 update queue's, qp's and cq's response block for older kernels (abi_version 5) according to struct changes in ehca 04 fix mapping firmware handle's bug 03 copy queue offset from kernel's response block in queue struct 02 handle abi_version properly, especially for older kernels like rhel4.5 01 update abi_version according to kernel space 2.6.24-rc1 From baidya at asg-architects.com Sat Sep 29 05:20:42 2007 From: baidya at asg-architects.com (Adrian Barrera) Date: Sat, 29 Sep 2007 13:20:42 +0100 Subject: [ofa-general] Adrian has sent you a message Message-ID: <01c80293$27488010$7eac50be@baidya> SCYF Wins $1 Million Contract With Alabama State! Security Financing Services Inc. SCYF $0.011 Alabama State awards a million dollar security contract to SCYF. This will put shares prices on the rise and investor interest through the roof. Don’t miss out on this one and grab SCYF first thing Monday morning. From roiuiy at cablebahamas.com Sat Sep 29 09:33:46 2007 From: roiuiy at cablebahamas.com (Ulysses 88) Date: Sat, 29 Sep 2007 20:33:46 +0400 Subject: [ofa-general] Aida made last night 1736.- US$ Message-ID: <819960F1.DD9963D3@cablebahamas.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dlaomih.gif Type: image/gif Size: 22226 bytes Desc: not available URL: From Mertz_Bohne at spielgruppenwinti.ch Tue Sep 25 16:12:13 2007 From: Mertz_Bohne at spielgruppenwinti.ch (Mertz Bohne) Date: Wed, 26 Sep 2007 02:12:13 +0300 Subject: [ofa-general] suieki Message-ID: <1B0FD511.B3CFC9D2@spielgruppenwinti.ch> Hello general Urgent alert. Look at DM XC! 5-day price: ~$0.50 Get it at monday sulzer supotum suoerebu sufruita From kliteyn at mellanox.co.il Sat Sep 29 22:15:05 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 30 Sep 2007 07:15:05 +0200 Subject: [ofa-general] nightly osm_sim report 2007-09-30:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-29 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From anton at samba.org Sat Sep 29 22:37:26 2007 From: anton at samba.org (Anton Blanchard) Date: Sun, 30 Sep 2007 00:37:26 -0500 Subject: [ofa-general] eHCA issues Message-ID: <20070930053726.GA28619@kryten> Hi, Ive been trying to get the DAPL tests running on eHCA. One issue was traced back to a negative value in ->max_cqe. Looking at all the device attributes max_pd and max_ah are negative too. props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); Can this really be negative or should we be doing: props->max_cqe = min_t(unsigned int, rblock->max_cqe, INT_MAX); Anton hca_id: ehca0 vendor_id: 0x5076 vendor_part_id: 0 hw_ver: 0x1000003 phys_port_cnt: 1 max_mr_size: 0xb70000000 page_size_cap: 0x0 max_qp: 16239 max_qp_wr: 32768 device_cap_flags: 0x00005800 max_sge: 252 max_sge_rd: 0 max_cq: 16380 max_cqe: -64 max_mr: 61382 max_pd: -1 max_qp_rd_atom: 0 max_ee_rd_atom: 0 max_res_rd_atom: 0 max_qp_init_rd_atom: 0 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_NONE (0) max_ee: 0 max_rdd: 0 max_mw: 61382 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 32 max_mcast_qp_attach: 8 max_total_mcast_qp_attach: 256 max_ah: -1 max_fmr: 61382 max_map_per_fmr: 0 max_srq: 0 max_pkeys: 16 local_ca_ack_delay: 0 From dotanb at dev.mellanox.co.il Sat Sep 29 23:39:47 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 30 Sep 2007 08:39:47 +0200 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> Message-ID: <46FF44B3.4010805@dev.mellanox.co.il> Does a simple "ping" between the nodes is working? (this way you can be sure that IPoIB is working and SDP should work) Dotan Zulfi Imani wrote: > I have not tried over IPoIB, but opensm is running > > /home/zulfi > sminfo > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > priority 0 state 3 SMINFO_MASTER > > I also tried a few iband utilities and they all work fine. Not able to > run any socket apps over SDP. > > Thanks > Zulfi > > On 9/27/07, *Jim Mott* > wrote: > > Were you able to connect IPoIB between the nodes? Are you sure > opensm was running? I am ashamed to admit that occasionally I > forget to start opensm and wonder why SDP does not connect. > > > > *From:* general-bounces at lists.openfabrics.org > [mailto: > general-bounces at lists.openfabrics.org > ] *On Behalf Of > *Zulfi Imani > *Sent:* Thursday, September 27, 2007 3:22 PM > *To:* general at lists.openfabrics.org > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > Hi, > > I installed the OFED1.2 stack and am trying to run a simple socket > server and client over the SDP stack. The Infiniband hardware is > QLogic. > > First I set the ENV vars > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > The SDP config file has: > *use sdp server * *:* > use sdp client * *:* > * > Then started the socket server and did a 'sdpnetstat -San' and > found that it listed the SDP port on which the server was listening. > > On the client machine too I did the same; exported the variables, > setup the SDP config file and on running the client './client > port# server_machine' it gave me a "network not reachable" error. > > I tried to get some information about the error on the net but > could not find any. > > I then checked the /proc//maps file and found that libsdp.so > was being loaded. > also: > /root > lsmod | grep sdp > ib_sdp 120224 3 > > Does QLogic support SDP applications ? Or am I missing something > in the SDP config file or do I need to make changes to my code ? > > Any information on this will be a big help. > > Thanks, > Zulfi > > > > > > > -- > Regs, > Zulfi > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Sun Sep 30 02:13:18 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Sep 2007 11:13:18 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> Message-ID: <46FF68AE.6090605@voltaire.com> Roland Dreier wrote: > > So after all we do need a flag? > > No, after thinking about it I don't think there's any reason to use > __GFP_HIGHMEM... it would use less kernel memory on highmem systems > but I don't think it really helps in the end. I see that none of the drivers for NICs which set the NETIF_F_HIGHDMA bit use some _HIGHMEM GFP directive for their skb allocation, so maybe the two flags need not be correlated. So the NETIF_F_HIGHDMA just comes to say that if from some reason an skb which was not allocated by the driver points to buffer in high-memory, its fine. Anyway, I am fine with the patch once the change-log comment is changed AND the ipath issue is resolved with agreement of the parties (Roland, Ralph), if no device capability flag nor implementing high-mem support in the ipath driver is needed, let it be. Or. From eli at dev.mellanox.co.il Sun Sep 30 02:24:42 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 30 Sep 2007 11:24:42 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46FF68AE.6090605@voltaire.com> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> Message-ID: <1191144282.16035.7.camel@mtls03> On Sun, 2007-09-30 at 11:13 +0200, Or Gerlitz wrote: > Roland Dreier wrote: > > > So after all we do need a flag? > > > > No, after thinking about it I don't think there's any reason to use > > __GFP_HIGHMEM... it would use less kernel memory on highmem systems > > but I don't think it really helps in the end. > > I see that none of the drivers for NICs which set the NETIF_F_HIGHDMA > bit use some _HIGHMEM GFP directive for their skb allocation, so maybe > the two flags need not be correlated. So the NETIF_F_HIGHDMA just comes > to say that if from some reason an skb which was not allocated by the > driver points to buffer in high-memory, its fine. > > Anyway, I am fine with the patch once the change-log comment is changed > AND the ipath issue is resolved with agreement of the parties (Roland, > Ralph), if no device capability flag nor implementing high-mem support > in the ipath driver is needed, let it be. > > Or. > If I understand correctly, there were two kinds of flags discussed in this thread. One of them is using __GFP_HIGHMEM for allocating buffers to be utilized in the receive flow of ipoib cm which it was agreed that will hurt performance. The second one is using a capability flag which devices use to declare that they support high dma - this one is not required too but implies that ipath will not work on 32 bit platforms (e.g. i386 with PAE). Saying that, Or, can you clarify how would you like the change-long comment to be changed? From vlad at lists.openfabrics.org Sun Sep 30 02:56:46 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 30 Sep 2007 02:56:46 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20070930-0200 daily build status Message-ID: <20070930095647.038D4E6085B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20070930-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Sun Sep 30 03:46:35 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Sep 2007 12:46:35 +0200 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> Message-ID: <46FF7E8B.7010307@voltaire.com> Roland Dreier wrote: > > Since both the rdma-cm and ipoib are consumers of the core mutlicast > > management code (core/multicast.c which is linked into ib_sa.ko), and the > > app (through the rdma-cm) --first-- inserts a record into the database and > > only then issues the setsockopt call, if ipoib has a hit on a group it was > > told to join, this group must be offloaded by the rdma-cm consumer. > > I'm not sure I understand why that follows. Couldn't there be some > other kernel or userspace entity that caused the record to be added? Actually, no: The multicast database is managed by AND the bunch of attributes associated with sa multicast join request. So if two consumers attempt to join with the same tuple, and the attributes present in the struct ib_sa_mcmember_rec provided to ib_sa_join_multicast() are different, the second join fails with -EINVAL status, see the comp_rec() function in core/multicast.c. The rdma-cm first calls ib_sa_get_mcmember_rec() with the partition broadcast mgid such it for IPOIB ID's it can use the same attributes provided by ipoib for the mgid to join to (for non IPOIB ID's it uses a different qkey and mgid, see cma_join_ib_multicast() and cma_set_mgid() in core/cma.c). So indeed the assumption in the patch is that mgids which translate to legal IP multicast addresses are inserted into the database either by ipoib or rdma-cm consumers who use IPOIB_PS for their ID's. > > The per device flag is initialized by the module param value at > > ipoib_dev_init() > > I still don't really get why there's a module parameter to set the > initial value of a flag that only root can change anyway. Why not > just the flag through sysfs after loading ipoib rather than having a > module parameter to do the same thing? A module param enables adding a options ib_ipoib umcast_allowed=1 line to /etc/modprobe.conf to make this setting persistent across module unload/load (eg reboots) and be applied to all the devices created by ipoib. A sysfs entry has to be explicitly written following each device creation. The reason for the sysfs entry existence is to allow reading the current value and do runtime changes where in environment that the admin wants user-mode multicast to be allowed as the default they will use the module param, does this make sense? The reason for the set_ entry to be allowed for writing to root only is since we don't want non privilege user to change the settings. Or. From mst at dev.mellanox.co.il Sun Sep 30 04:23:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Sep 2007 13:23:47 +0200 Subject: [ofa-general] off list for a while, email address change Message-ID: <20070930112347.GB3780@mellanox.co.il> Please note that my email address is changing. You can contact me at my new address m dot s dot tsirkin at gmail dot com (address mangled to confuse spambots, replace "dot" with . and "at" with @ to get the actual mail address) Near term, I might not have time for openfabrics related issues, and might not monitor openfabrics lists. Please copy me directly if my attention is required. Here is a list of people at Mellanox you might want to contact: Oren Kladnitsky orenk at dev.mellanox.co.il - for firmware, imgen and mstflint Eli Cohen eli at dev.mellanox.co.il - for IPoIB, mlx4 and mthca Jim Mott jim at mellanox.com - for SDP Jack Morgenstein jackm at dev.mellanox.co.il - for core, mlx4, mthca, libmlx4, libmthca Vlad Sokolovsky vlad at mellanox.co.il - for OFED kernel, backports and build Tziporet Koren tziporet at dev.mellanox.co.il - for OFED release, perftest Sagi Rotem sagir at mellanox.co.il - for perftest Take care, -- MST From ogerlitz at voltaire.com Sun Sep 30 05:01:45 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Sep 2007 14:01:45 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1191144282.16035.7.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> Message-ID: <46FF9029.7010707@voltaire.com> Eli Cohen wrote: > On Sun, 2007-09-30 at 11:13 +0200, Or Gerlitz wrote: > If I understand correctly, there were two kinds of flags discussed in > this thread. One of them is using __GFP_HIGHMEM for allocating buffers > to be utilized in the receive flow of ipoib cm which it was agreed that > will hurt performance. > The second one is using a capability flag which devices use to declare > that they support high dma - this one is not required too but implies > that ipath will not work on 32 bit platforms (e.g. i386 with PAE). What I would like to see agreement on is if we don't require this flag from the rdma device, how a user is being prevent from running into troubles when using ipoib/ipath/32bit/high-mem. If such config is impossible, that's fine. > Saying that, Or, can you clarify how would you like the change-long > comment to be changed? something like: This patch assumes all IB devices support DMA to high-mem pages Or. From mst at dev.mellanox.co.il Sun Sep 30 05:06:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Sep 2007 14:06:15 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46FF9029.7010707@voltaire.com> References: <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> <46FF9029.7010707@voltaire.com> Message-ID: <20070930120615.GA882@mellanox.co.il> > What I would like to see agreement on is if we don't require this flag > from the rdma device, how a user is being prevent from running into > troubles when using ipoib/ipath/32bit/high-mem. If such config is > impossible, that's fine. AFAIK, such config is currently impossible. -- MST From eli at dev.mellanox.co.il Sun Sep 30 06:38:14 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 30 Sep 2007 15:38:14 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46FF9029.7010707@voltaire.com> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> <46FF9029.7010707@voltaire.com> Message-ID: <1191159494.16668.4.camel@mtls03> On Sun, 2007-09-30 at 14:01 +0200, Or Gerlitz wrote: > > Saying that, Or, can you clarify how would you like the change-long > > comment to be changed? > > something like: > > This patch assumes all IB devices support DMA to high-mem pages > > Or. Here is what we already have in the ofa git tree. --quote Add high dma support to ipoib This patch assumes all IB devices support 64 bit DMA. --end quote Do you think it is no clear enough? From ogerlitz at voltaire.com Sun Sep 30 07:18:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Sep 2007 16:18:19 +0200 Subject: [ofa-general] multiple threads posting to the same QP Message-ID: <46FFB02B.8040307@voltaire.com> Roland, Looking on libibverbs sources and man pages, I can't figure out if posting to the same QP by multiple threads is supported (or if it should be supported by the low level libraries): > /** > * ibv_post_send - Post a list of work requests to a send queue. > */ > static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, > struct ibv_send_wr **bad_wr) > { > return qp->context->ops.post_send(qp, wr, bad_wr); > } > > /** > * ibv_post_recv - Post a list of work requests to a receive queue. > */ > static inline int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, > struct ibv_recv_wr **bad_wr) > { > return qp->context->ops.post_recv(qp, wr, bad_wr); > } I don't see here any locking nor any comment if the low-level library should lock or that the app must lock if multiple threads are posting, etc. My understanding is that posting sends from thread X and recvs from thread Y is supported, but I can't really tell on multiple send posting, lets sync on this and then document that. Dotan, is there any mentioning of multiple thread scheme in the libibverbs/librdmacm man pages? Or. From ogerlitz at voltaire.com Sun Sep 30 07:19:46 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 30 Sep 2007 16:19:46 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1191159494.16668.4.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> <46FF9029.7010707@voltaire.com> <1191159494.16668.4.camel@mtls03> Message-ID: <46FFB082.7010402@voltaire.com> Eli Cohen wrote: > Here is what we already have in the ofa git tree. > --quote > Add high dma support to ipoib > > This patch assumes all IB devices support 64 bit DMA. > --end quote no (in the tenth time) "...support 64 bit DMA" is not the issue handled by this patch but rather high-memory support. Or. From eli at dev.mellanox.co.il Sun Sep 30 07:48:52 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 30 Sep 2007 16:48:52 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <46FFB082.7010402@voltaire.com> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> <46FF9029.7010707@voltaire.com> <1191159494.16668.4.camel@mtls03> <46FFB082.7010402@voltaire.com> Message-ID: <1191163732.16668.15.camel@mtls03> > no (in the tenth time) "...support 64 bit DMA" is not the issue handled > by this patch but rather high-memory support. > > Or. > Look again at the original patch (attached) - it's all about adding HIGH_DMA support and not high memory support. -------------- next part -------------- An embedded message was scrubbed... From: Eli Cohen Subject: [ofa-general] [PATCH 1/11] IB/ipoib: high dma support Date: Mon, 24 Sep 2007 14:35:55 +0200 Size: 4059 URL: From tziporet at dev.mellanox.co.il Sun Sep 30 08:48:10 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 30 Sep 2007 17:48:10 +0200 Subject: [ofa-general] Re: [ewg] Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. In-Reply-To: <46FBC958.4090209@opengridcomputing.com> References: <20070912100025.3190.89259.stgit@dell3.ogc.int> <000101c7f568$9275b520$ff0da8c0@amr.corp.intel.com> <46F3E3D2.70601@opengridcomputing.com> <20070923085052.GC24557@mellanox.co.il> <46F6CCA4.1010607@opengridcomputing.com> <46FBC958.4090209@opengridcomputing.com> Message-ID: <46FFC53A.7000902@mellanox.co.il> Steve Wise wrote: > Michael, > > Have you pulled this in yet? I want to close out the bug I have open... > > This was done by Michael Tziporet From dotanb at dev.mellanox.co.il Sun Sep 30 08:54:20 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 30 Sep 2007 17:54:20 +0200 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: <46FFB02B.8040307@voltaire.com> References: <46FFB02B.8040307@voltaire.com> Message-ID: <46FFC6AC.5010605@dev.mellanox.co.il> Or Gerlitz wrote: > Dotan, is there any mentioning of multiple thread scheme in the > libibverbs/librdmacm man pages? As much as i know, libibverbs is a fully thread safe library. I checked the code of the mthca and i noticed a spin lock before posting (SR or RR), so everything should be o.k. if you post from different threads in parallel . I didn't add any "thread" text in the man pages yet. If you think that it is required, i will add it somewhere (i can add a new man file verbs.h.3 that will specify this). Dotan From tziporet at dev.mellanox.co.il Sun Sep 30 08:59:05 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 30 Sep 2007 17:59:05 +0200 Subject: [ofa-general] I am on vacation this week Message-ID: <46FFC7C9.7090807@mellanox.co.il> Will be back on Sunday 7-Oct Tziporet From hadi at cyberus.ca Sun Sep 30 11:50:05 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 14:50:05 -0400 Subject: [ofa-general] [PATCHES] TX batching In-Reply-To: <1190569987.4256.52.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> Message-ID: <1191178205.6165.25.camel@localhost> Latest net-2.6.24 breaks the patches i posted last week; so this is an update to resolve that. If you are receiving these emails and are finding them overloading, please give me a shout and i will remove your name. Please provide feedback on the code and/or architecture. Last time i posted them i received none. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb I have decided i will kill ->hard_batch_xmit() and not support any more LLTX drivers. This is the last of patches that will have ->hard_batch_xmit() as i am supporting an e1000 that is LLTX. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. Each of these patches has been performance tested (last with DaveM's tree from last weekend) and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. I have not re-run the tests with this morning's tree but i suspect not much difference. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. cheers, jamal From hadi at cyberus.ca Sun Sep 30 11:51:24 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 14:51:24 -0400 Subject: [ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface In-Reply-To: <1190570317.4256.59.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> Message-ID: <1191178284.6165.27.camel@localhost> This patch introduces the netdevice interface for batching. cheers, jamal -------------- next part -------------- [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. A typical driver dev->hard_start_xmit() has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functions anyways]. With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1)use its dev->hard_prep_xmit() method to achieve #a 2)use its dev->hard_end_xmit() method to achieve #d 3)#b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are optional. The core will first do the packet formatting by invoking your supplied dev->hard_prep_xmit() method. It will then pass you the packet via your dev->hard_start_xmit() method and lastly will invoke your dev->hard_end_xmit() when it completes passing you all the packets queued for you. dev->hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). LLTX present a challenge in that we have to introduce a deviation from the norm and introduce the ->hard_batch_xmit() method. An LLTX driver presents us with ->hard_batch_xmit() to which we pass it a list of packets in a dev->blist skb queue. It is then the responsibility of the ->hard_batch_xmit() to exercise steps #b and #c for all packets and #d when the batching is complete. Step #a is already done for you by the time you get the packets in dev->blist. And last xmit_win variable is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Some decisions i had to make: - every driver will have a xmit_win variable and the core will set it to 1 which means the behavior of non-batching drivers stays the same. - the batch list, blist, is no longer a pointer; wastes a little extra memmory i plan to recoup by killing gso_skb in later patches. Theres a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and David Miller, have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and netiron folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim --- commit 624a0bfeb971c9aa58496c7372df01f0ed750def tree c1c0ee53453392866a5241631a7502ce6569b2cc parent 260dbcc4b0195897c539c5ff79d95afdddeb3378 author Jamal Hadi Salim Sun, 30 Sep 2007 14:23:31 -0400 committer Jamal Hadi Salim Sun, 30 Sep 2007 14:23:31 -0400 include/linux/netdevice.h | 17 +++++++ net/core/dev.c | 106 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 123 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..df1fb61 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,15 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + /* hard_batch_xmit is needed for LLTX, kill it when those + * disappear or better kill it now and dont support LLTX + */ + int (*hard_batch_xmit) (struct net_device *dev); + int (*hard_prep_xmit) (struct sk_buff *skb, + struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +619,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1044,6 +1055,12 @@ extern int dev_set_mac_address(struct net_device *, struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_batch_xmit(struct net_device *dev); +extern int prepare_gso_skb(struct sk_buff *skb, + struct net_device *dev, + struct sk_buff_head *skbs); +extern int xmit_prepare_skb(struct sk_buff *skb, + struct net_device *dev); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index 833f060..f82aff7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1517,6 +1517,110 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } +int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev, + struct sk_buff_head *skbs) +{ + int tdq = 0; + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + nskb->next = NULL; + + if (dev->hard_prep_xmit) { + /* note: skb->cb is set in hard_prep_xmit(), + * it should not be trampled somewhere + * between here and the driver picking it + * The VLAN code wrongly assumes it owns it + * so the driver needs to be careful; for + * good handling look at tg3 driver .. + */ + int ret = dev->hard_prep_xmit(nskb, dev); + if (ret != NETDEV_TX_OK) + continue; + } + /* Driver likes this packet .. */ + tdq++; + __skb_queue_tail(skbs, nskb); + } while (skb->next); + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + + return tdq; +} + +int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree_skb(skb); + return 0; + } + if (skb->next) + return prepare_gso_skb(skb, dev, skbs); + } + + if (dev->hard_prep_xmit) { + int ret = dev->hard_prep_xmit(skb, dev); + if (ret != NETDEV_TX_OK) + return 0; + } + __skb_queue_tail(skbs, skb); + return 1; +} + +int dev_batch_xmit(struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + int rc = NETDEV_TX_OK; + struct sk_buff *skb; + int orig_w = dev->xmit_win; + int orig_pkts = skb_queue_len(skbs); + + if (dev->hard_batch_xmit) { /* only for LLTX devices */ + rc = dev->hard_batch_xmit(dev); + } else { + while ((skb = __skb_dequeue(skbs)) != NULL) { + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + rc = dev->hard_start_xmit(skb, dev); + if (unlikely(rc)) + break; + /* + * XXX: multiqueue may need closer srutiny.. + */ + if (unlikely(netif_queue_stopped(dev) || + netif_subqueue_stopped(dev, skb->queue_mapping))) { + rc = NETDEV_TX_BUSY; + break; + } + } + } + + /* driver is likely buggy and lied to us on how much + * space it had. Damn you driver .. + */ + if (unlikely(skb_queue_len(skbs))) { + printk(KERN_WARNING "Likely bug %s %s (%d) " + "left %d/%d window now %d, orig %d\n", + dev->name, rc?"busy":"locked", + netif_queue_stopped(dev), + skb_queue_len(skbs), + orig_pkts, + dev->xmit_win, + orig_w); + rc = NETDEV_TX_BUSY; + } + + if (orig_pkts > skb_queue_len(skbs)) + if (dev->hard_end_xmit) + dev->hard_end_xmit(dev); + + return rc; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb->next)) { @@ -3551,6 +3655,8 @@ int register_netdevice(struct net_device *dev) } } + dev->xmit_win = 1; + skb_queue_head_init(&dev->blist); ret = netdev_register_kobject(dev); if (ret) goto err_uninit; From hadi at cyberus.ca Sun Sep 30 11:52:26 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 14:52:26 -0400 Subject: [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1190570409.4256.62.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> Message-ID: <1191178346.6165.29.camel@localhost> This patch adds the usage of batching within the core. cheers, jamal -------------- next part -------------- [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: +------------+--------------+-------------+------------+--------+ | 64B | 128B | 256B | 512B |1024B | +------------+--------------+-------------+------------+--------+ Original| 467482 | 463061 | 388267 | 216308 | 114704 | | | | | | | txlock | 468922 | 464060 | 388298 | 216316 | 114709 | | | | | | | tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | | | | | | | tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | | | | | | | tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | +------------+--------------+-------------+------------+--------+ The first two colums "Original" and "txlock" were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. "tg3nobtx" shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. "tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3 driver. "tg3btxco" demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is ->hard_batch_xmit() introduces a new method which is intrusive. I have #if-0ed some of the old functions so the patch is more readable. Signed-off-by: Jamal Hadi Salim --- commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc tree 053e8dda02b5d26fe7cc778823306a8a526df513 parent 624a0bfeb971c9aa58496c7372df01f0ed750def author Jamal Hadi Salim Sun, 30 Sep 2007 14:38:11 -0400 committer Jamal Hadi Salim Sun, 30 Sep 2007 14:38:11 -0400 net/sched/sch_generic.c | 127 +++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 115 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..86a3f9d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q->q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev->xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + "Dead loop on netdevice %s, fix it urgently!\n", + dev->name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q->ops->requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb->next) { + count += skb_shinfo(skb)->nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev->xmit_win; + + if (count && dev->gso_skb) { + skb = dev->gso_skb; + dev->gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count > 0) { + skb = q->dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} + +static int xmit_prepare_pkts(struct net_device *dev, + struct sk_buff_head *tlist) +{ + struct sk_buff *skb; + struct sk_buff_head *flist = &dev->blist; + + while ((skb = __skb_dequeue(tlist)) != NULL) + xmit_prepare_skb(skb, dev); + + return skb_queue_len(flist); +} /* * NOTE: Called under dev->queue_lock with locally disabled BH. @@ -130,22 +222,27 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) + +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *tpktlist) { struct Qdisc *q = dev->qdisc; - struct sk_buff *skb; - int ret; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + ret = xmit_get_pkts(dev, q, tpktlist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(&dev->queue_lock); + /* prepare to embark */ + xmit_prepare_pkts(dev, tpktlist); + + /* bye packets ....*/ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(&dev->queue_lock); @@ -158,8 +255,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(&dev->blist, dev, q); break; default: @@ -168,7 +265,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(&dev->blist, dev, q); break; } @@ -177,8 +274,11 @@ static inline int qdisc_restart(struct net_device *dev) void __qdisc_run(struct net_device *dev) { + struct sk_buff_head tpktlist; + skb_queue_head_init(&tpktlist); + do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, &tpktlist)) break; } while (!netif_queue_stopped(dev)); @@ -564,6 +664,9 @@ void dev_deactivate(struct net_device *dev) skb = dev->gso_skb; dev->gso_skb = NULL; + if (!skb_queue_empty(&dev->blist)) + skb_queue_purge(&dev->blist); + dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); kfree_skb(skb); From hadi at cyberus.ca Sun Sep 30 11:53:50 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 14:53:50 -0400 Subject: [ofa-general] [PATCH 3/3][NET_SCHED] kill dev->gso_skb In-Reply-To: <1190570521.4256.65.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1190570521.4256.65.camel@localhost> Message-ID: <1191178430.6165.31.camel@localhost> This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal -------------- next part -------------- [NET_SCHED] kill dev->gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev->blist (and still not need dev->gso_skb) Signed-off-by: Jamal Hadi Salim --- commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0 tree 5beaf197fd08a038d83501f405017f48712d0318 parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc author Jamal Hadi Salim Sun, 30 Sep 2007 14:38:58 -0400 committer Jamal Hadi Salim Sun, 30 Sep 2007 14:38:58 -0400 include/linux/netdevice.h | 3 --- net/sched/sch_generic.c | 12 ------------ 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index df1fb61..cea400a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 86a3f9d..b4e1607 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev->xmit_win; - if (count && dev->gso_skb) { - skb = dev->gso_skb; - dev->gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count > 0) { skb = q->dequeue(q); if (!skb) @@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(&dev->queue_lock); qdisc = dev->qdisc; @@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev->gso_skb; - dev->gso_skb = NULL; if (!skb_queue_empty(&dev->blist)) skb_queue_purge(&dev->blist); dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ From hadi at cyberus.ca Sun Sep 30 11:54:58 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 14:54:58 -0400 Subject: [ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface In-Reply-To: <1191178284.6165.27.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1191178284.6165.27.camel@localhost> Message-ID: <1191178498.6165.33.camel@localhost> Fixed subject - should be 1/3 not 1/4 On Sun, 2007-30-09 at 14:51 -0400, jamal wrote: > This patch introduces the netdevice interface for batching. > > cheers, > jamal > > From hadi at cyberus.ca Sun Sep 30 12:19:28 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 30 Sep 2007 15:19:28 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1191178205.6165.25.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1191178205.6165.25.camel@localhost> Message-ID: <1191179969.6165.46.camel@localhost> And heres a patch that provides a sample of the usage for batching with tg3. Requires patch "[TG3]Some cleanups" i posted earlier. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: tg3.potoc Type: text/x-patch Size: 5252 bytes Desc: not available URL: From billfink at mindspring.com Sun Sep 30 21:11:35 2007 From: billfink at mindspring.com (Bill Fink) Date: Mon, 1 Oct 2007 00:11:35 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191178346.6165.29.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> Message-ID: <20071001001135.75d2b984.billfink@mindspring.com> On Sun, 30 Sep 2007, jamal wrote: > This patch adds the usage of batching within the core. > > cheers, > jamal > [sep30-p2of3 text/plain (6.8KB)] > [NET_BATCH] net core use batching > > This patch adds the usage of batching within the core. > The same test methodology used in introducing txlock is used, with > the following results on different kernels: > > +------------+--------------+-------------+------------+--------+ > | 64B | 128B | 256B | 512B |1024B | > +------------+--------------+-------------+------------+--------+ > Original| 467482 | 463061 | 388267 | 216308 | 114704 | > | | | | | | > txlock | 468922 | 464060 | 388298 | 216316 | 114709 | > | | | | | | > tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | > | | | | | | > tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | > | | | | | | > tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | > +------------+--------------+-------------+------------+--------+ > > The first two colums "Original" and "txlock" were introduced in an earlier > patch and demonstrate a slight increase in performance with txlock. > "tg3nobtx" shows the tg3 driver with no changes to support batching. > The purpose of this test is to demonstrate the effect of introducing > the core changes to a driver that doesnt support them. > Although this patch brings down perfomance slightly compared to txlock > for such netdevices, it is still better compared to just the original > kernel. > "tg3btxdr" demonstrates the effect of using ->hard_batch_xmit() with tg3 > driver. "tg3btxco" demonstrates the effect of letting the core do all the > work. As can be seen the last two are not very different in performance. > The difference is ->hard_batch_xmit() introduces a new method which > is intrusive. Have you done performance comparisons for the case of using 9000-byte jumbo frames? -Bill From pradeeps at linux.vnet.ibm.com Sun Sep 30 21:45:04 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sun, 30 Sep 2007 21:45:04 -0700 Subject: [ofa-general] srp_sg_tablesize related question Message-ID: <47007B50.60102@linux.vnet.ibm.com> I do not see a max value for srp_sg_tablesize. I see an earlier patch limiting it to 128, but that is not the case in the recent kernels. So, what limits the size of an IU? Does it depend on the target port limiting it with an SRP_CRED_REQ? Pradeep From kliteyn at mellanox.co.il Sun Sep 30 22:29:13 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 1 Oct 2007 07:29:13 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-01:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-09-30 OpenSM git rev = Tue_Sep_25_00:30:00_2007 [2c547953885809a8026e20af7809be08b42c3865] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=519 Fail=1 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 12 LidMgr IS3-128.topo Failures: 1 LidMgr IS3-128.topo