From ogerlitz at voltaire.com Mon Oct 1 00:00:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 01 Oct 2007 09:00:37 +0200 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: <46FFC6AC.5010605@dev.mellanox.co.il> References: <46FFB02B.8040307@voltaire.com> <46FFC6AC.5010605@dev.mellanox.co.il> Message-ID: <47009B15.8010506@voltaire.com> Dotan Barak wrote: > Or Gerlitz wrote: >> Dotan, is there any mentioning of multiple thread scheme in the >> libibverbs/librdmacm man pages? > As much as i know, libibverbs is a fully thread safe library. > I checked the code of the mthca and i noticed a spin lock before posting > (SR or RR), so everything should be o.k. if you post from different threads in > parallel . So the locking should be provided by the low-level per device library? if this is the case, I fail to see this documented anywhere. Also do we actually want locking in the fast posting path? for example is it legal to call send(2) on the same socket fd from two threads? > I didn't add any "thread" text in the man pages yet. > If you think that it is required, i will add it somewhere (i can add a > new man file verbs.h.3 that will specify this). I suggest we first resolve it with Roland and then see what/how to document this. btw, I think it would be very helpful for users if there would be a general man page to libibverbs, similar to the rdma_cm(7) page provided by the librdmacm-util package Or. From ogerlitz at voltaire.com Mon Oct 1 00:05:45 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 01 Oct 2007 09:05:45 +0200 Subject: [ofa-general] Re: [PATCH 1/11] IB/ipoib: high dma support In-Reply-To: <1191163732.16668.15.camel@mtls03> References: <1190637355.4947.56.camel@mtls03> <1190715865.4947.118.camel@mtls03> <46F8EBC6.40100@voltaire.com> <1190731269.4947.158.camel@mtls03> <1190741609.20700.101.camel@brick.pathscale.com> <20070925183243.GC9670@mellanox.co.il> <20070927210847.GH2778@mellanox.co.il> <4e6a6b3c0709280643g2ca7cf42t8ef9b7941400583f@mail.gmail.com> <46FF68AE.6090605@voltaire.com> <1191144282.16035.7.camel@mtls03> <46FF9029.7010707@voltaire.com> <1191159494.16668.4.camel@mtls03> <46FFB082.7010402@voltaire.com> <1191163732.16668.15.camel@mtls03> Message-ID: <47009C49.7030001@voltaire.com> Eli Cohen wrote: >> no (in the tenth time) "...support 64 bit DMA" is not the issue handled >> by this patch but rather high-memory support. > Look again at the original patch (attached) - it's all about adding > HIGH_DMA support and not high memory support. talking about looking, include/linux/netdevice.h states: ====================================================================== #define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. * ====================================================================== I am just insisting on correct documentation no-less but no-more. Your patch is all about DMA-ing to highmem and nothing about 64 bit DMA Or. > Add high dma support to ipoib > > This patch assumes all IB devices support 64 bit DMA. > > Signed-off-by: Eli Cohen > > --- > > Index: linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- linux-2.6.23-rc1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:16.000000000 +0300 > +++ linux-2.6.23-rc1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-15 20:50:27.000000000 +0300 > @@ -1079,6 +1079,8 @@ static struct net_device *ipoib_add_port > > SET_NETDEV_DEV(priv->dev, hca->dma_device); > > + priv->dev->features |= NETIF_F_HIGHDMA; > + > result = ib_query_pkey(hca, port, 0, &priv->pkey); > if (result) { > printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", From dotanb at dev.mellanox.co.il Mon Oct 1 00:14:26 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 01 Oct 2007 09:14:26 +0200 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: <47009B15.8010506@voltaire.com> References: <46FFB02B.8040307@voltaire.com> <46FFC6AC.5010605@dev.mellanox.co.il> <47009B15.8010506@voltaire.com> Message-ID: <47009E52.4070404@dev.mellanox.co.il> Or Gerlitz wrote: > Dotan Barak wrote: >> Or Gerlitz wrote: >>> Dotan, is there any mentioning of multiple thread scheme in the >>> libibverbs/librdmacm man pages? > >> As much as i know, libibverbs is a fully thread safe library. >> I checked the code of the mthca and i noticed a spin lock before >> posting (SR or RR), so everything should be o.k. if you post from >> different threads in parallel . > > So the locking should be provided by the low-level per device library? > if this is the case, I fail to see this documented anywhere. > > Also do we actually want locking in the fast posting path? for example > is it legal to call send(2) on the same socket fd from two threads? > >> I didn't add any "thread" text in the man pages yet. >> If you think that it is required, i will add it somewhere (i can add >> a new man file verbs.h.3 that will specify this). > > I suggest we first resolve it with Roland and then see what/how to > document this. let's wait for Roland response on it. > > btw, I think it would be very helpful for users if there would be a > general man page to libibverbs, similar to the rdma_cm(7) page > provided by the librdmacm-util package I will add it to my todo list for OFED 1.3 Dotan From jackm at dev.mellanox.co.il Mon Oct 1 00:48:40 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 1 Oct 2007 09:48:40 +0200 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: References: <200709180914.18560.jackm@dev.mellanox.co.il> Message-ID: <200710010948.40453.jackm@dev.mellanox.co.il> On Wednesday 26 September 2007 21:54, Roland Dreier wrote: > I had to edit the patch, because of: > >  >      MLX4_MGM_ENTRY_SIZE     =  0x100, > > this context didn't match the upstream kernel (I see 0x40 there).  Is > there some reason you have a bigger size in your tree?  If so should > we make the change upstream too? Yes, the change should go upstream. With an MGM entry size of 64, each multicast group can support only 8 QPs. Increasing the entry size to 256 enables support of 56 QPs per multicast group (8 QPs per multicast group was not enough for some users). - Jack From kim_jones at gebietszentrale-wurzen.de Mon Oct 1 02:36:40 2007 From: kim_jones at gebietszentrale-wurzen.de (kim_jones at gebietszentrale-wurzen.de) Date: Mon, 1 Oct 2007 11:36:40 +0200 Subject: [ofa-general] New yacht hits 81 MPH Message-ID: <4700BFA8.6080608@gebietszentrale-wurzen.de> Investors and Consumer are drooling Over FRLE, As Porsche Designed Yacht hits Market Fearless International Inc. FRLE.OB $0.32 Fearless International Rocked the Yacht World with the Porsche Designed "Fearless 28" in Miami this year. Since the release, fearless has nearly maxed their production line to fill orders. The coverage on this hot new company and its new line of luxury yachts has been nothing less than a frenzy. If you do nothing else this weekend, go to the fearlessyachts dot com website and see this in action on the video. When your done, set your buy for FRLE first thing Monday morning. From trevorfry at bak.rr.com Mon Oct 1 02:39:30 2007 From: trevorfry at bak.rr.com (trevorfry at bak.rr.com) Date: Mon, 01 Oct 2007 10:39:30 +0100 Subject: [ofa-general] CONGRATULATIONS YOUR EMAIL ID HAVE MADE YOU A WINNER !!! Message-ID: Yahoo/MsnLottery Incoperation Baley House, Har Road Sutton, Greater London SM1 4te United Kingdom This is to inform you that you have won a prize of(£500,000) FIVE HUNDRED THOUSAND GREAT BRITAIN POUNDS STERLINGS for the month of Sept.2007 Lottery promotion ,which is organized by YAHOO/MSN LOTTERY INC & WINDOWS LIVE. YAHOO&MSN MICROSOFT WINDOWS, collects all the email addresses of the people that are active online. Among the millions that subscribed to the internent, we only select five people every Month as our winners through electronic balloting System without the winner applying,we congratulate you for being one of the persons selected. PAYMENT OF PRIZE AND CLAIM You are to contact your Claims Agent on or before your date of Claim, Winners Shall be paid in accordance with his/her Settlement Centre. Yahoo/Msn Lottery Prize must be claimed no later than 15 days from date of Draw Notification after the Draw date in which Prize was won.Any prize not claimed within this period will be forfeited. These are your identification numbers: FPRIVATE "TYPE=PICT;ALT=Get ed traffic to your web site!" Batch number.....................YM 09102XN Reff number.......................YM35447XN Winning number...................YM09788 These numbers fall within your Location file,you are requested to contact EVENTS AGENT MANAGER,send your winning identification numbers to them at this contact address below. (CONTACT EVENTS AGENT MANAGER) Mr.MArk cliff (Phd) Email: msn_claimsagentdepartment_01 at yahoo.co.uk Phone:+44 70457 13049 You are therefore advised to send the following information to the EVENTS AGENT to facilitate them and process the COURIER for your moneyongratulations!! once again. Mrs.Monica Davids Msn Secretary Dr.Samuel Bent Online Co-ordinator From moranj215 at trcaa.org Mon Oct 1 02:37:59 2007 From: moranj215 at trcaa.org (moranj neves) Date: Mon, 1 Oct 2007 11:37:59 +0200 Subject: [ofa-general] lfadroow Message-ID: <000501c8040e$c1313560$c0cdb250@df70268d56f34a7> Morning, openib-general Alert, alert, alert! Start trade D.M.X.C Five day price: ~$0.50 lhefppit lereriez lgnidneb lfciasom -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Mon Oct 1 02:56:50 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 1 Oct 2007 02:56:50 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071001-0200 daily build status Message-ID: <20071001095650.3C4FEE6094A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071001-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Mon Oct 1 03:24:38 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 01 Oct 2007 12:24:38 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <46F9065C.3090907@voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> Message-ID: <4700CAE6.6040801@voltaire.com> Or Gerlitz wrote: > Sean Hefty wrote: >>>> node 1 <-> switch A <-> switch B <-> switch C <-> SA >>> The host would only see port up/down events as of changes in the link >>> state in the local port or in the port which is connected to it through >>> the cable. >> So, if you brought the link down/up between switches A & B, node 1 >> wouldn't receive any events, but it would be removed from the multicast >> group? > However, from the view point of the node, no port down is experienced. I have tested the a node <-> switch A <-> switch B (Voltaire SM/SA running here) scheme and possible problem you have pointed on does not happen: First, when the A-B link goes down the node is removed from the multicast group qat the SA database. No event is being experienced by the node. Second, when the link is brought back online, the SM discovers and configures the port. This causes bunch (six!) of events to be generated and IPoIB joins the multicast group (the broadcast in this case) and we are done. This join actually goes out to the SA from the multicast core code and the node is listed in the SA database for the group. The node system I was using has: OFED 1.2 / mthca / device 25208 (Arbel memfull) / firmware 4.8.200 I understand that for this device the events are generated by the firmware and not by some filtering code that captures mad passed through the process_mad() verb (Sean, am I correct?). I don't know exactly which events were generated by the firmware, since the IPoIB event handler does not print the event number, however, I am sure that PORT_ACTIVE and PKEY_CHANGE (for this one there is a different print which you can see below) were among them, and there is some chance that CLIENT_REREGISTER is not one of them. Note that port active event for itself is not enough for all this to work, since the multicast code does not flush the entries and hence a join that follows will be possibly replied with cached attributes (as discussed earlier on this thread) and no query would be sent to the SA. At the bottom line, with this device/firmware the problem does not happen, but there's a possible hole here if the IB spec does not require the SM to set the client re-register bit each time it discovers a node. Below is the ipoib log after I have reconnected the cable (when I removed it no event was generated and local read of the port info, eg through ibv_devinfo, reported the port is UP...) Or. > ib0: Port state change event > ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status -102) > ib0: Flushing ib0 > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: leaving MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: starting multicast thread > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: Port state change event > ib0: Flushing ib0 > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: starting multicast thread > ib0: Port state change event > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: Flushing ib0 > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: starting multicast thread > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: Port state change event > ib0: Flushing ib0 > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: starting multicast thread > ib0: Port state change event > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: Flushing ib0 > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: flushing > ib0: downing ib_dev > ib0: stopping multicast thread > ib0: flushing multicast list > ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: starting multicast thread > ib0: restarting multicast task > ib0: stopping multicast thread > ib0: starting multicast thread > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: pkey change event on port:1 > ib0: Flushing ib0 and restarting it's QP > ib0.f1f1: Not flushing - IPOIB_FLAG_INITIALIZED not set. > ib0: Not flushing - pkey index not changed. > ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status -110) > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -110 > ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff > ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) > ib0: Created ah 000001001f447640 > ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV 000001001f447640, LID 0xc000, SL 0 > ib0: successfully joined all multicast groups From kaber at trash.net Mon Oct 1 03:42:28 2007 From: kaber at trash.net (Patrick McHardy) Date: Mon, 01 Oct 2007 12:42:28 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191178346.6165.29.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> Message-ID: <4700CF14.2010809@trash.net> jamal wrote: > +static inline int > +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, > + struct Qdisc *q) > +{ > + > + struct sk_buff *skb; > + > + while ((skb = __skb_dequeue(skbs)) != NULL) > + q->ops->requeue(skb, q); ->requeue queues at the head, so this looks like it would reverse the order of the skbs. From Arkady.Kanevsky at netapp.com Mon Oct 1 05:34:42 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 1 Oct 2007 08:34:42 -0400 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacestoavoid 4-tuple conflicts. In-Reply-To: <46FD7380.6050107@ichips.intel.com> References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com><000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com><46FD5A2F.7010409@opengridcomputing.com> <46FD7380.6050107@ichips.intel.com> Message-ID: Sean, Not so simple. How does client application knows where to connect? Does this proposal forces applications to choose the "right" network? Currently, MPA or ULP and not applications handle it. Why would we want to change that? Sean, I may be beating the dead horse, but I recall that one of the main selling points of RDMA that it magical bust to performance with no changes applications. Just plug it in an viola, performances goes up and CPU utilization for network stack goes does. Win-Win. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, September 28, 2007 5:35 PM > To: Kanevsky, Arkady > Cc: netdev at vger.kernel.org; rdreier at cisco.com; > linux-kernel at vger.kernel.org; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: > Support"iwarp-only"interfacestoavoid 4-tuple conflicts. > > Kanevsky, Arkady wrote: > > Exactly, > > it forces the burden on administrator. > > And one will be forced to try one mount for iWARP and it > does not work > > issue another one TCP or UDP if it fails. > > Yack! > > > > And server will need to listen on different IP address and simple > > * will not work since it will need to listen in two > different domains. > > The server already has to call listen twice. Once for the > rdma_cm and once for sockets. Similarly on the client side, > connect must be made over rdma_cm or sockets. I really don't > see any impact on the application for this approach. > > We just end up separating the port space based on networking > addresses, rather than keeping the problem at the transport > level. If you have an alternate approach that will be > accepted upstream, feel free to post it. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From erezz at Voltaire.COM Mon Oct 1 06:02:37 2007 From: erezz at Voltaire.COM (Erez Zilber) Date: Mon, 01 Oct 2007 15:02:37 +0200 Subject: [ofa-general] Re: ofed 1.3 kernel tree updated to 2.6.23-rc8 In-Reply-To: <20070926113001.GC2778@mellanox.co.il> References: <20070926113001.GC2778@mellanox.co.il> Message-ID: <4700EFED.2040105@Voltaire.COM> Michael S. Tsirkin wrote: > > Hello! > I have updated the OFED 1.3 kernel tree at > git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel > to upstream 2.6.23-rc8. > > I have resolved minor conflicts in libiscsi backports for RHEL4, > and everything seems to build fine now. iSER maintainers, please > verify that I did the right thing. > > -- > MST > Those fixes are problematic because you didn't merge the patches that I sent 2 weeks ago: http://lists.openfabrics.org/pipermail/ewg/2007-September/004576.html I suggest that you merge them first, and then we can see if there are any conflicts. Thanks, Erez From hadi at cyberus.ca Mon Oct 1 06:21:40 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 01 Oct 2007 09:21:40 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <4700CF14.2010809@trash.net> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <4700CF14.2010809@trash.net> Message-ID: <1191244900.4378.3.camel@localhost> On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote: > jamal wrote: > > + while ((skb = __skb_dequeue(skbs)) != NULL) > > + q->ops->requeue(skb, q); > > > ->requeue queues at the head, so this looks like it would reverse > the order of the skbs. Excellent catch! thanks; i will fix. As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. Thanks again Patrick. cheers, jamal From hadi at cyberus.ca Mon Oct 1 06:30:40 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 01 Oct 2007 09:30:40 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071001001135.75d2b984.billfink@mindspring.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <20071001001135.75d2b984.billfink@mindspring.com> Message-ID: <1191245440.4378.12.camel@localhost> On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote: > Have you done performance comparisons for the case of using 9000-byte > jumbo frames? I havent, but will try if any of the gige cards i have support it. As a side note: I have not seen any useful gains or losses as the packet size approaches even 1500B MTU. For example, post about 256B neither the batching nor the non-batching give much difference in either throughput or cpu use. Below 256B, theres a noticeable gain for batching. Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and so the occupancy of both the qdisc queue(s) and ethernet ring is constantly high. For example at 512B, the app is 80% idle on all 4 CPUs and we are hitting in the range of wire speed. We are at 90% idle at 1024B. This is the case with or without batching. So my suspicion is that with that trend a 9000B packet will just follow the same pattern. cheers, jamal From mhagen at iol.unh.edu Mon Oct 1 06:44:13 2007 From: mhagen at iol.unh.edu (Mikkel Hagen) Date: Mon, 01 Oct 2007 09:44:13 -0400 Subject: [ofa-general] OFA-IWG Interop Event - Last Day to Register (10/2/07)!!! In-Reply-To: <46E5841B.9020507@iol.unh.edu> References: <46E5841B.9020507@iol.unh.edu> Message-ID: <4700F9AD.3090105@iol.unh.edu> The University of New Hampshire InterOperability Lab and Open Fabrics Alliance Interoperability Working Group would like to extend an invitation to all members to attend the upcoming Interoperability Event hosted at UNH-IOL facility. We will be performing the interoperability test plan developed within the OFA-IWG and granting logos to all qualified participants shortly after the event. All required information can be found at the following link regarding logistics, registration, test plan, etc: http://www.iol.unh.edu/services/testing/ofa/events/index.php Please download the Quick Start Guide (QSG) for all information and then feel free to forward any further questions to myself (mhagen at iol.unh.edu) or iwg at list.openfabrics.org. Thanks! Mikkel Hagen Project Assistant - Fibre Channel/SAS/SATA Consortiums Research and Development Engineer - iWARP Consortium FC/SAS/SATA:1-603-862-0701 iWARP:1-603-862-5083 Fax:1-603-862-4181 UNH-IOL 121 Technology Drive, Suite 2 Durham, NH 03824 Mikkel Hagen wrote: > The University of New Hampshire InterOperability Lab and Open Fabrics > Alliance Interoperability Working Group would like to extend an > invitation to all members to attend the upcoming Interoperability > Event hosted at UNH-IOL facility. We will be performing the > interoperability test plan developed within the OFA-IWG and granting > logos to all qualified participants shortly after the event. All > required information can be found at the following link regarding > logistics, registration, test plan, etc: > > http://www.iol.unh.edu/services/testing/ofa/events/index.php > > Please download the Quick Start Guide (QSG) for all information and > then feel free to forward any further questions to myself > (mhagen at iol.unh.edu) or interop-wg at list.openfabrics.org. Thanks! > From sashak at voltaire.com Mon Oct 1 08:16:30 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 17:16:30 +0200 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <4700CAE6.6040801@voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <4700CAE6.6040801@voltaire.com> Message-ID: <20071001151630.GC28627@sashak.voltaire.com> On 12:24 Mon 01 Oct , Or Gerlitz wrote: > > At the bottom line, with this device/firmware the problem does not happen, > but there's a possible hole here if the IB spec does not require the SM to > set the client re-register bit each time it discovers a node. BTW according to IB spec client reregistration support is optional for IB port (indicated by bit 25 of PortInfo:CapabilityMask). Sasha From Lannytrigfoist at nymag.com Mon Oct 1 08:15:43 2007 From: Lannytrigfoist at nymag.com (Jamar Coffey) Date: Mon, 1 Oct 2007 08:15:43 -0700 (PDT) Subject: [ofa-general] %SUB1 %SUB2, we %SUB5 Message-ID: <20071001151544.B2367E608CB@openfabrics.org> As a business you have been preapproved to receive 32584 USD TODAY! No hassle at all, completely unsecured. There are no hidden costs or fees. Worried that your credit is less than perfect? Not an issue. Give us a ring, now.. 18772926894 Turn your dream into a reality. 18772926894 She screamed like a cat, writhed like a cat, and tried to claw out from under him like a cat. Nasty as a hand-job in a sleazy bar, fine as a cute from the worlds most talented call-girl. Royal Bonner From changquing.tang at hp.com Mon Oct 1 08:20:19 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 1 Oct 2007 15:20:19 -0000 Subject: [ofa-general] How to use some ibv_wc fields ? Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030267C99F@G3W0634.americas.hpqcorp.net> Can anyone tell how to use field 'qp_num' and 'src_qp' ? I understand that qp_num is the destination QP's number, while src_qp is the source QP's number, right ? Another question, how is qp number assigned to a QP ? is it a random number, or a number from 1 and then increasing with QPs ? one process basis or on node basis ? I hope to use qp_num to hash the QP pointer, then in SRQ mode, I know who sent the message. Thanks. --CQ struct ibv_wc { uint64_t wr_id; enum ibv_wc_status status; enum ibv_wc_opcode opcode; uint32_t vendor_err; uint32_t byte_len; uint32_t imm_data; /* in network byte order */ uint32_t qp_num; uint32_t src_qp; enum ibv_wc_flags wc_flags; uint16_t pkey_index; uint16_t slid; uint8_t sl; uint8_t dlid_path_bits; }; From todd.rimmer at qlogic.com Mon Oct 1 08:36:37 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Mon, 1 Oct 2007 10:36:37 -0500 Subject: [ofa-general] How to use some ibv_wc fields ? In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030267C99F@G3W0634.americas.hpqcorp.net> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE0611930E8EB@EPEXCH2.qlogic.org> > From: Tang, Changqing > Sent: Monday, October 01, 2007 11:20 AM > To: general at lists.openfabrics.org > Subject: [ofa-general] How to use some ibv_wc fields ? > > > Another question, how is qp number assigned to a QP ? is it a random > number, or > a number from 1 and then increasing with QPs ? one process basis or on > node basis ? The assignment depends on the specifics of the HCA driver. QP 0 and QP 1 are well known and defined by the spec for management. The remaining 2^24 QPs are up to the hardware and driver. A given QP number is unique per HCA. As QPs are destroyed and created, QP numbers may be reused, however at a given point in time, only a single QP in the HCA will exist with the given number. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From anton at samba.org Mon Oct 1 08:36:20 2007 From: anton at samba.org (Anton Blanchard) Date: Mon, 1 Oct 2007 10:36:20 -0500 Subject: [ofa-general] [PATCH] fix some ehca limits In-Reply-To: <20070930053726.GA28619@kryten> References: <20070930053726.GA28619@kryten> Message-ID: <20071001153620.GA31830@kryten> I had trouble getting DAPL to work on eHCA, this turned out to be a negative value in max_cqe. max_pd and max_ah are currently negative too, fix them up at the same time. Before: max_cqe: -64 max_pd: -1 max_ah: -1 After: max_cqe: 2147483647 max_pd: 2147483647 max_ah: 2147483647 Signed-off-by: Anton Blanchard --- diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index fc19ef9..4df1e2b 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -87,11 +88,11 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props->max_sge = min_t(int, rblock->max_sge, INT_MAX); props->max_sge_rd = min_t(int, rblock->max_sge_rd, INT_MAX); props->max_cq = min_t(int, rblock->max_cq, INT_MAX); - props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); + props->max_cqe = min_t(unsigned int, rblock->max_cqe, INT_MAX); props->max_mr = min_t(int, rblock->max_mr, INT_MAX); props->max_mw = min_t(int, rblock->max_mw, INT_MAX); - props->max_pd = min_t(int, rblock->max_pd, INT_MAX); - props->max_ah = min_t(int, rblock->max_ah, INT_MAX); + props->max_pd = min_t(unsigned int, rblock->max_pd, INT_MAX); + props->max_ah = min_t(unsigned int, rblock->max_ah, INT_MAX); props->max_fmr = min_t(int, rblock->max_mr, INT_MAX); props->max_srq = 0; props->max_srq_wr = 0; From hassansmlaw3 at hotmail.com Mon Oct 1 08:55:52 2007 From: hassansmlaw3 at hotmail.com (Barrister Hassan Mustapha Mustapha) Date: Mon, 1 Oct 2007 08:55:52 -0700 Subject: [ofa-general] Good day Sir, Message-ID: Please, I want to introduce myself and this business opportunity to you My name is Hassan Mustapha, a legal practitioner,I wish to know if we can work together. I would like you to stand as the next of kin to my deceased client who made a deposit with Citibank Nigeria Plc. He died without any registered next of kin and as such the funds now have an open beneficiary mandate. If you are interested please do let me know so that I can give you comprehensive details on what we are to do. I urgently hope to get your response as soon as possible. Best regards, Barr. Hassan MustaphaTel: +234 806-9720870 _________________________________________________________________ Connect to the next generation of MSN Messenger  http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline -------------- next part -------------- An HTML attachment was scrubbed... URL: From changquing.tang at hp.com Mon Oct 1 09:14:50 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 1 Oct 2007 16:14:50 -0000 Subject: [ofa-general] How to use some ibv_wc fields ? In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611930E8EB@EPEXCH2.qlogic.org> References: <349DCDA352EACF42A0C49FA6DCEA84030267C99F@G3W0634.americas.hpqcorp.net> <4FB1BCCAE6CAED44A1DC005B1DE0611930E8EB@EPEXCH2.qlogic.org> Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030267CACF@G3W0634.americas.hpqcorp.net> OK, thanks. qp_num is something like process PID. I really wish 'struct ibv_wc' had qp_context field(void *) application given when creating QP. --CQ > -----Original Message----- > From: Todd Rimmer [mailto:todd.rimmer at qlogic.com] > Sent: Monday, October 01, 2007 10:37 AM > To: Tang, Changqing; general at lists.openfabrics.org > Subject: RE: [ofa-general] How to use some ibv_wc fields ? > > > > > From: Tang, Changqing > > Sent: Monday, October 01, 2007 11:20 AM > > To: general at lists.openfabrics.org > > Subject: [ofa-general] How to use some ibv_wc fields ? > > > > > > Another question, how is qp number assigned to a QP ? is it > a random > > number, or a number from 1 and then increasing with QPs ? > one process > > basis or on node basis ? > > > The assignment depends on the specifics of the HCA driver. > QP 0 and QP 1 are well known and defined by the spec for management. > The remaining 2^24 QPs are up to the hardware and driver. A > given QP number is unique per HCA. As QPs are destroyed and > created, QP numbers may be reused, however at a given point > in time, only a single QP in the HCA will exist with the given number. > > Todd Rimmer > Chief Architect > QLogic System Interconnect Group > Voice: 610-233-4852 Fax: 610-233-4777 > Todd.Rimmer at QLogic.com www.QLogic.com > > > From mshefty at ichips.intel.com Mon Oct 1 09:17:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 01 Oct 2007 09:17:29 -0700 Subject: [ofa-general] [BUG report / PATCH] fix race in the core multicast management In-Reply-To: <20071001151630.GC28627@sashak.voltaire.com> References: <46F2C064.9030404@ichips.intel.com> <46F6267E.7090407@voltaire.com> <46F7E96E.4060302@ichips.intel.com> <46F9065C.3090907@voltaire.com> <4700CAE6.6040801@voltaire.com> <20071001151630.GC28627@sashak.voltaire.com> Message-ID: <47011D99.4010309@ichips.intel.com> >> At the bottom line, with this device/firmware the problem does not happen, >> but there's a possible hole here if the IB spec does not require the SM to >> set the client re-register bit each time it discovers a node. > > BTW according to IB spec client reregistration support is optional for > IB port (indicated by bit 25 of PortInfo:CapabilityMask). The multicast code transitions all local multicast groups into an error state on any of these events: port error, LID change, SM change, or client reregister. IPoIB responds to these events, plus port active and pkey change. From section 7.2 (figure 50) and section 11.6.3.4, we should get a port error event before a port active event, except in the case of link active defer. I didn't think it was necessary to transition all multicast groups into the error state for link active defer. But, do we need to? Pkey changes were not handled, to avoid failing unaffected multicast groups. However, to be safe, we could see what the change was and generate errors on the affected multicast groups. Does anyone see any other holes in the multicast group handling? - Sean From sashak at voltaire.com Mon Oct 1 09:55:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 18:55:51 +0200 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <200709261614.09499.kilian@stanford.edu> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <20070925233421.GA19757@sashak.voltaire.com> <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> <200709261614.09499.kilian@stanford.edu> Message-ID: <20071001165550.GE28627@sashak.voltaire.com> On 16:14 Wed 26 Sep , Kilian CAVALOTTI wrote: > On Tuesday 25 September 2007 04:30:48 pm Jeff Becker wrote: > > Hi Sasha. Thanks for the info. I did have the following problem when > > building against the 1.2 libibmad: > > > > cc -Wall -g -fpic -I. -I../include -I/home/becker//include -c -o > > sim_mad.o sim_mad.c > > sim_mad.c: In function 'encode_trap144': > > sim_mad.c:1261: error: 'IB_NOTICE_DATA_144_LID_F' undeclared (first > > use in this function) > > sim_mad.c:1261: error: (Each undeclared identifier is reported only > > once sim_mad.c:1261: error: for each function it appears in.) > > sim_mad.c:1262: error: 'IB_NOTICE_DATA_144_CAPMASK_F' undeclared > > (first use in this function) > > make[1]: *** [sim_mad.o] Error 1 > > make[1]: Leaving directory `/home/becker/ibrouting/ibsim/ibsim' > > And indeed those have been introduced by this patch in 1.2.5: > http://lists.openfabrics.org/pipermail/general/2007-June/036912.html As far as I remember this patch was for master originally and was not part of OFED-1.2 or 1.2.5. Was it? Sasha From sashak at voltaire.com Mon Oct 1 10:01:29 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 19:01:29 +0200 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <1190747180.7075.465.camel@hrosenstock-ws.xsigo.com> <20070925192501.GE29384@sashak.voltaire.com> <795c49870709251441y4ee77d43wd91ea62c62e832d2@mail.gmail.com> <20070925233421.GA19757@sashak.voltaire.com> <795c49870709251630q5edef1fcx50235d65cbae6e9d@mail.gmail.com> Message-ID: <20071001170129.GF28627@sashak.voltaire.com> On 16:30 Tue 25 Sep , Jeff Becker wrote: > Hi Sasha. Thanks for the info. I did have the following problem when > building against the 1.2 libibmad: > > cc -Wall -g -fpic -I. -I../include -I/home/becker//include -c -o > sim_mad.o sim_mad.c > sim_mad.c: In function 'encode_trap144': > sim_mad.c:1261: error: 'IB_NOTICE_DATA_144_LID_F' undeclared (first > use in this function) > sim_mad.c:1261: error: (Each undeclared identifier is reported only once > sim_mad.c:1261: error: for each function it appears in.) > sim_mad.c:1262: error: 'IB_NOTICE_DATA_144_CAPMASK_F' undeclared > (first use in this function) > make[1]: *** [sim_mad.o] Error 1 > make[1]: Leaving directory `/home/becker/ibrouting/ibsim/ibsim' I guess you can revert this patch in order to build ibsim against OFED-1.2: commit b10a01708fb620b7bc4bad17ff51c1b82bda7968 Author: Sasha Khapyorsky Date: Wed Jun 6 02:59:07 2007 +0300 ibsim: trap144 encoder Signed-off-by: Sasha Khapyorsky diff --git a/ibsim/sim.h b/ibsim/sim.h index 5a95d12..538e7d7 100644 --- a/ibsim/sim.h +++ b/ibsim/sim.h @@ -83,6 +83,7 @@ enum NODE_TYPES { enum TRAP_TYPE_ID { TRAP_128, + TRAP_144, TRAP_NUM_LAST }; diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c index 680d0e4..970d56e 100644 --- a/ibsim/sim_mad.c +++ b/ibsim/sim_mad.c @@ -60,6 +60,7 @@ static Smpfn do_nodeinfo, do_nodedesc, do_switchinfo, do_portinfo, do_pkeytbl, do_sl2vl, do_vlarb, do_guidinfo, do_nothing; static EncodeTrapfn encode_trap128; +static EncodeTrapfn encode_trap144; Smpfn *attrs[IB_PERFORMANCE_CLASS + 1][0xff] = { [IB_SMI_CLASS] {[IB_ATTR_NODE_DESC] do_nodedesc, @@ -89,6 +90,7 @@ Smpfn *attrs[IB_PERFORMANCE_CLASS + 1][0xff] = { EncodeTrapfn *encodetrap[] = { [TRAP_128] encode_trap128, + [TRAP_144] encode_trap144, [TRAP_NUM_LAST] 0, @@ -1241,6 +1243,28 @@ static int encode_trap128(Port * port, char *data) return 0; } +static int encode_trap144(Port * port, char *data) +{ + if (!port->lid || !port->smlid) { + VERB("switch trap 144 for lid %d with smlid %d", + port->lid, port->smlid); + return -1; + } + + mad_set_field(data, 0, IB_NOTICE_IS_GENERIC_F, 1); + mad_set_field(data, 0, IB_NOTICE_TYPE_F, 4); // Informational + mad_set_field(data, 0, IB_NOTICE_PRODUCER_F, port->node->type); + mad_set_field(data, 0, IB_NOTICE_TRAP_NUMBER_F, 144); + mad_set_field(data, 0, IB_NOTICE_ISSUER_LID_F, port->lid); + mad_set_field(data, 0, IB_NOTICE_TOGGLE_F, 0); + mad_set_field(data, 0, IB_NOTICE_COUNT_F, 0); + mad_set_field(data, 0, IB_NOTICE_DATA_144_LID_F, port->lid); + mad_set_field(data, 0, IB_NOTICE_DATA_144_CAPMASK_F, + mad_get_field(port->portinfo, 0, IB_PORT_CAPMASK_F)); + + return 0; +} + static int encode_trap_header(char *buf) { mad_set_field(buf, 0, IB_MAD_CLASSVER_F, 0x1); // Class Sasha From sales at osteriacicchetti.com Mon Oct 1 10:03:08 2007 From: sales at osteriacicchetti.com (Sales) Date: Mon, 01 Oct 2007 17:03:08 +0000 Subject: [ofa-general] rifkin Message-ID: <7f2101c8044c$215b9788$8fdde458@[88.228.221.143]> Starting from 09/25 you can buy Authentic Viagra directly from Pfizer Link here: http://www.osteriacicchetti.com/ All prices are TAX/VAT free and same day free worldwide shipping also included. From sashak at voltaire.com Mon Oct 1 10:16:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 19:16:36 +0200 Subject: Fw: [ofa-general] Re: [query] Multi path discovery in openSM In-Reply-To: <829ded920709260224o151da169g78d4ff89c18e07f6@mail.gmail.com> References: <479359.26315.qm@web8315.mail.in.yahoo.com> <829ded920709240023v1282341cq4e14ce29f19fba1b@mail.gmail.com> <6C2C79E72C305246B504CBA17B5500C9024AA95A@mtlexch01.mtl.com> <829ded920709240128s3fde49f6pe49c05f4300261af@mail.gmail.com> <6C2C79E72C305246B504CBA17B5500C9024AAA1F@mtlexch01.mtl.com> <829ded920709260224o151da169g78d4ff89c18e07f6@mail.gmail.com> Message-ID: <20071001171636.GG28627@sashak.voltaire.com> On 14:54 Wed 26 Sep , Keshetti Mahesh wrote: > > The sharing paths is orthogonal to the min-hop requirement. > > The min-hop requirements is a common way to avoid routing loops. > > All algorithms I know are using it. > > Even with that requirement there regularly multiple paths from A to B > > available both for fat-tree or mesh/tori topologies. > > Thanks for clarifying it. Now, I can see that openSM supports four different > algorithms(Min-hop being the default). Depending on the physical network > topology whether the openSM decides the routing policy on its own or one > has to configure openSM's routing algorithm before starting it. Desired routing algorithm should be specified with -R option (refer OpenSM man page for more usage details). > Is there any > document describing which algorithm should be used when? And is there > document describing the current openSM routing algorithms in detail ? Some basic explanations are in OpenSM man page. Sasha From sandroguiodays at gmail.com Mon Oct 1 10:07:53 2007 From: sandroguiodays at gmail.com (sandro sandor) Date: Mon, 1 Oct 2007 14:07:53 -0300 Subject: [ofa-general] =?iso-8859-1?q?Oi=2C_olha_isso_=2C_=22fa=E7a_um_Gra?= =?iso-8859-1?q?nde_Neg=F3cio=22?= In-Reply-To: References: Message-ID: *Pessoal, comecei , agora levem adiante, .... o importante é todos participarem, nem que seja pra ver no que dá, pelo menos tentem, afinal de contas, acabamos sempre gastando com alguma besteira. Essa pode ser a sua chance de um real investimento.* *Quem me conhece sabe que sou bem cético, mas fiz uns cálculos com minha irmã que é economista, e vi onde essa brincadeira pode chegar. Por falar nisso, a Giórgia ali da lista é a minha irmã e o Rudi todos conhecem. A mensagem abaixo é a mesma que me enviaram, sintam-se a vontade pra modificá-la, claro que os depoimentos não. * *OBS: Vocês pereceberão vários comentários meus ao longo do texto.* *Abraços Sandro * * ----------------------------------------------------------------------------------------------------------- * *IMPORTANTE!!! ESSA MENSAGEM PODERÁ MUDAR SUA VIDA!!!* *LEIA ATÉ O FINAL!!!* * * *Gente, eu não acredito em conto de fadas, é pura matemática! Os cálculos não mentem! (Sandro)*** (Fiz a minha parte, faça a sua, o investimento é baixo e o retorno é enorme acredite!!!) Você pode ganhar muito dinheiro, desde que seja *honesto* e mande para o número máximo de pessoas cadastradas no seu email...Custa muito tentar? *Calma! Não é SPAM! (Sandro)* *Depoimento1: *Meu nome é Maria Clara. Sou publicitária no Rio de Janeiro e descobri que O Segredo funciona de verdade. Postei todo o meu depoimento no meu orkut, pois é maravilhoso o que está acontecendo comigo e quero dividir com todos, pois como bem ensinado no Segredo, o universo se arranja para dar-nos aquilo que queremos. Espero que gostem e que seja útil também na vida de todos vocês. Acreditem, estou ganhando muito dinheiro com a Lei da Atração. Recebi essa mensagem abaixo há um mês. Acreditem... já recebi até agora, R$ 1.455,00. Foram 1455 transferências para a minha conta. Estou mandando para todos os meus amigos. O negócio é FANTÁSTICO! Vamos continuar nos ajudando mutuamente. *Depoimento 2: *Meu nome é Elias, sou um pequeno empresário. Em Junho de 2002, recebi pelo correio eletrônico uma informação inusitada. É claro, ela veio espontaneamente. Simplesmente pegaram meu nome/e-mail em algum cadastro de mala direta ou de algum provedor. GRAÇAS A DEUS POR ISTO! Depois de ter lido a informação, mal pude acreditar no que meus olhos tinham visto. Diante de mim estava uma estupenda maneira de resolver todos os meus problemas. Eu não teria que investir quase nada, e mais, sem me endividar novamente. Pensei: Por que não? "Pior do que eu estava não poderia ficar". Segui as instruções correta e minuciosamente. Enviei, inicialmente, 250 e-mails, e o dinheiro começou a chegar. Vagarosamente no início, mas após algumas semanas eu estava recebendo mais do que eu poderia imaginar. Passados três meses, mais ou menos, o dinheiro parou de chegar. Como tinha feito um registro preciso do dinheiro recebido, fiquei estarrecido. O final totalizava R$ 111.972,00 (Cento e onze mil novecentos e setenta e dois reais). Fantástico!!! Paguei todas as minhas dívidas, comprei um carro novo, uma bela casa e enviei de forma intercalada (quatro vezes de 250) mais 1.000 cartas/e-mails. Em quatro meses, aproximadamente, recebi R$ 447.888,00 (Quatrocentos e quarenta e sete mil oitocentos e oitenta e oito reais). Leia atentamente, isso pode mudar sua vida para sempre. Lembre-se: este programa não funciona, se não for colocado em prática de forma correta e como indicado nas instruções adiante. ESTA É UMA GRANDE OPORTUNIDADE, COM POUQUÍSSIMO CUSTO OU RISCO!!!" Siga o texto: Por favor, siga estas diretrizes EXATAMENTE como descritas e você poderá ganhar muito dinheiro num espaço de 15 a 30 dias. OBS: Este programa permanece próspero por causa da honestidade, dedicação e integridade dos participantes. *Aqui estão os passos fáceis para sucesso: Siga os passos: * *PASSO 1)* Deposite R$ 1,00 na conta bancária de cada um dos seis nomes na lista logo abaixo, (direto no banco). *Tá pessoal, eu também fiz isso e no final está a prova. (Sandro)* * ----------------------------------------------------------------------------------------------------------- * Meus comprovantes estão em anexo. Quem não receber ou não conseguir abrir poderão ver pelo site: http://picasaweb.google.com/sandroguioday/Recibos Coloque seu comprovante em anexo também para aumentar sua confiabilidade. Ou façam como eu, dêem upload em algum fotolog. * ----------------------------------------------------------------------------------------------------------- * * * * * (1) Maria (Banco Itaú) AG - 0040 C/C - 68760-0 (2) Marcio - Unibanco - ag. 0212 C/C 203162-4 (3) Priscila Freitas ( Unibanco ) AG 0083 C/C 106-495-1 (4) *Rodiney(Banco do Brasil)AG.2708-1 Poupança **(Variação 01) **40.456-x* *(5) Giórgia (Banco do Brasil) Ag. 3529-7 Poupança (Variação 01) 11.509-6* *(6) Sandro (Banco do Brasil) Ag. 1249-1 Poupança (Variação 01) 31.301-7* * * * * *PASSO 2)* Retire o nome que está na posição (1) e introduza o seu nome juntamente com o número de sua conta bancária na posição de numero (6) da lista, e eleve os outros. EX: Quem estava no número (2) sobe para o (1) e assim por diante até chegar em você que estará na posição (6). *PASSO 3)* Depois envie esse artigo a no mínimo 200 *fóruns ou newgroups* espalhados pela Internet (uol, terra, terravista, ubbi, etc...). Lembre-se que existem milhares e quanto mais você mandar, mais dinheiro você irá ganhar. Agora pare e pense. Faz sentido não faz?? Baseando-se em um sistema simples onde todo mundo ajuda todo mundo*. Afinal de contas, quem não gosta de ganhar uma grana?* E são só R$6,00 reais de investimento, que voltarão pra você em uma quantia *bem maior*. *Vamos ser realistas. É uma brincadeira, vai depender do seu ciclo de relacionamentos. Mas só temos a ganhar! Vamos fazer nem que seja para recuperar os 6 pilas. Vamos entrar na brincadeira e ver no que dá! Não custa nada! (Sandro)* *OBS: Só não vá errar o número da sua conta e confira primeiro o nome da pessoa no depósito. ** * * * Um sistema simples e seguro e sem envolvimento com ninguém e baseado na *HONESTIDADE* *e credibilidade*. E que pode ser repetido toda vez que você estiver no sufoco, é só passar pra frente e ACREDITE FUNCIONA MESMO!!! É uma quantia de dinheiro que muitas vezes arriscamos em loteria ou qualquer outra coisa, sem ganharmos nada, sem contar quantas vezes damos dinheiro a pessoas que não merecem. Quando estacionamos o carro, por exemplo, sempre tem um para pedir um real; mas aqui você está investindo. *Deposite R$ 1,00 para cada uma das pessoas e espere*, pois em alguns dias começará a aparecer depósitos de R$1,00 na sua conta, e até que você chegue na posição (1) da lista, já terá recebido muita grana. *FAÇA O TESTE!!! * *CLARO AGORA VOU EXPLICAR COMO VC VAI GANHAR TANTO DINHEIRO E EU** SEI QUE VC VAI ENTENDER *; VAMOS CALCULAR: Vamos dizer que das 200 mensagens que eu enviei eu receba só 5 respostas (um exemplo muito ruim e baixo, quase impossível). Então receberei R$5,00 com meu nome na 6ª posição da lista. Agora, cada uma das 5 pessoas que há pouco me enviaram R$1,00 enviam mais 200 mensagens para outros lugares diferentes, cada com o meu nome agora na 5ª posição da lista, 5 pessoas multiplicado por 5 é igual a 25 pessoas, vezes R$ 1,00 é igual a R$25,00 de ganho. Agora, cada uma dessas 25 pessoas envia mais 200 mensagens para outros lugares diferentes, com meu nome na posição 4ª posição da lista. E, vamos supor, que novamente somente 5 indivíduos respondam para cada um dos 25 remetentes, totalizará 125 pessoas e eu receberei então mais R$125,00! Agora, essas 125 pessoas postam mais 200 mensagens para outros lugares diferentes, com meu nome na posição 3ª posição da lista. E, vamos supor, que novamente somente 5 indivíduos respondam, totalizará 625 pessoas e eu ganharei mais R$625,00! OK! Agora aqui é a parte divertida, cada dessas 625 pessoas postam mais 200 mensagens para outros lugares diferentes, com meu nome na posição 2ª posição da lista. E, cada um obtém somente 5 retornos. Teremos um total de 3.125 retornos de R$ 1,00. Isso me rende mais R$ 3.125,00!!! Finalmente, essas 3.125 pessoas também postem mais 200 mensagens para outros lugares diferentes, com meu nome na posição 1ª posição da lista. E, se ainda apenas 5 pessoas retornem teremos um total de 15.625 pessoas, que me dará um ganho de R$15.625,00! Inacreditável, com o pequeno Investimento de "Apenas R$ 6,00" eu ganhei a bagatela de R$ 19.530,00 (A soma dos valores). O mais incrível ainda, com apenas o retorno de 2,5% (5 retorno por cada uma das 200 mensagens enviadas) que eu e meus parceiros mandamos. PORTANTO ESTA É A OPORTUNIDADE DE MUDAR A SUA VIDA! DEPOSITE AGORA!!! (JÁ), PELA INTERNET OU VÁ ATÉ O BANCO. E ENVIE SUAS MENSAGENS NOS FÓRUNS E NEWSGROUPS DIFERENTES, QUANTO MAIS, MELHOR!!! . . . E *FELIZ VIDA NOVA!!!* O MUNDO SERIA BEM MELHOR SE TODAS AS PESSOAS PUDESSEM CONTAR COM AS OUTRAS PESSOAS E NÃO COM SISTEMA FALIDO DO GOVERNO DO NOSSO PAÍS, FAÇA A DIFERENÇA NESSE PAÍS E ALÉM DE AJUDAR, VOCÊ PODE SER BENEFIADO TAMBÉM, O SISTEMA FUNCIONA! SÓ DEPENDE DE VOCE CONTINUAR. ESSA É A VERDADEIRA CORRENTE DO BEM! BOA SORTE!!!! ****ATENÇÃO**** PARA PARTICIPAR VC PODE USAR ESSA MESMA MENSSAGEM, COLOCANDO O SEU NOME NO NÚMERO 6 DA LISTA, SUBINDO OS OUTROS E EXCLUINDO O PRIMEIRO. EM CASO DE DÚVIDAS ME DIGAM POR E-MAIL -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: deposito-sandro.jpg Type: image/jpeg Size: 87454 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: depositos-giorgia.jpg Type: image/jpeg Size: 112095 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: depositos-rodiney.jpg Type: image/jpeg Size: 205690 bytes Desc: not available URL: From kilian at stanford.edu Mon Oct 1 10:29:50 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Mon, 1 Oct 2007 10:29:50 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <20071001165550.GE28627@sashak.voltaire.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <200709261614.09499.kilian@stanford.edu> <20071001165550.GE28627@sashak.voltaire.com> Message-ID: <200710011029.50753.kilian@stanford.edu> Hi Sasha, On Monday 01 October 2007 09:55:51 am Sasha Khapyorsky wrote: > > And indeed those have been introduced by this patch in 1.2.5: > > http://lists.openfabrics.org/pipermail/general/2007-June/036912.htm > >l > > As far as I remember this patch was for master originally and was not > part of OFED-1.2 or 1.2.5. Was it? I'm not sure if it is part of 1.2.5, but the fact is that it's not part of 1.2, and ibsim explicitly refers IB_NOTICE_DATA_144_LID_F and IB_NOTICE_DATA_144_CAPMASK_F. So my point was simply that compiling ibsim against OFED 1.2 fails. Cheers, -- Kilian From hrosenstock at xsigo.com Mon Oct 1 10:53:49 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 01 Oct 2007 10:53:49 -0700 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <200710011029.50753.kilian@stanford.edu> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <200709261614.09499.kilian@stanford.edu> <20071001165550.GE28627@sashak.voltaire.com> <200710011029.50753.kilian@stanford.edu> Message-ID: <1191261229.1998.355.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-01 at 10:29 -0700, Kilian CAVALOTTI wrote: > Hi Sasha, > > On Monday 01 October 2007 09:55:51 am Sasha Khapyorsky wrote: > > > And indeed those have been introduced by this patch in 1.2.5: > > > http://lists.openfabrics.org/pipermail/general/2007-June/036912.htm > > >l > > > > As far as I remember this patch was for master originally and was not > > part of OFED-1.2 or 1.2.5. Was it? > > I'm not sure if it is part of 1.2.5, It's not (part of 1.2.5). Hopefully, it'll be part of OFED 1.3. > but the fact is that it's not part of 1.2, That's exactly the issue and there is no ofed_1_2 branch of ibsim. -- Hal > and ibsim explicitly refers IB_NOTICE_DATA_144_LID_F and > IB_NOTICE_DATA_144_CAPMASK_F. So my point was simply that compiling > ibsim against OFED 1.2 fails. > Cheers, From sashak at voltaire.com Mon Oct 1 11:13:31 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 20:13:31 +0200 Subject: [ofa-general] ibnetdiscover topology output In-Reply-To: <1191261229.1998.355.camel@hrosenstock-ws.xsigo.com> References: <795c49870709251204h41cc1edek9df9f8bffca455f2@mail.gmail.com> <200709261614.09499.kilian@stanford.edu> <20071001165550.GE28627@sashak.voltaire.com> <200710011029.50753.kilian@stanford.edu> <1191261229.1998.355.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071001181331.GH28627@sashak.voltaire.com> On 10:53 Mon 01 Oct , Hal Rosenstock wrote: > On Mon, 2007-10-01 at 10:29 -0700, Kilian CAVALOTTI wrote: > > Hi Sasha, > > > > On Monday 01 October 2007 09:55:51 am Sasha Khapyorsky wrote: > > > > And indeed those have been introduced by this patch in 1.2.5: > > > > http://lists.openfabrics.org/pipermail/general/2007-June/036912.htm > > > >l > > > > > > As far as I remember this patch was for master originally and was not > > > part of OFED-1.2 or 1.2.5. Was it? > > > > I'm not sure if it is part of 1.2.5, > > It's not (part of 1.2.5). Hopefully, it'll be part of OFED 1.3. It is already :) Sasha From sashak at voltaire.com Mon Oct 1 13:23:59 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 22:23:59 +0200 Subject: [ofa-general] [PATCH] management/Makefile: remove obsolete .orig Makefile stuff Message-ID: <20071001202359.GI28627@sashak.voltaire.com> Remove obsolete (and broken) Makefile.orig files Signed-off-by: Sasha Khapyorsky --- Makefile | 9 ------- libibcommon/Makefile.orig | 19 --------------- libibmad/Makefile.orig | 19 --------------- libibumad/Makefile.orig | 19 --------------- make.inc.orig | 38 ------------------------------- make.rules | 54 --------------------------------------------- 6 files changed, 0 insertions(+), 158 deletions(-) delete mode 100644 libibcommon/Makefile.orig delete mode 100644 libibmad/Makefile.orig delete mode 100644 libibumad/Makefile.orig delete mode 100644 make.inc.orig delete mode 100644 make.rules diff --git a/Makefile b/Makefile index 01466a2..957b1fe 100644 --- a/Makefile +++ b/Makefile @@ -11,15 +11,6 @@ all: BUILD_TARG=all all: libs_install subdirs @echo Make all done -.PHONY : svnclean -svnclean: - svn st --no-ignore | awk '/^[\?I]/{print $$2}' | xargs rm -rf - -.PHONY : origmake -origmake: - for i in `find . -name Makefile.orig`; do cp $$i `echo $$i | sed 's/.orig$$//'`; done - -.PHONY : orig automake: @for i in $(LIBS); do \ if [ -x $$i/autogen.sh ]; then \ diff --git a/libibcommon/Makefile.orig b/libibcommon/Makefile.orig deleted file mode 100644 index 0e7f013..0000000 --- a/libibcommon/Makefile.orig +++ /dev/null @@ -1,19 +0,0 @@ -OPENIB_ROOT=.. -include $(OPENIB_ROOT)/make.inc - -SRCS:=$(wildcard src/*.c) -LIB_OBJS:=$(SRCS:.c=.lo) -LIB_HDRS=$(wildcard include/infiniband/*.h) - -PUBLIC_HEADERS=include/infiniband/common.h - -#LIB_STATIC_TARGET=libibcommon.a -LIB_SO_TARGET=libibcommon.la - -EXTRA_CLEAN= - -all: .depend $(LIB_SO_TARGET) #$(LIB_STATIC_TARGET) - -install: lib_install public_headers_install - -include $(OPENIB_ROOT)/make.rules diff --git a/libibmad/Makefile.orig b/libibmad/Makefile.orig deleted file mode 100644 index f9fe218..0000000 --- a/libibmad/Makefile.orig +++ /dev/null @@ -1,19 +0,0 @@ -OPENIB_ROOT=.. -include $(OPENIB_ROOT)/make.inc - -SRCS=$(wildcard src/*.c) -LIB_OBJS=$(SRCS:.c=.lo) -LIB_HDRS=$(wildcard include/infiniband/*.h) - -PUBLIC_HEADERS=include/infiniband/mad.h - -#LIB_STATIC_TARGET=libibmad.a -LIB_SO_TARGET=libibmad.la - -EXTRA_CLEAN= - -all: .depend $(LIB_SO_TARGET) #$(LIB_STATIC_TARGET) - -install: lib_install public_headers_install - -include $(OPENIB_ROOT)/make.rules diff --git a/libibumad/Makefile.orig b/libibumad/Makefile.orig deleted file mode 100644 index 8ce1481..0000000 --- a/libibumad/Makefile.orig +++ /dev/null @@ -1,19 +0,0 @@ -OPENIB_ROOT=.. -include $(OPENIB_ROOT)/make.inc - -SRCS=$(wildcard src/*.c) -LIB_OBJS=$(SRCS:.c=.lo) -LIB_HDRS=$(wildcard include/infiniband/*.h) ../libibcommon/include/infiniband/common.h - -PUBLIC_HEADERS=include/infiniband/umad.h - -#LIB_STATIC_TARGET=libibumad.a -LIB_SO_TARGET=libibumad.la - -EXTRA_CLEAN= - -all: $(LIB_SO_TARGET) - -install: lib_install public_headers_install - -include $(OPENIB_ROOT)/make.rules diff --git a/make.inc.orig b/make.inc.orig deleted file mode 100644 index 7d6e203..0000000 --- a/make.inc.orig +++ /dev/null @@ -1,38 +0,0 @@ -################################# -# Openib usermode common make file variables -# - -# OPENIB_ROOT: root of OPENIB usermode management src tree - should be set by any makefile -#OPENIB_ROOT=. - -# OPENIB_USR_INC: common usermode includes -OPENIB_USR_INC=/usr/local/include/infiniband - -# OPENIB_USR_INSTALL: root directory for target installation directories -OPENIB_USR_INSTALL=/usr/local/ib - -# OPENIB_USR_BIN: binaries install target -OPENIB_USR_BIN=$(OPENIB_USR_INSTALL)/bin - -# OPENIB_USR_LIB: libraries install target -OPENIB_USR_LIB=$(OPENIB_USR_INSTALL)/lib - -BUILD_VERS:=$(shell svn info Makefile | awk '/^Revision/{print $$0;exit 0}' 2> /dev/null) - -# Common CFLAGS -CFLAGS+= -I$(OPENIB_ROOT)/libibcommon/include/infiniband \ - -I$(OPENIB_ROOT)/libibmad/include/infiniband \ - -I$(OPENIB_ROOT)/libibumad/include/infiniband -CFLAGS+= -Wall -ggdb -LDFLAGS=-L$(OPENIB_USR_LIB) -ifneq ($(BUILD_VERS),) - CFLAGS+= -D__BUILD_VERSION_TAG__="$(BUILD_VERS)" -endif - -LD_SO_FLAGS=-g -rpath $(OPENIB_USR_LIB) -lm -LD_STATIC_FLAGS=-g -O - -# common application -CC=gcc -LD=gcc -INSTALL=install diff --git a/make.rules b/make.rules deleted file mode 100644 index 02c1b1d..0000000 --- a/make.rules +++ /dev/null @@ -1,54 +0,0 @@ -ifeq ($(shell sh -c "test -f .depend && echo yes"),yes) -include .depend -endif - -_C_SRCS:=$(BIN_OBJS:.o=.c)$(LIB_OBJS:.lo=.c) -export _C_SRCS - -$(BIN_TARGET): .depend $(BIN_OBJS) $(BIN_HDRS) $(LIB_HDRS) - libtool --mode=link $(LD) $(LDFLAGS) $(BIN_OBJS) $(BIN_LIBS) -o $@ - -$(LIB_STATIC_TARGET): .depend $(LIB_OBJS) $(LIB_HDRS) - libtool --mode=link $(LD) $(LD_STATIC_FLAGS) $(LIB_OBJS) -o $@ - -$(LIB_SO_TARGET): .depend $(LIB_OBJS) $(LIB_HDRS) - libtool --mode=link $(LD) $(LD_SO_FLAGS) $(LIB_OBJS) -o $@ - -lib_install: $(LIB_SO_TARGET) $(LIB_STATIC_TARGET) - $(INSTALL) -d $(OPENIB_USR_LIB) - libtool --quiet --mode=install $(INSTALL) $(LIB_SO_TARGET) $(LIB_STATIC_TARGET) $(OPENIB_USR_LIB) - libtool --quiet --mode=finish $(OPENIB_USR_LIB) - -public_headers_install: - $(INSTALL) -d $(OPENIB_USR_INC) - libtool --quiet --mode=install $(INSTALL) $(PUBLIC_HEADERS) $(OPENIB_USR_INC) - -bin_install: $(BIN_TARGET) - $(INSTALL) -d $(OPENIB_USR_BIN) - libtool --mode=install $(INSTALL) $(BIN_TARGET) $(OPENIB_USR_BIN) - -script_install: - $(INSTALL) -d $(OPENIB_USR_BIN) - $(INSTALL) $(SCRIPT_TARGET) $(OPENIB_USR_BIN) - -clean: - libtool --mode=clean rm -f *.a *.o *.lo src/*.o src/*.lo $(BIN_TARGET) $(LIB_SO_TARGET) $(EXTRA_CLEAN) .depend - -.depend: - @touch .depend - @if ! [ "$$_C_SRCS" = "" ]; then\ - gcc -M $(CFLAGS) $(_C_SRCS) > .depend;\ - fi - @echo `pwd`/.depend file created - -rmdep: - rm -f .depend - -depend: .depend - -%.o: %.c - libtool --mode=compile $(CC) -O -c $(CFLAGS) $< -o $@ - -%.lo: %.c - libtool --mode=compile $(CC) -O -c $(CFLAGS) $< -o $@ - -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Mon Oct 1 13:26:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 1 Oct 2007 22:26:53 +0200 Subject: [ofa-general] [PATCH] management/Makefile: simplify top level Makefile In-Reply-To: <20071001202359.GI28627@sashak.voltaire.com> References: <20071001202359.GI28627@sashak.voltaire.com> Message-ID: <20071001202653.GJ28627@sashak.voltaire.com> Simplified management top level Makefile. Signed-off-by: Sasha Khapyorsky --- Makefile | 84 +++++++++++++------------------------------------------------ 1 files changed, 18 insertions(+), 66 deletions(-) diff --git a/Makefile b/Makefile index 957b1fe..e6d5383 100644 --- a/Makefile +++ b/Makefile @@ -1,70 +1,22 @@ -#LIBS:=$(wildcard lib*) -LIBS:=libibcommon libibumad libibmad -OSM:=opensm -OSMLIBS:=complib libvendor -DIAG:=infiniband-diags +SUBDIRS:= libibcommon libibumad libibmad opensm infiniband-diags -SUBDIRS=$(OSM) $(DIAG) - -all: BUILD_TARG=all -all: libs_install subdirs +allall: all install @echo Make all done -automake: - @for i in $(LIBS); do \ - if [ -x $$i/autogen.sh ]; then \ - if !(cd $$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi \ - fi \ - done - @for i in $(OSMLIBS); do \ - if [ -x $(OSM)/$$i/autogen.sh ]; then \ - if !(cd $(OSM)/$$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi \ - fi \ - done - @for i in $(DIAG) $(OSM)/opensm; do \ - if [ -x $$i/autogen.sh ]; then \ - if !(cd $$i; ./autogen.sh && ./configure); then exit 1; fi \ - fi \ - done - @for i in $(DIAG) $(OSM)/opensm; do \ - if [ -x $$i/autogen.sh ]; then \ - if !(cd $$i; make && make install); then exit 1; fi \ - fi \ - done - -install: BUILD_TARG=install -install: subdirs - @echo Install done - -clean: SUBDIRS=$(LIBS) $(DIAG) $(OSM) -clean: BUILD_TARG=clean -clean: subdirs - @rm -f build_tag - @echo Clean done - -rmdep: - find $(SUBDIRS) -name ".depend" | xargs rm -f - -depend: SUBDIRS=$(LIBS) $(DIAG) $(OSM) -depend: BUILD_TARG=depend -depend: rmdep subdirs - @echo Depend done - -.PHONY : subdirs -subdirs: - @for i in $(SUBDIRS); do \ - test -x $$i/configure || ( cd $$i && ./autogen.sh || exit 1 ); \ - test -e $$i/Makefile || ( cd $$i && ./configure || exit 1 ); \ - ( cd $$i && make ) || exit 1; \ - done - -.PHONY : libs_install -libs_install: - @for i in $(LIBS); do \ - test -x $$i/configure || ( cd $$i && ./autogen.sh || exit 1 ); \ - test -e $$i/Makefile || ( cd $$i && ./configure || exit 1 ); \ - ( cd $$i && make && make install ) || exit 1; \ - done - -export BUILD_TARG +config: + $(foreach dir, $(SUBDIRS), \ + if [ ! -z "$(force)" -o ! -x $(dir)/configure ] ; then \ + ( cd $(dir) && ./autogen.sh && ./configure ) \ + || exit 1 ; \ + elif [ ! -e $(dir)/Makefile ] ; then \ + ( cd $(dir) && ./configure ) \ + || exit 1 ; \ + fi ; ) + +automake: force=1 +automake: config + +all: config +all install clean: + $(foreach dir, $(SUBDIRS), $(MAKE) -C $(dir) $@ && ) echo $@ done -- 1.5.3.rc2.29.gc4640f From Langstonzlxnweok at haletrailer.com Mon Oct 1 14:37:21 2007 From: Langstonzlxnweok at haletrailer.com (Mitch boot) Date: Tue, 02 Oct 2007 03:37:21 +0600 Subject: [ofa-general] ***SPAM*** datasets for the healthcare profession Message-ID: <174037m3txe0$k2452pu0$1988e2l0@Delldim5150 Up until Oct 6 - With every purchase of the Physician Directory you will receive Pharmaceutical Company Decision Makers data absolutely FREE Licensed Physicians in the USA 788,711 in total � 17,400 emails Many different medical specialties Many unique fields like 'medical school attended' and 'location of residency training' Price for this week only = $255 *** BONUS: Get the list below as a bonus when you order the MD data *** Contact List of US Pharma Companies 5,000 names and emails of the major players reply by email: medpharmdat at hotmail.com to manage your subscription settings send an email to the address above with 987 in the subject From rdreier at cisco.com Mon Oct 1 13:40:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:40:53 -0700 Subject: [ofa-general] multiple threads posting to the same QP In-Reply-To: <46FFB02B.8040307@voltaire.com> (Or Gerlitz's message of "Sun, 30 Sep 2007 16:18:19 +0200") References: <46FFB02B.8040307@voltaire.com> Message-ID: > Looking on libibverbs sources and man pages, I can't figure out if > posting to the same QP by multiple threads is supported (or if it > should be supported by the low level libraries): The low-level driver libraries are expected to be fully thread-safe. Basically the same rules as in Documentation/infiniband/core-locking.txt for the kernel level. However I agree that this expectation is not written down anywhere. I guess libibverbs needs some documentation for low-level driver authors. Not sure where to put it (I don't think a man page is really appropriate, is it?). - R. From rdreier at cisco.com Mon Oct 1 13:41:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:41:44 -0700 Subject: [ofa-general] srp_sg_tablesize related question In-Reply-To: <47007B50.60102@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Sun, 30 Sep 2007 21:45:04 -0700") References: <47007B50.60102@linux.vnet.ibm.com> Message-ID: > I do not see a max value for srp_sg_tablesize. I see an earlier patch limiting it to > 128, but that is not the case in the recent kernels. So, what limits the size of > an IU? Does it depend on the target port limiting it with an SRP_CRED_REQ? There's no validation of the value for srp_sg_tablesize, so it is probably possible to mess things up by picking too big a value. - R. From rdreier at cisco.com Mon Oct 1 13:46:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:46:07 -0700 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: <47009B15.8010506@voltaire.com> (Or Gerlitz's message of "Mon, 01 Oct 2007 09:00:37 +0200") References: <46FFB02B.8040307@voltaire.com> <46FFC6AC.5010605@dev.mellanox.co.il> <47009B15.8010506@voltaire.com> Message-ID: > So the locking should be provided by the low-level per device library? Yes. > if this is the case, I fail to see this documented anywhere. I don't believe anyone has actually written it down anywhere. > Also do we actually want locking in the fast posting path? for example > is it legal to call send(2) on the same socket fd from two threads? Yes, I think every call must be fully thread safe, for a few reasons. First, if we try to make some calls not thread safe then we will undoubtedly has application authors creating races and reporting strange bugs. Second, pushing the locking to the low-level driver actually allows smarter locking to be used -- cf the slightly tricky way that mthca/mlx4 lock CQs during QP destroy to avoid taking the QP table lock during poll CQ operations. I guess it would be possible to compile a special driver library with all pthread calls stubbed out, for use in single-threaded applications, but I'm not convinced it's worth it. (And BTW, yes, it is possible to call send(2) in any racy way you want on the same FD, and the kernel's internal state will not get messed up) - R. From rdreier at cisco.com Mon Oct 1 13:47:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:47:39 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: <200710010948.40453.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 1 Oct 2007 09:48:40 +0200") References: <200709180914.18560.jackm@dev.mellanox.co.il> <200710010948.40453.jackm@dev.mellanox.co.il> Message-ID: > Yes, the change should go upstream. With an MGM entry size of 64, each multicast group > can support only 8 QPs. Increasing the entry size to 256 enables support of 56 QPs per > multicast group (8 QPs per multicast group was not enough for some users). OK, send a patch I guess. But is there some reason why the same limit in mthca wasn't a problem for years now? - R. From rdreier at cisco.com Mon Oct 1 13:50:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:50:56 -0700 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: <46FF7E8B.7010307@voltaire.com> (Or Gerlitz's message of "Sun, 30 Sep 2007 12:46:35 +0200") References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> <46FF7E8B.7010307@voltaire.com> Message-ID: > So indeed the assumption in the patch is that mgids which translate to > legal IP multicast addresses are inserted into the database either by > ipoib or rdma-cm consumers who use IPOIB_PS for their ID's. I guess since it's confiruable, it's OK. But I think that you miss the fact that there might be other consumers of ib_sa creating multicast groups, and there might be other rdma_cm consumers using IPOIB_PS also. > A module param enables adding a > > options ib_ipoib umcast_allowed=1 > > line to /etc/modprobe.conf to make this setting persistent across > module unload/load (eg reboots) and be applied to all the devices > created by ipoib. A sysfs entry has to be explicitly written following > each device creation. The umcast setting could be made persistent with a script that runs at ipoib interface hotplug too. In fact surely OFED must have this infrastructure for setting connected mode, mtu, etc. I really want to push back as much as possible on creating new module parameters since we have too many as it is. - R. From rdreier at cisco.com Mon Oct 1 13:53:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 13:53:16 -0700 Subject: [ofa-general] [PATCH] fix some ehca limits In-Reply-To: <20071001153620.GA31830@kryten> (Anton Blanchard's message of "Mon, 1 Oct 2007 10:36:20 -0500") References: <20070930053726.GA28619@kryten> <20071001153620.GA31830@kryten> Message-ID: > props->max_sge = min_t(int, rblock->max_sge, INT_MAX); > props->max_sge_rd = min_t(int, rblock->max_sge_rd, INT_MAX); > props->max_cq = min_t(int, rblock->max_cq, INT_MAX); > - props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); > + props->max_cqe = min_t(unsigned int, rblock->max_cqe, INT_MAX); > props->max_mr = min_t(int, rblock->max_mr, INT_MAX); > props->max_mw = min_t(int, rblock->max_mw, INT_MAX); > - props->max_pd = min_t(int, rblock->max_pd, INT_MAX); > - props->max_ah = min_t(int, rblock->max_ah, INT_MAX); > + props->max_pd = min_t(unsigned int, rblock->max_pd, INT_MAX); > + props->max_ah = min_t(unsigned int, rblock->max_ah, INT_MAX); > props->max_fmr = min_t(int, rblock->max_mr, INT_MAX); Seems like all these min_t(int, ..., INT_MAX) values are equally buggy, right? You're just fixing the two that happened to trigger but I think they should all be cleaned up now that we noticed them. - R. From pradeeps at linux.vnet.ibm.com Mon Oct 1 14:09:58 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 01 Oct 2007 14:09:58 -0700 Subject: [ofa-general] srp_sg_tablesize related question In-Reply-To: References: <47007B50.60102@linux.vnet.ibm.com> Message-ID: <47016226.2020400@linux.vnet.ibm.com> Roland Dreier wrote: > > I do not see a max value for srp_sg_tablesize. I see an earlier patch limiting it to > > 128, but that is not the case in the recent kernels. So, what limits the size of > > an IU? Does it depend on the target port limiting it with an SRP_CRED_REQ? > > There's no validation of the value for srp_sg_tablesize, so it is > probably possible to mess things up by picking too big a value. > > - R. > Thanks for the information. If that is indeed the case, then why was the check removed? I see a patch that limited it to 128; not sure if it was ever applied. What could one use as a max value which would not mess things up, or is that target dependent? Pradeep From rdreier at cisco.com Mon Oct 1 14:19:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 14:19:01 -0700 Subject: [ofa-general] srp_sg_tablesize related question In-Reply-To: <47016226.2020400@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Mon, 01 Oct 2007 14:09:58 -0700") References: <47007B50.60102@linux.vnet.ibm.com> <47016226.2020400@linux.vnet.ibm.com> Message-ID: > Thanks for the information. If that is indeed the case, then why was the check removed? > I see a patch that limited it to 128; not sure if it was ever applied. I don't think there was ever any check that was removed. > What could one use as a max value which would not mess things up, or is that target > dependent? What are you trying to do? The default should be fine for most uses. - R. From pradeeps at linux.vnet.ibm.com Mon Oct 1 14:45:35 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 01 Oct 2007 14:45:35 -0700 Subject: [ofa-general] srp_sg_tablesize related question In-Reply-To: References: <47007B50.60102@linux.vnet.ibm.com> <47016226.2020400@linux.vnet.ibm.com> Message-ID: <47016A7F.6000309@linux.vnet.ibm.com> Roland Dreier wrote: > > Thanks for the information. If that is indeed the case, then why was the check removed? > > I see a patch that limited it to 128; not sure if it was ever applied. > > I don't think there was ever any check that was removed. Maybe this patch was never applied - http://lists.openfabrics.org/pipermail/general/2006-May/021775.html > > > What could one use as a max value which would not mess things up, or is that target > > dependent? > > What are you trying to do? The default should be fine for most uses. Some large system users have reported seeing better throughputs with values >128. I was wondering how high one could go without running into any issues. Pradeep From radi at delalande.net Mon Oct 1 16:37:16 2007 From: radi at delalande.net (radi kabbe) Date: Tue, 2 Oct 2007 02:37:16 +0300 Subject: [ofa-general] methot Message-ID: <000b01c80484$00087e90$e1e6e658@casperf1bd92bc> CWTE: C'Watre International, Inc Trade Alert. CWTE just announced trading on the OTC. CWTE has the potential to return 5 times your money with this tight capital structure. This means the stock can see $1.50 when news is realesed. CWTE has a womens line of ageless cosmetics that is overwhelming the celebrity industry. Keep an eye for news to hit the market and create a frenzy in this stock. When investors find out who's using it, the stock could go well beyond our target. openib-general, contact your broker NOW for CWTE! milieupo miketsus miespuol miksen mierikwo metmu"ar -------------- next part -------------- An HTML attachment was scrubbed... URL: From teraju_06 at yahoo.com Mon Oct 1 18:45:13 2007 From: teraju_06 at yahoo.com (software murah) Date: Tue, 2 Oct 2007 09:45:13 +0800 Subject: [ofa-general] Software yg diperlukan utk biz online !!!! murah-murah-murah<> Message-ID: <20071002014516.AEE5BE60895@openfabrics.org> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fungsi a.m.s.JPG Type: image/jpeg Size: 120510 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: muka a.m.s.JPG Type: image/jpeg Size: 17474 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: super mail spider.JPG Type: image/jpeg Size: 84775 bytes Desc: not available URL: From rdreier at cisco.com Mon Oct 1 20:49:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Oct 2007 20:49:32 -0700 Subject: [ofa-general] [PATCH][RFC] IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems Message-ID: [ I just added this to my for-2.6.24 branch... it should only change behavior for big-endian 32-on-64 systems, so I think this is the fix that causes the least breakage. Anyway, as usual, comments appreciated. ] The declaration of struct ib_user_mad_reg_req.method_mask[] exported to userspace was an array of __u32, but the kernel internally treated it as a bitmap made up of longs. This makes a difference for 64-bit big-endian kernels, where numbering the bits in an array of__u32 gives: |31.....0|63....31|95....64|127...96| while numbering the bits in an array of longs gives: |63..............0|127............64| 64-bit userspace can handle this by just treating method_mask[] as an array of longs, but 32-bit userspace is really stuck: the meaning of the bits in method_mask[] depends on whether the kernel is 32-bit or 64-bit, and there's no sane way for userspace to know that. Fix this by updating to make it clear that method_mask[] is an array of longs, and using a compat_ioctl method to convert to an array of 64-bit longs to handle the 32-on-64 problem. This fixes the interface description to match existing behavior (so working binaries continue to work) in almost all situations, and gives consistent semantics in the case of 32-bit userspace that can run on either a 32-bit or 64-bit kernel, so that the same binary can work for both 32-on-32 and 32-on-64 systems. Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 49 +++++++++++++++++++++++++++++------ include/rdma/ib_user_mad.h | 22 +++++++++++++++- 2 files changed, 61 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index aee2913..b53eac4 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include @@ -607,7 +608,8 @@ static unsigned int ib_umad_poll(struct file *filp, struct poll_table_struct *wa return mask; } -static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) +static int ib_umad_reg_agent(struct ib_umad_file *file, void __user *arg, + int compat_method_mask) { struct ib_user_mad_reg_req ureq; struct ib_mad_reg_req req; @@ -622,7 +624,7 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, unsigned long arg) goto out; } - if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { + if (copy_from_user(&ureq, arg, sizeof ureq)) { ret = -EFAULT; goto out; } @@ -643,8 +645,18 @@ found: if (ureq.mgmt_class) { req.mgmt_class = ureq.mgmt_class; req.mgmt_class_version = ureq.mgmt_class_version; - memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); - memcpy(req.oui, ureq.oui, sizeof req.oui); + memcpy(req.oui, ureq.oui, sizeof req.oui); + + if (compat_method_mask) { + u32 *umm = (u32 *) ureq.method_mask; + int i; + + for (i = 0; i < BITS_TO_LONGS(IB_MGMT_MAX_METHODS); ++i) + req.method_mask[i] = + umm[i * 2] | ((u64) umm[i * 2 + 1] << 32); + } else + memcpy(req.method_mask, ureq.method_mask, + sizeof req.method_mask); } agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, @@ -682,13 +694,13 @@ out: return ret; } -static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) +static int ib_umad_unreg_agent(struct ib_umad_file *file, u32 __user *arg) { struct ib_mad_agent *agent = NULL; u32 id; int ret = 0; - if (get_user(id, (u32 __user *) arg)) + if (get_user(id, arg)) return -EFAULT; down_write(&file->port->mutex); @@ -729,9 +741,9 @@ static long ib_umad_ioctl(struct file *filp, unsigned int cmd, { switch (cmd) { case IB_USER_MAD_REGISTER_AGENT: - return ib_umad_reg_agent(filp->private_data, arg); + return ib_umad_reg_agent(filp->private_data, (void __user *) arg, 0); case IB_USER_MAD_UNREGISTER_AGENT: - return ib_umad_unreg_agent(filp->private_data, arg); + return ib_umad_unreg_agent(filp->private_data, (__u32 __user *) arg); case IB_USER_MAD_ENABLE_PKEY: return ib_umad_enable_pkey(filp->private_data); default: @@ -739,6 +751,23 @@ static long ib_umad_ioctl(struct file *filp, unsigned int cmd, } } +#ifdef CONFIG_COMPAT +static long ib_umad_compat_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + switch (cmd) { + case IB_USER_MAD_REGISTER_AGENT: + return ib_umad_reg_agent(filp->private_data, compat_ptr(arg), 1); + case IB_USER_MAD_UNREGISTER_AGENT: + return ib_umad_unreg_agent(filp->private_data, compat_ptr(arg)); + case IB_USER_MAD_ENABLE_PKEY: + return ib_umad_enable_pkey(filp->private_data); + default: + return -ENOIOCTLCMD; + } +} +#endif + static int ib_umad_open(struct inode *inode, struct file *filp) { struct ib_umad_port *port; @@ -826,7 +855,9 @@ static const struct file_operations umad_fops = { .write = ib_umad_write, .poll = ib_umad_poll, .unlocked_ioctl = ib_umad_ioctl, - .compat_ioctl = ib_umad_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ib_umad_compat_ioctl, +#endif .open = ib_umad_open, .release = ib_umad_close }; diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h index 2a32043..29d2c72 100644 --- a/include/rdma/ib_user_mad.h +++ b/include/rdma/ib_user_mad.h @@ -147,6 +147,26 @@ struct ib_user_mad { __u64 data[0]; }; +/* + * Earlier versions of this interface definition declared the + * method_mask[] member as an array of __u32 but treated it as a + * bitmap made up of longs in the kernel. This ambiguity meant that + * 32-bit big-endian applications that can run on both 32-bit and + * 64-bit kernels had no consistent ABI to rely on, and 64-bit + * big-endian applications that treated method_mask as being made up + * of 32-bit words would have their bitmap misinterpreted. + * + * To clear up this confusion, we change the declaration of + * method_mask[] to use unsigned long and handle the conversion from + * 32-bit userspace to 64-bit kernel for big-endian systems in the + * compat_ioctl method. Unfortunately, to keep the structure layout + * the same, we need the method_mask[] array to be aligned only to 4 + * bytes even when long is 64 bits, which forces us into this ugly + * typedef. + */ +typedef unsigned long __attribute__((aligned(4))) packed_ulong; +#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 * sizeof (long))) + /** * ib_user_mad_reg_req - MAD registration request * @id - Set by the kernel; used to identify agent in future requests. @@ -165,7 +185,7 @@ struct ib_user_mad { */ struct ib_user_mad_reg_req { __u32 id; - __u32 method_mask[4]; + packed_ulong method_mask[IB_USER_MAD_LONGS_PER_METHOD_MASK]; __u8 qpn; __u8 mgmt_class; __u8 mgmt_class_version; -- 1.5.3.2 From billfink at mindspring.com Mon Oct 1 21:25:02 2007 From: billfink at mindspring.com (Bill Fink) Date: Tue, 2 Oct 2007 00:25:02 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191245440.4378.12.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <20071001001135.75d2b984.billfink@mindspring.com> <1191245440.4378.12.camel@localhost> Message-ID: <20071002002502.fe0f2bb3.billfink@mindspring.com> On Mon, 01 Oct 2007, jamal wrote: > On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote: > > > Have you done performance comparisons for the case of using 9000-byte > > jumbo frames? > > I havent, but will try if any of the gige cards i have support it. > > As a side note: I have not seen any useful gains or losses as the packet > size approaches even 1500B MTU. For example, post about 256B neither the > batching nor the non-batching give much difference in either throughput > or cpu use. Below 256B, theres a noticeable gain for batching. > Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and > so the occupancy of both the qdisc queue(s) and ethernet ring is > constantly high. For example at 512B, the app is 80% idle on all 4 CPUs > and we are hitting in the range of wire speed. We are at 90% idle at > 1024B. This is the case with or without batching. So my suspicion is > that with that trend a 9000B packet will just follow the same pattern. One reason I ask, is that on an earlier set of alternative batching xmit patches by Krishna Kumar, his performance testing showed a 30 % performance hit for TCP for a single process and a size of 4 KB, and a performance hit of 5 % for a single process and a size of 16 KB (a size of 8 KB wasn't tested). Unfortunately I was too busy at the time to inquire further about it, but it would be a major potential concern for me in my 10-GigE network testing with 9000-byte jumbo frames. Of course the single process and 4 KB or larger size was the only case that showed a significant performance hit in Krishna Kumar's latest reported test results, so it might be acceptable to just have a switch to disable the batching feature for that specific usage scenario. So it would be useful to know if your xmit batching changes would have similar issues. Also for your xmit batching changes, I think it would be good to see performance comparisons for TCP and IP forwarding in addition to your UDP pktgen tests, including various packet sizes up to and including 9000-byte jumbo frames. -Bill From dheerajdarghali at liselegrand.info Mon Oct 1 21:37:21 2007 From: dheerajdarghali at liselegrand.info (dheeraj darghali) Date: Tue, 2 Oct 2007 01:37:21 -0300 Subject: [ofa-general] ceduh Message-ID: <26F10715.5D36985A@liselegrand.info> CWTE: C'Watre International, Inc Trade Alert. CWTE just announced trading on the OTC. CWTE has the potential to return 5 times your money with this tight capital structure. This means the stock can see $1.50 when news is realesed. CWTE has a womens line of ageless cosmetics that is overwhelming the celebrity industry. Keep an eye for news to hit the market and create a frenzy in this stock. When investors find out who's using it, the stock could go well beyond our target. openib-general, contact your broker NOW for CWTE! carpatho ccullogh c-deliam celsiaes calabria cambill From kliteyn at mellanox.co.il Mon Oct 1 22:15:06 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 2 Oct 2007 07:15:06 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-02:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-01 OpenSM git rev = Mon_Oct_1_19:42:25_2007 [a2b54cfdc1f6b8f2877d3d93f232b474913526be] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From ogerlitz at voltaire.com Tue Oct 2 00:16:42 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 02 Oct 2007 09:16:42 +0200 Subject: [ofa-general] multiple threads posting to the same QP In-Reply-To: References: <46FFB02B.8040307@voltaire.com> Message-ID: <4701F05A.8010706@voltaire.com> Roland Dreier wrote: > > Looking on libibverbs sources and man pages, I can't figure out if > > posting to the same QP by multiple threads is supported (or if it > > should be supported by the low level libraries): > > The low-level driver libraries are expected to be fully thread-safe. > Basically the same rules as in Documentation/infiniband/core-locking.txt > for the kernel level. However I agree that this expectation is not > written down anywhere. I guess libibverbs needs some documentation > for low-level driver authors. Not sure where to put it (I don't think > a man page is really appropriate, is it?). I believe libibverbs/threading need to be documented both for app developers and for low-level driver libraries developers. A man page provided by libibverbs-devel surely fits the first case, as for the second case, libibverbs-devel can provide a document under /usr/share/doc as done by other packages (eg zlib-devel). The man pages for app writers should provide a general description, address the thread safety properties of libibverbs, point to the other man pages etc. As I have pointed to Dotan, librdmacm-devel provides a nice man page rdma_cm(7) which can be used as an example. Sean - I guess it would be nice to update also rdma_cm(7) to document the thread related assumptions etc of librdmacm... Or. From jackm at dev.mellanox.co.il Tue Oct 2 00:38:32 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 2 Oct 2007 09:38:32 +0200 Subject: [ofa-general] Re: [PATCH] mlx4: display misc device information via sysfs under /sys/class/infiniband/mlx4_x, for ibstat and ibv_devinfo In-Reply-To: References: <200709180914.18560.jackm@dev.mellanox.co.il> <200710010948.40453.jackm@dev.mellanox.co.il> Message-ID: <200710020938.32755.jackm@dev.mellanox.co.il> On Monday 01 October 2007 22:47, Roland Dreier wrote: > > Yes, the change should go upstream. With an MGM entry size of 64, each multicast group > > can support only 8 QPs. Increasing the entry size to 256 enables support of 56 QPs per > > multicast group (8 QPs per multicast group was not enough for some users). > > OK, send a patch I guess. But is there some reason why the same limit > in mthca wasn't a problem for years now? Reason is new 16-core nodes (4 socket quad-core). Want a QP per node to join a specific multicast group. I'm sending a patch in a separate post. - Jack > - R. > From jackm at dev.mellanox.co.il Tue Oct 2 00:40:13 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 2 Oct 2007 09:40:13 +0200 Subject: [ofa-general] [PATCH] mlx4: increase permissible number of QPs per multicast group to 56 Message-ID: <200710020940.13862.jackm@dev.mellanox.co.il> Increase QPs per multicast group to 56 (needed to support 16-core nodes). Signed-off-by: Jack Morgenstein Index: connectx_kernel/drivers/net/mlx4/mlx4.h =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/mlx4.h 2007-07-08 16:31:10.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/mlx4.h 2007-07-08 16:31:47.000000000 +0300 @@ -59,7 +59,7 @@ }; enum { - MLX4_MGM_ENTRY_SIZE = 0x40, + MLX4_MGM_ENTRY_SIZE = 0x100, MLX4_QP_PER_MGM = 4 * (MLX4_MGM_ENTRY_SIZE / 16 - 2), MLX4_MTT_ENTRY_PER_SEG = 8 }; From ogerlitz at voltaire.com Tue Oct 2 00:43:50 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 02 Oct 2007 09:43:50 +0200 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: References: <46FFB02B.8040307@voltaire.com> <46FFC6AC.5010605@dev.mellanox.co.il> <47009B15.8010506@voltaire.com> Message-ID: <4701F6B6.9080006@voltaire.com> Roland Dreier wrote: > > Also do we actually want locking in the fast posting path? for example > > is it legal to call send(2) on the same socket fd from two threads? > > Yes, I think every call must be fully thread safe, for a few reasons. > First, if we try to make some calls not thread safe then we will > undoubtedly has application authors creating races and reporting > strange bugs. Second, pushing the locking to the low-level driver > actually allows smarter locking to be used -- cf the slightly tricky > way that mthca/mlx4 lock CQs during QP destroy to avoid taking the QP > table lock during poll CQ operations. > > I guess it would be possible to compile a special driver library with > all pthread calls stubbed out, for use in single-threaded > applications, but I'm not convinced it's worth it. Looking on the mthca and mlx4 low-level libraries I realize that you use pthread_spin_lock/unlock for thread safeness. Can you spare few words on why spinning is used rather then sleeping (eg pthread_mutex_lock/unlock) - is it since you assume that: A) if the lock is not contended - both calls have the same efficiency B) if the lock is contended - it would be such for --short-- time and hence spinning is more efficient then sleeping (no context-switch etc) Assuming that the locking scheme of the libraries does not introduce any notable overhead for single thread runs, I agree there's no need to provide single threaded instances as well. > (And BTW, yes, it is possible to call send(2) in any racy way you want > on the same FD, and the kernel's internal state will not get messed > up) is it documented any where? other then the kernel state, what happens if send(2) is called from two threads on a datagram socket? will the two datagrams be accepted in the remote side uncorrupted? as for stream socket, since the bits order in the stream is not defined in the sender side, I "corruption" would surely be experienced by the remote side. Or. From ogerlitz at voltaire.com Tue Oct 2 00:58:16 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 02 Oct 2007 09:58:16 +0200 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> <46FF7E8B.7010307@voltaire.com> Message-ID: <4701FA18.6040107@voltaire.com> Roland Dreier wrote: > > So indeed the assumption in the patch is that mgids which translate to > > legal IP multicast addresses are inserted into the database either by > > ipoib or rdma-cm consumers who use IPOIB_PS for their ID's. > > I guess since it's confiruable, it's OK. But I think that you miss > the fact that there might be other consumers of ib_sa creating > multicast groups, and there might be other rdma_cm consumers using > IPOIB_PS also. I understand that there may be other ib_sa consumers that use multicast and other rdma_cm consumers that use IPOIB_PS, however, the point I was trying to make is that if there are consumers that join the --same-- multicast groups as ipoib they can actually avoid ipoib to join these group if they provide different attributes for the group. This is why Sean made the rdma_cm to use the --same-- attributes as ipoib does for IPOIB_PS joins. For other joins the rdma-cm makes sure that the MGID is different (rdma-cm signature instead of the ipv4/v6 one) and uses a different qkey. Anyway, I understand that we are more or less on the same page regarding this point, correct? Or. From ogerlitz at voltaire.com Tue Oct 2 01:07:22 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 02 Oct 2007 10:07:22 +0200 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> <46FF7E8B.7010307@voltaire.com> Message-ID: <4701FC3A.4010207@voltaire.com> Roland Dreier wrote: >> options ib_ipoib umcast_allowed=1 >> line to /etc/modprobe.conf to make this setting persistent across >> module unload/load (eg reboots) and be applied to all the devices >> created by ipoib. A sysfs entry has to be explicitly written following >> each device creation. > The umcast setting could be made persistent with a script that runs at > ipoib interface hotplug too. In fact surely OFED must have this > infrastructure for setting connected mode, mtu, etc. > I really want to push back as much as possible on creating new module > parameters since we have too many as it is. I understand this desire... just need a little clarification from you re hotplug. First, as for OFED, looking on the openibd service script (excerpts below) installed by OFED 1.3 I see that mode and mtu are set "manually", that is the user sets/provides the mode and mtu params for the script and the script uses sysfs to configure the device. This does not address devices created after the service has started nor seem a very elegant way to do so. I understand that you think hotplug is the correct way to go, but its not pci hot plug as being used for the low level hw drivers (mthca, mlx4 , etc) what rules one should set for the hotplug to act when a new interface is created (eg the default interfaces created by ipoib for each or child interface for a pkey created by the user)? Assuming hotplug can be used to configure allowing umcast, I will remove the module param from the patch. Or. > set_ipoib_cm() > { > local i=$1 > shift > > if [ ! -e /sys/class/net/${i}/mode ]; then > echo "Failed to configure IPoIB connected mode for ${i}" > return 1 > fi > > echo connected > /sys/class/net/${i}/mode > /sbin/ifconfig ${i} mtu ${IPOIB_MTU} > } .... > bring_up() > { > local i=$1 > shift > > case $DISTRIB in > RedHat|Rocks) > if [ $IS_FEDORA -eq 0 ]; then > /sbin/ifup ${i} > else > . ${NETWORK_CONF_DIR}/ifcfg-${i} > if [ ! -z ${IPADDR} ] && [ ! -z ${NETMASK} ] && [ ! -z ${BROADCAST} ]; then > /sbin/ifconfig ${i} ${IPADDR} netmask ${NETMASK} broadcast ${BROADCAST} > /dev/null 2>&1 > else > /sbin/ifup ${i} > fi > fi > ;; > SuSE) > if [ "$KPREFIX" == "26" ]; then > ifconfig ${i} up > /dev/null 2>&1 > fi > # Workaround for ifup issue: two devices with the same IP address > . ${NETWORK_CONF_DIR}/ifcfg-${i} > if [ ! -z ${IPADDR} ] && [ ! -z ${NETMASK} ] && [ ! -z ${BROADCAST} ]; then > /sbin/ifconfig ${i} ${IPADDR} netmask ${NETMASK} broadcast ${BROADCAST} > /dev/null 2>&1 > else > /sbin/ifup ${i} > fi > # /sbin/ifup ${i} > /dev/null 2>&1 > ;; > *) > /sbin/ifup ${i} > ;; > esac > > if [ "X${SET_IPOIB_CM}" == "Xyes" ]; then > set_ipoib_cm ${i} > fi > > return $? > } From JERMAINE.Guerard at p-zone.pl Tue Oct 2 02:33:52 2007 From: JERMAINE.Guerard at p-zone.pl (JERMAINE Guerard) Date: Tue, 2 Oct 2007 05:33:52 -0400 Subject: [ofa-general] {{ttiksy Message-ID: <000e01c804d7$587b6260$eb34d418@PROPRI2AJPI8AN> Crash! Boom! Bang! C.W.T.E has the potential to return 500% to your money within 7 trading days. Hot news released today! Check this out. general, call ur broker NOW. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Tue Oct 2 02:55:36 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 2 Oct 2007 02:55:36 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071002-0200 daily build status Message-ID: <20071002095536.8450DE60894@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071002-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From Cardenas at mx.regret.com Tue Oct 2 05:08:45 2007 From: Cardenas at mx.regret.com (Donk Cardenas) Date: Tue, 2 Oct 2007 14:08:45 +0200 Subject: [ofa-general] klimawec Message-ID: <646814968668.024676492783@mx.regret.com> Cra+sh! Boom+! B-a n-g-! C+.W.T.E h a s t-h'e p+otent ial to ret-urn 5'0-0+% to y o-u.r m_oney with in 7 tra ding d,a'y s+. H+o*t n,e.w*s release_*d tod*ay! Chec*k t*h-i s o.u t+. gene'.ral, c'a,l,l ur br'oker N.O+W.. From Caseygalenarabble at gigaom.com Tue Oct 2 06:13:49 2007 From: Caseygalenarabble at gigaom.com (Darla Swan) Date: Tue, 2 Oct 2007 06:13:49 -0700 (PDT) Subject: [ofa-general] catholicism accede sanford __ Message-ID: <20071002131349.F0514E60846@openfabrics.org> As a business you have been preapproved to receive 39114 USD TODAY! No hassle at all, completely unsecured. There are no hidden costs or fees. Worried that your credit is less than perfect? Not an issue. Give us a ring, now.. 1.877.292-6894 Turn your dream into a reality. 1.877.292-6894 Now her passage was not silent; it was like the tread of Goliath striding into the Valley of Bones. Halfway through the cigarette, the room filled with smoke, he had heard her opening the front door. May Temple From hadi at cyberus.ca Tue Oct 2 06:20:38 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 02 Oct 2007 09:20:38 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071002002502.fe0f2bb3.billfink@mindspring.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <20071001001135.75d2b984.billfink@mindspring.com> <1191245440.4378.12.camel@localhost> <20071002002502.fe0f2bb3.billfink@mindspring.com> Message-ID: <1191331238.4353.59.camel@localhost> On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote: > One reason I ask, is that on an earlier set of alternative batching > xmit patches by Krishna Kumar, his performance testing showed a 30 % > performance hit for TCP for a single process and a size of 4 KB, and > a performance hit of 5 % for a single process and a size of 16 KB > (a size of 8 KB wasn't tested). Unfortunately I was too busy at the > time to inquire further about it, but it would be a major potential > concern for me in my 10-GigE network testing with 9000-byte jumbo > frames. Of course the single process and 4 KB or larger size was > the only case that showed a significant performance hit in Krishna > Kumar's latest reported test results, so it might be acceptable to > just have a switch to disable the batching feature for that specific > usage scenario. So it would be useful to know if your xmit batching > changes would have similar issues. There were many times while testing that i noticed inconsistencies and in each case when i analysed[1], i found it to be due to some variable other than batching which needed some resolving, always via some parametrization or other. I suspect what KK posted is in the same class. To give you an example, with UDP, batching was giving worse results at around 256B compared to 64B or 512B; investigating i found that the receiver just wasnt able to keep up and the udp layer dropped a lot of packets so both iperf and netperf reported bad numbers. Fixing the receiver ended up with consistency coming back. On why 256B was the one that overwhelmed the receiver more than 64B(which sent more pps)? On some limited investigation, it seemed to me to be the effect of the choice of the tg3 driver's default tx mitigation parameters as well tx ring size; which is something i plan to revisit (but neutralizing it helps me focus on just batching). In the end i dropped both netperf and iperf for similar reasons and wrote my own app. What i am trying to achieve is demonstrate if batching is a GoodThing. In experimentation like this, it is extremely valuable to reduce the variables. Batching may expose other orthogonal issues - those need to be resolved or fixed as they are found. I hope that sounds sensible. Back to the >=9K packet size you raise above: I dont have a 10Gige card so iam theorizing. Given that theres an observed benefit to batching for a saturated link with "smaller" packets (in my results "small" is anything below 256B which maps to about 380Kpps anything above that seems to approach wire speed and the link is the bottleneck); then i theorize that 10Gige with 9K jumbo frames if already achieving wire rate, should continue to do so. And sizes below that will see improvements if they were not already hitting wire rate. So i would say that with 10G NICS, there will be more observed improvements with batching with apps that do bulk transfers (assuming those apps are not seeing wire speed already). Note that this hasnt been quiet the case even with TSO given the bottlenecks in the Linux receivers that J Heffner put nicely in a response to some results you posted - but that exposes an issue with Linux receivers rather than TSO. > Also for your xmit batching changes, I think it would be good to see > performance comparisons for TCP and IP forwarding in addition to your > UDP pktgen tests, That is not pktgen - it is a udp app running in process context utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely and has its own merits in proving that batching infact is a GoodThing even if it is just for traffic generation ;-> > including various packet sizes up to and including > 9000-byte jumbo frames. I will do TCP and forwarding tests in the near future. cheers, jamal [1] On average i spend 10x more time performance testing and analysing results than writting code. From Franciscaloyalneutrino at lapiazza-highheels.com Tue Oct 2 06:54:47 2007 From: Franciscaloyalneutrino at lapiazza-highheels.com (Caitlin Spaulding) Date: Tue, 2 Oct 2007 06:54:47 -0700 (PDT) Subject: [ofa-general] Re: Change Message-ID: <20071002135448.2671BE603C6@openfabrics.org> As a business you have been preapproved to receive 36728 USD TODAY! No hassle at all, completely unsecured. There are no hidden costs or fees. Worried that your credit is less than perfect? Not an issue. Give us a ring, now.. 1877-292.6894 Turn your dream into a reality. 1877-292.6894 I had to, because your car being gone meant that you could really stay, you could really finish my book. Another tomato accurately eats a pig pen around the tape recorder. Fern Gee From monisonlists at gmail.com Tue Oct 2 08:49:54 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 02 Oct 2007 17:49:54 +0200 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <10376.1190733869@death> References: <46F7D770.4090500@voltaire.com> <10376.1190733869@death> Message-ID: <470268A2.7080102@gmail.com> Jay Vosburgh wrote: > ACK patches 3 - 9. > > Roland, are you comfortable with the IB changes in patches 1 and 2? > > Jeff, when Roland acks patches 1 and 2, please apply all 9. > > -J Hi Jeff, Roland acked the IPoIB patches. If you haven't done so already can you please apply them? I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. thanks From jgarzik at pobox.com Tue Oct 2 09:52:14 2007 From: jgarzik at pobox.com (Jeff Garzik) Date: Tue, 02 Oct 2007 12:52:14 -0400 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <470268A2.7080102@gmail.com> References: <46F7D770.4090500@voltaire.com> <10376.1190733869@death> <470268A2.7080102@gmail.com> Message-ID: <4702773E.4090201@pobox.com> Moni Shoua wrote: > Jay Vosburgh wrote: >> ACK patches 3 - 9. >> >> Roland, are you comfortable with the IB changes in patches 1 and 2? >> >> Jeff, when Roland acks patches 1 and 2, please apply all 9. >> >> -J > > Hi Jeff, > Roland acked the IPoIB patches. If you haven't done so already can you please apply them? > I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. hrm, I don't see them in my inbox for some reason. can someone bounce them to me? or give me a git tree to pull from? Jeff From Albaendpointbarbarism at robert-fisk.com Tue Oct 2 10:19:00 2007 From: Albaendpointbarbarism at robert-fisk.com (Lolita Vann) Date: Tue, 2 Oct 2007 10:19:00 -0700 (PDT) Subject: [ofa-general] Want to pay out less than before? Message-ID: <20071002171901.E377FE603D0@openfabrics.org> As a business you have been preapproved to receive 30051 USD TODAY! No hassle at all, completely unsecured. There are no hidden costs or fees. Worried that your credit is less than perfect? Not an issue. Give us a ring, now.. 1877-292.6894 Turn your dream into a reality. 1877-292.6894 She stood in the doorway, holding a bottle of champagne wrapped in a strip of towelling. The best of the Misery books, and maybe the best thing I ever wrote, mongrel dog or not. Bertie Vann From rdreier at cisco.com Tue Oct 2 10:53:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Oct 2007 10:53:04 -0700 Subject: [ofa-general] Re: multiple threads posting to the same QP In-Reply-To: <4701F6B6.9080006@voltaire.com> (Or Gerlitz's message of "Tue, 02 Oct 2007 09:43:50 +0200") References: <46FFB02B.8040307@voltaire.com> <46FFC6AC.5010605@dev.mellanox.co.il> <47009B15.8010506@voltaire.com> <4701F6B6.9080006@voltaire.com> Message-ID: > Looking on the mthca and mlx4 low-level libraries I realize that you > use pthread_spin_lock/unlock for thread safeness. > > Can you spare few words on why spinning is used rather then sleeping > (eg pthread_mutex_lock/unlock) - is it since you assume that: > > A) if the lock is not contended - both calls have the same efficiency > B) if the lock is contended - it would be such for --short-- time and > hence spinning is more efficient then sleeping (no context-switch etc) Pretty much, although the main reason is really that pthread spinlocks are actually measurably faster than mutexes in the uncontended case. > > (And BTW, yes, it is possible to call send(2) in any racy way you want > > on the same FD, and the kernel's internal state will not get messed > > up) > > is it documented any where? other then the kernel state, what happens > if send(2) is called from two threads on a datagram socket? will the > two datagrams be accepted in the remote side uncorrupted? as for > stream socket, since the bits order in the stream is not defined in > the sender > side, I "corruption" would surely be experienced by the remote side. I don't know what documentation has written explicitly. I guess you could read the POSIX or SUS standards to see. - R. From rdreier at cisco.com Tue Oct 2 10:53:20 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Oct 2007 10:53:20 -0700 Subject: [ofa-general] [PATCH] mlx4: increase permissible number of QPs per multicast group to 56 In-Reply-To: <200710020940.13862.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 2 Oct 2007 09:40:13 +0200") References: <200710020940.13862.jackm@dev.mellanox.co.il> Message-ID: Do we want a similar change for mthca? - R. From rdreier at cisco.com Tue Oct 2 10:54:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Oct 2007 10:54:46 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: add basic performance counters In-Reply-To: <46FAF1C4.1090109@ichips.intel.com> (Sean Hefty's message of "Wed, 26 Sep 2007 16:56:52 -0700") References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> <46F953FC.50101@ichips.intel.com> <46FAA827.90504@ichips.intel.com> <46FAF1C4.1090109@ichips.intel.com> Message-ID: > Er, I could use some help here. Is there a preferred way to share > /sys/class/infiniband_cm between the ib_cm and ib_user_cm modules? > > Currently, ib_user_cm registers the infiniband_cm class and registers > devices (ucm0, ucm1, ...) on that class. It ends up making use of the > infiniband_cm class 'release' callback for this. I want to make sure > that I'm not overlooking some simple way of maintaining this while > letting the ib_cm module stick statistics under it. I guess you need to move the creation of the class into the ib_cm module (since ib_cm can be loaded without ib_user_cm), and then export some methods for ib_user_cm to share it. Dunno whether it's worth the trouble. - R. From sean.hefty at intel.com Tue Oct 2 10:58:32 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 2 Oct 2007 10:58:32 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2] [RFC] ib/cm: addbasic performance counters In-Reply-To: References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com><000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com><46F953FC.50101@ichips.intel.com> <46FAA827.90504@ichips.intel.com> <46FAF1C4.1090109@ichips.intel.com> Message-ID: <000001c8051d$d8f7fb60$ff0da8c0@amr.corp.intel.com> >I guess you need to move the creation of the class into the ib_cm >module (since ib_cm can be loaded without ib_user_cm), and then export >some methods for ib_user_cm to share it. Thanks - this is what I ended up doing. I just wasn't sure how acceptable this approach would be. - Sean From fubar at us.ibm.com Tue Oct 2 11:10:19 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Tue, 02 Oct 2007 11:10:19 -0700 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <4702773E.4090201@pobox.com> References: <46F7D770.4090500@voltaire.com> <10376.1190733869@death> <470268A2.7080102@gmail.com> <4702773E.4090201@pobox.com> Message-ID: <23084.1191348619@death> Jeff Garzik wrote: >Moni Shoua wrote: >> Jay Vosburgh wrote: >>> ACK patches 3 - 9. >>> >>> Roland, are you comfortable with the IB changes in patches 1 and 2? >>> >>> Jeff, when Roland acks patches 1 and 2, please apply all 9. >>> >>> -J >> >> Hi Jeff, >> Roland acked the IPoIB patches. If you haven't done so already can you please apply them? >> I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. > >hrm, I don't see them in my inbox for some reason. can someone bounce >them to me? or give me a git tree to pull from? Moni, can you repost the patch series to Jeff, and put the appropriate "Acked-by" lines in for myself (patches 3 - 8) and Roland (patches 1 and 2)? You can probably leave off the netdev and openfabrics lists, but cc me. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com From sean.hefty at intel.com Tue Oct 2 11:15:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 2 Oct 2007 11:15:40 -0700 Subject: [ofa-general] librdmacm 1.0.3 release Message-ID: <000501c80520$3d81aca0$ff0da8c0@amr.corp.intel.com> librdmacm 1.0.3 release is now available on the OFA download page (and my git tree). This version will support with the 2.6.24 kernel code for QoS support. Please pull this into OFED 1.3. There is also a libibcm 1.0.1 release available that should also go into OFED 1.3. Thanks, Sean From rdreier at cisco.com Tue Oct 2 11:26:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Oct 2007 11:26:53 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <000001c7f6f7$074584e0$9c98070a@amr.corp.intel.com> (Sean Hefty's message of "Fri, 14 Sep 2007 10:45:23 -0700") References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> <000001c7f6f7$074584e0$9c98070a@amr.corp.intel.com> Message-ID: > >OK -- just to make sure I'm understanding what you're saying: have you > >confirmed that your proposed [CM MRA] patches actually fix the issue? > > Not directly. I cannot easily test kernel patches on our larger, production > clusters. We've seen the issue with specific applications on 512 and 1024 > cores, but I've only been able to test the patch on a 48-core cluster. I have > verified that it successfully increases the timeout to where it *should* work, > but cannot absolutely confirm that it will fix the problem. I'm unlikely to > know that until the production clusters move to an OFED release (1.3?) > containing this patch. Umm... this is a difficult situation for me to merge the changes then. We're changing the CM retry behavior blind here. How do we know that the MRA changes don't make the scalability issue worse? - R. From sean.hefty at intel.com Tue Oct 2 11:50:04 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 2 Oct 2007 11:50:04 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com><000001c7f6f7$074584e0$9c98070a@amr.corp.intel.com> Message-ID: <000601c80525$0b661f30$ff0da8c0@amr.corp.intel.com> >Umm... this is a difficult situation for me to merge the changes then. >We're changing the CM retry behavior blind here. How do we know that >the MRA changes don't make the scalability issue worse? What's currently upstream doesn't work for Intel MPI on our larger clusters. The connection requests time out on the active side before the passive side can respond. The OFED release works because it provides a kernel patch to make the timeout a module parameter. I'm trying to avoid adding a module parameter, and the MRA is designed for this situation. I tested this by simulating a slow passive side responder, and it worked as expected for those tests. Using an MRA does add another MAD to the CM exchange, which is why it is sent only after seeing a duplicate request. Alternatively, we can take the OFED module parameter patch. - Sean From adit.262 at gmail.com Tue Oct 2 13:12:27 2007 From: adit.262 at gmail.com (Adit Ranadive) Date: Tue, 2 Oct 2007 16:12:27 -0400 Subject: [ofa-general] IB Packet receive timings Message-ID: Hello, I had a question on whether it is possible to get the exact timing at which a packet arrived in the recieve queue on the HCA? Does the packet have a timestamp which the HCA modifies when it arrives? If so, how can one retrieve this? Thanks, Adit -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From ardavis at ichips.intel.com Tue Oct 2 16:04:03 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 02 Oct 2007 16:04:03 -0700 Subject: [ofa-general] Re: [PATCH] uDAPL fix DT_Mdep_GetTime In-Reply-To: <20071002003607.GX29287@kryten> References: <46FBF0CE.40409@ichips.intel.com> <20070927201205.GA29287@kryten> <46FC1498.1090708@ichips.intel.com> <20070929144808.GO29287@kryten> <20070929155029.GP29287@kryten> <20071001165422.GT29287@kryten> <20071001221805.GV29287@kryten> <20071002002020.GW29287@kryten> <20071002003607.GX29287@kryten> Message-ID: <4702CE63.7070700@ichips.intel.com> Anton Blanchard wrote: > Hi, > >> The userspace DT_Mdep_GetTime() function is supposed to output the time in >> milliseconds. It uses the times() syscall to do this, and uses the >> CLOCKS_PER_SEC define to scale it. On PowerPC this is hardwired to >> 1000000 which leads to bogus values. > > Note also that times() is pretty low resolution, Id suggest using > gettimeofday(). What do you think? Thanks, committed for 1.2 and 2.0. > > -- > > Change DT_Mdep_GetTime to use gettimeofday() which has more resolution > than times(). > > Signed-off-by: Anton Blanchard > --- > > diff --git a/test/dapltest/mdep/linux/dapl_mdep_user.c b/test/dapltest/mdep/linux/dapl_mdep_user.c > index 1e2d44b..c5738e1 100644 > --- a/test/dapltest/mdep/linux/dapl_mdep_user.c > +++ b/test/dapltest/mdep/linux/dapl_mdep_user.c > @@ -176,9 +176,9 @@ DT_Mdep_GetCpuStat ( > unsigned long > DT_Mdep_GetTime (void) > { > - struct tms ts; > - clock_t t = times (&ts); > - return (unsigned long) ((DAT_UINT64) t * 1000 / CLOCKS_PER_SEC); > + struct timeval tv; > + gettimeofday(&tv, NULL); > + return tv.tv_sec * 1000 + tv.tv_usec / 1000; > } > > double > From ardavis at ichips.intel.com Tue Oct 2 16:06:53 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 02 Oct 2007 16:06:53 -0700 Subject: [ofa-general] Re: [PATCH] uDAPL PPC fixes for dapl_osd.h In-Reply-To: <20070927201205.GA29287@kryten> References: <46FBF0CE.40409@ichips.intel.com> <20070927201205.GA29287@kryten> Message-ID: <4702CF0D.1010605@ichips.intel.com> Anton Blanchard wrote: > Hi Arlin, > > I have a patch to allow PowerPC to compile dapl as both 32 and 64bit. > This is for OFED1.2, if there are issues I can rebase to mainline. > > Thanks, committed for 1.2 and 2.0. > -- > > Fix dapl to compile as both 32bit and 64bit on PowerPC. Instead of using > the kernel atomic routines, code them explicitely like x86 does. > > Signed-off-by: Anton Blanchard > --- > > --- ./dapl/udapl/linux/dapl_osd.h.orig 2007-06-23 18:24:09.000000000 -0500 > +++ ./dapl/udapl/linux/dapl_osd.h 2007-06-23 18:38:01.000000000 -0500 > @@ -49,7 +49,7 @@ > #error UNDEFINED OS TYPE > #endif /* __linux__ */ > > -#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) && !defined(__PPC64__) > +#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) && !defined(__PPC__) && !defined(__PPC64__) > #error UNDEFINED ARCH > #endif > > @@ -78,12 +78,9 @@ > #include > #include > > -#if !defined(REDHAT_EL5) && (defined(__ia64__) || defined(__PPC64__)) > +#if !defined(REDHAT_EL5) && (defined(__ia64__)) > #include > #endif > -#if defined(__PPC64__) > -#include > -#endif > > /* Useful debug definitions */ > #ifndef STATIC > @@ -163,8 +160,17 @@ > #else > IA64_FETCHADD(old_value,v,1,4); > #endif > -#elif defined(__PPC64__) > - atomic_inc((atomic_t *) v); > +#elif defined(__PPC__) || defined(__PPC64__) > + int tmp; > + > + __asm__ __volatile__( > + "1: lwarx %0,0,%2\n\ > + addic %0,%0,1\n\ > + stwcx. %0,0,%2\n\ > + bne- 1b" > + : "=&r" (tmp), "+m" (v) > + : "r" (&v) > + : "cc"); > #else /* !__ia64__ */ > __asm__ __volatile__ ( > "lock;" "incl %0" > @@ -193,9 +199,17 @@ > #else > IA64_FETCHADD(old_value,v,-1,4); > #endif > -#elif defined (__PPC64__) > - atomic_dec((atomic_t *)v); > +#elif defined (__PPC__) || defined(__PPC64__) > + int tmp; > > + __asm__ __volatile__( > + "1: lwarx %0,0,%2\n\ > + addic %0,%0,-1\n\ > + stwcx. %0,0,%2\n\ > + bne- 1b" > + : "=&r" (tmp), "+m" (v) > + : "r" (&v) > + : "cc"); > #else /* !__ia64__ */ > __asm__ __volatile__ ( > "lock;" "decl %0" > @@ -240,7 +254,7 @@ > #else > current_value = ia64_cmpxchg(acq,v,match_value,new_value,4); > #endif /* __ia64__ */ > -#elif defined(__PPC64__) > +#elif defined(__PPC__) || defined(__PPC64__) > __asm__ __volatile__ ( > " lwsync\n\ > 1: lwarx %0,0,%2 # __cmpxchg_u32\n\ > --- ./test/dapltest/mdep/linux/dapl_mdep_user.h.orig 2007-06-23 18:26:45.000000000 -0500 > +++ ./test/dapltest/mdep/linux/dapl_mdep_user.h 2007-06-23 18:43:24.000000000 -0500 > @@ -124,7 +124,7 @@ > __asm__ __volatile__ ("mov %0=ar.itc" : "=r"(ret)); > return ret; > #else > -#if defined(__PPC64__) > +#if defined(__PPC__) || defined(__PPC64__) > unsigned int tbl, tbu0, tbu1; > do { > __asm__ __volatile__ ("mftbu %0" : "=r"(tbu0)); > --- ./test/dapltest/mdep/linux/dapl_mdep_user.c.orig 2007-06-24 02:56:12.000000000 -0500 > +++ ./test/dapltest/mdep/linux/dapl_mdep_user.c 2007-06-24 02:56:27.000000000 -0500 > @@ -186,7 +186,7 @@ > void ) > { > #define DT_CPU_MHZ_BUFFER_SIZE 128 > -#if defined (__PPC64__) > +#if defined (__PPC__) || defined (__PPC64__) > #define DT_CPU_MHZ_MHZ "clock" > #else > #define DT_CPU_MHZ_MHZ "cpu MHz" > From gcallis at montana.edu Tue Oct 2 20:45:33 2007 From: gcallis at montana.edu (Kaitlin Solis) Date: Wed, 33 Sep 2007 -03:45:33 -0600 Subject: [ofa-general] Deal Message-ID: <214422562.42747354719056@montana.edu> An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Tue Oct 2 22:19:27 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 3 Oct 2007 07:19:27 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-03:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-02 OpenSM git rev = Mon_Oct_1_22:26:04_2007 [fc6f3f7cf82131748e4d6c22ecceb601e0883901] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From billfink at mindspring.com Tue Oct 2 22:29:29 2007 From: billfink at mindspring.com (Bill Fink) Date: Wed, 3 Oct 2007 01:29:29 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191331238.4353.59.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <20071001001135.75d2b984.billfink@mindspring.com> <1191245440.4378.12.camel@localhost> <20071002002502.fe0f2bb3.billfink@mindspring.com> <1191331238.4353.59.camel@localhost> Message-ID: <20071003012929.d28f7cd8.billfink@mindspring.com> On Tue, 02 Oct 2007, jamal wrote: > On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote: > > > One reason I ask, is that on an earlier set of alternative batching > > xmit patches by Krishna Kumar, his performance testing showed a 30 % > > performance hit for TCP for a single process and a size of 4 KB, and > > a performance hit of 5 % for a single process and a size of 16 KB > > (a size of 8 KB wasn't tested). Unfortunately I was too busy at the > > time to inquire further about it, but it would be a major potential > > concern for me in my 10-GigE network testing with 9000-byte jumbo > > frames. Of course the single process and 4 KB or larger size was > > the only case that showed a significant performance hit in Krishna > > Kumar's latest reported test results, so it might be acceptable to > > just have a switch to disable the batching feature for that specific > > usage scenario. So it would be useful to know if your xmit batching > > changes would have similar issues. > > There were many times while testing that i noticed inconsistencies and > in each case when i analysed[1], i found it to be due to some variable > other than batching which needed some resolving, always via some > parametrization or other. I suspect what KK posted is in the same class. > To give you an example, with UDP, batching was giving worse results at > around 256B compared to 64B or 512B; investigating i found that the > receiver just wasnt able to keep up and the udp layer dropped a lot of > packets so both iperf and netperf reported bad numbers. Fixing the > receiver ended up with consistency coming back. On why 256B was the one > that overwhelmed the receiver more than 64B(which sent more pps)? On > some limited investigation, it seemed to me to be the effect of the > choice of the tg3 driver's default tx mitigation parameters as well tx > ring size; which is something i plan to revisit (but neutralizing it > helps me focus on just batching). In the end i dropped both netperf and > iperf for similar reasons and wrote my own app. What i am trying to > achieve is demonstrate if batching is a GoodThing. In experimentation > like this, it is extremely valuable to reduce the variables. Batching > may expose other orthogonal issues - those need to be resolved or fixed > as they are found. I hope that sounds sensible. It does sound sensible. My own decidedly non-expert speculation was that the big 30 % performance hit right at 4 KB may be related to memory allocation issues or having to split the skb across multiple 4 KB pages. And perhaps it only affected the single process case because with multiple processes lock contention may be a bigger issue and the xmit batching changes would presumably help with that. I am admittedly a novice when it comes to the detailed internals of TCP/skb processing, although I have been slowly slogging my way through parts of the TCP kernel code to try and get a better understanding, so I don't know if these thoughts have any merit. BTW does anyone know of a good book they would recommend that has substantial coverage of the Linux kernel TCP code, that's fairly up-to-date and gives both an overall view of the code and packet flow as well as details on individual functions and algorithms, and hopefully covers basic issues like locking and synchronization, concurrency of different parts of the stack, and memory allocation. I have several books already on Linux kernel and networking internals, but they seem to only cover the IP (and perhaps UDP) portions of the network stack, and none have more than a cursory reference to TCP. The most useful documentation on the Linux TCP stack that I have found thus far is some of Dave Miller's excellent web pages and a few other web references, but overall it seems fairly skimpy for such an important part of the Linux network code. > Back to the >=9K packet size you raise above: > I dont have a 10Gige card so iam theorizing. Given that theres an > observed benefit to batching for a saturated link with "smaller" packets > (in my results "small" is anything below 256B which maps to about > 380Kpps anything above that seems to approach wire speed and the link is > the bottleneck); then i theorize that 10Gige with 9K jumbo frames if > already achieving wire rate, should continue to do so. And sizes below > that will see improvements if they were not already hitting wire rate. > So i would say that with 10G NICS, there will be more observed > improvements with batching with apps that do bulk transfers (assuming > those apps are not seeing wire speed already). Note that this hasnt been > quiet the case even with TSO given the bottlenecks in the Linux > receivers that J Heffner put nicely in a response to some results you > posted - but that exposes an issue with Linux receivers rather than TSO. It would be good to see some empirical evidence that there aren't any unforeseen gotchas for larger packet sizes, that at least the same level of performance can be obtained with no greater CPU utilization. > > Also for your xmit batching changes, I think it would be good to see > > performance comparisons for TCP and IP forwarding in addition to your > > UDP pktgen tests, > > That is not pktgen - it is a udp app running in process context > utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely > and has its own merits in proving that batching infact is a GoodThing > even if it is just for traffic generation ;-> > > > including various packet sizes up to and including > > 9000-byte jumbo frames. > > I will do TCP and forwarding tests in the near future. Looking forward to it. > cheers, > jamal > > [1] On average i spend 10x more time performance testing and analysing > results than writting code. As you have written previously, and I heartily agree with, this is a very good practice for developing performance enhancement patches. -Thanks -Bill From austinuhpot at ascentek.com.cn Tue Oct 2 23:59:17 2007 From: austinuhpot at ascentek.com.cn (Bruno Gorman) Date: Wed, 03 Oct 2007 03:59:17 -0300 Subject: [ofa-general] datasets for the healthcare profession Message-ID: <758635d1nkx0$d0593ns0$0628l4m0@Delldim5150 Certified Physicians in the USA 788,713 in total � 17,400 emails Physicians in many different specialties Sort by over a dozen different fields Price Just Lowered - $297 *** Recieve the data below without charge when you buy the Physician Contact List above *** US Pharmaceutical Executives Listing Personal email addresses (5000 in total) and names for execs Hospitals in the USA complete contact information for CEO's, CFO's, Directors and more - over 23,000 listings in total for more than 7,000 hospitals in the USA American Dentists More than half a million listings [worth $299 alone!] Chiropractors in the USA 100,000 Chiropractors in the USA (worth $249 alone) email to: medicalstats at hotmail.com above offer valid until Oct 5 to manage your subscription settings send an email to the address above with 407 in the subject From dotanb at dev.mellanox.co.il Wed Oct 3 00:44:26 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 03 Oct 2007 09:44:26 +0200 Subject: [ofa-general] IB Packet receive timings In-Reply-To: References: Message-ID: <4703485A.8080905@dev.mellanox.co.il> Adit Ranadive wrote: > Hello, > > I had a question on whether it is possible to get the exact timing at > which a packet arrived in the recieve queue on the HCA? > Does the packet have a timestamp which the HCA modifies when it > arrives? If so, how can one retrieve this? > I think that getting this info is impossible. For example: a message of 2 GB was received (with several packet retransmission). Which timestamps will be created: for all of the packets or only for the first/last one? where will they be written in? For this message a single completion may be created (if this was a SEND operation and not an RDMA). Maybe HW vendors can supply this info using special registers/commands, but there isn't any standard way to get this info. The closest thing that you can do is to query the IB port counters for received packets and received data (for example, the utility perfquery give you the values of those counters). Dotan From tziporet at dev.mellanox.co.il Wed Oct 3 01:42:24 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 03 Oct 2007 10:42:24 +0200 Subject: [ofa-general] [PATCH] mlx4: increase permissible number of QPs per multicast group to 56 In-Reply-To: References: <200710020940.13862.jackm@dev.mellanox.co.il> Message-ID: <470355F0.3030301@mellanox.co.il> Roland Dreier wrote: > Do we want a similar change for mthca? > > > Yes Tziporet From vlad at lists.openfabrics.org Wed Oct 3 02:54:11 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 3 Oct 2007 02:54:11 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071003-0200 daily build status Message-ID: <20071003095411.D2D28E60876@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071003-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From MAILER-DAEMON at mx1.christianbook.com Wed Oct 3 04:06:50 2007 From: MAILER-DAEMON at mx1.christianbook.com (Mail Delivery Subsystem) Date: Wed, 3 Oct 2007 07:06:50 -0400 Subject: [ofa-general] Returned mail: see transcript for details Message-ID: <200710031106.l93B6oZZ020413@mx1.christianbook.com> The original message was received at Wed, 3 Oct 2007 07:06:45 -0400 from [62.215.55.129] ----- The following addresses had permanent fatal errors ----- (reason: 550 Requested action was not taken because this server doesn't handle mail for that user) ----- Transcript of session follows ----- ... while talking to mailfoundry.cckh.com.: >>> RCPT To: <<< 550 Requested action was not taken because this server doesn't handle mail for that user 550 5.1.1 ... User unknown -------------- next part -------------- An embedded message was scrubbed... From: Subject: Fall's Best Products Date: Wed, 3 Oct 2007 07:06:43 -0400 Size: 6991 URL: From hadi at cyberus.ca Wed Oct 3 06:42:34 2007 From: hadi at cyberus.ca (jamal) Date: Wed, 03 Oct 2007 09:42:34 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071003012929.d28f7cd8.billfink@mindspring.com> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1191178346.6165.29.camel@localhost> <20071001001135.75d2b984.billfink@mindspring.com> <1191245440.4378.12.camel@localhost> <20071002002502.fe0f2bb3.billfink@mindspring.com> <1191331238.4353.59.camel@localhost> <20071003012929.d28f7cd8.billfink@mindspring.com> Message-ID: <1191418954.4357.49.camel@localhost> On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote: > It does sound sensible. My own decidedly non-expert speculation > was that the big 30 % performance hit right at 4 KB may be related > to memory allocation issues or having to split the skb across > multiple 4 KB pages. plausible. But i also worry it could be 10 other things; example, could it be the driver used? I noted in my udp test the oddity that turned out to be tx coal parameter related. In any case, I will attempt to run those tests later. > And perhaps it only affected the single > process case because with multiple processes lock contention may > be a bigger issue and the xmit batching changes would presumably > help with that. I am admittedly a novice when it comes to the > detailed internals of TCP/skb processing, although I have been > slowly slogging my way through parts of the TCP kernel code to > try and get a better understanding, so I don't know if these > thoughts have any merit. You do bring up issues that need to be looked into and i will run those tests. Note, the effectiveness of batching becomes evident as the number of flows grows. Actually, scratch that: It becomes evident if you can keep the tx path busyed out to which multiple users running contribute. If i can have a user per CPU with lots of traffic to send, i can create that condition. It's a little boring in the scenario where the bottleneck is the wire but it needs to be checked. > BTW does anyone know of a good book they would recommend that has > substantial coverage of the Linux kernel TCP code, that's fairly > up-to-date and gives both an overall view of the code and packet > flow as well as details on individual functions and algorithms, > and hopefully covers basic issues like locking and synchronization, > concurrency of different parts of the stack, and memory allocation. > I have several books already on Linux kernel and networking internals, > but they seem to only cover the IP (and perhaps UDP) portions of the > network stack, and none have more than a cursory reference to TCP. > The most useful documentation on the Linux TCP stack that I have > found thus far is some of Dave Miller's excellent web pages and > a few other web references, but overall it seems fairly skimpy > for such an important part of the Linux network code. Reading books or magazines may end up busying you out with some small gains of knowledge at the end. They tend to be outdated fast. My advice is if you start with a focus on one thing, watch the patches that fly around on that area and learn that way. Read the code to further understand things then ask questions when its not clear. Other folks may have different views. The other way to do it is pick yourself some task to either add or improve something and get your hands dirty that way. > It would be good to see some empirical evidence that there aren't > any unforeseen gotchas for larger packet sizes, that at least the > same level of performance can be obtained with no greater CPU > utilization. Reasonable - I will try with 9K after i move over to the new tree from Dave and make sure nothing else broke in the previous tests. And when all looks good, i will move to TCP. > > [1] On average i spend 10x more time performance testing and analysing > > results than writting code. > > As you have written previously, and I heartily agree with, this is a > very good practice for developing performance enhancement patches. To give you a perspective, the results i posted were each run 10 iterations per packet size per kernel. Each run is 60 seconds long. I think i am past that stage for resolving or fixing anything for UDP or pktgen, but i need to keep checking for any new regressions when Dave updates his tree. Now multiply that by 5 packet sizes (I am going to add 2 more) and multiply that by 3-4 kernels. Then add the time it takes to sift through the data and collect it then analyze it and go back to the drawing table when something doesnt look right. Essentially, it needs a weekend ;-> cheers, jamal From kanoj at netxen.com Wed Oct 3 10:09:16 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 03 Oct 2007 10:09:16 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma Message-ID: <4703CCBC.3060703@netxen.com> Hello, Here's a possible (aka easily reproducible) deadlock scenario involving cma.c's global mutex "lock" while destroying a listener. Assume provider has done a IW_CM_EVENT_CONNECT_REQUEST upcall on behalf of listener, thus iwcm.c:cm_event_handler() will cause refcount to be bumped and iw_cm_wq to be scheduled to execute cm_work_handler(). cma.c:rdma_destroy_id() is invoked on the listener causing invocation of the call chain cma_cancel_operation():cma_cancel_listens():cma_destroy_listen():iw_destroy_cm_id() with the global "lock" held; iw_destroy_cm_id() will do wait_for_completion(), waiting for the listener refcount to get to 0. When iw_cm_wq gets to run, it executes cm_work_handler():process_event():cm_conn_req_handler():iw_conn_req_handler(), which tries to get the global "lock" (held as described previously) and goes to sleep. The deadlock is because iw_cm_wq needs to execute cm_work_handler():iwcm_deref_id() for things to make forward progress. Notice that cm_conn_req_handler() tries to exit early if listener destruct has started (by checking IW_CM_STATE_LISTEN). iw_conn_req_handler() does similar checks on CMA_LISTEN. But there is a race window with the destruct path, such that the upcall path waits for the mutex which the destruct path acquires. Appended patch fixes the problem. Thanks. Kanoj --- drivers/infiniband/core/cma.c 2006-12-13 17:14:23.000000000 -0800 +++ /tmp/cma.c 2007-10-03 00:48:32.000000000 -0700 @@ -624,6 +624,7 @@ cma_exch(id_priv, CMA_DESTROYING); if (id_priv->cma_dev) { + mutex_unlock(&lock); switch (rdma_node_get_transport(id_priv->id.device->node_type)) { case RDMA_TRANSPORT_IB: if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) @@ -636,6 +637,7 @@ default: break; } + mutex_lock(&lock); cma_detach_from_dev(id_priv); } list_del(&id_priv->listen_list); From pw at osc.edu Wed Oct 3 10:42:52 2007 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 3 Oct 2007 13:42:52 -0400 Subject: [ofa-general] iSER data corruption issues Message-ID: <20071003174252.GA28637@osc.edu> How does the requester (in IB speak) know that an RDMA Write operation has completed on the responder? We have a software iSER target, available at git.osc.edu/tgt or browse at http://git.osc.edu/?p=tgt.git . Using the existing in-kernel iSER initiator code, very rarely data corruption occurs, in that the received data from SCSI read operations does not match what was expected. Sometimes it appears as if random kernel memory has been scribbled on by an errant RDMA write from the target. My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target: Recv request target: RDMA Write response to initiator target: Wait for CQ entry for local RDMA Write completion target: Send response to initiator initiator: Recv response, access buffer On very rare occasions, this buffer will have the test pattern, not the data that the target just sent. Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. One site with fast disks can see similar corruption with 2.6.23-rc6, however. Target is pure userspace. Initiator is in kernel and is poked by "lmdd" (like normal dd) through an iSCSI block device (/dev/sdb). The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the "Wait for CQ entry" step on the target should be unnecessary, no? Could there be some caching issues that the initiator is missing? I've added print[fk]s to the initiator and target to verify that the sequence of events is truly as above, and that the virtual addresses are as expected on both sides. Any suggestions or advice would help. Thanks, -- Pete P.S. Here are some debugging printfs not in the git. Userspace code does 200 read()s of length 8000, but complains about the result somewhere in the 14th read, from 112000 to 120000, and exits early. Expected pattern is a series of 400000 4-byte words, incrementing by 4, starting from 0. So 0x00000000, 0x00000004, ..., 0x001869fc: % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 off=112000 want=1c000 got=3b3b3b3b Initiator generates a series of SCSI operations, as driven by readahead and the block queue scheduler. You can see that it starts reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, in order. These sizes and offsets vary from run to run. Each line here is printed after the SCSI read response has been received. It prints the first word in the buffer, and you can see the test pattern where data should be: tag 02 va 36061000 len 4000 word0 00000000 ref 1 tag 03 va 36065000 len 1000 word0 00004000 ref 1 tag 04 va 36066000 len 17000 word0 00005000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 tag 07 va 7bdc2000 len 20000 word0 0003c000 ref 1 The userspace target code prints a line when it starts the RDMA write, then a line when the RDMA write completes locally, then a line when it sends the repsponse. The tags are what the initiator assigned to each request. The target thinks it is sending a 4096-byte buffer that has 0x1c000 as its first word, but the initiator did not see it: tag 02 va 36061000 len 4000 word0 00000000 rdmaw tag 02 rdmaw completion tag 02 resp tag 03 va 36065000 len 1000 word0 00004000 rdmaw tag 03 rdmaw completion tag 03 resp tag 04 va 36066000 len 17000 word0 00005000 rdmaw tag 04 rdmaw completion tag 04 resp tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw tag 05 rdmaw completion tag 05 resp tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw tag 06 rdmaw completion tag 07 va 7bdc2000 len 20000 word0 0003c000 rdmaw tag 07 rdmaw completion tag 06 resp tag 07 resp From john.benninghoff at intel.com Wed Oct 3 10:44:11 2007 From: john.benninghoff at intel.com (Benninghoff, John) Date: Wed, 3 Oct 2007 10:44:11 -0700 Subject: [ofa-general] RH4.5 and OFED 1.2.5 build problem with ib-bonding In-Reply-To: <000101c80520$0e02b550$ff0da8c0@amr.corp.intel.com> References: <2E020D3DD4A80647AE77E1692F6E97D930461A@FMSMSX420> <2E020D3DD4A80647AE77E1692F6E97D930470B@FMSMSX420> <000101c80520$0e02b550$ff0da8c0@amr.corp.intel.com> Message-ID: <2E020D3DD4A80647AE77E1692F6E97D934FE16@FMSMSX420> I'm building OFED 1.2.5 using the build.sh script. All went fine except ib-bonding. OFED release notes indicate my RH release is supported: - RedHat EL4 up5: 2.6.9-55.ELsmp [root at logon OFED-1.2.5]# uname -a Linux logon 2.6.9-55.0.2.ELlargesmp #1 SMP Tue Jun 12 18:09:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root at logon OFED-1.2.5]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 5) Errors clip from the build log: + cd linux/drivers/net/bonding/ ++ pwd + make -C /lib/modules/2.6.9-55.0.2.ELlargesmp/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bon ding make: Entering directory `/usr/src/kernels/2.6.9-55.0.2.EL-largesmp-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function `bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: `IFF_SLAVE_NEEDARP' undeclared (firs t use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: (Each undeclared identifier is repor ted only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function `bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:269: error: `IFF_SLAVE_NEEDARP' undeclared (firs t use in this function) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Wed Oct 3 10:47:36 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 3 Oct 2007 10:47:36 -0700 Subject: [ofa-general] RH4.5 and OFED 1.2.5 build problem with ib-bonding In-Reply-To: <2E020D3DD4A80647AE77E1692F6E97D934FE16@FMSMSX420> References: <2E020D3DD4A80647AE77E1692F6E97D930461A@FMSMSX420><2E020D3DD4A80647AE77E1692F6E97D930470B@FMSMSX420><000101c80520$0e02b550$ff0da8c0@amr.corp.intel.com> <2E020D3DD4A80647AE77E1692F6E97D934FE16@FMSMSX420> Message-ID: You are hitting https://bugs.openfabrics.org/show_bug.cgi?id=651, which was present in 1.2 and 1.2.5. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Benninghoff, John Sent: Wednesday, October 03, 2007 10:44 AM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ofa-general] RH4.5 and OFED 1.2.5 build problem with ib-bonding I'm building OFED 1.2.5 using the build.sh script. All went fine except ib-bonding. OFED release notes indicate my RH release is supported: - RedHat EL4 up5: 2.6.9-55.ELsmp [root at logon OFED-1.2.5]# uname -a Linux logon 2.6.9-55.0.2.ELlargesmp #1 SMP Tue Jun 12 18:09:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root at logon OFED-1.2.5]# cat /etc/*release Red Hat Enterprise Linux AS release 4 (Nahant Update 5) Errors clip from the build log: + cd linux/drivers/net/bonding/ ++ pwd + make -C /lib/modules/2.6.9-55.0.2.ELlargesmp/build modules M=/var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bon ding make: Entering directory `/usr/src/kernels/2.6.9-55.0.2.EL-largesmp-x86_64' CC [M] /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.o In file included from /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bond_m ain.c:78: /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function `bond_set_slave_inactive_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: `IFF_SLAVE_NEEDARP' undeclared (firs t use in this function) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: (Each undeclared identifier is repor ted only once /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:263: error: for each function it appears in.) /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h: In function `bond_set_slave_active_flags': /var/tmp/OFEDRPM/BUILD/ib-bonding-0.9.0/linux/drivers/net/bonding/bondin g.h:269: error: `IFF_SLAVE_NEEDARP' undeclared (firs t use in this function) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Oct 3 10:53:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 10:53:58 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: <4703CCBC.3060703@netxen.com> (Kanoj Sarcar's message of "Wed, 03 Oct 2007 10:09:16 -0700") References: <4703CCBC.3060703@netxen.com> Message-ID: I'll leave it to Sean and others who know the cma locking better than I do to comment on the patch in detail, but a few notes: - your patch is completely whitespace mangled so it would have to be applied by hand. please look into configuring your mail client so that it can send patches without corrupting them. - patches should be generated so they apply with 'patch -p1', so rather than what you have: > --- drivers/infiniband/core/cma.c 2006-12-13 17:14:23.000000000 -0800 > +++ /tmp/cma.c 2007-10-03 00:48:32.000000000 -0700 the paths should be more like --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c - I wonder which tree you generated your patch against: it seems to be modifying cma_destroy_listen(), but you have the header: > @@ -624,6 +624,7 @@ and cma_destroy_listen() is nowhere near line 624 in my tree. (And it would be nice to use the '-p' option of diff to put the function name there for easier reviewing) - Your comment doesn't make it clear to me that dropping and reacquiring the lock is safe; can you explain why nothing else could come along while the lock is dropped and mess things up? It seems rdma_destroy_id() has the same pattern, but it's not clear to me in the code: mutex_lock(&lock); if (id_priv->cma_dev) { mutex_unlock(&lock); // why can't the device be hot-unplugged here?? switch (rdma_node_get_transport(id->device->node_type)) { what guarantees that the device does not disappear before it is dereferenced in the switch statement. This would be a separate bug but we probably shouldn't introduce another instance of it (assuming I'm correct). - R. From tom at opengridcomputing.com Wed Oct 3 11:02:22 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 03 Oct 2007 13:02:22 -0500 Subject: [ofa-general] iSER data corruption issues In-Reply-To: <20071003174252.GA28637@osc.edu> References: <20071003174252.GA28637@osc.edu> Message-ID: <1191434542.1022.3.camel@trinity.ogc.int> On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: > How does the requester (in IB speak) know that an RDMA Write > operation has completed on the responder? > > We have a software iSER target, available at git.osc.edu/tgt or > browse at http://git.osc.edu/?p=tgt.git . Using the existing > in-kernel iSER initiator code, very rarely data corruption occurs, > in that the received data from SCSI read operations does not match > what was expected. Sometimes it appears as if random kernel memory > has been scribbled on by an errant RDMA write from the target. My > current working theory that the RDMA write has not completed by the > time the initiator looks at its incoming data buffer. > > Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write > work requests are used. After everything is connected up, a SCSI > read sequence looks like: > > initiator: register pages with FMR, write test pattern > initiator: Send request to target > target: Recv request > target: RDMA Write response to initiator > target: Wait for CQ entry for local RDMA Write completion Pete: I don't think this should be necessary... > target: Send response to initiator ...as long as the send is posted on the same SQ as the write. > initiator: Recv response, access buffer > > On very rare occasions, this buffer will have the test pattern, not > the data that the target just sent. > > Machines are opteron, fedora 7 up-to-date with its openfab libs, > kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or > 2.6.18-rhel5 on initiator. For some reason, it is much easier to > produce with the rhel5 kernel. One site with fast disks can see > similar corruption with 2.6.23-rc6, however. Target is pure > userspace. Initiator is in kernel and is poked by "lmdd" (like > normal dd) through an iSCSI block device (/dev/sdb). > > The IB spec seems to indicate that the contents of the RDMA Write > buffer should be stable after completion of a subsequent send > message (o9-20). In fact, the "Wait for CQ entry" step on the > target should be unnecessary, no? I think so too. > > Could there be some caching issues that the initiator is missing? > I've added print[fk]s to the initiator and target to verify that the > sequence of events is truly as above, and that the virtual addresses > are as expected on both sides. > > Any suggestions or advice would help. Thanks, > If your theory is correct, the data should eventually show up. Does it? Does your code check for errors on dma_map_single/page? > -- Pete > > > P.S. Here are some debugging printfs not in the git. > > Userspace code does 200 read()s of length 8000, but complains about > the result somewhere in the 14th read, from 112000 to 120000, and > exits early. Expected pattern is a series of 400000 4-byte words, > incrementing by 4, starting from 0. So 0x00000000, 0x00000004, ..., > 0x001869fc: > > % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 > off=112000 want=1c000 got=3b3b3b3b > > Initiator generates a series of SCSI operations, as driven by > readahead and the block queue scheduler. You can see that it starts > reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, > in order. These sizes and offsets vary from run to run. Each line > here is printed after the SCSI read response has been received. It > prints the first word in the buffer, and you can see the test > pattern where data should be: > > tag 02 va 36061000 len 4000 word0 00000000 ref 1 > tag 03 va 36065000 len 1000 word0 00004000 ref 1 > tag 04 va 36066000 len 17000 word0 00005000 ref 1 > tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 Is it interesting that the bad word occurs on the first page of the new map? > tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 > tag 07 va 7bdc2000 len 20000 word0 0003c000 ref 1 > > The userspace target code prints a line when it starts the RDMA > write, then a line when the RDMA write completes locally, then a > line when it sends the repsponse. The tags are what the initiator > assigned to each request. The target thinks it is sending a > 4096-byte buffer that has 0x1c000 as its first word, but the > initiator did not see it: > > tag 02 va 36061000 len 4000 word0 00000000 rdmaw > tag 02 rdmaw completion > tag 02 resp > tag 03 va 36065000 len 1000 word0 00004000 rdmaw > tag 03 rdmaw completion > tag 03 resp > tag 04 va 36066000 len 17000 word0 00005000 rdmaw > tag 04 rdmaw completion > tag 04 resp > tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw > tag 05 rdmaw completion > tag 05 resp > tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw > tag 06 rdmaw completion > tag 07 va 7bdc2000 len 20000 word0 0003c000 rdmaw > tag 07 rdmaw completion > tag 06 resp > tag 07 resp > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From xma at us.ibm.com Wed Oct 3 11:43:40 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 3 Oct 2007 11:43:40 -0700 Subject: [ofa-general] Re: InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: Message-ID: Roland Dreier wrote on 09/17/2007 02:47:42 PM: > > > IPoIB CM handles this properly by gathering together single pages in > > > skbs' fragment lists. > > > Then can we reuse IPoIB CM code here? > > Yes, if possible, refactoring things so that the rx skb allocation > code becomes common between CM and non-CM would definitely make sense. IPoIB-CM rx skb allocation is not generic to be used by UD, it allocates more buffers than needed if mtu is not 64K, and doesn't query the real max_num_sg from the device. I am thinking to have a generic skb allocation in IPoIB based on matrix of (ipoib-mtu-size, page-size, max_num_sg, head-size). Thanks Shirley From kanoj at netxen.com Wed Oct 3 11:44:25 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 03 Oct 2007 11:44:25 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: References: <4703CCBC.3060703@netxen.com> Message-ID: <4703E309.6060103@netxen.com> Roland Dreier wrote: >I'll leave it to Sean and others who know the cma locking better than >I do to comment on the patch in detail, but a few notes: > > - your patch is completely whitespace mangled so it would have to be > applied by hand. please look into configuring your mail client so > that it can send patches without corrupting them. > > - patches should be generated so they apply with 'patch -p1', so > rather than what you have: > > > --- drivers/infiniband/core/cma.c 2006-12-13 17:14:23.000000000 -0800 > > +++ /tmp/cma.c 2007-10-03 00:48:32.000000000 -0700 > > the paths should be more like > > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > > Sorry, its been a while since I posted a patch. I am attaching a copy of it (not sure if patch attachments are ok), generated with "diff -Naurp". > - I wonder which tree you generated your patch against: it seems to > be modifying cma_destroy_listen(), but you have the header: > > > @@ -624,6 +624,7 @@ > > and cma_destroy_listen() is nowhere near line 624 in my tree. (And > it would be nice to use the '-p' option of diff to put the function > name there for easier reviewing) > > This was against 2.6.20. > - Your comment doesn't make it clear to me that dropping and > reacquiring the lock is safe; can you explain why nothing else > could come along while the lock is dropped and mess things up? > > It seems rdma_destroy_id() has the same pattern, but it's not clear > to me in the code: > > mutex_lock(&lock); > if (id_priv->cma_dev) { > mutex_unlock(&lock); > // why can't the device be hot-unplugged here?? > switch (rdma_node_get_transport(id->device->node_type)) { > > what guarantees that the device does not disappear before it is > dereferenced in the switch statement. This would be a separate bug > but we probably shouldn't introduce another instance of it > (assuming I'm correct). > > - R. > > > Yes, rdma_destroy_id() has the same thing, where the lock is dropped; that was the inspiration for this fix too. I believe that by setting the cmid state to CMA_DESTROYING and still keeping it on the device's list, it should be ok to drop the lock and have a racing cma_process_remove() silently ignore this cmid. Also notice that the destroying thread has to do a cma_detach_from_dev() to dec the refcount on the device before the device structure can be freed up. By no means do I understand all the intricacies of the cma code, hopefully Sean/others will review and comment. Thanks. Kanoj -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pp URL: From mshefty at ichips.intel.com Wed Oct 3 11:47:33 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 03 Oct 2007 11:47:33 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: References: <4703CCBC.3060703@netxen.com> Message-ID: <4703E3C5.4020404@ichips.intel.com> > - Your comment doesn't make it clear to me that dropping and > reacquiring the lock is safe; can you explain why nothing else > could come along while the lock is dropped and mess things up? I need to study this part in more detail, but I don't think we can safely release the lock without introducing a race in at least cma_listen_on_all(). > It seems rdma_destroy_id() has the same pattern, but it's not clear > to me in the code: > > mutex_lock(&lock); > if (id_priv->cma_dev) { > mutex_unlock(&lock); > // why can't the device be hot-unplugged here?? The state of the id has been set to destroying, which will cause the device removal code to ignore the id. Even if device removal occurs before the id state has been set, this should be safe. A hot-plug event reports the device removal, but waits for the user to destroy the id. The device is only removed from the id by this function, further down. The locking here is, in part, to prevent attaching a device to the id from a callback while it's being destroyed. See addr_handler(). - Sean From kanoj at netxen.com Wed Oct 3 12:14:39 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 03 Oct 2007 12:14:39 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: <4703E3C5.4020404@ichips.intel.com> References: <4703CCBC.3060703@netxen.com> <4703E3C5.4020404@ichips.intel.com> Message-ID: <4703EA1F.8060307@netxen.com> Sean Hefty wrote: >> - Your comment doesn't make it clear to me that dropping and >> reacquiring the lock is safe; can you explain why nothing else >> could come along while the lock is dropped and mess things up? > > > I need to study this part in more detail, but I don't think we can > safely release the lock without introducing a race in at least > cma_listen_on_all(). > Yes, if you see a race, please point it out, that would help me understand this code too. I will just point out that in the call chain rdma_destroy_id():cma_cancel_operation():cma_cancel_listens(), the cmid is taken off the listen_any_list first, before the provider calls are made. Thanks. Kanoj >> It seems rdma_destroy_id() has the same pattern, but it's not clear >> to me in the code: >> >> mutex_lock(&lock); >> if (id_priv->cma_dev) { >> mutex_unlock(&lock); >> // why can't the device be hot-unplugged here?? > > > The state of the id has been set to destroying, which will cause the > device removal code to ignore the id. Even if device removal occurs > before the id state has been set, this should be safe. A hot-plug > event reports the device removal, but waits for the user to destroy > the id. The device is only removed from the id by this function, > further down. > > The locking here is, in part, to prevent attaching a device to the id > from a callback while it's being destroyed. See addr_handler(). > > - Sean > From rdreier at cisco.com Wed Oct 3 12:20:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 12:20:41 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: <4703E3C5.4020404@ichips.intel.com> (Sean Hefty's message of "Wed, 03 Oct 2007 11:47:33 -0700") References: <4703CCBC.3060703@netxen.com> <4703E3C5.4020404@ichips.intel.com> Message-ID: > > It seems rdma_destroy_id() has the same pattern, but it's not clear > > to me in the code: > > mutex_lock(&lock); > > if (id_priv->cma_dev) { > > mutex_unlock(&lock); > > // why can't the device be hot-unplugged here?? > > The state of the id has been set to destroying, which will cause the > device removal code to ignore the id. Even if device removal occurs > before the id state has been set, this should be safe. A hot-plug > event reports the device removal, but waits for the user to destroy > the id. The device is only removed from the id by this function, > further down. Got it -- you still have a cma-internal reference to the device, so the hot-unplug won't complete, even though you drop the lock. OK, looks fine to me. - R. From rdreier at cisco.com Wed Oct 3 12:22:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 12:22:54 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: <4703E309.6060103@netxen.com> (Kanoj Sarcar's message of "Wed, 03 Oct 2007 11:44:25 -0700") References: <4703CCBC.3060703@netxen.com> <4703E309.6060103@netxen.com> Message-ID: > This was against 2.6.20. 2.6.20 was released in February, so that code is 8 months old now. It's much better to base patches against a more current tree. Even 2.6.22 is a little crusty now -- a current 2.6.23-rc9 tree or even my for-2.6.24 git branch would be better. - R. From ardavis at ichips.intel.com Wed Oct 3 12:28:18 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 03 Oct 2007 12:28:18 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <795c49870709241137g27b82df6ueba445ae4a3fdb6f@mail.gmail.com> References: <46F43D07.1010902@ichips.intel.com> <46F44951.6080401@ichips.intel.com> <46F4512C.4010505@ichips.intel.com> <795c49870709211702k1294cd79y5b7c987b04958adf@mail.gmail.com> <46F7EA33.5050706@ichips.intel.com> <795c49870709241137g27b82df6ueba445ae4a3fdb6f@mail.gmail.com> Message-ID: <4703ED52.7090907@ichips.intel.com> Jeff Becker wrote: > Hi Sean. I just talked to Jeff Scott about this, as he had announced > the new downloads page. It turns out that the new page does not use my > php page that automatically updates, but rather took a "snapshot" of > the page state. That's why your update doesn't show up. He said he > would try to fix this. > When can we get this fixed? From ardavis at ichips.intel.com Wed Oct 3 12:45:58 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 03 Oct 2007 12:45:58 -0700 Subject: [ofa-general] [ANNOUCE] dapl-1.2.2-1 and dapl-2.0.1-1 release Message-ID: <4703F176.6040907@ichips.intel.com> All, There are new releases for DAPL 1.2 and 2.0 available on the OFA download page and in my git tree. They are built to co-exist and support existing 1.2 applications while providing a development environment for the new 2.0 API's, including IB extensions. md5sum: 381642e81a9e8a8ed48258b8066c3434 dapl-1.2.2-1.tar.gz md5sum: 804e7669130772cc90dbb101170025e6 dapl-2.0.1-1.tar.gz Vlad, please pull both these releases into OFED 1.3, using the configure options from the package specfiles, and install the following packages: dapl-1.2.2-1 dapl-2.0.1-1 dapl-utils-2.0.1-1 dapl-devel-2.0.1-1 dapl-debuginfo-2.0.1-1 See http://www.openfabrics.org/downloads/dapl/README for more details. -arlin From kanoj at netxen.com Wed Oct 3 12:47:05 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 03 Oct 2007 12:47:05 -0700 Subject: [ofa-general] [PATCH] Fix racy deadlock in cma In-Reply-To: References: <4703CCBC.3060703@netxen.com> <4703E309.6060103@netxen.com> Message-ID: <4703F1B9.40902@netxen.com> Roland Dreier wrote: > > This was against 2.6.20. > >2.6.20 was released in February, so that code is 8 months old now. >It's much better to base patches against a more current tree. Even >2.6.22 is a little crusty now -- a current 2.6.23-rc9 tree or even >my for-2.6.24 git branch would be better. > > - R. > > > Agreed. I did check 2.6.22 had the same problem code though, before I sent the patch. Kanoj From pw at osc.edu Wed Oct 3 13:15:47 2007 From: pw at osc.edu (Pete Wyckoff) Date: Wed, 3 Oct 2007 16:15:47 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: <1191434542.1022.3.camel@trinity.ogc.int> References: <20071003174252.GA28637@osc.edu> <1191434542.1022.3.camel@trinity.ogc.int> Message-ID: <20071003201547.GB10013@osc.edu> tom at opengridcomputing.com wrote on Wed, 03 Oct 2007 13:02 -0500: > On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: > > My > > current working theory that the RDMA write has not completed by the > > time the initiator looks at its incoming data buffer. [..] > If your theory is correct, the data should eventually show up. Does it? Good point. It does not eventually show up. I added 5 1-second busy loop delays, checking to see if the values ever change. They don't. > Does your code check for errors on dma_map_single/page? This is drivers/infiniband/ulp/iser/iser_verbs.c, in iser_reg_page_vec, as called from iser_reg_rdma_mem. It uses ib_fmr_pool_map_phys, and would complain if it saw an error. These are page cache pages, and the FMR calls seem to take physical pages, but never map them into DMA addresses. Should be no mapping required for opteron and arbel, though. I could be misunderstanding something here. I don't see any major differences between this old 2.6.18-rhel5 and 2.6.23-rc6, except for a call to dma_sync_single() in mthca_arbel_map_phys_fmr(), which I'm guessing is a noop on this platform (swiotlb). Unfortunately 2.3.23-rc6 does not break at my site. At the other site with fast disks, adding any sort of kernel debugging apparently causes the problem to go away. Frustrating. > > tag 02 va 36061000 len 4000 word0 00000000 ref 1 > > tag 03 va 36065000 len 1000 word0 00004000 ref 1 > > tag 04 va 36066000 len 17000 word0 00005000 ref 1 > > tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 > > Is it interesting that the bad word occurs on the first page of the new > map? One would think so, but it is not always the first page. Sometimes, less often, it is the first word of a page in the middle of a map. I'll keep digging. Thanks, -- Pete From Thomas.Talpey at netapp.com Wed Oct 3 13:48:38 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 03 Oct 2007 16:48:38 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: <20071003174252.GA28637@osc.edu> References: <20071003174252.GA28637@osc.edu> Message-ID: At 01:42 PM 10/3/2007, Pete Wyckoff wrote: >Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write >work requests are used. After everything is connected up, a SCSI >read sequence looks like: > > initiator: register pages with FMR, write test pattern > initiator: Send request to target > target: Recv request > target: RDMA Write response to initiator > target: Wait for CQ entry for local RDMA Write completion > target: Send response to initiator > initiator: Recv response, access buffer >... >The IB spec seems to indicate that the contents of the RDMA Write >buffer should be stable after completion of a subsequent send >message (o9-20). In fact, the "Wait for CQ entry" step on the >target should be unnecessary, no? Not only unnecessary, on some hardware it may even be meaningless. A local completion means only that the hardware has accepted the RDMA Write, not that it has been sent - and certainly not that it has been received and placed in remote memory. I would look into the dma_sync behavior on the receiver. Especially on an Opteron, it's critical to synchronize the iommu and cachelines to the right memory locations. Since the FMR code hides some of this, it may be a challenge to trace. Can you try another memory registration strategy? NFS/RDMA can do that, for example. Tom. From rajouri.jammu at gmail.com Wed Oct 3 13:54:11 2007 From: rajouri.jammu at gmail.com (Rajouri Jammu) Date: Wed, 3 Oct 2007 13:54:11 -0700 Subject: [ofa-general] ib_create_cq in OFED 1.2.5 Vs 1.2 Message-ID: <3307cdf90710031354ka25df1dhefcea03408372a97@mail.gmail.com> The ib_create_cq() api changed from 1.2 to 1.2.5. We have a custom driver that runs on OFED kernel modules. Is there a way to find out at compile time the ABI version for the kernel verbs? From meier3 at llnl.gov Wed Oct 3 14:30:22 2007 From: meier3 at llnl.gov (Timothy A. Meier) Date: Wed, 03 Oct 2007 14:30:22 -0700 Subject: [ofa-general] [PATCH] opensm: osm_console.h replaced string literals with macro definitions Message-ID: <470409EE.8010905@llnl.gov> Sasha - another small patch. I think I fixed the line wrap issue, but have also attached the patch just in case. From f1ea67d05410373c90441962e1f3005aa6212b05 Mon Sep 17 00:00:00 2001 From: Tim Meier Date: Wed, 3 Oct 2007 14:05:03 -0700 Subject: [PATCH] opensm: osm_console.h replaced string literals with macro definitions Several string constants are used to define and control the behavior of the OSM Console. This patch formalizes those constants, and uses them in a consistent manner. Signed-off-by: Tim Meier --- opensm/include/opensm/osm_console.h | 8 +++++++- opensm/opensm/main.c | 14 +++++++------- opensm/opensm/osm_console.c | 8 ++++---- opensm/opensm/osm_subnet.c | 6 +++--- 4 files changed, 21 insertions(+), 15 deletions(-) diff --git a/opensm/include/opensm/osm_console.h b/opensm/include/opensm/osm_console.h index ceba3cc..33e41e7 100644 --- a/opensm/include/opensm/osm_console.h +++ b/opensm/include/opensm/osm_console.h @@ -38,9 +38,15 @@ #include #include +#define OSM_DISABLE_CONSOLE "off" +#define OSM_LOCAL_CONSOLE "local" +#define OSM_REMOTE_CONSOLE "socket" +#define OSM_LOOPBACK_CONSOLE "loopback" +#define OSM_CONSOLE_NAME "OSM Console" + #define OSM_COMMAND_LINE_LEN 120 #define OSM_COMMAND_PROMPT "$ " -#define OSM_DEFAULT_CONSOLE "off" +#define OSM_DEFAULT_CONSOLE OSM_DISABLE_CONSOLE #define OSM_DEFAULT_CONSOLE_PORT 10000 #define OSM_DAEMON_NAME "opensm" diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 0005531..0250551 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -733,11 +733,11 @@ int main(int argc, char *argv[]) /* * OpenSM interactive console */ - if (strcmp(optarg, "off") == 0 - || strcmp(optarg, "local") == 0 + if (strcmp(optarg, OSM_DISABLE_CONSOLE) == 0 + || strcmp(optarg, OSM_LOCAL_CONSOLE) == 0 #ifdef ENABLE_OSM_CONSOLE_SOCKET - || strcmp(optarg, "socket") == 0 - || strcmp(optarg, "loopback") == 0 + || strcmp(optarg, OSM_REMOTE_CONSOLE) == 0 + || strcmp(optarg, OSM_LOOPBACK_CONSOLE) == 0 #endif ) opt.console = optarg; @@ -1040,10 +1040,10 @@ int main(int argc, char *argv[]) Sit here forever */ while (!osm_exit_flag) { - if (strcmp(opt.console, "local") == 0 + if (strcmp(opt.console, OSM_LOCAL_CONSOLE) == 0 #ifdef ENABLE_OSM_CONSOLE_SOCKET - || strcmp(opt.console, "socket") == 0 - || strcmp(opt.console, "loopback") == 0 + || strcmp(opt.console, OSM_REMOTE_CONSOLE) == 0 + || strcmp(opt.console, OSM_LOOPBACK_CONSOLE) == 0 #endif ) osm_console(&osm); diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index c2816d5..c6e02ab 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -927,7 +927,7 @@ void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm) { p_osm->console.socket = -1; /* set up the file descriptors for the console */ - if (strcmp(opt->console, "local") == 0) { + if (strcmp(opt->console, OSM_LOCAL_CONSOLE) == 0) { p_osm->console.in = stdin; p_osm->console.out = stdout; p_osm->console.in_fd = fileno(stdin); @@ -935,8 +935,8 @@ void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm) osm_console_prompt(p_osm->console.out); #ifdef ENABLE_OSM_CONSOLE_SOCKET - } else if (strcmp(opt->console, "socket") == 0 - || strcmp(opt->console, "loopback") == 0) { + } else if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0 + || strcmp(opt->console, OSM_LOOPBACK_CONSOLE) == 0) { struct sockaddr_in sin; int optval = 1; @@ -951,7 +951,7 @@ void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm) &optval, sizeof(optval)); sin.sin_family = AF_INET; sin.sin_port = htons(opt->console_port); - if (strcmp(opt->console, "socket") == 0) + if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0) sin.sin_addr.s_addr = htonl(INADDR_ANY); else sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 8475936..829c82b 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -936,10 +936,10 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts) p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; } - if (strcmp(p_opts->console, "off") - && strcmp(p_opts->console, "local") + if (strcmp(p_opts->console, OSM_DISABLE_CONSOLE) + && strcmp(p_opts->console, OSM_LOCAL_CONSOLE) #ifdef ENABLE_OSM_CONSOLE_SOCKET - && strcmp(p_opts->console, "socket") + && strcmp(p_opts->console, OSM_REMOTE_CONSOLE) #endif ) { sprintf(buff, " Invalid Cached Option Value:console = %s" -- 1.5.1.4 -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-opensm-osm_console.h-replaced-string-literals-with.patch URL: From rdreier at cisco.com Wed Oct 3 15:01:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 15:01:44 -0700 Subject: [ofa-general] iSER data corruption issues In-Reply-To: <20071003174252.GA28637@osc.edu> (Pete Wyckoff's message of "Wed, 3 Oct 2007 13:42:52 -0400") References: <20071003174252.GA28637@osc.edu> Message-ID: > Machines are opteron, fedora 7 up-to-date with its openfab libs, > kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or > 2.6.18-rhel5 on initiator. For some reason, it is much easier to > produce with the rhel5 kernel. There was a bug in mthca that caused data corruption with FMRs on Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 ("IB/mthca: Fix data corruption after FMR unmap on Sinai") which went in shortly before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel has this fix or not -- but if you still see the problem on 2.6.22 and later kernels then this isn't the fix anyway. - R. From rdreier at cisco.com Wed Oct 3 15:04:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 15:04:09 -0700 Subject: [ofa-general] iSER data corruption issues In-Reply-To: (Thomas Talpey's message of "Wed, 03 Oct 2007 16:48:38 -0400") References: <20071003174252.GA28637@osc.edu> Message-ID: > I would look into the dma_sync behavior on the receiver. Especially > on an Opteron, it's critical to synchronize the iommu and cachelines > to the right memory locations. Since the FMR code hides some of > this, it may be a challenge to trace. Can you try another memory > registration strategy? NFS/RDMA can do that, for example. I think this is a red herring. Every IB HCA does 64-bit DMA, which means it bypasses all the Opteron iommu/swiotlb stuff. Also FMR doesn't hide any DMA mapping stuff; it is completely up to the consumer to handle all the DMA mapping, because FMRs operate completely at the level of bus (HCA DMA) addresses. - R. From jriotto at cisco.com Wed Oct 3 16:20:19 2007 From: jriotto at cisco.com (Jamie Riotto (jriotto)) Date: Wed, 3 Oct 2007 16:20:19 -0700 Subject: [ofa-general] Resignation from the OFA Message-ID: <944AD9DA9232E346ADF590C41BFFEC4104E3CA94@xmb-sjc-232.amer.cisco.com> Dear OFA Members, It has been my express pleasure to have served on the Board of the OFA, and in particular to have had the opportunity to act as Chairman for the Enterprise Working Group (EWG) for the last year or so. I believe the OFA to be a shining example of how an open source community can come together, overcome competitive tendancies and forge a truly lasting software effort that clearly and directly benefits the user and development community it supports. I am particularly proud of the job the EWG has done in furthering the stability of the OFA release process as embodied by the OFED releases. It is clear that these efforts have directly contributed to the widespread adoption of OFA technology with scientific, commercial and enterprise customers. However, it is time in my life for a change, and as such I have resigned from my role at Cisco, effective Oct 12. I will be taking a year or so off to persue some personal interests. Therefore, I must also resign from the Board of the OFA and the Chair of the EWG. In my place, Cisco officially nominates Gopal Hedge as my replacement for the Board seat as well as EWG chair. Gopal has just joined Cisco as my replacement and will head up all the InfiniBand and RDMA efforts for Cisco. Gopal recently joined Cisco from Adaptec where he was running engineering for their RAID product line, and prior to that Gopal spent several years at Intel driving Intel's I/O Architecture and Strategy for Server Platforms. In that role, Gopal drove Intel's DCE, FCoE and I/O virtualization strategies. As such he is uniquely qualified to drive Cisco's combined IB and Ethernet RDMA strategy. Gopal's Contact Info: Gopal Hegde 408-853-7058 gohegde at cisco.com Good luck to you all, and keep up the good work. For future reference, I can be reached at jamie.riotto at gmail.com Cheers - jamie Jamie Riotto Sr. Director Engineering Server Virtualization Business Unit (SVBU) Cisco Systems 408-853-7813 jriotto at cisco.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Wed Oct 3 18:59:31 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 03 Oct 2007 21:59:31 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: References: <20071003174252.GA28637@osc.edu> Message-ID: At 06:04 PM 10/3/2007, Roland Dreier wrote: > > I would look into the dma_sync behavior on the receiver. Especially > > on an Opteron, it's critical to synchronize the iommu and cachelines > > to the right memory locations. Since the FMR code hides some of > > this, it may be a challenge to trace. Can you try another memory > > registration strategy? NFS/RDMA can do that, for example. > >I think this is a red herring. Every IB HCA does 64-bit DMA, which >means it bypasses all the Opteron iommu/swiotlb stuff. > >Also FMR doesn't hide any DMA mapping stuff; it is completely up to >the consumer to handle all the DMA mapping, because FMRs operate >completely at the level of bus (HCA DMA) addresses. Fair enough, but the FMR *pools* still worry me, because they manage internal registrations and defer their manipulation. Depending on lots of things beyond the consumer's control, they sometimes don't even close the handles advertised to the RDMA peer. Bypassing the pools and going directly to the FMRs themselves avoids this (which is what NFS/RDMA does), but iSER and SRP both use the pool API, don't they? So, what else sends an RDMA write into the weeds? Short of writing to the wrong address, it sure sounds like a dma consistency thing to me. The connection wasn't lost, so it's not an error. Tom. From rdreier at cisco.com Wed Oct 3 20:09:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 03 Oct 2007 20:09:54 -0700 Subject: [ofa-general] iSER data corruption issues In-Reply-To: (Thomas Talpey's message of "Wed, 03 Oct 2007 21:59:31 -0400") References: <20071003174252.GA28637@osc.edu> Message-ID: > Fair enough, but the FMR *pools* still worry me, because they manage > internal registrations and defer their manipulation. Depending on lots > of things beyond the consumer's control, they sometimes don't even > close the handles advertised to the RDMA peer. The FMR pool stuff (especially with caching turned off, as the iSER initiator uses the API) isn't really doing anything particularly fancy. It just keeps a list of FMRs that are available to remap, and batches up the unregistration. It is true that an R_Key may remain valid after an FMR is unmapped, but that's the whole point of FMRs: if you don't batch up the real flushing to amortize the cost, they're no better than regular MRs really. > So, what else sends an RDMA write into the weeds? Short of writing > to the wrong address, it sure sounds like a dma consistency thing to > me. The connection wasn't lost, so it's not an error. I don't have that feeling. x86 systems are really pretty strongly consistent with respect to DMA when you're not using any of the GART/IOMMU stuff, so I think it's more likely that either the wrong address is being given to the HCA somehow, or the mthca FMR implementation is making the HCA write to the wrong address. Especially since the correct data never shows up even after a long time, it seems that the data must just be going to the wrong place. Given that there was an FMR bug with 1-port Mellanox HCAs that caused iSER corruption, I would like to make sure that the same thing isn't hitting here as well. Reproducing on 2.6.22 or 2.6.23-rcX (which have the bug fixed) would rule that out, as would seeing the bug on anything but a 1-port Mellanox HCA. - R. From kliteyn at mellanox.co.il Wed Oct 3 22:13:45 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 4 Oct 2007 07:13:45 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-04:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-03 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=519 Fail=1 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 12 LidMgr IS3-128.topo Failures: 1 LidMgr IS3-128.topo From vlad at lists.openfabrics.org Thu Oct 4 02:55:36 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 4 Oct 2007 02:55:36 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071004-0200 daily build status Message-ID: <20071004095536.8046AE608A2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071004-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From bulten at netmarkpatent.com Thu Oct 4 03:44:25 2007 From: bulten at netmarkpatent.com (NETMARK PATENT) Date: Thu, 4 Oct 2007 13:44:25 +0300 Subject: [ofa-general] ***SPAM*** =?windows-1254?q?T=FCrk_Patent_Enstit=FCs=FC_ve_Mo=F0o?= =?windows-1254?q?listan_Fikri_M=FClkiyet_Ofisi_Dayan=FD=FEmas=FD?= Message-ID: <3842-22007104410442557@ugur> Türk Patent Enstitüsü ve Moğolistan Fikri Mülkiyet Ofisi Dayanışması Türk Patent Enstitüsü (TPE) ve Moğolistan Fikri Mülkiyet Ofisi (IPOM) arasındaki teknik işbirliği görüşmeleri 21-23 Ağustos 2007 tarihlerinde Moğolistan'ın Ulan Batur şehrinde gerçekleştirildi. IPOM personelinin coğrafi işaretler ve enformasyon faaliyetleri eğitimi, TPE eş-başkanlığında yürütülmekte olan İslam Konferansı Teşkilatı Projesi ve Ekonomik İşbirliği Örgütü altında düzenlenebilecek eğitimlere Moğolistan'ın da katılımını öngören bir faaliyet planı oluşturuldu.Kaynak: TPE Time'ın Değerlendirdiği 2006 Yılının En İyi Cihazları Time'ın değerlendirmelerine göre, 2006 yılının en iyi 8 cihazı; Logitech VX, Sanyo HDI Digital Media, Apple Macbook Pro, Nintendo DS Lite, Logitech Wireless DJ, Nike + ipod sport kid, Garmin Street Pilot c550 ve Palm Treo 700W yer almakta. Kaynak: time.com İstanbul Büyükşehir Belediyesi, Kentin Elektronik Haritasını Çıkarttı! İstanbulluların www.sehirrehberi.ibb.gov.tr internet adresinden ulaşabilecekleri rehberde turistik ve tarihi mekanlar, adres bilgileri, nöbetçi eczaneler, yol durumu gibi çok sayıda merak edilen konuya ulaşılabiliyor. Ayrıca İstanbul'un uydu ve hava fotoğraflarıda bulunmakta. İstanbullular bu sanal rehberden trafik kazası, yangın, sel, kapalı yol ve yol daraltması gibi son gelişmeleride sıcağı sıcağına takip edebilecekler. Bu bültenleri almak istemiyorsan1z bulten at netmarkpatent.com adresine bo_ bir mail göndermenizi rica ederiz. Böyle bir talebiniz olmad11 sürece düzenli olarak bültenlerimizi alabilirsiniz. NETMARK PATENT T:0212 220 31 20 F:0212 220 74 21 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Thu Oct 4 04:55:28 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 04 Oct 2007 07:55:28 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: References: <20071003174252.GA28637@osc.edu> Message-ID: At 11:09 PM 10/3/2007, Roland Dreier wrote: >... It just keeps a list of FMRs that are available to remap, and >batches up the unregistration. It is true that an R_Key may remain >valid after an FMR is unmapped, but that's the whole point of FMRs: if >you don't batch up the real flushing to amortize the cost, they're no >better than regular MRs really. This is an aside, but in my experience the FMR is actually a win even if it's invalidated after each use. In testing with NFS/RDMA, I believe that direct FMR manipulation via ib_map_phys_mr()/ib_unmap_fmr() was worth somewhere on the order of 35% over straight ib_reg_phys_mr()/ib_dereg_mr(). I can only assume this was because the TPT-entry setup (ib_alloc_fmr()) is avoided on a per-I/O basis. As for the pools not invalidating the R_key/handle. Speaking as a storage provider, we take data integrity darn seriously. It's my opinion that a dynamic registration scheme that doesn't include per-I/O protection is pretty much not the point of dynamic registration. In many environments however, the performance tradeoff is important - this is why I prefer an all-physical scheme to FMRs, even though it requires additional RDMA ops to handle the resulting extra scatter/gather. Additionally, FMRs don't provide byte-range protection granularity, and they're not supported by iWARP hardware (plus they're buggy as heck on early Tavors, etc). So I didn't make them a default. Tom. From Thomas.Talpey at netapp.com Thu Oct 4 05:05:05 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 04 Oct 2007 08:05:05 -0400 Subject: [ofa-general] Fwd: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree... Message-ID: I'm happy to forward this note from Trond that the NFS/RDMA kernel client will be available in the mainline after the 2.6.24-rcX process begins, among other NFS improvements of course. There is a new nfs-aware mount command required to actually invoke an NFS/RDMA mount. The easiest way to accomplish this is to fetch the latest nfs-utils from and to invoke the mount.nfs binary directly. I'll forward more details later. Tom. > ---------- Forwarded Message ---------- >From: Trond Myklebust >To: linux-kernel at vger.kernel.org, Andrew Morton >Date: Wed, 03 Oct 2007 19:41:16 -0400 >Cc: nfsv4 at linux-nfs.org, nfs at lists.sourceforge.net >Subject: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client > git tree... >List-Id: "Discussion of NFS under Linux development, interoperability, > and testing." >List-Archive: >List-Post: >List-Help: >List-Subscribe: , > >Sender: nfs-bounces at lists.sourceforge.net > >Aside from the usual updates from Chuck for NFS-over-IPv6 (still >incomplete) and a number of bugfixes for the text-based mount code, the >main news in the NFS tree is the merging of support for the NFS/RDMA >client code from Tom Talpey and the NetApp New England (NANE) team. > >We also have the 64-bit inode support from RedHat/Peter Staubach. > >There is also the addition of a nfs_vm_page_mkwrite() method in order to >clean up the mmap() write code. >Finally, I've been working on a number of updates for the attribute >revalidation, having pulled apart most of the dentry and attribute >revalidation into separate variables. A number of fixes that address >existing bugs fell out of that review, which should hopefully result in >more efficient dcache behaviour... > >The NFS client git tree can be found at > > git://git.linux-nfs.org/pub/linux/nfs-2.6.git > >or on gitweb at > > http://linux-nfs.org/cgi-bin/gitweb.cgi?p=nfs-2.6.git;a=summary > >Finally, a full set of patches may be found on > > http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc9/ > >Cheers > Trond > >------------------- > >Adrian Bunk (1): > [2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static > >Christoph Hellwig (1): > [NFS] [PATCH] nfs: tiny makefile cleanup > >Chuck Lever (41): > SUNRPC: Fix a signed v. unsigned comparison in rpcbind's XDR routines > SUNRPC: Fix a signed v. unsigned comparison in net/sunrpc/xprtsock.c > SUNRPC: Use standard macros for printing IP addresses > SUNRPC: Free address buffers in a loop > SUNRPC: Add hex-formatted address support to rpc_peeraddr2str() > SUNRPC: Rename xs_format_peer_addresses > SUNRPC: add a function to format IPv6 addresses > SUNRPC: add support for IPv6 to the kernel's rpcbind client > SUNRPC: Introduce support for setting the port number in IPv6 addresses > SUNRPC: Rename xs_bind() to prepare for IPv6-specific bind method > SUNRPC: create an IPv6-savvy mechanism for binding to a reserved port > SUNRPC: Refactor a part of socket connect logic into a helper function > SUNRPC: Rename IPv4 connect workers > SUNRPC: create connect workers for IPv6 > SUNRPC: Add IPv6 address support to net/sunrpc/xprtsock.c > SUNRPC: Add a helper for extracting the address using the correct type > SUNRPC: Split xs_reclassify_socket into an IPv4 and IPv6 version > SUNRPC: Add support for formatted universal addresses > SUNRPC: Fix generation of universal addresses for > SUNRPC: Only one dprintk is needed during client creation > SUNRPC: fix a signed v. unsigned comparison nit in rpc_bind_new_program > SUNRPC: Use correct argument type in memcpy() > SUNRPC: Make sure server name is reasonable before trying to print it > SUNRPC: Clean up in rpc_show_tasks > SUNRPC: Make rpcb_decode_getaddr more picky about universal addresses > SUNRPC: Retry bad rpcbind replies > SUNRPC: Add a new error code for retry waiting for another binder > SUNRPC: Split another new rpcbind retry error code from EACCES > SUNRPC: RPC bind failures should be permanent for NULL requests > NFS: Kernel mount client should use async bind > NFS: Add new 'mountaddr=' mount option > NFS: Convert printk's to dprintk's in fs/nfs/nfs?xdr.c > LOCKD: Convert printk's to dprintk's in lockd XDR routines > NFSD: Convert printk's to dprintk's in NFSD's nfs4xdr > NFS: Verify server address before invoking in-kernel mount client > NFS: Show "nointr" mount option > SUNRPC: Fix bytes-per-op accounting for RPC over UDP > NFS: Don't call nfs_renew_times() in nfs_dentry_iput() > NFS: Eliminate nfs_renew_times() > NFS: Eliminate nfs_refresh_verifier() > SUNRPC: Use correct type in buffer length calculations > >Fabio Olive Leite (1): > Re: [NFS] [PATCH] Attribute timeout handling and wrapping u32 jiffies > >J. Bruce Fields (2): > nfs: add server port to rpc_pipe info file > SUNRPC: Fix default hostname created in rpc_create() > >James Lentini (1): > [NFS] [PATCH] NFS: initialize default port in kernel mount client > >Jeff Layton (1): > [NFS] [PATCH] NFS: show addr=ipaddr in /proc/mounts rather than > >Jesper Juhl (1): > [23/37] Clean up duplicate includes in > >Peter Staubach (1): > 64 bit ino support for NFS client > >Trond Myklebust (56): > NFS: Add the helper nfs_vm_page_mkwrite > NFS: Clean up write code... > NFS: Clean up nfs_writepages() > VFS: Remove writeback_control->fs_private > NFS: Clean up NFS writeback flush code > NFS: Writeback optimisation > NFS: Fall back to synchronous writes when a background write errors... > SUNRPC: Convert rpc_pipefs to use the generic filesystem >notification hooks > NFSv4: Fix a bug in nfs4_validate_mount_data() > NFS: Add a helper to extract the nfs_open_context from a struct file > NFS: Replace file->private_data with calls to nfs_file_open_context() > NFSv4: Simplify _nfs4_do_access() > NFSv4: Make NFSv4 ACCESS calls return attributes too... > NFS: Fix over-conservative attribute invalidation in nfs_update_inode() > NFS: nfs_post_op_update_inode() should call nfs_refresh_inode() > NFS: fix nfs_verify_change_attribute > NFS: Fix dcache revalidation bugs > NFS: nfs_wcc_update_inode: directory caches are always invalidated > NFS: Don't force a dcache revalidation if nfs_wcc_update_inode succeeds > NFSv4: Don't use ctime/mtime for determining when to invalidate >the caches > NFS: Don't use readdirplus data if the page cache is invalid > NFS: Fix atime revalidation in readdir() > NFS: Fix atime revalidation in read() > NFS: Fix the ESTALE "revalidation" in _nfs_revalidate_inode() > NFS: Remove bogus check of cache_change_attribute in nfs_update_inode > NFS: Fake up 'wcc' attributes to prevent cache invalidation after write > NFS: Fix the sign of the return value of nfs_save_change_attribute() > NFS: Fix nfs_verify_change_attribute() > NFS: Ensure nfs_instantiate() invalidates the parent dir on error > NFS: nfs_instantiate() should set the dentry verifier > NFS: Don't hash the negative dentry when optimising for an O_EXCL open > NFS: Fix a bug in nfs_open_revalidate() > NFS: Don't set cache_change_attribute in nfs_revalidate_mapping > NFS: Don't revalidate dentries on directory size or ctime changes > NFS: nfs_post_op_update_inode don't update cache_change_attribute > NFS: nfs_mark_for_revalidate don't update cache_change_attribute > NFS: don't cache the verifer across ->lookup() calls > NFS: Remove bogus nfs_mark_for_revalidate() in nfs_lookup > NFS: NFS_CACHEINV() should not test for nfs_caches_unstable() > NFS: Remove NFS_I(inode)->data_updates > NFS: Remove nfs_begin_data_update/nfs_end_data_update > NFS: Reset nfsi->last_updated only if the attribute changed > NFS: Optimise nfs_lookup_revalidate() > NFSv4: Don't revalidate the directory in nfs_atomic_lookup() > NFSv4: Use NFSv2/v3 rules for negative dentries in nfs_open_revalidate > NFSv4: Fix nfs_atomic_open() to set the verifier on negative dentries too > NFSv3: Always use directory post-op attributes in nfs3_proc_lookup > NFS: Remove the redundant nfs_reval_fsid() > NFS: Don't zap the readdir caches upon error > NFS: Be strict about dentry revalidation when doing exclusive create > NFS: Ensure that nfs_link() returns a hashed dentry > NFS: Simplify filehandle revalidation > NFS: Get rid of some obsolete macros > SUNRPC: Fix buggy UDP transmission > SUNRPC: Don't call xprt_release() if call_allocate fails > SUNRPC: Don't call xprt_release in call refresh > >\"Talpey, Thomas\ (20): > SUNRPC: move per-transport rpcbind netid's > SUNRPC: export per-transport rpcbind netid's > NFS: move nfs_parsed_mount_data structure definition > NFS: use in-kernel mount argument structure for nfsv[23] mounts > NFS: use in-kernel mount argument structure for nfsv4 mounts > SUNRPC: mark bulk read/write data in xdrbuf > SUNRPC: add EXPORT_SYMBOL_GPL for generic transport functions > SUNRPC: Provide a new API for registering transport implementations > SUNRPC: Finish API to load RPC transport implementations dynamically > SUNRPC: rename the rpc_xprtsock_create structure > SUNRPC: rearrange RPC sockets definitions > NFS/SUNRPC: support transport protocol naming > NFS/SUNRPC: use transport protocol naming > NFS - print accurate transport protocol > RPCRDMA: Kconfig and header file with rpcrdma protocol definitions > NFS: support RDMA mounts > RPCRDMA: rpc rdma transport switch > RPCRDMA: rpc rdma protocol implementation > RPCRDMA: rpc rdma verbs interface implementation > SUNRPC: Add RDMA dependency to SUNRPC_XPRT_RDMA > > > > >------------------------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. >Still grepping through log files to find problems? Stop. >Now Search log events and configuration files using AJAX and a browser. >Download your FREE copy of Splunk now >> http://get.splunk.com/ >_______________________________________________ >NFS maillist - NFS at lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/nfs > ---------- End of Forwarded Message ---------- From pw at osc.edu Thu Oct 4 09:14:07 2007 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 4 Oct 2007 12:14:07 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: References: <20071003174252.GA28637@osc.edu> Message-ID: <20071004161407.GB15045@osc.edu> rdreier at cisco.com wrote on Wed, 03 Oct 2007 15:01 -0700: > > Machines are opteron, fedora 7 up-to-date with its openfab libs, > > kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or > > 2.6.18-rhel5 on initiator. For some reason, it is much easier to > > produce with the rhel5 kernel. > > There was a bug in mthca that caused data corruption with FMRs on > Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 ("IB/mthca: > Fix data corruption after FMR unmap on Sinai") which went in shortly > before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel > has this fix or not -- but if you still see the problem on 2.6.22 and > later kernels then this isn't the fix anyway. This is definitely it. Same test setup runs for an hour with this patch, but fails in tens of seconds without it. Thanks for pointing it out. This rhel5 kernel is 2.6.18-8.1.6. Perhaps there are newer ones about that have this critical patch included. I'm going to add a Big Fat Warning on the iser distribution about pre-2.6.21 kernels. It also crashes if the iSER connection drops in a certain easy-to-reproduce way, another reason to avoid it. Regarding the "larger" test I talked about that fails even on modern kernels, I'm still not able to reproduce that on my setup. I ran it literally all night with a hacked target that calculated the return buffer rather than accessing the disk. For now I'm calling that a separate bug and will investigate it further. Thanks to Tom and Tom for helping debug this. -- Pete From pw at osc.edu Thu Oct 4 09:18:24 2007 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 4 Oct 2007 12:18:24 -0400 Subject: [ofa-general] iSER data corruption issues In-Reply-To: References: <20071003174252.GA28637@osc.edu> Message-ID: <20071004161824.GC15045@osc.edu> Thomas.Talpey at netapp.com wrote on Thu, 04 Oct 2007 07:55 -0400: > This is an aside, but in my experience the FMR is actually a win even if > it's invalidated after each use. In testing with NFS/RDMA, I believe that > direct FMR manipulation via ib_map_phys_mr()/ib_unmap_fmr() was worth > somewhere on the order of 35% over straight ib_reg_phys_mr()/ib_dereg_mr(). > I can only assume this was because the TPT-entry setup (ib_alloc_fmr()) > is avoided on a per-I/O basis. > > As for the pools not invalidating the R_key/handle. Speaking as a storage > provider, we take data integrity darn seriously. It's my opinion that a > dynamic registration scheme that doesn't include per-I/O protection is > pretty much not the point of dynamic registration. In many environments > however, the performance tradeoff is important - this is why I prefer an > all-physical scheme to FMRs, even though it requires additional RDMA ops > to handle the resulting extra scatter/gather. Ack. Unfortunately in the iSER case, we are limited to a single virtual address per command. Page size fragmentation may destroy performance, even with heavy pipelining. -- Pete > Additionally, FMRs don't provide byte-range protection granularity, > and they're not supported by iWARP hardware (plus they're buggy > as heck on early Tavors, etc). So I didn't make them a default. From tom at opengridcomputing.com Thu Oct 4 10:44:57 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 04 Oct 2007 12:44:57 -0500 Subject: [ofa-general] iSER data corruption issues In-Reply-To: <20071004161407.GB15045@osc.edu> References: <20071003174252.GA28637@osc.edu> <20071004161407.GB15045@osc.edu> Message-ID: <1191519897.411.84.camel@trinity.ogc.int> On Thu, 2007-10-04 at 12:14 -0400, Pete Wyckoff wrote: > rdreier at cisco.com wrote on Wed, 03 Oct 2007 15:01 -0700: > > > Machines are opteron, fedora 7 up-to-date with its openfab libs, > > > kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or > > > 2.6.18-rhel5 on initiator. For some reason, it is much easier to > > > produce with the rhel5 kernel. > > > > There was a bug in mthca that caused data corruption with FMRs on > > Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 ("IB/mthca: > > Fix data corruption after FMR unmap on Sinai") which went in shortly > > before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel > > has this fix or not -- but if you still see the problem on 2.6.22 and > > later kernels then this isn't the fix anyway. > > This is definitely it. Same test setup runs for an hour with this > patch, but fails in tens of seconds without it. Thanks for pointing > it out. > > This rhel5 kernel is 2.6.18-8.1.6. Perhaps there are newer ones > about that have this critical patch included. I'm going to add a > Big Fat Warning on the iser distribution about pre-2.6.21 kernels. > It also crashes if the iSER connection drops in a certain > easy-to-reproduce way, another reason to avoid it. > > Regarding the "larger" test I talked about that fails even on modern > kernels, I'm still not able to reproduce that on my setup. I ran it > literally all night with a hacked target that calculated the return > buffer rather than accessing the disk. For now I'm calling that a > separate bug and will investigate it further. > > Thanks to Tom and Tom for helping debug this. > Thanks to Roland who actually knew what it was ... ;-) > -- Pete > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From loterie_cristale at orangemail.es Thu Oct 4 11:54:06 2007 From: loterie_cristale at orangemail.es (=?iso-8859-1?q?LOTERIE=20CRISTALE?=) Date: Thu, 4 Oct 2007 20:54:06 +0200 (CEST) Subject: [ofa-general] =?iso-8859-1?q?!_!_!_!_F=C9LICITATIONS_!_!_F=C9LICI?= =?iso-8859-1?q?TATIONS_!_!_F=C9LICITATIONS_!_!_!_!_!_?= Message-ID: <20071004185406.E2B2B9645FE@smtp.latinmail.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Oct 4 11:27:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 04 Oct 2007 11:27:06 -0700 Subject: [ofa-general] Re: [PATCH 2 of 3 for-2.6.24] mlx4: always fill MTTs from CPU In-Reply-To: <20070801092853.GD29259@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Aug 2007 12:28:53 +0300") References: <20070801092853.GD29259@mellanox.co.il> Message-ID: > + /* Reserved mtt entries must be aligned up to a cacheline boundary, > + * since the FW will write to them, while the driver writes to all > + * other mtt entries. (Note that the variable dev->caps.mtt_entry_sz > + * below is really the mtt segment size, not the raw entry size) > + */ > + num_mtt_res_bytes = ((dev->caps.reserved_mtts * > + (dev->caps.mtt_entry_sz / MLX4_MTT_ENTRY_PER_SEG) > + + L1_CACHE_BYTES - 1) / > + L1_CACHE_BYTES) * L1_CACHE_BYTES; Shouldn't this be dma_get_cache_alignment() instead of L1_CACHE_BYTES (which would match what mthca does)? - R. From changquing.tang at hp.com Thu Oct 4 12:21:18 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 4 Oct 2007 19:21:18 -0000 Subject: [ofa-general] Issues to scale to 64K ranks. Message-ID: <349DCDA352EACF42A0C49FA6DCEA840302725728@G3W0634.americas.hpqcorp.net> When talking to run 64K processes, I noticed that, on connectX with 2.2 firmware, ibv_devinfo only shows: max_qp: 65472, that means we can not create 65536 QPs on this HCA, Is this max_qp on per process basis, or per HCA basis ? How to increase this number ? any hardware/firmwire change needed ? Thanks. From angelakenn6 at hotmail.com Thu Oct 4 13:41:30 2007 From: angelakenn6 at hotmail.com (angela kennedy) Date: Thu, 4 Oct 2007 21:41:30 +0100 Subject: [ofa-general] ***SPAM*** HELLO Message-ID: My dear, I am miss Angela from Asmara, Eritrea, single and 21 years old . Immediately after going through your profile i made up my mind to contact you for long term relationship, because you are my choice of trust and i see nothing wrong with the choice that i have made in you. Now that i am in a state of absolute confussion I must let you know that my daddy was the Financial controler to the Common Wealth North African Region.The following information is my purpose of chosing you. My daddy was killed by unidentified family enemies and my daddy's lawyer and my daddy's brother are among the suspects, they are all against me because of my daddy's properties in Eritrea. Before my daddy died he made me the beneficiary of the amount of 14.5 Million gbp£ in his account with Citi Bank in Dakar, Senegal, i have the bank statement of account in my travelling bag in this prison. on my way travelling to dakar, senegal i arrived this gambia on transit, on the same night i arrived Gambia i was attacked by 2 big boys in my guest house (hotel) room, they robbed me, collected my hand bag that contained all my money, as if that was not enough, they tried to rape me so i collected the nearest object in the room and hit one of them on the head and screamed to the hearing of the neighbouring compounds and people came out and decended on the criminals, the next morning the police came to the guest house and arrested me, since then i have been kept under awaiting trial here in this central prison of Gambia because the criminal i heated died as a result of the severe beating given to him by the neighbourhood. I am among the girls newly appointed to head the girls sector in this prison, hence i have the advantage to use the prison computer to communicate with you, and i will be very glad to also have a detailed information about you. From here i communicated with citi bank and they said that according to the agreement that my daddy signed with them that i must be present in their bank to claim the money by myself OR that i should appoint a foreign partner who will claim and receive the money on my behalf. the money is my only hope in life. as soon as Citi Bank transfers the money into your bank account, you will use some of the money to get me a lawyer/s to fight for my case and get me out of here, then thesame week of my release you will fly down here in Gambia and i and you will depart to your home in your country together. I want you to help me claim and receive the amount and also be my fund and investment manager. Reply me only on email: angelakenn6000 at hotmail.com ONLY. In your reply do give me your house address so that i will put it in my diary. Yours sincerely,Miss Angela Kennedy _________________________________________________________________ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Oct 4 14:26:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 04 Oct 2007 14:26:02 -0700 Subject: [ofa-general] Re: [PATCH 2 of 3 for-2.6.24] mlx4: always fill MTTs from CPU In-Reply-To: <20070801092853.GD29259@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Aug 2007 12:28:53 +0300") References: <20070801092853.GD29259@mellanox.co.il> Message-ID: > + num_mtt_res_bytes = ((dev->caps.reserved_mtts * > + (dev->caps.mtt_entry_sz / MLX4_MTT_ENTRY_PER_SEG) > + + L1_CACHE_BYTES - 1) / > + L1_CACHE_BYTES) * L1_CACHE_BYTES; > err = mlx4_init_icm_table(dev, &priv->mr_table.mtt_table, > init_hca->mtt_base, > dev->caps.mtt_entry_sz, > dev->caps.num_mtt_segs, > - dev->caps.reserved_mtts, 1, 0); > + num_mtt_res_bytes / dev->caps.mtt_entry_sz, > + 1, 0); This is a little off I think because it may be that mtt_entry_sz might be bigger than L1_CACHE_BYTES so the number of reserved objects might not be big enough. Also the current meaning of reserved_mtts is really wrong (it leads to the driver reserving too much stuff) -- we should convert it to segments, like this (I'll put this in my queue before this patch and fix up the expression above): diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 07c2847..ed7e8d7 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -149,7 +149,8 @@ static int __devinit mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev dev->caps.max_cqes = dev_cap->max_cq_sz - 1; dev->caps.reserved_cqs = dev_cap->reserved_cqs; dev->caps.reserved_eqs = dev_cap->reserved_eqs; - dev->caps.reserved_mtts = dev_cap->reserved_mtts; + dev->caps.reserved_mtts = DIV_ROUND_UP(dev_cap->reserved_mtts, + MLX4_MTT_ENTRY_PER_SEG); dev->caps.reserved_mrws = dev_cap->reserved_mrws; dev->caps.reserved_uars = dev_cap->reserved_uars; dev->caps.reserved_pds = dev_cap->reserved_pds; From troy at scl.ameslab.gov Thu Oct 4 15:23:22 2007 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Thu, 04 Oct 2007 17:23:22 -0500 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? Message-ID: <470567DA.6010502@scl.ameslab.gov> How can I set the ipoib broadcast multicast group rate such that even a 1X SDR connected machine can still join the group? I tried the following in /etc/osm-partitions.conf based on a mail list posting from awhile ago, but it doesn't seem to work.. Default=0x7fff,ipoib,rate=1:ALL=full; And then what does it take on the ipoib client side to pick up the new partition parameters and such? reloading ipoib, or a full restart? From weiny2 at llnl.gov Thu Oct 4 15:30:10 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 4 Oct 2007 15:30:10 -0700 Subject: [ofa-general] Question about mthca_alloc_memfree and mthca_alloc_db Message-ID: <20071004153010.41d5892a.weiny2@llnl.gov> Roland, We hit a bug in the RHEL4 kernel which was fixed in your latest tree. The bug was in mthca_alloc_memfree. When comparing your code to the current RH kernel, we wondered why you would not return the error code from mthca_alloc_db rather than -ENOMEM as demonstrated in the patch below? Thoughts? Ira >From f8c47490fd039efcf74f6470b34e2351fb302455 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 4 Oct 2007 15:16:45 -0700 Subject: [PATCH] Return the error code mthca_alloc_db rather than mask its code with ENOMEM. Signed-off-by: Ira K. Weiny --- drivers/infiniband/hw/mthca/mthca_qp.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index df01b20..c1f7e14 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1144,13 +1144,13 @@ static int mthca_alloc_memfree(struct mthca_dev *dev, qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ, qp->qpn, &qp->rq.db); if (qp->rq.db_index < 0) - return -ENOMEM; + return (qp->rq.db_index); qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ, qp->qpn, &qp->sq.db); if (qp->sq.db_index < 0) { mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index); - return -ENOMEM; + return (qp->sq.db_index); } } -- 1.5.1 From rdreier at cisco.com Thu Oct 4 15:35:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 04 Oct 2007 15:35:13 -0700 Subject: [ofa-general] Re: Question about mthca_alloc_memfree and mthca_alloc_db In-Reply-To: <20071004153010.41d5892a.weiny2@llnl.gov> (Ira Weiny's message of "Thu, 4 Oct 2007 15:30:10 -0700") References: <20071004153010.41d5892a.weiny2@llnl.gov> Message-ID: > We hit a bug in the RHEL4 kernel which was fixed in your latest tree. The bug > was in mthca_alloc_memfree. When comparing your code to the current RH kernel, > we wondered why you would not return the error code from mthca_alloc_db rather > than -ENOMEM as demonstrated in the patch below? Does Red Hat know about the bug so they can fix it in an update? Anyway, I don't think the return value matters much -- I think when I wrote the code, I just figured that the allocation failed so it makes sense to return ENOMEM rather than whatever internal reason caused the allocation to fail. Does it make any practical difference one way or another? - R. From rajouri.jammu at gmail.com Thu Oct 4 15:45:09 2007 From: rajouri.jammu at gmail.com (Rajouri Jammu) Date: Thu, 4 Oct 2007 15:45:09 -0700 Subject: [ofa-general] Re: ib_create_cq in OFED 1.2.5 Vs 1.2 In-Reply-To: <3307cdf90710031354ka25df1dhefcea03408372a97@mail.gmail.com> References: <3307cdf90710031354ka25df1dhefcea03408372a97@mail.gmail.com> Message-ID: <3307cdf90710041545q6875c302x8f3207ca9b402933@mail.gmail.com> Any ideas? thanks much in advance. On 10/3/07, Rajouri Jammu wrote: > The ib_create_cq() api changed from 1.2 to 1.2.5. > > We have a custom driver that runs on OFED kernel modules. > > Is there a way to find out at compile time the ABI version for the kernel verbs? > From hrosenstock at xsigo.com Thu Oct 4 15:46:11 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 04 Oct 2007 15:46:11 -0700 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? In-Reply-To: <470567DA.6010502@scl.ameslab.gov> References: <470567DA.6010502@scl.ameslab.gov> Message-ID: <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-04 at 17:23 -0500, Troy Benjegerdes wrote: > How can I set the ipoib broadcast multicast group rate such that even a > 1X SDR connected machine can still join the group? > > I tried the following in /etc/osm-partitions.conf What version of OpenSM ? That's the default for an OFED 1.2 based version of OpenSM. Default is /etc/ofa/opensm-partitions.conf for a master/OFED 1.3 version. > based on a mail list > posting from awhile ago, but it doesn't seem to work.. > > Default=0x7fff,ipoib,rate=1:ALL=full; rate=2 > And then what does it take on the ipoib client side to pick up the new > partition parameters and such? reloading ipoib, or a full restart? Restarting opensm after making the configuration change to the partitions file should be sufficient. -- Hal > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Thu Oct 4 16:04:11 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 04 Oct 2007 16:04:11 -0700 Subject: [ofa-general] Re: [PATCH 3 of 3 for-2.6.24] mlx4: implement FMRs In-Reply-To: <20070801092905.GE29259@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Aug 2007 12:29:05 +0300") References: <20070801092905.GE29259@mellanox.co.il> Message-ID: > +#define MLX4_MTT_FLAG_PRESENT 1 Am I missing something? Hasn't mlx4 already defined this since forever? - R. From weiny2 at llnl.gov Thu Oct 4 16:06:43 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 4 Oct 2007 16:06:43 -0700 Subject: [ofa-general] Re: Question about mthca_alloc_memfree and mthca_alloc_db In-Reply-To: References: <20071004153010.41d5892a.weiny2@llnl.gov> Message-ID: <20071004160643.3a329f04.weiny2@llnl.gov> On Thu, 04 Oct 2007 15:35:13 -0700 Roland Dreier wrote: > > We hit a bug in the RHEL4 kernel which was fixed in your latest tree. The bug > > was in mthca_alloc_memfree. When comparing your code to the current RH kernel, > > we wondered why you would not return the error code from mthca_alloc_db rather > > than -ENOMEM as demonstrated in the patch below? > > Does Red Hat know about the bug so they can fix it in an update? Yes I emailed Doug and our contractor here with a patch which uses the return values from mthca_alloc_db. > > Anyway, I don't think the return value matters much -- I think when I > wrote the code, I just figured that the allocation failed so it makes > sense to return ENOMEM rather than whatever internal reason caused the > allocation to fail. Does it make any practical difference one way or > another? > Only because a ULP could print the return code and one could get a better idea of what the error was. (Lustre does this.) Since mthca_alloc_db returns EINVAL as well as ENOMEM it seems wasteful to ignore that. Thanks, Ira From troy at scl.ameslab.gov Thu Oct 4 16:22:57 2007 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Thu, 04 Oct 2007 18:22:57 -0500 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? In-Reply-To: <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> References: <470567DA.6010502@scl.ameslab.gov> <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> Message-ID: <470575D1.4000400@scl.ameslab.gov> Hal Rosenstock wrote: > On Thu, 2007-10-04 at 17:23 -0500, Troy Benjegerdes wrote: > >> How can I set the ipoib broadcast multicast group rate such that even a >> 1X SDR connected machine can still join the group? >> >> I tried the following in /etc/osm-partitions.conf >> > > What version of OpenSM ? That's the default for an OFED 1.2 based > version of OpenSM. Default is /etc/ofa/opensm-partitions.conf for a > master/OFED 1.3 version. > > This is an ofed 1.2 version >> based on a mail list >> posting from awhile ago, but it doesn't seem to work.. >> >> Default=0x7fff,ipoib,rate=1:ALL=full; >> > > rate=2 > > Can we get something added to the opensm man page about what the different rate= options mean? I couldn't find anything documenting what these rates map to. >> And then what does it take on the ipoib client side to pick up the new >> partition parameters and such? reloading ipoib, or a full restart? >> > > Restarting opensm after making the configuration change to the > partitions file should be sufficient. > > -- Hal > > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From rdreier at cisco.com Thu Oct 4 16:28:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 04 Oct 2007 16:28:52 -0700 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? In-Reply-To: <470575D1.4000400@scl.ameslab.gov> (Troy Benjegerdes's message of "Thu, 04 Oct 2007 18:22:57 -0500") References: <470567DA.6010502@scl.ameslab.gov> <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> <470575D1.4000400@scl.ameslab.gov> Message-ID: > Can we get something added to the opensm man page about what the > different rate= options mean? I couldn't find anything documenting > what these rates map to. FWIW the rates are defined in the IB spec, and you can look at enum ib_rate in in the kernel to see all the values. From hrosenstock at xsigo.com Thu Oct 4 20:23:04 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 04 Oct 2007 20:23:04 -0700 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? In-Reply-To: <470575D1.4000400@scl.ameslab.gov> References: <470567DA.6010502@scl.ameslab.gov> <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> <470575D1.4000400@scl.ameslab.gov> Message-ID: <1191554584.1998.956.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-04 at 18:22 -0500, Troy Benjegerdes wrote: > Can we get something added to the opensm man page about what the > different rate= options mean? I couldn't find anything documenting what > these rates map to. The opensm man page says: "Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048)." in the PARTITION CONFIGURATION section. From kliteyn at mellanox.co.il Thu Oct 4 22:09:56 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 5 Oct 2007 07:09:56 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-05:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-04 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From damaru at gmail.com Fri Oct 5 01:30:31 2007 From: damaru at gmail.com (damaru) Date: 5 Oct 2007 01:30:31 -0700 Subject: [ofa-general] Do we like the same books? Message-ID: <20071005083048.F2D40E60887@openfabrics.org> I just joined Shelfari to connect with other book lovers. Come see the books I love and see if we have any in common. Then pick my next book so I can keep on reading. Click below to join my group of friends on Shelfari! http://www.shelfari.com/Register.aspx?ActivityId=22801633&InvitationCode=c70c284d-a0dd-4a69-a023-3022d4752243 damaru Shelfari is a free site that lets you share book ratings and reviews with friends and meet people who have similar tastes in books. It also lets you build an online bookshelf, join book clubs, and get good book recommendations from friends. You should check it out. -------- You have received this email because damaru (damaru at gmail.com) directly invited you to join his/her community on Shelfari. It is against Shelfari's policies to invite people who you don't know directly. Follow this link (http://www.shelfari.com/actions/emailoptout.aspx?email=openib-general at openib.org&activityid=22801633) to prevent future invitations to this address. If you believe you do not know this person, you may view (http://www.shelfari.com/damaru) his/her Shelfari page or report him/her in our feedback (http://www.shelfari.com/Feedback.aspx) section. Shelfari, 616 1st Ave #300, Seattle, WA 98104 -------------- next part -------------- An HTML attachment was scrubbed... URL: From taxmart at bellsouth.net Fri Oct 5 02:24:07 2007 From: taxmart at bellsouth.net (WINNING NOTIFICATION) Date: Fri, 5 Oct 2007 5:24:07 -0400 Subject: [ofa-general] ***SPAM*** CONFIRM YOUR WINNING PRIZE Ref: XYL /26510460037/05 Message-ID: <20071005092431.EE704E60884@openfabrics.org> Ref: XYL /26510460037/05 Batch: 24/00319/IPD WINNING NOTIFICATION We happily announce to you the draw (#1071)winner of the cash prize of £2,696,385held on the 4th of October 2007 in London Uk. contact our fiduaciary claims department Agents Name: Van Williams Email: claims_uknationallotterydept3 at yahoo.co.uk Tel: +447024096270 1.Name...2.Address...3.Nationality....4.Age...5.Sex... 6.Occupation...7.Phone/Fax..8.COUNTRY.. Cordially, Rose Wood Online Co-ordinator From che_del_rosario at z2p.net Fri Oct 5 02:42:50 2007 From: che_del_rosario at z2p.net (che_del_rosario at z2p.net) Date: Fri, 5 Oct 2007 13:42:50 +0400 Subject: [ofa-general] Big mover shows market today Message-ID: <000701c80734$182c7640$ac869fd3@zuafj> FRLE begins to deliver promised returns, Shares up over 31%. Fearless International Inc. (F R L E) $0.25 UP 31.76 % Hard climb for the hottest new yacht on the market, shares jumped nearly 32% today. You cant ignore these kind of numbers, this is going to be huge. There is a time and place for everything, and Friday more is yours, grab this one early. From vlad at lists.openfabrics.org Fri Oct 5 02:53:56 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 5 Oct 2007 02:53:56 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071005-0200 daily build status Message-ID: <20071005095357.01E1EE60887@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071005-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From mrlouisfuner at yahoo.co.uk Thu Oct 4 10:10:58 2007 From: mrlouisfuner at yahoo.co.uk (Apacs payment) Date: Thu, 4 Oct 2007 19:10:58 +0200 (SAST) Subject: [ofa-general] ***SPAM*** funds release/delivery Message-ID: <1855.196.1.190.47.1191517858.squirrel@www.smartcape.org.za> From: Louis Funer APACS - the UK payments association Mercury House Triton Court Finsbury Square London EC2A 1LQ Tel: +44 704 572 2650 REPLY: funerlouis at yahoo.com.hk An official notification of funds deposited. This is to inform you that i will like you to be part of this great transaction worth of US$12 Million it has been approved for immediate release/delivery. For the purpose of clarification of who i am dealing send all these:- 1) Your Full Name: _________ 2) Your Address:__________ 3) Your Telephone Number:________ 4) Your Fax Number: _________ 5) Your Mobile Number:___________ 6) The Name of the Closest Airport to your City ofResidence:________ 7) Your Age:________ 8) Your Country:______ 9) Sex : ____________ 10)Occupation:_____________________ On receipt of your information I will send you the full details of the consignment to you. Your quick response will be highly appreciated. Alternative address: funerlouis at yahoo.com.hk Mr Louis Funer. From sean.hefty at intel.com Fri Oct 5 10:19:51 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 5 Oct 2007 10:19:51 -0700 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> References: <46F7FDE5.9070305@oracle.com> <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> Message-ID: <000401c80773$f07c2380$3c98070a@amr.corp.intel.com> Rick, have you had a chance to test out this patch? - Sean >If a user allocates a QP on an rdma_cm_id, the rdma_cm will automatically >transition the QP through its states (RTR, RTS, error, etc.) While the >QP state transitions are occurring, the QP itself must remain valid. >Provide locking around the QP pointer to prevent its destruction while >accessing the pointer. > >This fixes an issue reported by Olaf Kirch from Oracle that resulted in >a system crash: > >"An incoming connection arrives and we decide to tear down the nascent > connection. The remote ends decides to do the same. We start to shut > down the connection, and call rdma_destroy_qp on our cm_id. ... Now > apparently a 'connect reject' message comes in from the other host, > and cma_ib_handler() is called with an event of IB_CM_REJ_RECEIVED. > It calls cma_modify_qp_err, which for some odd reason tries to modify > the exact same QP we just destroyed." > >Signed-off-by: Sean Hefty >--- >Rick, can you please test this patch and let me know if it fixes your problem? > > drivers/infiniband/core/cma.c | 90 +++++++++++++++++++++++++++-------------- > 1 files changed, 60 insertions(+), 30 deletions(-) > >diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >index 9ffb998..c6a6dba 100644 >--- a/drivers/infiniband/core/cma.c >+++ b/drivers/infiniband/core/cma.c >@@ -120,6 +120,8 @@ struct rdma_id_private { > > enum cma_state state; > spinlock_t lock; >+ struct mutex qp_mutex; >+ > struct completion comp; > atomic_t refcount; > wait_queue_head_t wait_remove; >@@ -387,6 +389,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler >event_handler, > id_priv->id.event_handler = event_handler; > id_priv->id.ps = ps; > spin_lock_init(&id_priv->lock); >+ mutex_init(&id_priv->qp_mutex); > init_completion(&id_priv->comp); > atomic_set(&id_priv->refcount, 1); > init_waitqueue_head(&id_priv->wait_remove); >@@ -472,61 +475,86 @@ EXPORT_SYMBOL(rdma_create_qp); > > void rdma_destroy_qp(struct rdma_cm_id *id) > { >- ib_destroy_qp(id->qp); >+ struct rdma_id_private *id_priv; >+ >+ id_priv = container_of(id, struct rdma_id_private, id); >+ mutex_lock(&id_priv->qp_mutex); >+ ib_destroy_qp(id_priv->id.qp); >+ id_priv->id.qp = NULL; >+ mutex_unlock(&id_priv->qp_mutex); > } > EXPORT_SYMBOL(rdma_destroy_qp); > >-static int cma_modify_qp_rtr(struct rdma_cm_id *id) >+static int cma_modify_qp_rtr(struct rdma_id_private *id_priv) > { > struct ib_qp_attr qp_attr; > int qp_attr_mask, ret; > >- if (!id->qp) >- return 0; >+ mutex_lock(&id_priv->qp_mutex); >+ if (!id_priv->id.qp) { >+ ret = 0; >+ goto out; >+ } > > /* Need to update QP attributes from default values. */ > qp_attr.qp_state = IB_QPS_INIT; >- ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); >+ ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); > if (ret) >- return ret; >+ goto out; > >- ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); >+ ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > if (ret) >- return ret; >+ goto out; > > qp_attr.qp_state = IB_QPS_RTR; >- ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); >+ ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); > if (ret) >- return ret; >+ goto out; > >- return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); >+ ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); >+out: >+ mutex_unlock(&id_priv->qp_mutex); >+ return ret; > } > >-static int cma_modify_qp_rts(struct rdma_cm_id *id) >+static int cma_modify_qp_rts(struct rdma_id_private *id_priv) > { > struct ib_qp_attr qp_attr; > int qp_attr_mask, ret; > >- if (!id->qp) >- return 0; >+ mutex_lock(&id_priv->qp_mutex); >+ if (!id_priv->id.qp) { >+ ret = 0; >+ goto out; >+ } > > qp_attr.qp_state = IB_QPS_RTS; >- ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); >+ ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); > if (ret) >- return ret; >+ goto out; > >- return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); >+ ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); >+out: >+ mutex_unlock(&id_priv->qp_mutex); >+ return ret; > } > >-static int cma_modify_qp_err(struct rdma_cm_id *id) >+static int cma_modify_qp_err(struct rdma_id_private *id_priv) > { > struct ib_qp_attr qp_attr; >+ int ret; > >- if (!id->qp) >- return 0; >+ mutex_lock(&id_priv->qp_mutex); >+ if (!id_priv->id.qp) { >+ ret = 0; >+ goto out; >+ } > > qp_attr.qp_state = IB_QPS_ERR; >- return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); >+ ret = ib_modify_qp(id_priv->id.qp, &qp_attr, IB_QP_STATE); >+out: >+ mutex_unlock(&id_priv->qp_mutex); >+ return ret; > } > > static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, >@@ -855,11 +883,11 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) > { > int ret; > >- ret = cma_modify_qp_rtr(&id_priv->id); >+ ret = cma_modify_qp_rtr(id_priv); > if (ret) > goto reject; > >- ret = cma_modify_qp_rts(&id_priv->id); >+ ret = cma_modify_qp_rts(id_priv); > if (ret) > goto reject; > >@@ -869,7 +897,7 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) > > return 0; > reject: >- cma_modify_qp_err(&id_priv->id); >+ cma_modify_qp_err(id_priv); > ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, NULL, 0); > return ret; >@@ -945,7 +973,7 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct >ib_cm_event *ib_event) > /* ignore event */ > goto out; > case IB_CM_REJ_RECEIVED: >- cma_modify_qp_err(&id_priv->id); >+ cma_modify_qp_err(id_priv); > event.status = ib_event->param.rej_rcvd.reason; > event.event = RDMA_CM_EVENT_REJECTED; > event.param.conn.private_data = ib_event->private_data; >@@ -2236,7 +2264,7 @@ static int cma_connect_iw(struct rdma_id_private >*id_priv, > sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; > cm_id->remote_addr = *sin; > >- ret = cma_modify_qp_rtr(&id_priv->id); >+ ret = cma_modify_qp_rtr(id_priv); > if (ret) > goto out; > >@@ -2303,7 +2331,7 @@ static int cma_accept_ib(struct rdma_id_private *id_priv, > int qp_attr_mask, ret; > > if (id_priv->id.qp) { >- ret = cma_modify_qp_rtr(&id_priv->id); >+ ret = cma_modify_qp_rtr(id_priv); > if (ret) > goto out; > >@@ -2342,7 +2370,7 @@ static int cma_accept_iw(struct rdma_id_private *id_priv, > struct iw_cm_conn_param iw_param; > int ret; > >- ret = cma_modify_qp_rtr(&id_priv->id); >+ ret = cma_modify_qp_rtr(id_priv); > if (ret) > return ret; > >@@ -2414,7 +2442,7 @@ int rdma_accept(struct rdma_cm_id *id, struct >rdma_conn_param *conn_param) > > return 0; > reject: >- cma_modify_qp_err(id); >+ cma_modify_qp_err(id_priv); > rdma_reject(id, NULL, 0); > return ret; > } >@@ -2484,7 +2512,7 @@ int rdma_disconnect(struct rdma_cm_id *id) > > switch (rdma_node_get_transport(id->device->node_type)) { > case RDMA_TRANSPORT_IB: >- ret = cma_modify_qp_err(id); >+ ret = cma_modify_qp_err(id_priv); > if (ret) > goto out; > /* Initiate or respond to a disconnect. */ >@@ -2515,9 +2543,11 @@ static int cma_ib_mc_handler(int status, struct >ib_sa_multicast *multicast) > cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) > return 0; > >+ mutex_lock(&id_priv->qp_mutex); > if (!status && id_priv->id.qp) > status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, > multicast->rec.mlid); >+ mutex_unlock(&id_priv->qp_mutex); > > memset(&event, 0, sizeof event); > event.status = status; > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at dev.mellanox.co.il Fri Oct 5 10:51:20 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Fri, 05 Oct 2007 19:51:20 +0200 Subject: [ofa-general] Issues to scale to 64K ranks. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840302725728@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA840302725728@G3W0634.americas.hpqcorp.net> Message-ID: <47067998.3090402@dev.mellanox.co.il> Hi. Tang, Changqing wrote: > When talking to run 64K processes, I noticed that, on connectX with 2.2 > firmware, > ibv_devinfo only shows: max_qp: 65472, that means we can not create > 65536 QPs on > this HCA, Is this max_qp on per process basis, or per HCA basis ? > This number of QPs (and any other resource) is per HCA basis. The HCA itself support much more QPs (and more elements from any other resource), but the driver have limited the number of the QPs to consume less memory. The mthca low level driver support changing the number of resources with module parameters, this need to be done with the connectX low level driver as well. > How to increase this number ? any hardware/firmwire change needed ? > > Until those module parameters will be added, the only way to do is to hack the low level driver. Dotan From hugeshill at yahoo.com.au Sat Oct 6 04:48:17 2007 From: hugeshill at yahoo.com.au (hugeshill at yahoo.com.au) Date: Sat, 6 Oct 2007 04:48:17 -0700 Subject: [ofa-general] HR Torch Relay in Berlin, Murich and ...... Message-ID: <20071006114817156.3D9B5466481B4642@kax> Hi, I am writing to forward the news and pictures (attached) of: After lited in Greece on 9 August, the Global Human Rights Torch Relay (HRTR), arrived in Berlin Germen on 18 August. MUNICH, GERMANY, Saturday, August 25 2007. Sydney Australia, October 27 2007 ...... Please search humanrightstorch for more infor. Regards, Mr. Huges Hill -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay07.jpg Type: image/jpeg Size: 174130 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay01.jpg Type: image/jpeg Size: 151033 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay02.jpg Type: image/jpeg Size: 190805 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay03.jpg Type: image/jpeg Size: 171809 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay04.jpg Type: image/jpeg Size: 133083 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay05.jpg Type: image/jpeg Size: 176366 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay06.jpg Type: image/jpeg Size: 184737 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay08.jpg Type: image/jpeg Size: 116265 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TourchRelay09.jpg Type: image/jpeg Size: 170397 bytes Desc: not available URL: From troy at scl.ameslab.gov Fri Oct 5 11:54:11 2007 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Fri, 05 Oct 2007 13:54:11 -0500 Subject: [ofa-general] Setting lowest-common denominator ipoib multicast rate? In-Reply-To: <1191554584.1998.956.camel@hrosenstock-ws.xsigo.com> References: <470567DA.6010502@scl.ameslab.gov> <1191537971.1998.937.camel@hrosenstock-ws.xsigo.com> <470575D1.4000400@scl.ameslab.gov> <1191554584.1998.956.camel@hrosenstock-ws.xsigo.com> Message-ID: <47068853.2090206@scl.ameslab.gov> Hal Rosenstock wrote: > On Thu, 2007-10-04 at 18:22 -0500, Troy Benjegerdes wrote: > >> Can we get something added to the opensm man page about what the >> different rate= options mean? I couldn't find anything documenting what >> these rates map to. >> > > The opensm man page says: > > "Note that values for rate, mtu, and scope should be specified as > defined in the IBTA specification (for example, mtu=4 for 2048)." > > in the PARTITION CONFIGURATION section. > > I think it would help usability a lot to put the PARTITION CONFIGURATION section in a separate 'opensm-partitions.conf' man page with the values for rate, mtu and scope listed directly. From zulfiimani at gmail.com Fri Oct 5 13:46:00 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Fri, 5 Oct 2007 14:46:00 -0600 Subject: [ofa-general] OFED libibverbs API Message-ID: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> Hi all, I wanted to find out where I can get the libibverbs API specification from. I checked the openfabrics.org website but could not find anything immediately. Thanks Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Oct 5 14:31:00 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 5 Oct 2007 14:31:00 -0700 Subject: [ofa-general] [PATCH-2.6.24 2/2 v2] [RFC] ib/cm: add basic performance counters In-Reply-To: <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> References: <000001c7ff9e$1b764580$3c98070a@amr.corp.intel.com> <000101c7ff9e$9f35fe60$3c98070a@amr.corp.intel.com> Message-ID: <000601c80797$064720c0$3c98070a@amr.corp.intel.com> Add performance/debug counters to track sent/received messages, retries, and duplicates. Counters are tracked per CM message type, per port. The counters are always enabled, so intrusive state tracking is not done. Counters are exported as: /sys/class/infiniband_cm/device/port/counter_description/cm_attribute for example: /sys/class/infiniband_cm/mthca0/1/cm_tx_msgs/req /sys/class/infiniband_cm/mthca0/1/cm_tx_retries/rep Signed-off-by: Sean Hefty --- >From v1: This moves the counters from debugfs to sysfs. Everything works fine for me, but I'm not entirely sure if I'm using the kobject stuff in the best way. This still depends on the ib_mad changes to export the number of retries. There were no changes to that patch, so I'm not re-sending it at this time. Is there still the possibility of getting this into OFED 1.3? Did feature freeze occur on 10/3, or was it pushed out because of vacations? drivers/infiniband/core/cm.c | 294 +++++++++++++++++++++++++++++++++++++++-- drivers/infiniband/core/ucm.c | 37 ++--- 2 files changed, 296 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 2e39236..790149e 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2007 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. @@ -37,12 +37,14 @@ #include #include +#include #include #include #include #include #include #include +#include #include #include @@ -78,15 +80,92 @@ static struct ib_cm { struct workqueue_struct *wq; } cm; +/* Counter indexes ordered by attribute ID */ +enum { + CM_REQ_COUNTER, + CM_MRA_COUNTER, + CM_REJ_COUNTER, + CM_REP_COUNTER, + CM_RTU_COUNTER, + CM_DREQ_COUNTER, + CM_DREP_COUNTER, + CM_SIDR_REQ_COUNTER, + CM_SIDR_REP_COUNTER, + CM_LAP_COUNTER, + CM_APR_COUNTER, + CM_ATTR_COUNT, + CM_ATTR_ID_OFFSET = 0x0010, +}; + +enum { + CM_XMIT, + CM_XMIT_RETRIES, + CM_RECV, + CM_RECV_DUPLICATES, + CM_COUNTER_GROUPS +}; + +static char const counter_group_names[CM_COUNTER_GROUPS] + [sizeof("cm_rx_duplicates")] = { + "cm_tx_msgs", "cm_tx_retries", + "cm_rx_msgs", "cm_rx_duplicates" +}; + +struct cm_counter_group { + struct kobject obj; + atomic_long_t counter[CM_ATTR_COUNT]; +}; + +struct cm_counter_attribute { + struct attribute attr; + int index; +}; + +#define CM_COUNTER_ATTR(_name, _index) \ +struct cm_counter_attribute cm_##_name##_counter_attr = { \ + .attr = {.name = __stringify(_name), .mode = 0444, .owner = THIS_MODULE}, \ + .index = _index \ +} + +static CM_COUNTER_ATTR(req, CM_REQ_COUNTER); +static CM_COUNTER_ATTR(mra, CM_MRA_COUNTER); +static CM_COUNTER_ATTR(rej, CM_REJ_COUNTER); +static CM_COUNTER_ATTR(rep, CM_REP_COUNTER); +static CM_COUNTER_ATTR(rtu, CM_RTU_COUNTER); +static CM_COUNTER_ATTR(dreq, CM_DREQ_COUNTER); +static CM_COUNTER_ATTR(drep, CM_DREP_COUNTER); +static CM_COUNTER_ATTR(sidr_req, CM_SIDR_REQ_COUNTER); +static CM_COUNTER_ATTR(sidr_rep, CM_SIDR_REP_COUNTER); +static CM_COUNTER_ATTR(lap, CM_LAP_COUNTER); +static CM_COUNTER_ATTR(apr, CM_APR_COUNTER); + +static struct attribute *cm_counter_default_attrs[] = { + &cm_req_counter_attr.attr, + &cm_mra_counter_attr.attr, + &cm_rej_counter_attr.attr, + &cm_rep_counter_attr.attr, + &cm_rtu_counter_attr.attr, + &cm_dreq_counter_attr.attr, + &cm_drep_counter_attr.attr, + &cm_sidr_req_counter_attr.attr, + &cm_sidr_rep_counter_attr.attr, + &cm_lap_counter_attr.attr, + &cm_apr_counter_attr.attr, + NULL +}; + struct cm_port { struct cm_device *cm_dev; struct ib_mad_agent *mad_agent; + struct kobject port_obj; u8 port_num; + struct cm_counter_group counter_group[CM_COUNTER_GROUPS]; }; struct cm_device { struct list_head list; struct ib_device *device; + struct kobject dev_obj; u8 ack_delay; struct cm_port port[0]; }; @@ -1270,6 +1349,9 @@ static void cm_dup_req_handler(struct cm_work *work, struct ib_mad_send_buf *msg = NULL; int ret; + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_REQ_COUNTER]); + /* Quick state check to discard duplicate REQs. */ if (cm_id_priv->id.state == IB_CM_REQ_RCVD) return; @@ -1616,6 +1698,8 @@ static void cm_dup_rep_handler(struct cm_work *work) if (!cm_id_priv) return; + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_REP_COUNTER]); ret = cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg); if (ret) goto deref; @@ -1781,6 +1865,8 @@ static int cm_rtu_handler(struct cm_work *work) if (cm_id_priv->id.state != IB_CM_REP_SENT && cm_id_priv->id.state != IB_CM_MRA_REP_RCVD) { spin_unlock_irq(&cm_id_priv->lock); + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_RTU_COUNTER]); goto out; } cm_id_priv->id.state = IB_CM_ESTABLISHED; @@ -1958,6 +2044,8 @@ static int cm_dreq_handler(struct cm_work *work) cm_id_priv = cm_acquire_id(dreq_msg->remote_comm_id, dreq_msg->local_comm_id); if (!cm_id_priv) { + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_DREQ_COUNTER]); cm_issue_drep(work->port, work->mad_recv_wc); return -EINVAL; } @@ -1977,6 +2065,8 @@ static int cm_dreq_handler(struct cm_work *work) case IB_CM_MRA_REP_RCVD: break; case IB_CM_TIMEWAIT: + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_DREQ_COUNTER]); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -1988,6 +2078,10 @@ static int cm_dreq_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_DREQ_RCVD: + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_DREQ_COUNTER]); + goto unlock; default: goto unlock; } @@ -2339,10 +2433,20 @@ static int cm_mra_handler(struct cm_work *work) if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_OTHER || cm_id_priv->id.lap_state != IB_CM_LAP_SENT || ib_modify_mad(cm_id_priv->av.port->mad_agent, - cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) { + if (cm_id_priv->id.lap_state == IB_CM_MRA_LAP_RCVD) + atomic_long_inc(&work->port-> + counter_group[CM_RECV_DUPLICATES]. + counter[CM_MRA_COUNTER]); goto out; + } cm_id_priv->id.lap_state = IB_CM_MRA_LAP_RCVD; break; + case IB_CM_MRA_REQ_RCVD: + case IB_CM_MRA_REP_RCVD: + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_MRA_COUNTER]); + /* fall through */ default: goto out; } @@ -2502,6 +2606,8 @@ static int cm_lap_handler(struct cm_work *work) case IB_CM_LAP_IDLE: break; case IB_CM_MRA_LAP_SENT: + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_LAP_COUNTER]); if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg)) goto unlock; @@ -2515,6 +2621,10 @@ static int cm_lap_handler(struct cm_work *work) if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; + case IB_CM_LAP_RCVD: + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_LAP_COUNTER]); + goto unlock; default: goto unlock; } @@ -2796,6 +2906,8 @@ static int cm_sidr_req_handler(struct cm_work *work) cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv); if (cur_cm_id_priv) { spin_unlock_irq(&cm.lock); + atomic_long_inc(&work->port->counter_group[CM_RECV_DUPLICATES]. + counter[CM_SIDR_REQ_COUNTER]); goto out; /* Duplicate message. */ } cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; @@ -2990,6 +3102,27 @@ static void cm_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { struct ib_mad_send_buf *msg = mad_send_wc->send_buf; + struct cm_port *port; + u16 attr_index; + + port = mad_agent->context; + attr_index = be16_to_cpu(((struct ib_mad_hdr *) + msg->mad)->attr_id) - CM_ATTR_ID_OFFSET; + + /* + * If the send was in response to a received message (context[0] is not + * set to a cm_id), and is not a REJ, then it is a send that was + * manually retried. + */ + if (!msg->context[0] && (attr_index != CM_REJ_COUNTER)) + msg->retries = 1; + + atomic_long_add(1 + msg->retries, + &port->counter_group[CM_XMIT].counter[attr_index]); + if (msg->retries) + atomic_long_add(msg->retries, + &port->counter_group[CM_XMIT_RETRIES]. + counter[attr_index]); switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -3148,8 +3281,10 @@ EXPORT_SYMBOL(ib_cm_notify); static void cm_recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { + struct cm_port *port = mad_agent->context; struct cm_work *work; enum ib_cm_event_type event; + u16 attr_id; int paths = 0; switch (mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) { @@ -3194,6 +3329,10 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, return; } + attr_id = be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id); + atomic_long_inc(&port->counter_group[CM_RECV]. + counter[attr_id - CM_ATTR_ID_OFFSET]); + work = kmalloc(sizeof *work + sizeof(struct ib_sa_path_rec) * paths, GFP_KERNEL); if (!work) { @@ -3204,7 +3343,7 @@ static void cm_recv_handler(struct ib_mad_agent *mad_agent, INIT_DELAYED_WORK(&work->work, cm_work_handler); work->cm_event.event = event; work->mad_recv_wc = mad_recv_wc; - work->port = (struct cm_port *)mad_agent->context; + work->port = port; queue_delayed_work(cm.wq, &work->work, 0); } @@ -3379,6 +3518,110 @@ static void cm_get_ack_delay(struct cm_device *cm_dev) cm_dev->ack_delay = attr.local_ca_ack_delay; } +static ssize_t cm_show_counter(struct kobject *obj, struct attribute *attr, + char *buf) +{ + struct cm_counter_group *group; + struct cm_counter_attribute *cm_attr; + + group = container_of(obj, struct cm_counter_group, obj); + cm_attr = container_of(attr, struct cm_counter_attribute, attr); + + return sprintf(buf, "%ld\n", + atomic_long_read(&group->counter[cm_attr->index])); +} + +static struct sysfs_ops cm_counter_ops = { + .show = cm_show_counter +}; + +static struct kobj_type cm_counter_obj_type = { + .sysfs_ops = &cm_counter_ops, + .default_attrs = cm_counter_default_attrs +}; + +static void cm_release_dev_obj(struct kobject *obj) +{ + struct cm_device *cm_dev; + + cm_dev = container_of(obj, struct cm_device, dev_obj); + kfree(cm_dev); +} + +static struct kobj_type cm_dev_obj_type = { + .release = cm_release_dev_obj +}; + +static struct class cm_class = { + .name = "infiniband_cm", +}; +EXPORT_SYMBOL(cm_class); + +static int cm_add_fs_obj(struct kobject *obj, struct kobject *parent, + struct kobj_type *type, const char *name) +{ + int ret; + + ret = kobject_set_name(obj, "%s", name); + if (ret) + return ret; + + obj->ktype = type; + obj->parent = kobject_get(parent); + if (!obj->parent) + return -EBUSY; + + ret = kobject_register(obj); + if (ret) + kobject_put(parent); + + return ret; +} + +static void cm_remove_fs_obj(struct kobject *obj) +{ + kobject_put(obj->parent); + kobject_unregister(obj); +} + +static int cm_create_port_fs(struct cm_port *port) +{ + char port_name[8]; + int i, ret; + + snprintf(port_name, sizeof port_name, "%d", port->port_num); + ret = cm_add_fs_obj(&port->port_obj, &port->cm_dev->dev_obj, + NULL, port_name); + if (ret) + return ret; + + for (i = 0; i < CM_COUNTER_GROUPS; i++) { + ret = cm_add_fs_obj(&port->counter_group[i].obj, &port->port_obj, + &cm_counter_obj_type, counter_group_names[i]); + if (ret) + goto error; + } + + return 0; + +error: + while (i--) + cm_remove_fs_obj(&port->counter_group[i].obj); + cm_remove_fs_obj(&port->port_obj); + return ret; + +} + +static void cm_remove_port_fs(struct cm_port *port) +{ + int i; + + for (i = 0; i < CM_COUNTER_GROUPS; i++) + cm_remove_fs_obj(&port->counter_group[i].obj); + + cm_remove_fs_obj(&port->port_obj); +} + static void cm_add_one(struct ib_device *device) { struct cm_device *cm_dev; @@ -3397,7 +3640,7 @@ static void cm_add_one(struct ib_device *device) if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; - cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * + cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) return; @@ -3405,11 +3648,23 @@ static void cm_add_one(struct ib_device *device) cm_dev->device = device; cm_get_ack_delay(cm_dev); + ret = cm_add_fs_obj(&cm_dev->dev_obj, &cm_class.subsys.kobj, + &cm_dev_obj_type, device->name); + if (ret) { + kfree(cm_dev); + return; + } + set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask); for (i = 1; i <= device->phys_port_cnt; i++) { port = &cm_dev->port[i-1]; port->cm_dev = cm_dev; port->port_num = i; + + ret = cm_create_port_fs(port); + if (ret) + goto error1; + port->mad_agent = ib_register_mad_agent(device, i, IB_QPT_GSI, ®_req, @@ -3418,11 +3673,11 @@ static void cm_add_one(struct ib_device *device) cm_recv_handler, port); if (IS_ERR(port->mad_agent)) - goto error1; + goto error2; ret = ib_modify_port(device, i, 0, &port_modify); if (ret) - goto error2; + goto error3; } ib_set_client_data(device, &cm_client, cm_dev); @@ -3431,8 +3686,10 @@ static void cm_add_one(struct ib_device *device) write_unlock_irqrestore(&cm.device_lock, flags); return; -error2: +error3: ib_unregister_mad_agent(port->mad_agent); +error2: + cm_remove_port_fs(port); error1: port_modify.set_port_cap_mask = 0; port_modify.clr_port_cap_mask = IB_PORT_CM_SUP; @@ -3440,8 +3697,9 @@ error1: port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); } - kfree(cm_dev); + cm_remove_fs_obj(&cm_dev->dev_obj); } static void cm_remove_one(struct ib_device *device) @@ -3466,8 +3724,9 @@ static void cm_remove_one(struct ib_device *device) port = &cm_dev->port[i-1]; ib_modify_port(device, port->port_num, 0, &port_modify); ib_unregister_mad_agent(port->mad_agent); + cm_remove_port_fs(port); } - kfree(cm_dev); + cm_remove_fs_obj(&cm_dev->dev_obj); } static int __init ib_cm_init(void) @@ -3488,17 +3747,25 @@ static int __init ib_cm_init(void) idr_pre_get(&cm.local_id_table, GFP_KERNEL); INIT_LIST_HEAD(&cm.timewait_list); - cm.wq = create_workqueue("ib_cm"); - if (!cm.wq) + ret = class_register(&cm_class); + if (ret) return -ENOMEM; + cm.wq = create_workqueue("ib_cm"); + if (!cm.wq) { + ret = -ENOMEM; + goto error1; + } + ret = ib_register_client(&cm_client); if (ret) - goto error; + goto error2; return 0; -error: +error2: destroy_workqueue(cm.wq); +error1: + class_unregister(&cm_class); return ret; } @@ -3519,6 +3786,7 @@ static void __exit ib_cm_cleanup(void) } ib_unregister_client(&cm_client); + class_unregister(&cm_class); idr_destroy(&cm.local_id_table); } diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 424983f..4291ab4 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -106,6 +106,9 @@ enum { IB_UCM_MAX_DEVICES = 32 }; +/* ib_cm and ib_user_cm modules share /sys/class/infiniband_cm */ +extern struct class cm_class; + #define IB_UCM_BASE_DEV MKDEV(IB_UCM_MAJOR, IB_UCM_BASE_MINOR) static void ib_ucm_add_one(struct ib_device *device); @@ -1199,7 +1202,7 @@ static int ib_ucm_close(struct inode *inode, struct file *filp) return 0; } -static void ib_ucm_release_class_dev(struct class_device *class_dev) +static void ucm_release_class_dev(struct class_device *class_dev) { struct ib_ucm_device *dev; @@ -1217,11 +1220,6 @@ static const struct file_operations ucm_fops = { .poll = ib_ucm_poll, }; -static struct class ucm_class = { - .name = "infiniband_cm", - .release = ib_ucm_release_class_dev -}; - static ssize_t show_ibdev(struct class_device *class_dev, char *buf) { struct ib_ucm_device *dev; @@ -1257,9 +1255,10 @@ static void ib_ucm_add_one(struct ib_device *device) if (cdev_add(&ucm_dev->dev, IB_UCM_BASE_DEV + ucm_dev->devnum, 1)) goto err; - ucm_dev->class_dev.class = &ucm_class; + ucm_dev->class_dev.class = &cm_class; ucm_dev->class_dev.dev = device->dma_device; ucm_dev->class_dev.devt = ucm_dev->dev.dev; + ucm_dev->class_dev.release = ucm_release_class_dev; snprintf(ucm_dev->class_dev.class_id, BUS_ID_SIZE, "ucm%d", ucm_dev->devnum); if (class_device_register(&ucm_dev->class_dev)) @@ -1306,40 +1305,34 @@ static int __init ib_ucm_init(void) "infiniband_cm"); if (ret) { printk(KERN_ERR "ucm: couldn't register device number\n"); - goto err; + goto error1; } - ret = class_register(&ucm_class); - if (ret) { - printk(KERN_ERR "ucm: couldn't create class infiniband_cm\n"); - goto err_chrdev; - } - - ret = class_create_file(&ucm_class, &class_attr_abi_version); + ret = class_create_file(&cm_class, &class_attr_abi_version); if (ret) { printk(KERN_ERR "ucm: couldn't create abi_version attribute\n"); - goto err_class; + goto error2; } ret = ib_register_client(&ucm_client); if (ret) { printk(KERN_ERR "ucm: couldn't register client\n"); - goto err_class; + goto error3; } return 0; -err_class: - class_unregister(&ucm_class); -err_chrdev: +error3: + class_remove_file(&cm_class, &class_attr_abi_version); +error2: unregister_chrdev_region(IB_UCM_BASE_DEV, IB_UCM_MAX_DEVICES); -err: +error1: return ret; } static void __exit ib_ucm_cleanup(void) { ib_unregister_client(&ucm_client); - class_unregister(&ucm_class); + class_remove_file(&cm_class, &class_attr_abi_version); unregister_chrdev_region(IB_UCM_BASE_DEV, IB_UCM_MAX_DEVICES); idr_destroy(&ctx_id_table); } From swise at opengridcomputing.com Fri Oct 5 14:53:02 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 05 Oct 2007 16:53:02 -0500 Subject: [ofa-general] OFED libibverbs API In-Reply-To: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> References: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> Message-ID: <4706B23E.8050709@opengridcomputing.com> OFA Admins: It would be nice to put the man pages on-line... If we installed the man pages, then used man2html or something we could point folks at that for on-line docs... Zulfi, if you build/install ofed-1.2.5, you can then get man pages for the verbs and rdmacm APIs. Also there are header files and examples that get build/installed. Steve. Zulfi Imani wrote: > Hi all, > > I wanted to find out where I can get the libibverbs API specification > from. I checked the openfabrics.org website but > could not find anything immediately. > > Thanks > Zulfi > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From zulfiimani at gmail.com Fri Oct 5 15:01:10 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Fri, 5 Oct 2007 16:01:10 -0600 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <46FF44B3.4010805@dev.mellanox.co.il> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> <46FF44B3.4010805@dev.mellanox.co.il> Message-ID: <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> Hi Dotan, ifconfig shows up ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr: 140.221.37.32 Bcast: 140.221.37.255 Mask: 255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) does this mean that the apps would now use IPoIB ? How do i tell when IPoIB is working and when it isnt ? Because I assume when it isnt it would default to Ethernet ? It will be great if I can get this cleared. Thanks Zulfi On 9/30/07, Dotan Barak < dotanb at dev.mellanox.co.il > wrote: > > Does a simple "ping" between the nodes is working? > (this way you can be sure that IPoIB is working and SDP should work) > > Dotan > > > Zulfi Imani wrote: > > I have not tried over IPoIB, but opensm is running > > > > /home/zulfi > sminfo > > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > > priority 0 state 3 SMINFO_MASTER > > > > I also tried a few iband utilities and they all work fine. Not able to > > run any socket apps over SDP. > > > > Thanks > > Zulfi > > > > On 9/27/07, *Jim Mott* < jimmott at austin.rr.com > > > wrote: > > > > Were you able to connect IPoIB between the nodes? Are you sure > > opensm was running? I am ashamed to admit that occasionally I > > forget to start opensm and wonder why SDP does not connect. > > > > > > > > *From:* general-bounces at lists.openfabrics.org > > [mailto: > > general-bounces at lists.openfabrics.org > > ] *On Behalf Of > > *Zulfi Imani > > *Sent:* Thursday, September 27, 2007 3:22 PM > > *To:* general at lists.openfabrics.org > > > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > > > > > Hi, > > > > I installed the OFED1.2 stack and am trying to run a simple socket > > server and client over the SDP stack. The Infiniband hardware is > > QLogic. > > > > First I set the ENV vars > > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > > > > The SDP config file has: > > *use sdp server * *:* > > use sdp client * *:* > > * > > Then started the socket server and did a 'sdpnetstat -San' and > > found that it listed the SDP port on which the server was listening. > > > > > On the client machine too I did the same; exported the variables, > > setup the SDP config file and on running the client './client > > port# server_machine' it gave me a "network not reachable" error. > > > > I tried to get some information about the error on the net but > > could not find any. > > > > I then checked the /proc//maps file and found that libsdp.so > > was being loaded. > > also: > > /root > lsmod | grep sdp > > ib_sdp 120224 3 > > > > Does QLogic support SDP applications ? Or am I missing something > > in the SDP config file or do I need to make changes to my code ? > > > > Any information on this will be a big help. > > > > Thanks, > > Zulfi > > > > > > > > > > > > > > -- > > Regs, > > Zulfi > > ------------------------------------------------------------------------ > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Fri Oct 5 15:08:20 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 5 Oct 2007 15:08:20 -0700 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com><000901c8014b$0435f880$0ca1e980$@rr.com><7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com><46FF44B3.4010805@dev.mellanox.co.il> <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> Message-ID: Can you ping between the two nodes using the IPoIB IP address? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Zulfi Imani Sent: Friday, October 05, 2007 3:01 PM To: dotanb at dev.mellanox.co.il Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Problem running SDP apps using OFED 1.2 Hi Dotan, ifconfig shows up ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr: 140.221.37.32 Bcast: 140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) does this mean that the apps would now use IPoIB ? How do i tell when IPoIB is working and when it isnt ? Because I assume when it isnt it would default to Ethernet ? It will be great if I can get this cleared. Thanks Zulfi On 9/30/07, Dotan Barak < dotanb at dev.mellanox.co.il > wrote: Does a simple "ping" between the nodes is working? (this way you can be sure that IPoIB is working and SDP should work) Dotan Zulfi Imani wrote: > I have not tried over IPoIB, but opensm is running > > /home/zulfi > sminfo > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > priority 0 state 3 SMINFO_MASTER > > I also tried a few iband utilities and they all work fine. Not able to > run any socket apps over SDP. > > Thanks > Zulfi > > On 9/27/07, *Jim Mott* < jimmott at austin.rr.com > > wrote: > > Were you able to connect IPoIB between the nodes? Are you sure > opensm was running? I am ashamed to admit that occasionally I > forget to start opensm and wonder why SDP does not connect. > > > > *From:* general-bounces at lists.openfabrics.org > [mailto: > general-bounces at lists.openfabrics.org > ] *On Behalf Of > *Zulfi Imani > *Sent:* Thursday, September 27, 2007 3:22 PM > *To:* general at lists.openfabrics.org > > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > Hi, > > I installed the OFED1.2 stack and am trying to run a simple socket > server and client over the SDP stack. The Infiniband hardware is > QLogic. > > First I set the ENV vars > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > The SDP config file has: > *use sdp server * *:* > use sdp client * *:* > * > Then started the socket server and did a 'sdpnetstat -San' and > found that it listed the SDP port on which the server was listening. > > On the client machine too I did the same; exported the variables, > setup the SDP config file and on running the client './client > port# server_machine' it gave me a "network not reachable" error. > > I tried to get some information about the error on the net but > could not find any. > > I then checked the /proc//maps file and found that libsdp.so > was being loaded. > also: > /root > lsmod | grep sdp > ib_sdp 120224 3 > > Does QLogic support SDP applications ? Or am I missing something > in the SDP config file or do I need to make changes to my code ? > > Any information on this will be a big help. > > Thanks, > Zulfi > > > > > > > -- > Regs, > Zulfi > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From zulfiimani at gmail.com Fri Oct 5 15:18:56 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Fri, 5 Oct 2007 16:18:56 -0600 Subject: [ofa-general] OFED libibverbs API In-Reply-To: <4706B23E.8050709@opengridcomputing.com> References: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> <4706B23E.8050709@opengridcomputing.com> Message-ID: <7778a2950710051518w2095564ehf43dbf367a9def8f@mail.gmail.com> Thanks Steve. Just a couple of questions. I have installed the OFED1.2 stack. You said I would find example programs. Under my installation dir > ls bin include lib lib64 mpi sbin src I do not see any subdir for example programs ? Also where can I find simple programs like file transfer using RDMA and libibverbs ? Does the "verbs.h" in the $INSTALL/include/infiniband represent the libverbs API ? I am sorry but I am just starting to program on Infiniband and am a little lost. Thanks for the help. Zulfi On 10/5/07, Steve Wise wrote: > > OFA Admins: > > It would be nice to put the man pages on-line... > > If we installed the man pages, then used man2html or something we could > point folks at that for on-line docs... > > Zulfi, if you build/install ofed-1.2.5, you can then get man pages for > the verbs and rdmacm APIs. Also there are header files and examples > that get build/installed. > > > Steve. > > > Zulfi Imani wrote: > > Hi all, > > > > I wanted to find out where I can get the libibverbs API specification > > from. I checked the openfabrics.org website but > > could not find anything immediately. > > > > Thanks > > Zulfi > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From zulfiimani at gmail.com Fri Oct 5 15:26:07 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Fri, 5 Oct 2007 16:26:07 -0600 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> <46FF44B3.4010805@dev.mellanox.co.il> <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> Message-ID: <7778a2950710051526uf72bee6y30d482adcf41bff3@mail.gmail.com> For machine#1 my IPoIB interface is ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:140.221.37.32 Bcast:140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) For machine#2 my IPoIB interface is ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet6 addr: fe80::211:7500:ff:d7f2/64 Scope:Link UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1 errors:0 dropped:21 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) Can you please tell me what the IPoIB IP addresses are for these two machines ? Also I do not know the IPv4 address for ib0 of machine#2 is not showing up ? On 10/5/07, Scott Weitzenkamp (sweitzen) wrote: > > Can you ping between the two nodes using the IPoIB IP address? > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > ------------------------------ > *From:* general-bounces at lists.openfabrics.org [mailto: > general-bounces at lists.openfabrics.org] *On Behalf Of *Zulfi Imani > *Sent:* Friday, October 05, 2007 3:01 PM > *To:* dotanb at dev.mellanox.co.il > *Cc:* general at lists.openfabrics.org > *Subject:* Re: [ofa-general] Problem running SDP apps using OFED 1.2 > > Hi Dotan, > > ifconfig shows up > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr: 140.221.37.32 Bcast: 140.221.37.255 Mask: > 255.255.255.0 > inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) > > does this mean that the apps would now use IPoIB ? How do i tell when > IPoIB is working and when it isnt ? Because I assume when it isnt it would > default to Ethernet ? > > It will be great if I can get this cleared. > > Thanks > Zulfi > > On 9/30/07, Dotan Barak < dotanb at dev.mellanox.co.il > wrote: > > > > Does a simple "ping" between the nodes is working? > > (this way you can be sure that IPoIB is working and SDP should work) > > > > Dotan > > > > > > Zulfi Imani wrote: > > > I have not tried over IPoIB, but opensm is running > > > > > > /home/zulfi > sminfo > > > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > > > priority 0 state 3 SMINFO_MASTER > > > > > > I also tried a few iband utilities and they all work fine. Not able to > > > run any socket apps over SDP. > > > > > > Thanks > > > Zulfi > > > > > > On 9/27/07, *Jim Mott* < jimmott at austin.rr.com > > > > wrote: > > > > > > Were you able to connect IPoIB between the nodes? Are you sure > > > opensm was running? I am ashamed to admit that occasionally I > > > forget to start opensm and wonder why SDP does not connect. > > > > > > > > > > > > *From:* general-bounces at lists.openfabrics.org > > > [mailto: > > > general-bounces at lists.openfabrics.org > > > ] *On Behalf Of > > > *Zulfi Imani > > > *Sent:* Thursday, September 27, 2007 3:22 PM > > > *To:* general at lists.openfabrics.org > > > > > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > > > > > > > > > Hi, > > > > > > I installed the OFED1.2 stack and am trying to run a simple socket > > > server and client over the SDP stack. The Infiniband hardware is > > > QLogic. > > > > > > First I set the ENV vars > > > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > > > > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > > > > > > > The SDP config file has: > > > *use sdp server * *:* > > > use sdp client * *:* > > > * > > > Then started the socket server and did a 'sdpnetstat -San' and > > > found that it listed the SDP port on which the server was > > listening. > > > > > > On the client machine too I did the same; exported the variables, > > > setup the SDP config file and on running the client './client > > > port# server_machine' it gave me a "network not reachable" error. > > > > > > I tried to get some information about the error on the net but > > > could not find any. > > > > > > I then checked the /proc//maps file and found that libsdp.so > > > was being loaded. > > > also: > > > /root > lsmod | grep sdp > > > ib_sdp 120224 3 > > > > > > Does QLogic support SDP applications ? Or am I missing something > > > in the SDP config file or do I need to make changes to my code ? > > > > > > Any information on this will be a big help. > > > > > > Thanks, > > > Zulfi > > > > > > > > > > > > > > > > > > > > > -- > > > Regs, > > > Zulfi > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > -- > Regs, > Zulfi > > -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From akepner at sgi.com Fri Oct 5 15:36:19 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 5 Oct 2007 15:36:19 -0700 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters Message-ID: <20071005223619.GI20278@sgi.com> On "large" IB-connected ia64 clusters, I (and some customers) are seeing failures in MPI programs. This is commoner the bigger the cluster nodes are, but I've seen it with as few as 32P/node. I'm using "Mellanox Technologies MT23108 InfiniHost (rev a1)" HCAs, with firmware version 3.5.0 (but this has been seen with several firmware revisions) and OFED-1.2. For example, with 2-128P systems connected via a single IB port, using this simple MPI program: int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return 0; } and running it with something like: # mpirun machine1, machine2 128 a.out I see failures on >1% of runs. On one run we got this in syslog (ib_mthca's debug_level set to 1): 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16) .... (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?) or on another run: 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01. .... (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???) These are just the first debug messages logged (rebooting between each run), there are lots more, of almost every flavor. Anyone else seen anything like this? Got any suggestions for debugging? Should I be looking at MPI, or would you suspect a driver or h/w problem? Any other info I could provide that'd help to narrow things down? Thanks for any pointers. -- Arthur From rdreier at cisco.com Fri Oct 5 15:46:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 15:46:21 -0700 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters In-Reply-To: <20071005223619.GI20278@sgi.com> (akepner@sgi.com's message of "Fri, 5 Oct 2007 15:36:19 -0700") References: <20071005223619.GI20278@sgi.com> Message-ID: > On one run we got this in syslog (ib_mthca's debug_level set to 1): > > 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09 > 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16) > .... > (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?) > > or on another run: > > 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01 > 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01. > .... > (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???) > > These are just the first debug messages logged (rebooting between > each run), there are lots more, of almost every flavor. > > Anyone else seen anything like this? Got any suggestions for debugging? > Should I be looking at MPI, or would you suspect a driver or h/w > problem? Any other info I could provide that'd help to narrow things > down? Almost certainly this is a driver and/or firmware bug. MPI and userspace in general shouldn't be able to do anything that would cause this type of error. Given the semi-random nature of the error messages and the fact that having nodes with lots of CPUs means FW commands are being submitted in parallel, I have to suspect a race somewhere, possibly in firmware but possibly in the driver. You could try adding dev->cmd.max_cmds = 1; to the beginning of mthca_cmd_use_events() as a hack, and see if you still see problems. I don't really see anything racy in the FW command stuff, but it's possible that there's something like an mmiowb() missing somewhere (I have a hard time spotting that type of race for some reason). - R. From rdreier at cisco.com Fri Oct 5 15:51:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 15:51:21 -0700 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters In-Reply-To: (Roland Dreier's message of "Fri, 05 Oct 2007 15:46:21 -0700") References: <20071005223619.GI20278@sgi.com> Message-ID: > I don't really see anything racy in the FW command stuff, but it's > possible that there's something like an mmiowb() missing somewhere (I > have a hard time spotting that type of race for some reason). Another possibility (independent of the hack I suggested before) would be to add an mmiowb() before the mutex_unlock() in mthca_cmd_post(). I actually have a good feeling about this theory.... - R. From rdreier at cisco.com Fri Oct 5 15:56:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 15:56:55 -0700 Subject: [ofa-general] Re: [PATCH 3 of 3 for-2.6.24] mlx4: implement FMRs In-Reply-To: <20070801092905.GE29259@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Aug 2007 12:29:05 +0300") References: <20070801092905.GE29259@mellanox.co.il> Message-ID: Thanks, I applied cleaned-up versions of all three patches for 2.6.24. One thing I changed was to just pass an error back to the caller rather than doing BUG_ON() anywhere. It's very unfriendly to the user to crash the whole machine just because of a driver bug -- much better to try and continue so that the user sees the error and can report it. From or.gerlitz at gmail.com Fri Oct 5 15:58:44 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sat, 6 Oct 2007 00:58:44 +0200 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <7778a2950710051526uf72bee6y30d482adcf41bff3@mail.gmail.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> <46FF44B3.4010805@dev.mellanox.co.il> <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> <7778a2950710051526uf72bee6y30d482adcf41bff3@mail.gmail.com> Message-ID: <15ddcffd0710051558l23d3d495s926fb56e7a2c5d91@mail.gmail.com> On 10/6/07, Zulfi Imani wrote: > > For machine#1 my IPoIB interface is > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr: 140.221.37.32 Bcast:140.221.37.255 Mask: > 255.255.255.0 > inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) > > For machine#2 my IPoIB interface is > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet6 addr: fe80::211:7500:ff:d7f2/64 Scope:Link > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1 errors:0 dropped:21 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) > > Can you please tell me what the IPoIB IP addresses are for these two > machines ? Also I do not know the IPv4 address for ib0 of machine#2 is not > showing up ? ib0 on machine#2 is not running, but it seems that your bigger problem is lack of some essential background on TCP/IP operation, where this list is not the best place to gain it. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Oct 5 15:59:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 15:59:45 -0700 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: <4701FC3A.4010207@voltaire.com> (Or Gerlitz's message of "Tue, 02 Oct 2007 10:07:22 +0200") References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> <46FF7E8B.7010307@voltaire.com> <4701FC3A.4010207@voltaire.com> Message-ID: > I understand this desire... just need a little clarification from you > re hotplug. First, as for OFED, looking on the openibd service script > (excerpts below) installed by OFED 1.3 I see that mode and mtu are set > "manually", that is the user sets/provides the mode and mtu params for > the script and the script uses sysfs to configure the device. This > does not address devices created after the service has started nor > seem a very elegant way to do so. I don't know that much about OFED or Red Hat-like distros. But in Debian/Ubuntu, I know that I can use the /etc/network/interfaces file to specify arbitrary commands to run when an interface appears. eg from the interfaces(5) man page: pre-up command Run command before bringing the interface up. If this command fails then ifup aborts, refraining from marking the interface as configured, prints an error message, and exits with status 0. This behavior may change in the future. Seems like the same thing should be possible for other distros without much trouble. - R. From or.gerlitz at gmail.com Fri Oct 5 16:06:38 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sat, 6 Oct 2007 01:06:38 +0200 Subject: [ofa-general] Re: [PATCH RFC v2] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: <15ddcffd0709270338y33976845va9f36e4050044d86@mail.gmail.com> <46FF7E8B.7010307@voltaire.com> <4701FC3A.4010207@voltaire.com> Message-ID: <15ddcffd0710051606s11f480dbyf4882c257d6c0afa@mail.gmail.com> On 10/6/07, Roland Dreier wrote: > > I don't know that much about OFED or Red Hat-like distros. But in > Debian/Ubuntu, I know that I can use the /etc/network/interfaces file > to specify arbitrary commands to run when an interface appears. eg > from the interfaces(5) man page: > > pre-up command > Run command before bringing the interface up. If this > command fails then ifup aborts, refraining from marking > the interface as configured, prints an error message, > and exits with status 0. This behavior may change in > the future. > > Seems like the same thing should be possible for other distros without > much trouble. OK, AFAIK under both Red Hat and SLES there is a way to intall pre-up and post-down hooks for the iftools, if this is what you were referring to in "hot-plug", then we are on the same page, thanks. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Oct 5 16:10:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 16:10:59 -0700 Subject: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <000601c80525$0b661f30$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Tue, 2 Oct 2007 11:50:04 -0700") References: <000401c7f632$c993e8e0$65cc180a@amr.corp.intel.com> <000001c7f6f7$074584e0$9c98070a@amr.corp.intel.com> <000601c80525$0b661f30$ff0da8c0@amr.corp.intel.com> Message-ID: > I tested this by simulating a slow passive side responder, and it worked as > expected for those tests. Using an MRA does add another MAD to the CM exchange, > which is why it is sent only after seeing a duplicate request. Alternatively, > we can take the OFED module parameter patch. What the heck, I added this for 2.6.24. If it doesn't work out we can back it out. - R. From rdreier at cisco.com Fri Oct 5 16:12:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 16:12:24 -0700 Subject: [ofa-general] [PATCH] mlx4: increase permissible number of QPs per multicast group to 56 In-Reply-To: <470355F0.3030301@mellanox.co.il> (Tziporet Koren's message of "Wed, 03 Oct 2007 10:42:24 +0200") References: <200710020940.13862.jackm@dev.mellanox.co.il> <470355F0.3030301@mellanox.co.il> Message-ID: Thanks, I just applied Jack's patch and also this: commit adeeb48f21a36693fed11b318bce132571ed3679 Author: Roland Dreier Date: Fri Oct 5 16:03:44 2007 -0700 IB/mthca: Increase max number of QPs per multicast group to 56 Increase the number of QPs allowed per multicast group from 8 to 56. This allows for one QP per core on 16-core systems, which are now quite common, and allows some space for future growth. This is basically the same patch that Jack Morgenstein just supplied for mlx4. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 9bae3cc..15aa32e 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -83,7 +83,7 @@ enum { MTHCA_QP_CONTEXT_SIZE = 0x200, MTHCA_RDB_ENTRY_SIZE = 0x20, MTHCA_AV_SIZE = 0x20, - MTHCA_MGM_ENTRY_SIZE = 0x40, + MTHCA_MGM_ENTRY_SIZE = 0x100, /* Arbel FW gives us these, but we need them for Tavor */ MTHCA_MPT_ENTRY_SIZE = 0x40, From rdreier at cisco.com Fri Oct 5 16:18:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Oct 2007 16:18:42 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 Message-ID: Since 2.6.23 still isn't out, and I've managed to reduce my patch review backlog a bit, it's probably a good idea to give another update about what I have queued for 2.6.24 already and what I hope to get to before the merge window opens. Core: - My user_mad P_Key index support patch. Merged this, although I still owe Sasha a patch to update libraries to use this. - A fix to the user_mad 32-bit big-endian userspace 64/32 problem with the method_mask when registering agents. Merged. - Sean's QoS changes. Merged. - Sean's IB CM MRA interface changes. I merged these -- what the heck, if it breaks we can back them out. ULPs: - Pradeep's IPoIB CM support for devices that don't have SRQs. Sean started reviewing but I didn't see any updated patches. - Moni's IPoIB bonding support. Seems like we found a clean set of changes, and these will go in via another (Jeff Garzik's?) tree. - Rolf's IPoIB MGID scope changes. No review progress here. - Eli and Michael's IPoIB stateless offload (checksum offload, LSO, LRO, etc). Not much review progress here; I'll try to chip away at the series and see what we can get into 2.6.24. - Or's IPoIB/userspace multicast coexistence stuff. I think we've converged on this; I'll merge this once a final version of the patch appears. HW specific: - I already merged patches to enable MSI-X by default for mthca and mlx4. I hope there aren't too many systems that get hosed if a MSI-X interrupt is generated. - Jack and Michael's mlx4 FMR support. Merged. I guess the fix for running in Xen domU may need to wait for 2.6.25, but I'll see what I can do. - ehca patch queue. Merged everything I think. - Steve's mthca router mode support. No one looked at it, seems like it's at risk of missing the window. - Arthur's mthca doorbell alignment fixes. I still need to check various approaches; I'll definitely merge something for 2.6.24. - Michael's mlx4 WQE shrinking patch. May miss the window and go for 2.6.25, I'll see if I can get to it. Here are a few topics that I believe will not be ready in time for the 2.6.24 window and will need to wait for 2.6.25: - Multiple CQ event vector support. I haven't seen any discussions about how ULPs or userspace apps should decide which vector to use, and hence no progress has been made since we deferred this during the 2.6.23 merge window. - XRC. Given the length of the backlog above and the fact that a first draft of this code has not been posted yet, I don't see any way that we could have something this major ready in time. BOILERPLATE =========== I keep patches in a git tree, available from git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git There are several branches of interest in this tree: for-2.6.23 - changes queued for merging into the current kernel release for-2.6.24 - changes queued for the next merge window for-linus - changes I have asked Linus to pull upstream for-mm - pulled by Andrew for inclusion in -mm I frequently rewrite history and rebase my tree, so the best way to track it is to keep a clone of Linus's tree around and then pull a fresh copy of my tree with git clone --reference /path/to/linus/tree \ git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git If you would like me to merge a patch, please send it to me as soon as it is ready. Do NOT wait for the merge window to open; if your change is not strictly a fix and you send it to me after the merge window opens, then it will likely have to wait for the next merge window. Please let me know if your patch is a fix that should go into the current release or if it can wait for the next merge window; if it is a fix, please describe the severity of the issue your are fixing, so I can make a good judgement about which release it should go into. Including a good changelog entry that explains what you are changing, why you are changing it, and how your change accomplishes your goal will greatly increase the chance of your patch being merged promptly. Getting an independent review and a Reviewed-by: line also helps a lot. The files Documentation/SubmittingPatches and Documentation/SubmitChecklist in kernel source tree also have a lot of good advice that makes it easier for me to handle your patches. DETAILS ======= Here is the complete list of patches I have in my for-2.6.24 branch: Ali Ayoub (1): IB/sa: Error handling thinko fix Anton Blanchard (3): IB/fmr_pool: Clean up some error messages in fmr_pool.c IB/ehca: Make output clearer by removing some debug messages IB/ehca: Export module parameters in sysfs Dotan Barak (1): mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS Eli Cohen (2): IPoIB: Fix typo to end statement with ';' instead of ',' IPoIB: Fix error path memory leak Hoang-Nam Nguyen (4): IB/ehca: Use remap_4k_pfn() to map firmware contexts to user space IB/ehca: Fix large page HW cap defines IB/ehca: Fix mem leak of firmware ctrlblock in ehca_create_srq() IB/ehca: Adjust 64-bit alignment of create QP response for userspace Jack Morgenstein (5): IB/mlx4: Display misc device information under /sys/class/infiniband/ mlx4_core: Support ICM tables in coherent memory mlx4_core: Write MTTs from CPU instead with of WRITE_MTT FW command IB/mlx4: Implement FMRs mlx4_core: Increase max number of QPs per multicast group to 56 Joachim Fenkes (11): IB/ehca: Refactor hvcall tracing IB/ehca: Print return codes as signed decimal integers IB/ehca: ehca_gen_warn() should always print IB/ehca: Add check for max #SGE to create_qp() IB/ehca: Path migration support IB/ehca: Serialize MR alloc and MR free hvCalls IB/ehca: Replace get_paca()->paca_index by the more portable raw_smp_processor_id() IB/ehca: Bump version number and change its format IB/umem: Add hugetlb flag to struct ib_umem IB/ehca: Only use MR large pages for hugetlb regions IB/ehca: Return srq_attr->max_sge in ehca_query_srq() Michael S. Tsirkin (2): mlx4_core: Enable MSI-X by default IB/mthca: Enable MSI-X by default Peter Oruba (1): IB/mthca: Use PCI-X/PCI-Express read control interfaces Ralph Campbell (1): IB/core: Fix handling of multicast response failures Roland Dreier (14): IPoIB: Make sure no receives are handled when stopping device IB: find_first_zero_bit() takes unsigned pointer mlx4_core: Don't free special QPs in QP number bitmap IB/mlx4: Use __set_data_seg() in mlx4_ib_post_recv() IB/ehca: Include from ehca_classes.h IB/mlx4: Fix up SRQ limit_watermark endianness IB/iser: Remove unnecessary includes mlx4_core: Change capability decoding: SRC->XRC IB/umad: Add P_Key index support IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems IB/uverbs: Make ib_uverbs_release_event_file() static mlx4_core: Reserve the correct number of MTT segments mlx4_core: Fix meaning of dev->caps.reserved_mtts IB/mthca: Increase max number of QPs per multicast group to 56 Satyam Sharma (1): IB/ehca: Misc cpuinit section annotations and #ifdef cleanups Sean Hefty (7): IPoIB: Specify Traffic Class with path record queries for QoS support IB/sa: Add new QoS fields to path record RDMA/cma: Add ability to specify type of service RDMA/ucma: Allow user space to set service type IB/srp: Add QoS support through service ID IB/cm: Modify interface to send MRAs in response to duplicate messages RDMA/cma: Queue IB CM MRAs to avoid unnecessary remote retries Stefan Roscher (2): IB/ehca: Small QP userspace support IB/ehca: Support more than 4k QPs for userspace and kernelspace Steve Wise (2): RDMA/cxgb3: Make the iw_cxgb3 module parameters writable RDMA/cma: Use neigh_event_send() to start neighbour discovery From zulfiimani at gmail.com Fri Oct 5 16:25:22 2007 From: zulfiimani at gmail.com (Zulfi Imani) Date: Fri, 5 Oct 2007 17:25:22 -0600 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> <46FF44B3.4010805@dev.mellanox.co.il> <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> Message-ID: <7778a2950710051625k36e9bae6mb1a62aa1c6f0df82@mail.gmail.com> I restarted openibd and now my interfaces are up. mach#1 ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:140.221.37.46 Bcast:140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d7f2/64 Scope:Link mach#2 ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:140.221.37.32 Bcast:140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link I am able to "ping 140.221.37.46" from machine#2. mach#2 > ping 140.221.37.46 PING 140.221.37.46 (140.221.37.46) 56(84) bytes of data. 64 bytes from 140.221.37.46: icmp_seq=1 ttl=64 time=2.49 ms 64 bytes from 140.221.37.46: icmp_seq=2 ttl=64 time=0.106 ms 64 bytes from 140.221.37.46: icmp_seq=3 ttl=64 time=0.098 ms --- 140.221.37.46 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.098/0.900/2.498/1.130 ms But sockets over SDP still gives me the same "Network Unreachable error". --- 140.221.37.46 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.098/0.900/2.498/1.130 ms On 10/5/07, Scott Weitzenkamp (sweitzen) wrote: > > Can you ping between the two nodes using the IPoIB IP address? > > Scott Weitzenkamp > SQA and Release Manager > Server Virtualization Business Unit > Cisco Systems > > > ------------------------------ > *From:* general-bounces at lists.openfabrics.org [mailto: > general-bounces at lists.openfabrics.org] *On Behalf Of *Zulfi Imani > *Sent:* Friday, October 05, 2007 3:01 PM > *To:* dotanb at dev.mellanox.co.il > *Cc:* general at lists.openfabrics.org > *Subject:* Re: [ofa-general] Problem running SDP apps using OFED 1.2 > > Hi Dotan, > > ifconfig shows up > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr: 140.221.37.32 Bcast: 140.221.37.255 Mask: > 255.255.255.0 > inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) > > does this mean that the apps would now use IPoIB ? How do i tell when > IPoIB is working and when it isnt ? Because I assume when it isnt it would > default to Ethernet ? > > It will be great if I can get this cleared. > > Thanks > Zulfi > > On 9/30/07, Dotan Barak < dotanb at dev.mellanox.co.il > wrote: > > > > Does a simple "ping" between the nodes is working? > > (this way you can be sure that IPoIB is working and SDP should work) > > > > Dotan > > > > > > Zulfi Imani wrote: > > > I have not tried over IPoIB, but opensm is running > > > > > > /home/zulfi > sminfo > > > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > > > priority 0 state 3 SMINFO_MASTER > > > > > > I also tried a few iband utilities and they all work fine. Not able to > > > run any socket apps over SDP. > > > > > > Thanks > > > Zulfi > > > > > > On 9/27/07, *Jim Mott* < jimmott at austin.rr.com > > > > wrote: > > > > > > Were you able to connect IPoIB between the nodes? Are you sure > > > opensm was running? I am ashamed to admit that occasionally I > > > forget to start opensm and wonder why SDP does not connect. > > > > > > > > > > > > *From:* general-bounces at lists.openfabrics.org > > > [mailto: > > > general-bounces at lists.openfabrics.org > > > ] *On Behalf Of > > > *Zulfi Imani > > > *Sent:* Thursday, September 27, 2007 3:22 PM > > > *To:* general at lists.openfabrics.org > > > > > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > > > > > > > > > Hi, > > > > > > I installed the OFED1.2 stack and am trying to run a simple socket > > > server and client over the SDP stack. The Infiniband hardware is > > > QLogic. > > > > > > First I set the ENV vars > > > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > > > > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > > > > > > > The SDP config file has: > > > *use sdp server * *:* > > > use sdp client * *:* > > > * > > > Then started the socket server and did a 'sdpnetstat -San' and > > > found that it listed the SDP port on which the server was > > listening. > > > > > > On the client machine too I did the same; exported the variables, > > > setup the SDP config file and on running the client './client > > > port# server_machine' it gave me a "network not reachable" error. > > > > > > I tried to get some information about the error on the net but > > > could not find any. > > > > > > I then checked the /proc//maps file and found that libsdp.so > > > was being loaded. > > > also: > > > /root > lsmod | grep sdp > > > ib_sdp 120224 3 > > > > > > Does QLogic support SDP applications ? Or am I missing something > > > in the SDP config file or do I need to make changes to my code ? > > > > > > Any information on this will be a big help. > > > > > > Thanks, > > > Zulfi > > > > > > > > > > > > > > > > > > > > > -- > > > Regs, > > > Zulfi > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > -- > Regs, > Zulfi > > -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From akepner at sgi.com Fri Oct 5 17:22:23 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 5 Oct 2007 17:22:23 -0700 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters In-Reply-To: References: <20071005223619.GI20278@sgi.com> Message-ID: <20071006002223.GK20278@sgi.com> On Fri, Oct 05, 2007 at 03:51:21PM -0700, Roland Dreier wrote: > Another possibility (independent of the hack I suggested before) would > be to add an mmiowb() before the mutex_unlock() in mthca_cmd_post(). > > I actually have a good feeling about this theory.... > Genius! I have completed over 275 runs with the patch below, so we can be very confident that this has fixed things. Roland, should I submit a proper patch, or do you want to take care of this? (And thanks alot, too!) diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-06-21 07:38:47.000000000 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-10-05 16:04:38.926857822 -0700 @@ -288,7 +288,7 @@ static int mthca_cmd_post(struct mthca_d else err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, op_modifier, op, token, event); - + mmiowb(); mutex_unlock(&dev->cmd.hcr_mutex); return err; } -- Arthur From sweitzen at cisco.com Fri Oct 5 18:37:05 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 5 Oct 2007 18:37:05 -0700 Subject: [ofa-general] Problem running SDP apps using OFED 1.2 In-Reply-To: <7778a2950710051625k36e9bae6mb1a62aa1c6f0df82@mail.gmail.com> References: <7778a2950709271322t4e1ccac7u3e864a9beb058e54@mail.gmail.com> <000901c8014b$0435f880$0ca1e980$@rr.com> <7778a2950709271528y4775d691ta3943848d7dba9e1@mail.gmail.com> <46FF44B3.4010805@dev.mellanox.co.il> <7778a2950710051501w27d5ecct35b95406c0d90808@mail.gmail.com> <7778a2950710051625k36e9bae6mb1a62aa1c6f0df82@mail.gmail.com> Message-ID: Does "lsmod | grep sdp" report SDP is loaded on both machines? I would then use strace with the client to watch the socket system calls happening, to make sure the client is trying to use SDP. Scott ________________________________ From: Zulfi Imani [mailto:zulfiimani at gmail.com] Sent: Friday, October 05, 2007 4:25 PM To: Scott Weitzenkamp (sweitzen) Cc: dotanb at dev.mellanox.co.il; general at lists.openfabrics.org Subject: Re: [ofa-general] Problem running SDP apps using OFED 1.2 I restarted openibd and now my interfaces are up. mach#1 ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:140.221.37.46 Bcast:140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d7f2/64 Scope:Link mach#2 ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:140.221.37.32 Bcast:140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link I am able to "ping 140.221.37.46" from machine#2. mach#2 > ping 140.221.37.46 PING 140.221.37.46 (140.221.37.46) 56(84) bytes of data. 64 bytes from 140.221.37.46: icmp_seq=1 ttl=64 time=2.49 ms 64 bytes from 140.221.37.46: icmp_seq=2 ttl=64 time=0.106 ms 64 bytes from 140.221.37.46: icmp_seq=3 ttl=64 time=0.098 ms --- 140.221.37.46 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.098/0.900/2.498/1.130 ms But sockets over SDP still gives me the same "Network Unreachable error". --- 140.221.37.46 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2002ms rtt min/avg/max/mdev = 0.098/0.900/2.498/1.130 ms On 10/5/07, Scott Weitzenkamp (sweitzen) wrote: Can you ping between the two nodes using the IPoIB IP address? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Zulfi Imani Sent: Friday, October 05, 2007 3:01 PM To: dotanb at dev.mellanox.co.il Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Problem running SDP apps using OFED 1.2 Hi Dotan, ifconfig shows up ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr: 140.221.37.32 Bcast: 140.221.37.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:ff:d802/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:1 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:68 (68.0 b) does this mean that the apps would now use IPoIB ? How do i tell when IPoIB is working and when it isnt ? Because I assume when it isnt it would default to Ethernet ? It will be great if I can get this cleared. Thanks Zulfi On 9/30/07, Dotan Barak < dotanb at dev.mellanox.co.il > wrote: Does a simple "ping" between the nodes is working? (this way you can be sure that IPoIB is working and SDP should work) Dotan Zulfi Imani wrote: > I have not tried over IPoIB, but opensm is running > > /home/zulfi > sminfo > sminfo: sm lid 1 sm guid 0x11750000ffdaf4, activity count 16220 > priority 0 state 3 SMINFO_MASTER > > I also tried a few iband utilities and they all work fine. Not able to > run any socket apps over SDP. > > Thanks > Zulfi > > On 9/27/07, *Jim Mott* < jimmott at austin.rr.com > > wrote: > > Were you able to connect IPoIB between the nodes? Are you sure > opensm was running? I am ashamed to admit that occasionally I > forget to start opensm and wonder why SDP does not connect. > > > > *From:* general-bounces at lists.openfabrics.org > [mailto: > general-bounces at lists.openfabrics.org > ] *On Behalf Of > *Zulfi Imani > *Sent:* Thursday, September 27, 2007 3:22 PM > *To:* general at lists.openfabrics.org > > > *Subject:* [ofa-general] Problem running SDP apps using OFED 1.2 > > > > Hi, > > I installed the OFED1.2 stack and am trying to run a simple socket > server and client over the SDP stack. The Infiniband hardware is > QLogic. > > First I set the ENV vars > export LD_PRELOAD=/root/zulfi/iband/INSTALL/lib64/libsdp.so > > export LIBSDP_CONFIG_FILE=/home/zulfi/libsdp.conf > > > The SDP config file has: > *use sdp server * *:* > use sdp client * *:* > * > Then started the socket server and did a 'sdpnetstat -San' and > found that it listed the SDP port on which the server was listening. > > On the client machine too I did the same; exported the variables, > setup the SDP config file and on running the client './client > port# server_machine' it gave me a "network not reachable" error. > > I tried to get some information about the error on the net but > could not find any. > > I then checked the /proc//maps file and found that libsdp.so > was being loaded. > also: > /root > lsmod | grep sdp > ib_sdp 120224 3 > > Does QLogic support SDP applications ? Or am I missing something > in the SDP config file or do I need to make changes to my code ? > > Any information on this will be a big help. > > Thanks, > Zulfi > > > > > > > -- > Regs, > Zulfi > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Regs, Zulfi -- Regs, Zulfi -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Fri Oct 5 22:18:40 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 6 Oct 2007 07:18:40 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-06:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-05 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From vlad at lists.openfabrics.org Sat Oct 6 02:55:57 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 6 Oct 2007 02:55:57 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071006-0200 daily build status Message-ID: <20071006095558.2EB31E608A5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071006-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From swise at opengridcomputing.com Sat Oct 6 07:13:47 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 06 Oct 2007 09:13:47 -0500 Subject: [ofa-general] OFED libibverbs API In-Reply-To: <7778a2950710051518w2095564ehf43dbf367a9def8f@mail.gmail.com> References: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> <4706B23E.8050709@opengridcomputing.com> <7778a2950710051518w2095564ehf43dbf367a9def8f@mail.gmail.com> Message-ID: <4707981B.4030203@opengridcomputing.com> I guess to get the src for the example programs, you need to install the ofed user src rpm. It will be in the SRPMS dir of the ofed distro tree. Once you install that, you'll get a tarball of the user space src stuff. On RH it will be in /usr/src/redhat/SOURCES. Untar that and you can find examples: Verbs: ofa_user-1.2.5/src/userspace/libibverbs/examples RDMA CMA: ofa_user-1.2.5/src/userspace/librdmacm/examples Steve. Zulfi Imani wrote: > Thanks Steve. > > Just a couple of questions. I have installed the OFED1.2 stack. You said > I would find example programs. Under my installation dir > > > ls > bin include lib lib64 mpi sbin src > > I do not see any subdir for example programs ? > Also where can I find simple programs like file transfer using RDMA and > libibverbs ? > Does the "verbs.h" in the $INSTALL/include/infiniband represent the > libverbs API ? > > I am sorry but I am just starting to program on Infiniband and am a > little lost. > > Thanks for the help. > Zulfi > > > On 10/5/07, *Steve Wise* > wrote: > > OFA Admins: > > It would be nice to put the man pages on-line... > > If we installed the man pages, then used man2html or something we could > point folks at that for on-line docs... > > Zulfi, if you build/install ofed-1.2.5, you can then get man pages for > the verbs and rdmacm APIs. Also there are header files and examples > that get build/installed. > > > Steve. > > > Zulfi Imani wrote: > > Hi all, > > > > I wanted to find out where I can get the libibverbs API specification > > from. I checked the openfabrics.org > website but > > could not find anything immediately. > > > > Thanks > > Zulfi > > > > > > > ------------------------------------------------------------------------ > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > -- > Regs, > Zulfi From mrlouisfuner at yahoo.co.uk Sat Oct 6 06:42:26 2007 From: mrlouisfuner at yahoo.co.uk (Apacs Payment) Date: Sat, 6 Oct 2007 15:42:26 +0200 (SAST) Subject: [ofa-general] funds release/delivery Message-ID: <1708.196.1.190.47.1191678146.squirrel@www.smartcape.org.za> From: Louis Funer APACS - the UK payments association Mercury House Triton Court Finsbury Square London EC2A 1LQ Tel: +44 704 572 2650 An official notification of funds deposited. This is to inform you that i will like you to be part of this great transaction worth of US$12 Million it has been approved for immediate release/delivery. For the purpose of clarification of who i am dealing send all these:- 1) Your Full Name: _________ 2) Your Address:__________ 3) Your Telephone Number:________ 4) Your Fax Number: _________ 5) Your Mobile Number:___________ 6) The Name of the Closest Airport to your City ofResidence:________ 7) Your Age:________ 8) Your Country:______ 9) Sex : ____________ 10)Occupation:_____________________ On receipt of your information I will send you the full details of the consignment to you. Your quick response will be highly appreciated. Alternative address: funerlouis at yahoo.com.hk Mr Louis Funer. From congeneracy at lookingforgod.net Sat Oct 6 13:30:33 2007 From: congeneracy at lookingforgod.net (Andrew Hunt) Date: Sat, 06 Oct 2007 23:30:33 +0300 Subject: [ofa-general] M5 0ffice 2OO7 PR0 79 $, 5ave 1O99.95 0ff Retai| Message-ID: <000001c80856$f161fe80$0100007f@localhost> G0 T0 cheapmicrosoftdeal. com From rdreier at cisco.com Sat Oct 6 13:40:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 06 Oct 2007 13:40:10 -0700 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters In-Reply-To: <20071006002223.GK20278@sgi.com> (akepner@sgi.com's message of "Fri, 5 Oct 2007 17:22:23 -0700") References: <20071005223619.GI20278@sgi.com> <20071006002223.GK20278@sgi.com> Message-ID: > Roland, should I submit a proper patch, or do you want > to take care of this? (And thanks alot, too!) Thanks for testing... I can take care of this -- I just added the patches below to my tree (since as far as I can see, mlx4 would be susceptible to the same bug): commit 66547550601a706e2b958ea351b34d8dee066b18 Author: Roland Dreier Date: Sat Oct 6 13:35:24 2007 -0700 IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up Firmware commands are sent to the HCA by writing multiple words to a command register block. Access to this block of registers is serialized with a mutex. However, on large SGI systems, problems were seen with multiple CPUs issuing FW commands at the same time, because the writes to the register block may be reordered within the system interconnect and reach the HCA in a different order than they were issued (even with the mutex). Fix this by adding an mmiowb() before dropping the mutex. Tested-by: Arthur Kepner Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index acc9589..6966f94 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -290,6 +290,12 @@ static int mthca_cmd_post(struct mthca_dev *dev, err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, op_modifier, op, token, event); + /* + * Make sure that our HCR writes don't get mixed in with + * writes from another CPU starting a FW command. + */ + mmiowb(); + mutex_unlock(&dev->cmd.hcr_mutex); return err; } commit 8c2348735c721eed6f08343eed851bfbec6e5a9a Author: Roland Dreier Date: Sat Oct 6 13:39:38 2007 -0700 mlx4_core: Use mmiowb() to avoid firmware commands getting jumbled up Firmware commands are sent to the HCA by writing multiple words to a command register block. Access to this block of registers is serialized with a mutex. However, on large SGI systems writes to the register block may be reordered within the system interconnect and reach the HCA in a different order than they were issued (even with the mutex). Fix this by adding an mmiowb() before dropping the mutex. This bug was observed with real workloads with the similar FW command code in the mthca driver, and adding the mmiowb() as in commit 66547550 ("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up") was confirmed to fix the problems, so we should add the same fix to mlx4. Signed-off-by: Roland Dreier diff --git a/drivers/net/mlx4/cmd.c b/drivers/net/mlx4/cmd.c index b540820..db49051 100644 --- a/drivers/net/mlx4/cmd.c +++ b/drivers/net/mlx4/cmd.c @@ -184,6 +184,13 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param, (event ? (1 << HCR_E_BIT) : 0) | (op_modifier << HCR_OPMOD_SHIFT) | op), hcr + 6); + + /* + * Make sure that our HCR writes don't get mixed in with + * writes from another CPU starting a FW command. + */ + mmiowb(); + cmd->toggle = cmd->toggle ^ 1; ret = 0; From pradeeps at linux.vnet.ibm.com Sat Oct 6 19:02:32 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sat, 06 Oct 2007 19:02:32 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <47083E38.2050005@linux.vnet.ibm.com> > > ULPs: > > - Pradeep's IPoIB CM support for devices that don't have SRQs. Sean > started reviewing but I didn't see any updated patches. > Roland, I submitted an updated patch incorporating some of Sean's comments within a day or two. Rest of comments pertained to restructuring the code and adding some additional module parameters. This would require more discussions since some of these had been already discussed previously. We had decided upon this code structure after a lot of discussions and incorporating these would be undoing some of that. We can discuss and revisit the comments after the merge. Pradeep From kliteyn at mellanox.co.il Sat Oct 6 22:15:37 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 7 Oct 2007 07:15:37 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-07:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-06 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From dotanb at dev.mellanox.co.il Sat Oct 6 23:05:35 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 07 Oct 2007 08:05:35 +0200 Subject: [ofa-general] OFED libibverbs API In-Reply-To: <7778a2950710051518w2095564ehf43dbf367a9def8f@mail.gmail.com> References: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> <4706B23E.8050709@opengridcomputing.com> <7778a2950710051518w2095564ehf43dbf367a9def8f@mail.gmail.com> Message-ID: <4708772F.7040700@dev.mellanox.co.il> Hi. Yes, the file "verbs.h" is the API for the libibverbs. In the folder "$INSTALL/bin" you can find the (compiled) examples ibv_*. Dotan Zulfi Imani wrote: > Thanks Steve. > > Just a couple of questions. I have installed the OFED1.2 stack. You > said I would find example programs. Under my installation dir > > > ls > bin include lib lib64 mpi sbin src > > I do not see any subdir for example programs ? > Also where can I find simple programs like file transfer using RDMA > and libibverbs ? > Does the "verbs.h" in the $INSTALL/include/infiniband represent the > libverbs API ? > > I am sorry but I am just starting to program on Infiniband and am a > little lost. > > Thanks for the help. > Zulfi > > > On 10/5/07, *Steve Wise* > wrote: > > OFA Admins: > > It would be nice to put the man pages on-line... > > If we installed the man pages, then used man2html or something we > could > point folks at that for on-line docs... > > Zulfi, if you build/install ofed-1.2.5, you can then get man pages for > the verbs and rdmacm APIs. Also there are header files and examples > that get build/installed. > > > Steve. > > > Zulfi Imani wrote: > > Hi all, > > > > I wanted to find out where I can get the libibverbs API > specification > > from. I checked the openfabrics.org > website but > > could not find anything immediately. > > > > Thanks > > Zulfi > > > > > > > ------------------------------------------------------------------------ > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > -- > Regs, > Zulfi > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at dev.mellanox.co.il Sun Oct 7 00:30:48 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 7 Oct 2007 09:30:48 +0200 Subject: [ofa-general] [PATCH] ipoib_cm: Changed the way QP is being created in ipoib_cm_create_tx_qp Message-ID: <200710070930.48454.dotanb@dev.mellanox.co.il> Changed the way QP is being created in ipoib_cm_create_tx_qp (to be consistent with ipoib_cm_create_rx_qp) Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 076a0bb..2a4269e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -813,14 +813,15 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_qp_init_attr attr = {}; - attr.recv_cq = priv->cq; - attr.srq = priv->cm.srq; - attr.cap.max_send_wr = ipoib_sendq_size; - attr.cap.max_send_sge = 1; - attr.sq_sig_type = IB_SIGNAL_ALL_WR; - attr.qp_type = IB_QPT_RC; - attr.send_cq = cq; + struct ib_qp_init_attr attr = { + .send_cq = cq, + .recv_cq = priv->cq, + .srq = priv->cm.srq, + .cap.max_send_wr = ipoib_sendq_size, + .cap.max_send_sge = 1, + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_RC, + }; return ib_create_qp(priv->pd, &attr); } From eli at dev.mellanox.co.il Sun Oct 7 01:36:24 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 07 Oct 2007 10:36:24 +0200 Subject: [ofa-general] Re: [PATCH] ipoib_cm: Changed the way QP is being created in ipoib_cm_create_tx_qp In-Reply-To: <200710070930.48454.dotanb@dev.mellanox.co.il> References: <200710070930.48454.dotanb@dev.mellanox.co.il> Message-ID: <1191746184.6176.4.camel@mtls03> I would add an empty line after the initialization to delimit variable declarations from statements. Otherwise looks good to me. On Sun, 2007-10-07 at 09:30 +0200, Dotan Barak wrote: > Changed the way QP is being created in ipoib_cm_create_tx_qp > (to be consistent with ipoib_cm_create_rx_qp) > > Signed-off-by: Dotan Barak > > --- > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 076a0bb..2a4269e 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -813,14 +813,15 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even > static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - struct ib_qp_init_attr attr = {}; > - attr.recv_cq = priv->cq; > - attr.srq = priv->cm.srq; > - attr.cap.max_send_wr = ipoib_sendq_size; > - attr.cap.max_send_sge = 1; > - attr.sq_sig_type = IB_SIGNAL_ALL_WR; > - attr.qp_type = IB_QPT_RC; > - attr.send_cq = cq; > + struct ib_qp_init_attr attr = { > + .send_cq = cq, > + .recv_cq = priv->cq, > + .srq = priv->cm.srq, > + .cap.max_send_wr = ipoib_sendq_size, > + .cap.max_send_sge = 1, > + .sq_sig_type = IB_SIGNAL_ALL_WR, > + .qp_type = IB_QPT_RC, > + }; > return ib_create_qp(priv->pd, &attr); > } > From vlad at lists.openfabrics.org Sun Oct 7 02:55:36 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 7 Oct 2007 02:55:36 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071007-0200 daily build status Message-ID: <20071007095536.30A58E608A1@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From jsquyres at cisco.com Sun Oct 7 03:37:50 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Sun, 7 Oct 2007 12:37:50 +0200 Subject: [ofa-general] Suggestions for nightly build e-mail In-Reply-To: <20071007095536.30A58E608A1@openfabrics.org> References: <20071007095536.30A58E608A1@openfabrics.org> Message-ID: <9F0DDB2E-F786-466A-A59F-614F59A222E9@cisco.com> I have two minor suggestions for the nightly automated e-mail: 1. The main thing that people care about is if/when failures occur -- this should be the focus of the e-mail. If there are failures, change the subject and/or have a BIG BOLD NOTICE at the top of the e- mail. The idea is to get people's attention (without requiring them to scroll down in the mail) when there are failures. 2. Please send the mail to general at lists.openfabrics.org (not openib- general at openib.org). The mailing list name changed a long, long time ago. :-) Just my $0.02... On Oct 7, 2007, at 11:55 AM, Vladimir Sokolovsky (Mellanox) wrote: > This email was generated automatically, please do not reply > > > git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git > git_branch: ofed_kernel > > Common build parameters: --with-ipoib-mod --with-sdp-mod --with- > srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod > --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds- > mod --with-cxgb3-mod --with-nes-mod > > Passed: > Passed on i686 with 2.6.15-23-server > Passed on i686 with linux-2.6.22 > Passed on i686 with linux-2.6.18 > Passed on i686 with linux-2.6.21.1 > Passed on i686 with linux-2.6.17 > Passed on i686 with linux-2.6.16 > Passed on i686 with linux-2.6.19 > Passed on i686 with linux-2.6.15 > Passed on i686 with linux-2.6.13 > Passed on i686 with linux-2.6.14 > Passed on i686 with linux-2.6.12 > Passed on x86_64 with linux-2.6.12 > Passed on x86_64 with linux-2.6.20 > Passed on powerpc with linux-2.6.13 > Passed on x86_64 with linux-2.6.16 > Passed on ppc64 with linux-2.6.16 > Passed on ppc64 with linux-2.6.18 > Passed on x86_64 with linux-2.6.18 > Passed on powerpc with linux-2.6.12 > Passed on ia64 with linux-2.6.14 > Passed on ia64 with linux-2.6.18 > Passed on powerpc with linux-2.6.15 > Passed on ia64 with linux-2.6.15 > Passed on ia64 with linux-2.6.13 > Passed on ppc64 with linux-2.6.15 > Passed on ia64 with linux-2.6.12 > Passed on ppc64 with linux-2.6.12 > Passed on ia64 with linux-2.6.17 > Passed on ia64 with linux-2.6.16 > Passed on ia64 with linux-2.6.19 > Passed on x86_64 with linux-2.6.17 > Passed on powerpc with linux-2.6.14 > Passed on x86_64 with linux-2.6.19 > Passed on ppc64 with linux-2.6.14 > Passed on ppc64 with linux-2.6.19 > Passed on ppc64 with linux-2.6.13 > Passed on x86_64 with linux-2.6.14 > Passed on x86_64 with linux-2.6.22 > Passed on x86_64 with linux-2.6.13 > Passed on x86_64 with linux-2.6.21.1 > Passed on ppc64 with linux-2.6.17 > Passed on x86_64 with linux-2.6.16.43-0.3-smp > Passed on x86_64 with linux-2.6.15 > Passed on x86_64 with linux-2.6.16.21-0.8-smp > Passed on ppc64 with linux-2.6.18-8.el5 > Passed on ia64 with linux-2.6.21.1 > Passed on x86_64 with linux-2.6.9-42.ELsmp > Passed on ia64 with linux-2.6.22 > Passed on x86_64 with linux-2.6.9-22.ELsmp > Passed on ia64 with linux-2.6.16.21-0.8-default > Passed on x86_64 with linux-2.6.18-8.el5 > Passed on x86_64 with linux-2.6.9-55.ELsmp > Passed on x86_64 with linux-2.6.18-1.2798.fc6 > Passed on x86_64 with linux-2.6.9-34.ELsmp > > Failed: > Build failed on powerpc with linux-2.6.18 > Log: > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of > '->' > make[4]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.o] Error 1 > make[3]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband/hw/ehca] Error 2 > make[2]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check/drivers/ > infiniband] Error 2 > make[1]: *** [_module_/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.18_powerpc_check] Error 2 > make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/ > linux-2.6.18' > make: *** [kernel] Error 2 > ---------------------------------------------------------------------- > ------------ > Build failed on powerpc with linux-2.6.19 > Log: > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of > '->' > make[4]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.o] Error 1 > make[3]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband/hw/ehca] Error 2 > make[2]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check/drivers/ > infiniband] Error 2 > make[1]: *** [_module_/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.19_powerpc_check] Error 2 > make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/ > linux-2.6.19' > make: *** [kernel] Error 2 > ---------------------------------------------------------------------- > ------------ > Build failed on powerpc with linux-2.6.17 > Log: > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of > '->' > make[4]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.o] Error 1 > make[3]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband/hw/ehca] Error 2 > make[2]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check/drivers/ > infiniband] Error 2 > make[1]: *** [_module_/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.17_powerpc_check] Error 2 > make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/ > linux-2.6.17' > make: *** [kernel] Error 2 > ---------------------------------------------------------------------- > ------------ > Build failed on powerpc with linux-2.6.16 > Log: > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of > '->' > /home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of > '->' > make[4]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband/hw/ehca/ehca_main.o] Error 1 > make[3]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband/hw/ehca] Error 2 > make[2]: *** [/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check/drivers/ > infiniband] Error 2 > make[1]: *** [_module_/home/vlad/tmp/ > ofa_1_3_kernel-20071007-0200_linux-2.6.16_powerpc_check] Error 2 > make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/ > linux-2.6.16' > make: *** [kernel] Error 2 > ---------------------------------------------------------------------- > ------------ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From dledford at redhat.com Sun Oct 7 07:34:40 2007 From: dledford at redhat.com (Doug Ledford) Date: Sun, 07 Oct 2007 14:34:40 +0000 Subject: [ofa-general] librdmacm feature request Message-ID: <1191767680.19888.310.camel@firewall.xsintricity.com> I've been trying to write some code using librdmacm and I've run across a few shortcomings in the library. 1) When you listen for connections, the event includes a new cm_id struct attached to the listen event channel. Attempts to change this channel make the cm_id unusable (rdma_create_qp fails). This is suboptimal in situations where you want the listen channel to produce listen events only. A function such as rdma_modify_channel(cm_id, new_channel); would work to solve this. 2) When you create a new cm_id with the intent of connecting to another machine, it is again desirable to get your events related to the establishment of the connection in a separate channel from those events related to already established connections (amongst other things, if you are sharing a channel with a different thread that is responsible for tearing down connections on error, then which thread gets the ADDR_RESOLVED or ROUTE_RESOLVED events is up in the air...to make sure it gets delivered properly, I currently have the connecting thread pthread_mutex_lock the connection construct, set connection->cm_waiting = 1, then issue the rdma_resolve_route, then pthread_mutex_lock again so it deadlocks, and then other thread gets the event, checks connection->cm_waiting == 1, and if true it places the event pointer in connection->event, clears connection->cm_waiting, then pthread_mutex_unlock's the connection...how gross is that). So, using a separate event channel up until the connection is established, then calling rdma_modify_channel() would also solve this problem. 3) The man pages on rdma_connect() and rdma_accept() aren't really clear on the role of the connection parameters struct that gets passed in. Specifically, it doesn't say whether or not the initiator_depth and responder_resources in the parm struct present in the listen event are what the other side set, or if they are already swapped to indicate the minimum/maximum that we can set on our side of the connection. Also, the initial message pointer is not detailed. When we call rdma_accept/rdma_reject, does our parm struct need to have that same pointer? Do we need to free that mem? Can we supply a new initial message and not leak the memory associated with the incoming initial message? Otherwise I haven't had any problems so far. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From swise at opengridcomputing.com Sun Oct 7 09:19:20 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 07 Oct 2007 11:19:20 -0500 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <47090708.6060604@opengridcomputing.com> No mention about the iwarp port space issue? Here is the status of the current proposed patch: - needs another round of changes based on Sean's feedback - Arkady raised issues about the pain this puts on admins - it forces services like nfs-rdma, which already separates the nfs-rdma server by port number, to needlessly use a separate subnet for the rdma service. I'm at a loss as to how to proceed. Any ideas? Steve. Roland Dreier wrote: > Since 2.6.23 still isn't out, and I've managed to reduce my patch > review backlog a bit, it's probably a good idea to give another update > about what I have queued for 2.6.24 already and what I hope to get to > before the merge window opens. > > Core: > > - My user_mad P_Key index support patch. Merged this, although I > still owe Sasha a patch to update libraries to use this. > > - A fix to the user_mad 32-bit big-endian userspace 64/32 problem > with the method_mask when registering agents. Merged. > > - Sean's QoS changes. Merged. > > - Sean's IB CM MRA interface changes. I merged these -- what the > heck, if it breaks we can back them out. > > ULPs: > > - Pradeep's IPoIB CM support for devices that don't have SRQs. Sean > started reviewing but I didn't see any updated patches. > > - Moni's IPoIB bonding support. Seems like we found a clean set of > changes, and these will go in via another (Jeff Garzik's?) tree. > > - Rolf's IPoIB MGID scope changes. No review progress here. > > - Eli and Michael's IPoIB stateless offload (checksum offload, LSO, > LRO, etc). Not much review progress here; I'll try to chip away at > the series and see what we can get into 2.6.24. > > - Or's IPoIB/userspace multicast coexistence stuff. I think we've > converged on this; I'll merge this once a final version of the > patch appears. > > HW specific: > > - I already merged patches to enable MSI-X by default for mthca and > mlx4. I hope there aren't too many systems that get hosed if a > MSI-X interrupt is generated. > > - Jack and Michael's mlx4 FMR support. Merged. I guess the fix for > running in Xen domU may need to wait for 2.6.25, but I'll see what > I can do. > > - ehca patch queue. Merged everything I think. > > - Steve's mthca router mode support. No one looked at it, seems like > it's at risk of missing the window. > > - Arthur's mthca doorbell alignment fixes. I still need to check > various approaches; I'll definitely merge something for 2.6.24. > > - Michael's mlx4 WQE shrinking patch. May miss the window and go for > 2.6.25, I'll see if I can get to it. > > Here are a few topics that I believe will not be ready in time for the > 2.6.24 window and will need to wait for 2.6.25: > > - Multiple CQ event vector support. I haven't seen any discussions > about how ULPs or userspace apps should decide which vector to use, > and hence no progress has been made since we deferred this during > the 2.6.23 merge window. > > - XRC. Given the length of the backlog above and the fact that a > first draft of this code has not been posted yet, I don't see any > way that we could have something this major ready in time. > > BOILERPLATE > =========== > > I keep patches in a git tree, available from > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > > There are several branches of interest in this tree: > > for-2.6.23 - changes queued for merging into the current kernel release > for-2.6.24 - changes queued for the next merge window > for-linus - changes I have asked Linus to pull upstream > for-mm - pulled by Andrew for inclusion in -mm > > I frequently rewrite history and rebase my tree, so the best way to > track it is to keep a clone of Linus's tree around and then pull a > fresh copy of my tree with > > git clone --reference /path/to/linus/tree \ > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > > If you would like me to merge a patch, please send it to me as soon as > it is ready. Do NOT wait for the merge window to open; if your change > is not strictly a fix and you send it to me after the merge window > opens, then it will likely have to wait for the next merge window. > Please let me know if your patch is a fix that should go into the > current release or if it can wait for the next merge window; if it is > a fix, please describe the severity of the issue your are fixing, so I > can make a good judgement about which release it should go into. > > Including a good changelog entry that explains what you are changing, > why you are changing it, and how your change accomplishes your goal > will greatly increase the chance of your patch being merged promptly. > Getting an independent review and a Reviewed-by: line also helps a lot. > > The files Documentation/SubmittingPatches and Documentation/SubmitChecklist > in kernel source tree also have a lot of good advice that makes it > easier for me to handle your patches. > > DETAILS > ======= > > Here is the complete list of patches I have in my for-2.6.24 branch: > > Ali Ayoub (1): > IB/sa: Error handling thinko fix > > Anton Blanchard (3): > IB/fmr_pool: Clean up some error messages in fmr_pool.c > IB/ehca: Make output clearer by removing some debug messages > IB/ehca: Export module parameters in sysfs > > Dotan Barak (1): > mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS > > Eli Cohen (2): > IPoIB: Fix typo to end statement with ';' instead of ',' > IPoIB: Fix error path memory leak > > Hoang-Nam Nguyen (4): > IB/ehca: Use remap_4k_pfn() to map firmware contexts to user space > IB/ehca: Fix large page HW cap defines > IB/ehca: Fix mem leak of firmware ctrlblock in ehca_create_srq() > IB/ehca: Adjust 64-bit alignment of create QP response for userspace > > Jack Morgenstein (5): > IB/mlx4: Display misc device information under /sys/class/infiniband/ > mlx4_core: Support ICM tables in coherent memory > mlx4_core: Write MTTs from CPU instead with of WRITE_MTT FW command > IB/mlx4: Implement FMRs > mlx4_core: Increase max number of QPs per multicast group to 56 > > Joachim Fenkes (11): > IB/ehca: Refactor hvcall tracing > IB/ehca: Print return codes as signed decimal integers > IB/ehca: ehca_gen_warn() should always print > IB/ehca: Add check for max #SGE to create_qp() > IB/ehca: Path migration support > IB/ehca: Serialize MR alloc and MR free hvCalls > IB/ehca: Replace get_paca()->paca_index by the more portable raw_smp_processor_id() > IB/ehca: Bump version number and change its format > IB/umem: Add hugetlb flag to struct ib_umem > IB/ehca: Only use MR large pages for hugetlb regions > IB/ehca: Return srq_attr->max_sge in ehca_query_srq() > > Michael S. Tsirkin (2): > mlx4_core: Enable MSI-X by default > IB/mthca: Enable MSI-X by default > > Peter Oruba (1): > IB/mthca: Use PCI-X/PCI-Express read control interfaces > > Ralph Campbell (1): > IB/core: Fix handling of multicast response failures > > Roland Dreier (14): > IPoIB: Make sure no receives are handled when stopping device > IB: find_first_zero_bit() takes unsigned pointer > mlx4_core: Don't free special QPs in QP number bitmap > IB/mlx4: Use __set_data_seg() in mlx4_ib_post_recv() > IB/ehca: Include from ehca_classes.h > IB/mlx4: Fix up SRQ limit_watermark endianness > IB/iser: Remove unnecessary includes > mlx4_core: Change capability decoding: SRC->XRC > IB/umad: Add P_Key index support > IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems > IB/uverbs: Make ib_uverbs_release_event_file() static > mlx4_core: Reserve the correct number of MTT segments > mlx4_core: Fix meaning of dev->caps.reserved_mtts > IB/mthca: Increase max number of QPs per multicast group to 56 > > Satyam Sharma (1): > IB/ehca: Misc cpuinit section annotations and #ifdef cleanups > > Sean Hefty (7): > IPoIB: Specify Traffic Class with path record queries for QoS support > IB/sa: Add new QoS fields to path record > RDMA/cma: Add ability to specify type of service > RDMA/ucma: Allow user space to set service type > IB/srp: Add QoS support through service ID > IB/cm: Modify interface to send MRAs in response to duplicate messages > RDMA/cma: Queue IB CM MRAs to avoid unnecessary remote retries > > Stefan Roscher (2): > IB/ehca: Small QP userspace support > IB/ehca: Support more than 4k QPs for userspace and kernelspace > > Steve Wise (2): > RDMA/cxgb3: Make the iw_cxgb3 module parameters writable > RDMA/cma: Use neigh_event_send() to start neighbour discovery > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hadi at cyberus.ca Sun Oct 7 11:34:53 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 07 Oct 2007 14:34:53 -0400 Subject: [ofa-general] [PATCHES] TX batching In-Reply-To: <1190569987.4256.52.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> Message-ID: <1191782093.4394.60.camel@localhost> Please provide feedback on the code and/or architecture. Last time i posted them i received little. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb What has changed since i posted last: 1) Fix a bug eyeballed by Patrick McHardy on requeue reordering. 2) Killed ->hard_batch_xmit() 3) I am going one step back and making this set of patches even simpler so i can make it easier to review.I am therefore killing dev->hard_prep_xmit() and focussing just on batching. I plan to re-introduce dev->hard_prep_xmit() but from now on i will make that a separate effort. (it seems to be creating confusion in relation to the general work). Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). UPCOMING PATCHES --------------- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. PERFORMANCE TESTING -------------------- I started testing since yesterday, but these tests take a long time so i will post results probably at the end of the day sometime and may stop running more tests and just comparing batch vs non-batch results. I have optimized the kernel-config so i expect my overall performance numbers to look better than the last test results i posted for both batch and non-batch. My system under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal From hadi at cyberus.ca Sun Oct 7 11:36:23 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 07 Oct 2007 14:36:23 -0400 Subject: [ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface In-Reply-To: <1190570317.4256.59.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> Message-ID: <1191782183.4394.62.camel@localhost> This patch introduces the netdevice interface for batching. cheers, jamal -------------- next part -------------- [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND --------- A driver dev->hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --------------- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1)Remove #d from dev->hard_start_xmit() and put it in dev->hard_end_xmit() method. 2)#b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION ------------------- 1. Core dequeues from qdiscs upto dev->xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev->hard_start_xmit() 4. Core invokes driver dev->hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY -------------------------------- There's a lot of history and reasoning of "why batching" in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim --- commit 0a0762e2c615a980af284e86d9729d233e1bf7f4 tree c27fec824a9e75ffbb791647bdb595c082a54990 parent 190674ff1fe0b7bddf038c2bfddf45b9c6418e2a author Jamal Hadi Salim Sun, 07 Oct 2007 08:51:10 -0400 committer Jamal Hadi Salim Sun, 07 Oct 2007 08:51:10 -0400 include/linux/netdevice.h | 11 ++++++ net/core/dev.c | 83 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b31df5c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +613,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1044,6 +1049,12 @@ extern int dev_set_mac_address(struct net_device *, struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_batch_xmit(struct net_device *dev); +extern int prepare_gso_skb(struct sk_buff *skb, + struct net_device *dev, + struct sk_buff_head *skbs); +extern int xmit_prepare_skb(struct sk_buff *skb, + struct net_device *dev); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index d998646..04df3fb 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1517,6 +1517,87 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } +int prepare_gso_skb(struct sk_buff *skb, struct net_device *dev, + struct sk_buff_head *skbs) +{ + int tdq = 0; + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + nskb->next = NULL; + + /* Driver likes this packet .. */ + tdq++; + __skb_queue_tail(skbs, nskb); + } while (skb->next); + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + + return tdq; +} + +int xmit_prepare_skb(struct sk_buff *skb, struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree_skb(skb); + return 0; + } + if (skb->next) + return prepare_gso_skb(skb, dev, skbs); + } + + __skb_queue_tail(skbs, skb); + return 1; +} + +int dev_batch_xmit(struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + int rc = NETDEV_TX_OK; + struct sk_buff *skb; + int orig_w = dev->xmit_win; + int orig_pkts = skb_queue_len(skbs); + + while ((skb = __skb_dequeue(skbs)) != NULL) { + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + rc = dev->hard_start_xmit(skb, dev); + if (unlikely(rc)) + break; + /* * XXX: multiqueue may need closer srutiny.. */ + if (unlikely(netif_queue_stopped(dev) || + netif_subqueue_stopped(dev, skb->queue_mapping))) { + rc = NETDEV_TX_BUSY; + break; + } + } + + /* driver is likely buggy and lied to us on how much + * space it had. Damn you driver .. + */ + if (unlikely(skb_queue_len(skbs))) { + printk(KERN_WARNING "Likely bug %s %s (%d) " + "left %d/%d window now %d, orig %d\n", + dev->name, rc?"busy":"locked", + netif_queue_stopped(dev), + skb_queue_len(skbs), + orig_pkts, + dev->xmit_win, + orig_w); + rc = NETDEV_TX_BUSY; + } + + if (orig_pkts > skb_queue_len(skbs)) + if (dev->hard_end_xmit) + dev->hard_end_xmit(dev); + + return rc; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb->next)) { @@ -3553,6 +3634,8 @@ int register_netdevice(struct net_device *dev) } } + dev->xmit_win = 1; + skb_queue_head_init(&dev->blist); ret = netdev_register_kobject(dev); if (ret) goto err_uninit; From hadi at cyberus.ca Sun Oct 7 11:38:09 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 07 Oct 2007 14:38:09 -0400 Subject: [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1190570409.4256.62.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> Message-ID: <1191782289.4394.64.camel@localhost> This patch adds the usage of batching within the core. cheers, jamal -------------- next part -------------- [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim --- commit cd602aa5f84fcef6359852cd99c95863eeb91015 tree f31d2dde4f138ff6789682163624bc0f8541aa77 parent 0a0762e2c615a980af284e86d9729d233e1bf7f4 author Jamal Hadi Salim Sun, 07 Oct 2007 09:13:04 -0400 committer Jamal Hadi Salim Sun, 07 Oct 2007 09:13:04 -0400 net/sched/sch_generic.c | 132 +++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 120 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..80ac56b 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q->q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev->xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + "Dead loop on netdevice %s, fix it urgently!\n", + dev->name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q->ops->requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb->next) { + count += skb_shinfo(skb)->nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev->xmit_win; + + if (count && dev->gso_skb) { + skb = dev->gso_skb; + dev->gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count > 0) { + skb = q->dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} + +static int xmit_prepare_pkts(struct net_device *dev, + struct sk_buff_head *tlist) +{ + struct sk_buff *skb; + struct sk_buff_head *flist = &dev->blist; + + while ((skb = __skb_dequeue(tlist)) != NULL) + xmit_prepare_skb(skb, dev); + + return skb_queue_len(flist); +} /* * NOTE: Called under dev->queue_lock with locally disabled BH. @@ -130,22 +222,32 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) + +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *tpktlist) { struct Qdisc *q = dev->qdisc; - struct sk_buff *skb; - int ret; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + /* use of tpktlist reduces the amount of time we sit + * holding the queue_lock + */ + ret = xmit_get_pkts(dev, q, tpktlist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(&dev->queue_lock); + /* prepare to embark, no locks held moves packets + * to dev->blist + * */ + xmit_prepare_pkts(dev, tpktlist); + + /* bye packets ....*/ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(&dev->queue_lock); @@ -158,8 +260,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(&dev->blist, dev, q); break; default: @@ -168,7 +270,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING "BUG %s code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(&dev->blist, dev, q); break; } @@ -177,8 +279,11 @@ static inline int qdisc_restart(struct net_device *dev) void __qdisc_run(struct net_device *dev) { + struct sk_buff_head tpktlist; + skb_queue_head_init(&tpktlist); + do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, &tpktlist)) break; } while (!netif_queue_stopped(dev)); @@ -564,6 +669,9 @@ void dev_deactivate(struct net_device *dev) skb = dev->gso_skb; dev->gso_skb = NULL; + if (!skb_queue_empty(&dev->blist)) + skb_queue_purge(&dev->blist); + dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); kfree_skb(skb); From hadi at cyberus.ca Sun Oct 7 11:39:49 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 07 Oct 2007 14:39:49 -0400 Subject: [ofa-general] [PATCH 3/3][NET_BATCH] kill dev->gso_skb In-Reply-To: <1190570521.4256.65.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1190570205.4256.56.camel@localhost> <1190570317.4256.59.camel@localhost> <1190570409.4256.62.camel@localhost> <1190570521.4256.65.camel@localhost> Message-ID: <1191782389.4394.66.camel@localhost> This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal -------------- next part -------------- [NET_BATCH] kill dev->gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev->blist (and still not need dev->gso_skb) Signed-off-by: Jamal Hadi Salim --- commit 7ebf50f0f43edd4897b88601b4133612fc36af61 tree 5d942ecebc14de6254ab3c812d542d524e148e92 parent cd602aa5f84fcef6359852cd99c95863eeb91015 author Jamal Hadi Salim Sun, 07 Oct 2007 09:30:19 -0400 committer Jamal Hadi Salim Sun, 07 Oct 2007 09:30:19 -0400 include/linux/netdevice.h | 3 --- net/sched/sch_generic.c | 12 ------------ 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b31df5c..4ddc6eb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 80ac56b..772e7fe 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev->xmit_win; - if (count && dev->gso_skb) { - skb = dev->gso_skb; - dev->gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count > 0) { skb = q->dequeue(q); if (!skb) @@ -659,7 +652,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(&dev->queue_lock); qdisc = dev->qdisc; @@ -667,15 +659,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev->gso_skb; - dev->gso_skb = NULL; if (!skb_queue_empty(&dev->blist)) skb_queue_purge(&dev->blist); dev->xmit_win = 1; spin_unlock_bh(&dev->queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ From tziporet at dev.mellanox.co.il Sun Oct 7 14:32:42 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 07 Oct 2007 23:32:42 +0200 Subject: [ofa-general] mpi failures on large ia64/ofed/IB clusters In-Reply-To: References: <20071005223619.GI20278@sgi.com> <20071006002223.GK20278@sgi.com> Message-ID: <4709507A.8090308@mellanox.co.il> Roland Dreier wrote: > Thanks for testing... I can take care of this -- I just added the > patches below to my tree (since as far as I can see, mlx4 would be > susceptible to the same bug): > > > Roland - is this for 2.6.23 or 24? Tziporet From tziporet at dev.mellanox.co.il Sun Oct 7 14:37:14 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 07 Oct 2007 23:37:14 +0200 Subject: [ofa-general] OFED libibverbs API In-Reply-To: <4706B23E.8050709@opengridcomputing.com> References: <7778a2950710051346g3ba805cejb6145564fb9478e3@mail.gmail.com> <4706B23E.8050709@opengridcomputing.com> Message-ID: <4709518A.20707@mellanox.co.il> Steve Wise wrote: > OFA Admins: > > It would be nice to put the man pages on-line... > > If we installed the man pages, then used man2html or something we > could point folks at that for on-line docs... > > Zulfi, if you build/install ofed-1.2.5, you can then get man pages for > the verbs and rdmacm APIs. Also there are header files and examples > that get build/installed. > > Jeff, Can you take care for this? Thanks, Tziporet From hadi at cyberus.ca Sun Oct 7 14:49:03 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 07 Oct 2007 17:49:03 -0400 Subject: [ofa-general] NET_BATCH: some results Message-ID: <1191793743.4352.13.camel@localhost> It seems prettier to just draw graphs and since this one is small file; here it is attached. The graph demos a patched net-2.6.24 vs a plain net-2.6.24 kernel with a udp app that sends on 4 CPUs as fast as the the lower layers would allow it. Refer to my earlier description of the test setup etc. As i noted earlier on, for this hardware at about 200B or so, we approach wire speed, so the app is mostly idle above that as the link becomes the bottleneck; example it is > 85% idle on 512B and > 90% idle on 1024B. This is so for either batch or non-batch. So the differentiation is really in the smaller sized packets. Enjoy! cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: batch-pps.pdf Type: application/pdf Size: 12238 bytes Desc: not available URL: From krkumar2 at in.ibm.com Sun Oct 7 22:03:27 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 8 Oct 2007 10:33:27 +0530 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191244900.4378.3.camel@localhost> Message-ID: > jamal wrote: > > > > + while ((skb = __skb_dequeue(skbs)) != NULL) > > > + q->ops->requeue(skb, q); > > > > > > ->requeue queues at the head, so this looks like it would reverse > > the order of the skbs. > > Excellent catch! thanks; i will fix. > > As a side note: Any batching driver should _never_ have to requeue; if > it does it is buggy. And the non-batching ones if they ever requeue will > be a single packet, so not much reordering. On the contrary, batching LLTX drivers (if that is not ruled out) will very often requeue resulting in heavy reordering. Fix looks good though. - KK From kliteyn at mellanox.co.il Sun Oct 7 22:06:46 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 8 Oct 2007 07:06:46 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-08:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-07 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From davem at davemloft.net Sun Oct 7 21:51:24 2007 From: davem at davemloft.net (David Miller) Date: Sun, 07 Oct 2007 21:51:24 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1190677099.4264.37.camel@localhost> References: <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> Message-ID: <20071007.215124.85709188.davem@davemloft.net> From: jamal Date: Mon, 24 Sep 2007 19:38:19 -0400 > How is the policy to define the qdisc queues locked/mapped to tx rings? For these high performance 10Gbit cards it's a load balancing function, really, as all of the transmit queues go out to the same physical port so you could: 1) Load balance on CPU number. 2) Load balance on "flow" 3) Load balance on destination MAC etc. etc. etc. It's something that really sits logically between the qdisc and the card, not something that is a qdisc thing. In some ways it's similar to bonding, but using anything similar to bonding's infrastructure (stacking devices) is way overkill for this. And then we have the virtualization network devices where the queue selection has to be made precisely, in order for the packet to reach the proper destination, rather than a performance improvement. It is also a situation where the TX queue selection is something to be made between qdisc activity and hitting the device. I think we will initially have to live with taking the centralized qdisc lock for the device, get in and out of that as fast as possible, then only take the TX queue lock of the queue selected. After we get things that far we can try to find some clever lockless algorithm for handling the qdisc to get rid of that hot spot. These queue selection schemes want a common piece of generic code. A set of load balancing algorithms, a "select TX queue by MAC with a default fallback on no match" for virtualization, and interfaces for both drivers and userspace to change the queue selection scheme. From jackm at dev.mellanox.co.il Sun Oct 7 23:38:12 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 8 Oct 2007 08:38:12 +0200 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: Message-ID: <200710080838.12778.jackm@dev.mellanox.co.il> On Saturday 06 October 2007 01:18, Roland Dreier wrote: >  - XRC.  Given the length of the backlog above and the fact that a >    first draft of this code has not been posted yet, I don't see any >    way that we could have something this major ready in time. > I posted the first draft patch set to the OpenFabrics list on September 18: [ofa-general] [PATCH 0 of 5] XRC implementation patches (libibverbs, libmlx4, core, mlx4) Jack Morgenstein [ofa-general] [PATCH 1 of 5] libibverbs: XRC implementation Jack Morgenstein [ofa-general] [PATCH 2 of 5] libmlx4: XRC implementation Jack Morgenstein [ofa-general] [PATCH 3 of 5] core: XRC implementation for fd = -1 when opening an xrc domain Jack Morgenstein [ofa-general] [PATCH 4 of 5] core: XRC implementation -- add support for working with file descriptors Jack Morgenstein [ofa-general] [PATCH 5 of 5] mlx4: XRC implementation Jack Morgenstein The above patch set implements XRC for userspace verbs layer and below. The Kernel-space verbs implementation (1 more patch) is finished, but as yet untested. - Jack From kliteyn at dev.mellanox.co.il Mon Oct 8 01:07:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 08 Oct 2007 10:07:55 +0200 Subject: [ofa-general] OpenSM prints guids twice Message-ID: <4709E55B.8070901@dev.mellanox.co.il> Hi Sasha, I noticed the following problem a while ago - when the whole duplicated guids and re-reading files mails were running, but never had a chance to dig deeper. Anyway, sometimes OpenSM 'sees' the same HCA ports twice. For instance, when I run "opensm -V" on a two-port HCA with a switch in between, OpenSM prints the following: ======================================================================================================= Vendor : Ty : # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID : Neighbor Port (Port #) Flextronics : SW : 00 : : 0003 : 0 : : : : 000b8cffff002037 : Flextronics : SW : 01 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 02 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 03 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 04 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 05 : ACT : : : 2048 : 4x : 2.5 : 000b8cffff002037 : 0002c902000017a2 (02) Flextronics : SW : 06 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 07 : DWN : : : ??? : ??? : ??? : 000b8cffff002037 : Flextronics : SW : 08 : ACT : : : 2048 : 4x : 2.5 : 000b8cffff002037 : 0002c902000017a1 (01) ------------------------------------------------------------------------------------------------------ Mellanox : CA : 01 : ACT : 0001 : 0 : 2048 : 4x : 2.5 * 0002c902000017a1 * 000b8cffff002037 (08) Mellanox : CA : 02 : ACT : 0002 : 0 : 2048 : 4x : 2.5 : 0002c902000017a2 : 000b8cffff002037 (05) ------------------------------------------------------------------------------------------------------ Mellanox : CA : 01 : ACT : 0001 : 0 : 2048 : 4x : 2.5 * 0002c902000017a1 * 000b8cffff002037 (08) Mellanox : CA : 02 : ACT : 0002 : 0 : 2048 : 4x : 2.5 : 0002c902000017a2 : 000b8cffff002037 (05) ------------------------------------------------------------------------------------------------------ The easiest way to reproduce it is to run it on a single HCA with two ports connected directly. It doesn't really disturbs anything (except for one of my tests that counts some records), but anyway, any idea what's causing it? thanks -- Yevgeny From ogerlitz at voltaire.com Mon Oct 8 01:13:00 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 8 Oct 2007 10:13:00 +0200 (IST) Subject: [ofa-general] [PATCH v3 for 2.6.24] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: Message-ID: changes from v2 - http://lists.openfabrics.org/pipermail/general/2007-September/040995.html - removed the module param, as hot-plug mechanisms etc can serve for persistent setting - replaced strcmp usage with simple_strtoul in the sysfs callback that sets the value changes from v1 - http://lists.openfabrics.org/pipermail/general/2007-September/040250.html - added module param to control the umcast bit in the device priv flags - changed the umcast bit name to IPOIB_FLAG_ADMIN_UMCAST_ALLOWED - the sysfs attribute has now values 0 and 1 instead of "allowed" and "disallowed" please review and consider for merge to 2.6.24 ----- The kernel IB stack allows (through the RDMA CM) user space multicast applications to interoperate with IP based apps optionally running at a different IP subnet. To support this inter-op for the case where the receiving party resides at the IB side, there is a need to handle IGMP (reports/queries) else the local IP router would not forward multicast traffic towards the IB network. This patch does a lookup on the database used for multicast reference counting and enhances IPoIB to ignore multicast group which is already handled by user space, all this under a per device policy flag. That is when the policy flag allows it, IPoIB will not join and attach its QP to a multicast group which has an entry on the database. For each IPoIB device, the /sys/class/net/$dev/umcast attribute controls the policy flag where the default value is being off (zero). The flag can be read and set/unset through sysfs. Signed-off-by: Or Gerlitz Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-10-07 14:34:14.000000000 +0200 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-10-07 14:41:22.000000000 +0200 @@ -783,6 +783,7 @@ void ipoib_mcast_restart_task(struct wor struct ipoib_mcast *mcast, *tmcast; LIST_HEAD(remove_list); unsigned long flags; + struct ib_sa_mcmember_rec rec; ipoib_dbg_mcast(priv, "restarting multicast task\n"); @@ -816,6 +817,15 @@ void ipoib_mcast_restart_task(struct wor if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { struct ipoib_mcast *nmcast; + /* ignore group which is directly joined by user space */ + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags) && + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) + { + ipoib_dbg_mcast(priv, "ignoring multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + continue; + } + /* Not found or send-only group, let's add a new entry */ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-07 14:34:14.000000000 +0200 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-07 14:41:22.000000000 +0200 @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_ADMIN_UMCAST_ALLOWED = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -364,6 +365,7 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-07 14:34:14.000000000 +0200 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-07 15:01:07.000000000 +0200 @@ -1017,6 +1017,45 @@ static ssize_t show_pkey(struct device * } static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static ssize_t show_umcast(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags)) + return sprintf(buf, "1\n"); + else + return sprintf(buf, "0\n"); +} + +static ssize_t set_umcast(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + unsigned long umcast_val = simple_strtoul(buf, NULL, 0); + + if (umcast_val > 0) { + set_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + ipoib_warn(priv, "ignoring multicast groups joined directly " + "by user space\n"); + return count; + } + + if (!umcast_val) { + clear_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + return count; + } + + return -EINVAL; +} +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); + +int ipoib_add_umcast_attr(struct net_device *dev) +{ + return device_create_file(&dev->dev, &dev_attr_umcast); +} + static ssize_t create_child(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -1134,6 +1173,8 @@ static struct net_device *ipoib_add_port goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) Index: linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- linux-2.6.23-rc5.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-10-07 14:34:14.000000000 +0200 +++ linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-10-07 14:41:22.000000000 +0200 @@ -119,6 +119,8 @@ int ipoib_vlan_add(struct net_device *pd goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; From krkumar2 at in.ibm.com Mon Oct 8 02:59:02 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 8 Oct 2007 15:29:02 +0530 Subject: [ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface In-Reply-To: <1191782183.4394.62.camel@localhost> Message-ID: Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK J Hadi Salim wrote on 10/08/2007 12:06:23 AM: > This patch introduces the netdevice interface for batching. > > cheers, > jamal > > > [attachment "oct07-p1of3" deleted by Krishna Kumar2/India/IBM] From vlad at lists.openfabrics.org Mon Oct 8 03:18:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 8 Oct 2007 03:18:35 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071008-0200 daily build status Message-ID: <20071008101835.B3018E60843@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071008-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Mon Oct 8 03:31:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 08 Oct 2007 12:31:19 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <1189526095.13053.123.camel@mtls03> References: <1189526095.13053.123.camel@mtls03> Message-ID: <470A06F7.3090602@voltaire.com> Eli Cohen wrote: > Add Large Receive Offload support to IPOIB > > Reduce overhead incurred by handling many small packets > by aggregating SKBs related to the same stream and passing > them up. This patch is based on the work done for MTNIC > by Liran Liss Hi Eli, Back on April a user having the configuration A --- 10g --- B --- IB --- C where node B acts as an IP router having one 10g interface and one IB interface reported on a sever bandwidth problem which was resolved to be related to the 10g driver have LRO mechanism which is not operative under forwarding scheme, see the email/thread http://lists.openfabrics.org/pipermail/general/2007-April/035322.html My question is, does the suggested LRO code need to be disabled when the node does forwarding? Indeed you have removed the LSO, LRO patches from the stateless offload patch set posting to the upstream kernel, but they do exist in OFED 1.3 Tziporet - I am quite worried from distributing with OFED 1.3 ipoib changes (namely LSO and LRO support) which were never reviewed by the community and that I understand are not planned for review towards 2.6.24. What is your thinking on the matter? Or. > -----Original Message----- > From: general-bounces at lists.openfabrics.org On Behalf Of David Miller > Sent: Saturday, April 28, 2007 2:40 AM > To: rick.jones2 at hp.com > Cc: lawver1 at llnl.gov; netdev at vger.kernel.org; mst at dev.mellanox.co.il; general at lists.openfabrics.org > Subject: Re: [ofa-general] Re: IPoIB forwarding > > From: Rick Jones > Date: Fri, 27 Apr 2007 16:37:49 -0700 > >> Large Receive Offload (LRO) is enabled by default. This will >> interfere with forwarding TCP traffic. If you plan to forward TCP >> traffic (using the host with the Myri10GE NIC as a router or bridge), >> you must disable LRO. To disable LRO, load the myri10ge driver with >> myri10ge_lro set to 0: > > LRO should be disabled by default if the driver does this. This is a major and unacceptable bug. > > Thanks for pointing this out Rick. From ukonlinewinners at bellsouth.net Mon Oct 8 05:14:53 2007 From: ukonlinewinners at bellsouth.net (ONLINE LOTTERY) Date: Mon, 08 Oct 2007 12:14:53 +0000 Subject: [ofa-general] ***SPAM*** You WON Message-ID: <100820071214.16813.470A1F32000C5EDE000041AD22230706129B0A02D2089B9A019C04040A0DBF9C9D0A020207990A0207040201059A@bellsouth.net> ATTENTION: WINNER This is to inform you that you have been selected for a cash prize of �1,000,000(One million pounds sterlings).held on the 29th 2007 UK. The selection process was carried out through random selection in Our computerized email selection system (ess) from a database of over 21,000 email Addresses drawn from which you were selected. Contact our fiduciary agent for claims with: Mr.Ben Daniels. Email:lotteryboard_claimsofficer at yahoo.com Tel:+44 70457 49316 Fill the below: 1. Name: 2. Address 3. Marital Status: 4. Occupation: 5. Age: 6. Sex: 7. Nationality: 8. Country of Residence: 9. Telephone Number: Yours faithfully, Sir. Philip Johnson. (Online supervisor) From eli at mellanox.co.il Mon Oct 8 05:31:01 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 08 Oct 2007 14:31:01 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <470A06F7.3090602@voltaire.com> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> Message-ID: <1191846661.7337.4.camel@mtls03> > Hi Eli, > > Back on April a user having the configuration > > A --- 10g --- B --- IB --- C > > where node B acts as an IP router having one 10g interface > and one IB interface reported on a sever bandwidth problem > which was resolved to be related to the 10g driver have LRO mechanism > which is not operative under forwarding scheme, see the email/thread > http://lists.openfabrics.org/pipermail/general/2007-April/035322.html > > My question is, does the suggested LRO code need to be disabled when the > node does forwarding? > I did not test such a setup with a host operating as a router between ipoib and Ethernet networks. Once I do this I will evaluate if there is a problem and possibly add facilities to disable LRO (probably via ethtool). > Indeed you have removed the LSO, LRO patches from the stateless offload > patch set posting to the upstream kernel, but they do exist in OFED 1.3 > > Tziporet - I am quite worried from distributing with OFED 1.3 ipoib > changes (namely LSO and LRO support) which were never reviewed by the > community and that I understand are not planned for review towards 2.6.24. > > What is your thinking on the matter? > > Or. > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org On Behalf Of David Miller > > Sent: Saturday, April 28, 2007 2:40 AM > > To: rick.jones2 at hp.com > > Cc: lawver1 at llnl.gov; netdev at vger.kernel.org; mst at dev.mellanox.co.il; general at lists.openfabrics.org > > Subject: Re: [ofa-general] Re: IPoIB forwarding > > > > From: Rick Jones > > Date: Fri, 27 Apr 2007 16:37:49 -0700 > > > >> Large Receive Offload (LRO) is enabled by default. This will > >> interfere with forwarding TCP traffic. If you plan to forward TCP > >> traffic (using the host with the Myri10GE NIC as a router or bridge), > >> you must disable LRO. To disable LRO, load the myri10ge driver with > >> myri10ge_lro set to 0: > > > > LRO should be disabled by default if the driver does this. This is a major and unacceptable bug. > > > > Thanks for pointing this out Rick. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eli at dev.mellanox.co.il Mon Oct 8 05:48:23 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 08 Oct 2007 14:48:23 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <1191846661.7337.4.camel@mtls03> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> Message-ID: <1191847703.7337.10.camel@mtls03> > > > > My question is, does the suggested LRO code need to be disabled when the > > node does forwarding? > > > I did not test such a setup with a host operating as a router between > ipoib and Ethernet networks. Once I do this I will evaluate if there is > a problem and possibly add facilities to disable LRO (probably via > ethtool). > Thinking about this I probably have to add the means to disable LRO for those hosts which so IP forwarding. I will send a modified patch. From ogerlitz at voltaire.com Mon Oct 8 05:51:23 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 08 Oct 2007 14:51:23 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <1191846661.7337.4.camel@mtls03> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> Message-ID: <470A27CB.9040403@voltaire.com> Eli Cohen wrote: >> My question is, does the suggested LRO code need to be disabled when the >> node does forwarding? > I did not test such a setup with a host operating as a router between > ipoib and Ethernet networks. Once I do this I will evaluate if there is > a problem and possibly add facilities to disable LRO (probably via > ethtool). Since you have posted the patch, I am asking you if it has any negative influence on packet forwarding. I am not asking you to test it or whether you tested it with forwarding. Or. From ogerlitz at voltaire.com Mon Oct 8 05:52:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 08 Oct 2007 14:52:09 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <1191847703.7337.10.camel@mtls03> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> <1191847703.7337.10.camel@mtls03> Message-ID: <470A27F9.3060605@voltaire.com> Eli Cohen wrote: >>> My question is, does the suggested LRO code need to be disabled when the >>> node does forwarding? > Thinking about this I probably have to add the means to disable LRO for > those hosts which so IP forwarding. I will send a modified patch. why, can you explain what is the problem with doing LRO with forwarding? Or. From johnpol at 2ka.mipt.ru Mon Oct 8 05:51:25 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Mon, 8 Oct 2007 16:51:25 +0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1191782093.4394.60.camel@localhost> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1191782093.4394.60.camel@localhost> Message-ID: <20071008125125.GA31456@2ka.mipt.ru> Hi Jamal. On Sun, Oct 07, 2007 at 02:34:53PM -0400, jamal (hadi at cyberus.ca) wrote: > > Please provide feedback on the code and/or architecture. > Last time i posted them i received little. They are now updated to > work with the latest net-2.6.24 from a few hours ago. > > Patch 1: Introduces batching interface > Patch 2: Core uses batching interface > Patch 3: get rid of dev->gso_skb it looks like you and Krishna use the same requeueing methods - get one from qdisk, queue it into blist, get next from qdisk, queue it, eventually start transmit, where you dequeue it one-by-one and send (or prepare and commit). This is not the 100% optimal approach, but if you proved it does not hurt usual network processing, it is ok. Number of comments dusted to very small - that's a sign, but I'm a bit lost - did you and Krishna create the competing approaches, or they can co-exist together, in the former case I doubt you can push, until all problematic places are resolved, in the latter case, this is probably ready. -- Evgeniy Polyakov From hadi at cyberus.ca Mon Oct 8 06:17:24 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 09:17:24 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: Message-ID: <1191849444.4352.29.camel@localhost> On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote: > > As a side note: Any batching driver should _never_ have to requeue; if > > it does it is buggy. And the non-batching ones if they ever requeue will > > be a single packet, so not much reordering. > > On the contrary, batching LLTX drivers (if that is not ruled out) will very > often requeue resulting in heavy reordering. Fix looks good though. Two things: one, LLTX is deprecated (I think i saw a patch which says no more new drivers should do LLTX) and i plan if nobody else does to kill LLTX in e1000 RSN. So for that reason i removed all code that existed to support LLTX. two, there should _never_ be any requeueing even if LLTX in the previous patches when i supported them; if there is, it is a bug. This is because we dont send more than what the driver asked for via xmit_win. So if it asked for more than it can handle, that is a bug. If its available space changes while we are sending to it, that too is a bug. cheers, jamal From hadi at cyberus.ca Mon Oct 8 06:34:50 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 09:34:50 -0400 Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <20071007.215124.85709188.davem@davemloft.net> References: <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20071007.215124.85709188.davem@davemloft.net> Message-ID: <1191850490.4352.41.camel@localhost> On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote: > For these high performance 10Gbit cards it's a load balancing > function, really, as all of the transmit queues go out to the same > physical port so you could: > > 1) Load balance on CPU number. > 2) Load balance on "flow" > 3) Load balance on destination MAC > > etc. etc. etc. The brain-block i am having is the parallelization aspect of it. Whatever scheme it is - it needs to ensure the scheduler works as expected. For example, if it was a strict prio scheduler i would expect that whatever goes out is always high priority first and never ever allow a low prio packet out at any time theres something high prio needing to go out. If i have the two priorities running on two cpus, then i cant guarantee that effect. IOW, i see the scheduler/qdisc level as not being split across parallel cpus. Do i make any sense? The rest of my understanding hinges on the above, so let me stop here. cheers, jamal From hadi at cyberus.ca Mon Oct 8 06:49:14 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 09:49:14 -0400 Subject: [ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface In-Reply-To: References: Message-ID: <1191851354.4352.57.camel@localhost> On Mon, 2007-08-10 at 15:29 +0530, Krishna Kumar2 wrote: > Hi Jamal, > > If you don't mind, I am trying to run your approach vs mine to get some > results for comparison. Please provide an analysis when you get the results. IOW, explain why one vs the other get different results. > For starters, I am having issues with iperf when using your infrastructure > code with > my IPoIB driver - about 100MB is sent and then everything stops for some > reason. I havent tested with iperf in a while. Can you post the netstat on both sides when the driver stops? It does sound like a driver issue to me. > The changes in the IPoIB driver that I made to support batching is to set > BTX, set > xmit_win, and dynamically reduce xmit_win on every xmit > and increase xmit_win on every xmit completion. >From driver howto: --- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. ---- > Is there anything else that is required from the > driver? Your driver needs to also support wake thresholding. I will post the driver howto later today. cheers, jamal From hadi at cyberus.ca Mon Oct 8 07:05:20 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 10:05:20 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <20071008125125.GA31456@2ka.mipt.ru> References: <20070914090058.17589.80352.sendpatchset@K50wks273871wss.in.ibm.com> <20070916.161748.48388692.davem@davemloft.net> <1189988958.4230.55.camel@localhost> <1190569987.4256.52.camel@localhost> <1191782093.4394.60.camel@localhost> <20071008125125.GA31456@2ka.mipt.ru> Message-ID: <1191852320.4352.73.camel@localhost> On Mon, 2007-08-10 at 16:51 +0400, Evgeniy Polyakov wrote: > it looks like you and Krishna use the same requeueing methods - get one > from qdisk, queue it into blist, get next from qdisk, queue it, > eventually start transmit, where you dequeue it one-by-one and send (or > prepare and commit). This is not the 100% optimal approach, but if you > proved it does not hurt usual network processing, it is ok. There are probably other bottlenecks that hide the need to optimize further. > Number of comments dusted to very small - that's a sign, but I'm a bit > lost - did you and Krishna create the competing approaches, or they can > co-exist together, in the former case I doubt you can push, until all > problematic places are resolved, in the latter case, this is probably > ready. Thanks. I would like to make one more cleanup and get rid of the temporary pkt list in qdisc restart; now that i have defered the skb pre-format interface it is unnecessary. I have a day off today, so i will make changes, re-run tests and post again. I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. cheers, jamal From eli at mellanox.co.il Mon Oct 8 07:16:48 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 08 Oct 2007 16:16:48 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <470A27CB.9040403@voltaire.com> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> <470A27CB.9040403@voltaire.com> Message-ID: <1191853008.7337.16.camel@mtls03> > Since you have posted the patch, I am asking you if it has any negative > influence on packet forwarding. > > I am not asking you to test it or whether you tested it with forwarding. > The answer is yes since I do not recalculate TCP checksum as I aggregate the SKBs so the kernel might forward the TCP segment as multiple IP packets but with wrong TCP checksum (which is that of the first aggregated packet) but not of the overall aggregated segment. From jeff at garzik.org Mon Oct 8 07:22:28 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 08 Oct 2007 10:22:28 -0400 Subject: [ofa-general] parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) In-Reply-To: <1191850490.4352.41.camel@localhost> References: <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> Message-ID: <470A3D24.3050803@garzik.org> jamal wrote: > On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote: > >> For these high performance 10Gbit cards it's a load balancing >> function, really, as all of the transmit queues go out to the same >> physical port so you could: >> >> 1) Load balance on CPU number. >> 2) Load balance on "flow" >> 3) Load balance on destination MAC >> >> etc. etc. etc. > > The brain-block i am having is the parallelization aspect of it. > Whatever scheme it is - it needs to ensure the scheduler works as > expected. For example, if it was a strict prio scheduler i would expect > that whatever goes out is always high priority first and never ever > allow a low prio packet out at any time theres something high prio > needing to go out. If i have the two priorities running on two cpus, > then i cant guarantee that effect. Any chance the NIC hardware could provide that guarantee? 8139cp, for example, has two TX DMA rings, with hardcoded characteristics: one is a high prio q, and one a low prio q. The logic is pretty simple: empty the high prio q first (potentially starving low prio q, in worst case). In terms of overall parallelization, both for TX as well as RX, my gut feeling is that we want to move towards an MSI-X, multi-core friendly model where packets are LIKELY to be sent and received by the same set of [cpus | cores | packages | nodes] that the [userland] processes dealing with the data. There are already some primitive NUMA bits in skbuff allocation, but with modern MSI-X and RX/TX flow hashing we could do a whole lot more, along the lines of better CPU scheduling decisions, directing flows to clusters of cpus, and generally doing a better job of maximizing cache efficiency in a modern multi-thread environment. IMO the current model where each NIC's TX completion and RX processes are both locked to the same CPU is outmoded in a multi-core world with modern NICs. :) But I readily admit general ignorance about the kernel process scheduling stuff, so my only idea about a starting point was to see how far to go with the concept of "skb affinity" -- a mask in sk_buff that is a hint about which cpu(s) on which the NIC should attempt to send and receive packets. When going through bonding or netfilter, it is trivial to 'or' together affinity masks. All the various layers of net stack should attempt to honor the skb affinity, where feasible (requires interaction with CFS scheduler?). Or maybe skb affinity is a dumb idea. I wanted to get people thinking on the bigger picture. Parallelization starts at the user process. Jeff From rdreier at cisco.com Mon Oct 8 07:48:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 08 Oct 2007 07:48:26 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <200710080838.12778.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 8 Oct 2007 08:38:12 +0200") References: <200710080838.12778.jackm@dev.mellanox.co.il> Message-ID: > >  - XRC.  Given the length of the backlog above and the fact that a > >    first draft of this code has not been posted yet, I don't see any > >    way that we could have something this major ready in time. > > > I posted the first draft patch set to the OpenFabrics list on September 18: Sorry, I didn't update that text. But my backlog is still too big to get XRC into 2.6.24. - R. From tziporet at mellanox.co.il Mon Oct 8 08:28:25 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 8 Oct 2007 17:28:25 +0200 Subject: [ofa-general] Reminder: OFED meeting today at 9am PST Message-ID: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> Hi, This is to remind you that we have the OFED teleconference today at 9am PST Agenda: * OFED 1.3 status toward alpha release this week * Items for OFED developers summit after SC07 Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From hadi at cyberus.ca Mon Oct 8 08:18:29 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 11:18:29 -0400 Subject: [ofa-general] Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock) In-Reply-To: <470A3D24.3050803@garzik.org> References: <1190674298.4264.24.camel@localhost> <1190677099.4264.37.camel@localhost> <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> <470A3D24.3050803@garzik.org> Message-ID: <1191856709.4352.124.camel@localhost> On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote: > Any chance the NIC hardware could provide that guarantee? If you can get the scheduling/dequeuing to run on one CPU (as we do today) it should work; alternatively you can totaly bypass the qdisc subystem and go direct to the hardware for devices that are capable and that would work but would require huge changes. My fear is there's a mini-scheduler pieces running on multi cpus which is what i understood as being described. > 8139cp, for example, has two TX DMA rings, with hardcoded > characteristics: one is a high prio q, and one a low prio q. The logic > is pretty simple: empty the high prio q first (potentially starving > low prio q, in worst case). sounds like strict prio scheduling to me which says "if low prio starves so be it" > In terms of overall parallelization, both for TX as well as RX, my gut > feeling is that we want to move towards an MSI-X, multi-core friendly > model where packets are LIKELY to be sent and received by the same set > of [cpus | cores | packages | nodes] that the [userland] processes > dealing with the data. Does putting things in the same core help? But overall i agree with your views. > There are already some primitive NUMA bits in skbuff allocation, but > with modern MSI-X and RX/TX flow hashing we could do a whole lot more, > along the lines of better CPU scheduling decisions, directing flows to > clusters of cpus, and generally doing a better job of maximizing cache > efficiency in a modern multi-thread environment. I think i see the receive with a lot of clarity, i am still foggy on the txmit path mostly because of the qos/scheduling issues. > IMO the current model where each NIC's TX completion and RX processes > are both locked to the same CPU is outmoded in a multi-core world with > modern NICs. :) Infact even with status quo theres a case that can be made to not bind to interupts. In my recent experience with batching, due to the nature of my test app, if i let the interupts float across multiple cpus i benefit. My app runs/binds a thread per CPU and so benefits from having more juice to send more packets per unit of time - something i wouldnt get if i was always running on one cpu. But when i do this i found that just because i have bound a thread to cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup happens on cpu1, scheduler will put the thread on cpu1 if it is to be run. It made sense to do that, it just took me a while to digest. > But I readily admit general ignorance about the kernel process > scheduling stuff, so my only idea about a starting point was to see how > far to go with the concept of "skb affinity" -- a mask in sk_buff that > is a hint about which cpu(s) on which the NIC should attempt to send and > receive packets. When going through bonding or netfilter, it is trivial > to 'or' together affinity masks. All the various layers of net stack > should attempt to honor the skb affinity, where feasible (requires > interaction with CFS scheduler?). There would be cache benefits if you can free the packet on the same cpu it was allocated; so the idea of skb affinity is useful in the minimal in that sense if you can pull it. Assuming hardware is capable, even if you just tagged it on xmit to say which cpu it was sent out on, and made sure thats where it is freed, that would be a good start. Note: The majority of the packet processing overhead is _still_ the memory subsystem latency; in my tests with batched pktgen improving the xmit subsystem meant the overhead on allocing and freeing the packets went to something > 80%. So something along the lines of parallelizing based on a split of alloc free of sksb IMO on more cpus than where xmit/receive run would see more performance improvements. > Or maybe skb affinity is a dumb idea. I wanted to get people thinking > on the bigger picture. Parallelization starts at the user process. cheers, jamal From mshefty at ichips.intel.com Mon Oct 8 10:04:45 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 08 Oct 2007 10:04:45 -0700 Subject: [ofa-general] librdmacm feature request In-Reply-To: <1191767680.19888.310.camel@firewall.xsintricity.com> References: <1191767680.19888.310.camel@firewall.xsintricity.com> Message-ID: <470A632D.1050001@ichips.intel.com> > 1) When you listen for connections, the event includes a new cm_id > struct attached to the listen event channel. Attempts to change this > channel make the cm_id unusable (rdma_create_qp fails). This is > suboptimal in situations where you want the listen channel to produce > listen events only. A function such as rdma_modify_channel(cm_id, > new_channel); would work to solve this. > > 2) When you create a new cm_id with the intent of connecting to another > machine, it is again desirable to get your events related to the > establishment of the connection in a separate channel from those events > related to already established connections (amongst other things, if you > are sharing a channel with a different thread that is responsible for > tearing down connections on error, then which thread gets the > ADDR_RESOLVED or ROUTE_RESOLVED events is up in the air...to make sure > it gets delivered properly, I currently have the connecting thread > pthread_mutex_lock the connection construct, set connection->cm_waiting > = 1, then issue the rdma_resolve_route, then pthread_mutex_lock again so > it deadlocks, and then other thread gets the event, checks > connection->cm_waiting == 1, and if true it places the event pointer in > connection->event, clears connection->cm_waiting, then > pthread_mutex_unlock's the connection...how gross is that). So, using a > separate event channel up until the connection is established, then > calling rdma_modify_channel() would also solve this problem. Thanks for the feedback. I'll give this some thought and see how difficult it is to add an rdma_modify_channel() routine. > 3) The man pages on rdma_connect() and rdma_accept() aren't really > clear on the role of the connection parameters struct that gets passed > in. Specifically, it doesn't say whether or not the initiator_depth and > responder_resources in the parm struct present in the listen event are > what the other side set, or if they are already swapped to indicate the > minimum/maximum that we can set on our side of the connection. Also, > the initial message pointer is not detailed. When we call > rdma_accept/rdma_reject, does our parm struct need to have that same > pointer? Do we need to free that mem? Can we supply a new initial > message and not leak the memory associated with the incoming initial > message? I'll update the man pages to answer your questions. - Sean From swise at opengridcomputing.com Mon Oct 8 11:03:08 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 08 Oct 2007 13:03:08 -0500 Subject: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts. In-Reply-To: References: <20070923203649.8324.64524.stgit@dell3.ogc.int><46FBF8AF.9040700@ichips.intel.com> <000101c8013a$41b374f0$a7cc180a@amr.corp.intel.com> Message-ID: <470A70DC.8000005@opengridcomputing.com> Kanevsky, Arkady wrote: > Sean, > IB aside, > it looks like an ULP which is capable of being both RDMA aware and RDMA > not-aware, > like iSER and iSCSI, NFS-RDMA and NFS, SDP and sockets, > will be treated as two separete ULPs. > Each has its own IP address, since there is a different IP address for > iWARP > port and "regular" Ethernet port. So it falls on the users of ULPs to > "handle" it > via DNS or some other services. > Is this "acceptable" to users? I doubt it. > > Recall that ULPs are going in opposite directions by having a different > port number for RDMA aware and RDMA unaware versions of the ULP. > This way, ULP "connection manager" handles RDMA-ness under the covers, > while users plug an IP address for a server to connect to. > Thanks, NOTE: iSCSI/iSER over iWARP won't work with the current Linux RDMA/Verbs anyway due to the requirement that the login connection be migrated into RDMA mode. That's a separate issue. Currently there is not even a way to setup an RDMA connection in streaming mode, then allow streaming mode I/O, then transitioning the connection in to RDMA mode. None of that is implemented. Also, iSCSI/ISER does _not_ use different ports for streaming mode vs data-mover/rdma modes. It is negotiated and assumes the same 4tuple. But, if we assume that reasonable services should use different ports for tcp vs rdma connections for the same service, then maybe all thats needed is a way to choose ephemeral ports without colliding with the TCP stack. Like maybe segmenting the ephemeral port space for TCP and RDMA ranges? This could be done without impacting the core networking code I think. This would still require a mvapich2 change to have the stack choose a port instead of randomly trying ports until one is available. This angle doesn't solve everything either, but it avoids 2 separate subnets... Steve. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > >> -----Original Message----- >> From: Sean Hefty [mailto:sean.hefty at intel.com] >> Sent: Thursday, September 27, 2007 3:12 PM >> To: Kanevsky, Arkady; Sean Hefty; Steve Wise >> Cc: netdev at vger.kernel.org; rdreier at cisco.com; >> linux-kernel at vger.kernel.org; general at lists.openfabrics.org >> Subject: RE: [ofa-general] [PATCH v3] iw_cxgb3: >> Support"iwarp-only"interfacesto avoid 4-tuple conflicts. >> >>> What is the model on how client connects, say for iSCSI, when client >>> and server both support, iWARP and 10GbE or 1GbE, and would like to >>> setup "most" performant "connection" for ULP? >> For the "most" performance connection, the ULP would use IB, >> and all these problems go away. :) >> >> This proposal is for each iwarp interface to have its own IP >> address. Clients would need an iwarp usable address of the >> server and would connect using rdma_connect(). If that call >> (or rdma_resolve_addr/route) fails, the client could try >> connecting using sockets, aoi, or some other interface. I >> don't see that Steve's proposal changes anything from the >> client's perspective. >> >> - Sean >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From hadi at cyberus.ca Mon Oct 8 11:10:09 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 14:10:09 -0400 Subject: [ofa-general] [DOC][NET_BATCH] Driver howto Message-ID: <1191867009.4335.21.camel@localhost> This is an updated driver howto for batching that works with patches from yesterday and the revised ones i am going to post. cheers, jamal -------------- next part -------------- Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites ------------------------------ For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API ----------------------------------- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev->hard_end_xmit() 2) dev->xmit_win 2.1 Using Core driver changes ----------------------------- To provide context, let's look at a typical driver abstraction for dev->hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1) use its dev->hard_end_xmit() method to achieve #d 2) use dev->xmit_win to tell the core how much space you have. #b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -------------------------- 1. Core dequeues from qdiscs upto dev->xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: ->invokes driver dev->hard_start_xmit() 4. Core invokes driver dev->hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX ------------------------- Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win ---------------- dev->xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev->xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -------------------------- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials --------------------- The typical driver tx state machine is: ---- -1-> +Core sends packets +--> Driver puts packet onto hardware queue + if hardware queue is full, netif_stop_queue(dev) + -2-> +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3-> +---> driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1-> +Cycle repeats and core sends more packets (step 1). ---- 3.1 Driver prerequisite -------------------------- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev->xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(&tun->readq); if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { tun->dev->xmit_win = tun->dev->tx_queue_len; netif_wake_queue(tun->dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned && netif_carrier_ok(netdev) && E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev->xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: ----- if (netif_queue_stopped(tp->dev) && (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp->dev); --- 3.2 Driver Setup ----------------- *) On initialization (before netdev registration) 1) set NETIF_F_BTX in dev->features i.e., dev->features |= NETIF_F_BTX This makes the core do proper initialization. 2) set dev->xmit_win to something reasonable like maybe half the tx DMA ring size etc. 3) create proper pointer to the ->hard_end_xmit() method. 3.3 Annotation on the different methods ---------------------------------------- This section shows examples and offers suggestions on how the different methods and variable could be used. 3.3.1 dev->hard_start_xmit() ---------------------------- Here's an example of tx routine that is similar to the one I added to the current tun driver. bxmit suffix is kept so that you can turn off batching if needed via an ethtool interface and call already existing interface. ---- static int xxx_net_bxmit(struct net_device *dev) { .... .... enqueue onto hardware ring if (hardware ring full) { netif_stop_queue(dev); dev->xmit_win = 1; } ....... .. . } ------ All return codes like NETDEV_TX_OK etc. still apply. 3.3.2 The tx complete, dev->hard_end_xmit() ------------------------------------------------- In this method, if there are any IO operations that apply to a set of packets such as kicking DMA, setting of interrupt thresholds etc., leave them to the end and apply them once if you have successfully enqueued. This provides a mechanism for saving a lot of CPU cycles since IO is cycle expensive. Here is a simplified tg3 dev->hard_end_xmit(): ---- void tg3_complete_xmit(struct net_device *dev) { /* Packets are ready, update Tx producer idx local and on card. */ tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry); if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) { netif_stop_queue(dev); dev->xmit_win = 1; if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp)) { tg3_set_win(tp); netif_wake_queue(dev); } } else { tg3_set_win(tp); } mmiowb(); dev->trans_start = jiffies; } ------- 3.3.3 setting the dev->xmit_win --------------------------------- As mentioned earlier this variable provides hints on how much data to send from the core to the driver. Here are the obvious ways: a) on doing a netif_stop, set it to 1. By default all drivers have this value set to 1 to emulate old behavior where a driver only receives one packet at a time. b) on netif_wake_queue set it to the max available space. You have to be careful if your hardware does scatter-gather since the core will pass you scatter-gatherable skbs and so you want to at least leave enough space for the maximum allowed. Look at the tg3 and e1000 to see how this is implemented. The variable is important because it avoids the core sending any more than what the driver can handle, therefore avoiding any need to muck with packet scheduling mechanisms. Appendix 1: History ------------------- June 11/2007: Initial revision June 11/2007: Fixed typo on e1000 netif_wake description .. Aug 08/2007: Added info on VLAN and the skb->cb[] danger .. Sep 24/2007: Revised and cleaned up Sep 25/2007: Cleanups from Randy Dunlap Oct 08/2007: Removed references to LLTX and packet formatting From hadi at cyberus.ca Mon Oct 8 11:21:21 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 14:21:21 -0400 Subject: [ofa-general] [PATCHES] TX batching rev2 Message-ID: <1191867681.4335.28.camel@localhost> Please provide feedback on the code and/or architecture. Last time i posted them i received little. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb What has changed since i posted last: Killed the temporary packet list that is passed to qdisc restart. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). UPCOMING PATCHES --------------- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. PERFORMANCE TESTING -------------------- System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal From hadi at cyberus.ca Mon Oct 8 11:24:45 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 14:24:45 -0400 Subject: [ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Message-ID: <1191867885.4335.30.camel@localhost> This patch introduces the netdevice interface for batching. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 01-introduce-batching-interface.patch Type: text/x-patch Size: 6320 bytes Desc: not available URL: From hadi at cyberus.ca Mon Oct 8 11:26:50 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 14:26:50 -0400 Subject: [ofa-general] [PATCH 2/3][NET_BATCH] net core use batching Message-ID: <1191868010.4335.33.camel@localhost> This patch adds the usage of batching within the core. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 02-net-core-use-batching.patch Type: text/x-patch Size: 4790 bytes Desc: not available URL: From hadi at cyberus.ca Mon Oct 8 11:27:44 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 14:27:44 -0400 Subject: [ofa-general] [PATCH 3/3][NET_BATCH] kill dev->gso_skb Message-ID: <1191868064.4335.35.camel@localhost> This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 03-kill-dev-gso-skb.patch Type: text/x-patch Size: 2277 bytes Desc: not available URL: From peter.p.waskiewicz.jr at intel.com Mon Oct 8 12:46:23 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 8 Oct 2007 12:46:23 -0700 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191868010.4335.33.camel@localhost> References: <1191868010.4335.33.camel@localhost> Message-ID: > -----Original Message----- > From: J Hadi Salim [mailto:j.hadi123 at gmail.com] On Behalf Of jamal > Sent: Monday, October 08, 2007 11:27 AM > To: David Miller > Cc: krkumar2 at in.ibm.com; johnpol at 2ka.mipt.ru; > herbert at gondor.apana.org.au; kaber at trash.net; > shemminger at linux-foundation.org; jagana at us.ibm.com; > Robert.Olsson at data.slu.se; rick.jones2 at hp.com; > xma at us.ibm.com; gaagaan at gmail.com; netdev at vger.kernel.org; > rdreier at cisco.com; Waskiewicz Jr, Peter P; > mcarlson at broadcom.com; jeff at garzik.org; mchan at broadcom.com; > general at lists.openfabrics.org; kumarkr at linux.ibm.com; > tgraf at suug.ch; randy.dunlap at oracle.com; sri at us.ibm.com > Subject: [PATCH 2/3][NET_BATCH] net core use batching > > This patch adds the usage of batching within the core. > > cheers, > jamal Hey Jamal, I still have concerns how this will work with Tx multiqueue. The way the batching code looks right now, you will probably send a batch of skb's from multiple bands from PRIO or RR to the driver. For non-Tx multiqueue drivers, this is fine. For Tx multiqueue drivers, this isn't fine, since the Tx ring is selected by the value of skb->queue_mapping (set by the qdisc on {prio|rr}_classify()). If the whole batch comes in with different queue_mappings, this could prove to be an interesting issue. Now I see in the driver HOWTO you recently sent that the driver will be expected to loop over the list and call it's ->hard_start_xmit() for each skb. I think that should be fine for multiqueue, I just wanted to see if you had any thoughts on how it should work, any performance issues you can see (I can't think of any). Since the batching feature and Tx multiqueue are very new features, I'd like to make sure we can think of any possible issues with them coexisting before they are both mainline. Looking ahead for multiqueue, I'm still working on the per-queue lock implementation for multiqueue, which I know will not work with batching as it's designed today. I'm still not sure how to handle this, because it really would require the batch you send to have the same queue_mapping in each skb, so you're grabbing the correct queue_lock. Or, we could have the core grab all the queue locks for each skb->queue_mapping represented in the batch. That would block another batch though if it had any of those queues in it's next batch before the first one completed. Thoughts? Thanks Jamal, -PJ Waskiewicz From hadi at cyberus.ca Mon Oct 8 12:53:24 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 15:53:24 -0400 Subject: [ofa-general] [NET_BATCH] Some perf results Message-ID: <1191873204.4373.12.camel@localhost> Ive attached a small pdf with results. This adds on top of results I posted yesterday (although i didnt see them reflected on netdev). 1) "batch-ntlst" is the patches posted today that remove the temporary list in qdisc restart and is derived from this AM net-2.6.24 2) "batch-kern" is result of batching patches posted yesterday that had the temporary list and is based on net-2.6.24 from yesterday AM 3) "net-2.6.2" is yesterday's AM net-2.6.24 with no changes, So #1 is not a completely fair comparison with #2 and #3. However, looking at the logs, the changes that have gone in are unrelated to the areas i have touched, so i dont expect any effect. Overall, removing the temporay list from qdisc_restart provides a small improvement noticeable only at the smaller packet sizes. In any case, enjoy. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: batch-res2.pdf Type: application/pdf Size: 12277 bytes Desc: not available URL: From tziporet at dev.mellanox.co.il Mon Oct 8 13:04:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 08 Oct 2007 22:04:06 +0200 Subject: [ofa-general] OFED October 8 meeting summary on OFED 1.3 alpha readiness In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> Message-ID: <470A8D36.7050407@mellanox.co.il> OFED October 8 meeting summary on OFED 1.3 alpha readiness Meeting summary: ============ 1. Alpha release is planed for this week (Wed or Thursday) * Vlad is working to integrate all new patches/changes that were posted in the last week * The alpha release will not fully support ppc64 (some user level apps will not be available) * Need to make sure we take the correct uCMA library from Sean's git tree 2. Requests for the beta release: * Two uDAPL libraries - 1.2 and 2.0 * A different RPM package for iSCSI (should be provided by Voltaire - Erez) * Add qperf test - Johann will work with Vlad to add it * RHEL 5.1 - Woody will try to generate the backport patches * Add the patches that fix compilation warnings - to be done immediately after the alpha release * SPEC files should be owned by each maintainer package 3. We discussed some ideas for talks in the developer's summit. The following ideas were raised: sa caching (Intel), QoS support (Sean), Extended RC (MPI team) From swise at opengridcomputing.com Mon Oct 8 13:30:27 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 08 Oct 2007 15:30:27 -0500 Subject: [ofa-general] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 Message-ID: <470A9363.4010007@opengridcomputing.com> Vlad/Tziporet, Can you please pull version 1.0.3 of libcxgb3 for inclusion in ofed-1.2.5 and ofed-1.3? It contains a bug fix for olders kernels like RHEL4U4. You can use the master branch for both releases: git://git.openfabrics.org/~swise/libcxgb3.git master Also, please update the spec file you're using to reflect the release (1.0.3). The spec file in the libcxgb3 git tree should be correct. Thanks, Steve. From hadi at cyberus.ca Mon Oct 8 13:48:50 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 16:48:50 -0400 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <1191868010.4335.33.camel@localhost> Message-ID: <1191876530.4373.58.camel@localhost> On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote: > I still have concerns how this will work with Tx multiqueue. > The way the batching code looks right now, you will probably send a > batch of skb's from multiple bands from PRIO or RR to the driver. For > non-Tx multiqueue drivers, this is fine. For Tx multiqueue drivers, > this isn't fine, since the Tx ring is selected by the value of > skb->queue_mapping (set by the qdisc on {prio|rr}_classify()). If the > whole batch comes in with different queue_mappings, this could prove to > be an interesting issue. true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev->xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say "i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2]" based on what scheduling scheme the driver implements - the dev->blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. > Now I see in the driver HOWTO you recently sent that the driver > will be expected to loop over the list and call it's ->hard_start_xmit() > for each skb. It's the core that does that, not the driver; the driver continues to use ->hard_start_xmit() (albeit modified one). The idea is not to have many new interfaces. > I think that should be fine for multiqueue, I just wanted > to see if you had any thoughts on how it should work, any performance > issues you can see (I can't think of any). Since the batching feature > and Tx multiqueue are very new features, I'd like to make sure we can > think of any possible issues with them coexisting before they are both > mainline. Isnt multiqueue mainline already? > Looking ahead for multiqueue, I'm still working on the per-queue > lock implementation for multiqueue, which I know will not work with > batching as it's designed today. The point behind batching is to reduce the cost of the locks by amortizing across the locks. Even better if one can, they should get rid of locks. Remind me, why do you need the per-queuemap lock? And is it needed from the enqueuing side too? Maybe lets start there to help me understand things? > I'm still not sure how to handle this, > because it really would require the batch you send to have the same > queue_mapping in each skb, so you're grabbing the correct queue_lock. Sure, that is doable if the driver can set a per queue_mapping xmit_win and the qdisc can be taught to say "give me packets for queue_mapping X" > Or, we could have the core grab all the queue locks for each > skb->queue_mapping represented in the batch. That would block another > batch though if it had any of those queues in it's next batch before the > first one completed. Thoughts? I am not understanding the desire to have locks on a per-queuemap. I think the single queuelock we have today should suffice. If the intent is to have concurent cpus running to each hardware ring, then this is what i questioned earlier whether it was the right thing to do(very top of email where i mention it as "other issue"). cheers, jamal From davem at davemloft.net Mon Oct 8 14:05:22 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 14:05:22 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock In-Reply-To: <1191850490.4352.41.camel@localhost> References: <1190677099.4264.37.camel@localhost> <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> Message-ID: <20071008.140522.57183793.davem@davemloft.net> From: jamal Date: Mon, 08 Oct 2007 09:34:50 -0400 > The brain-block i am having is the parallelization aspect of it. > Whatever scheme it is - it needs to ensure the scheduler works as > expected. For example, if it was a strict prio scheduler i would expect > that whatever goes out is always high priority first and never ever > allow a low prio packet out at any time theres something high prio > needing to go out. If i have the two priorities running on two cpus, > then i cant guarantee that effect. > IOW, i see the scheduler/qdisc level as not being split across parallel > cpus. Do i make any sense? Picture it like N tubes you stick packets into, and the tubes are processed using DRR. So packets within a tube won't be reordered, but reordering amongst tubes is definitely possible. From davem at davemloft.net Mon Oct 8 14:11:54 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 14:11:54 -0700 (PDT) Subject: [ofa-general] Re: parallel networking In-Reply-To: <470A3D24.3050803@garzik.org> References: <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> <470A3D24.3050803@garzik.org> Message-ID: <20071008.141154.107706003.davem@davemloft.net> From: Jeff Garzik Date: Mon, 08 Oct 2007 10:22:28 -0400 > In terms of overall parallelization, both for TX as well as RX, my gut > feeling is that we want to move towards an MSI-X, multi-core friendly > model where packets are LIKELY to be sent and received by the same set > of [cpus | cores | packages | nodes] that the [userland] processes > dealing with the data. The problem is that the packet schedulers want global guarantees on packet ordering, not flow centric ones. That is the issue Jamal is concerned about. The more I think about it, the more inevitable it seems that we really might need multiple qdiscs, one for each TX queue, to pull this full parallelization off. But the semantics of that don't smell so nice either. If the user attaches a new qdisc to "ethN", does it go to all the TX queues, or what? All of the traffic shaping technology deals with the device as a unary object. It doesn't fit to multi-queue at all. From davem at davemloft.net Mon Oct 8 14:26:26 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 14:26:26 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191876530.4373.58.camel@localhost> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> Message-ID: <20071008.142626.26988698.davem@davemloft.net> From: jamal Date: Mon, 08 Oct 2007 16:48:50 -0400 > On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote: > > > I still have concerns how this will work with Tx multiqueue. > > The way the batching code looks right now, you will probably send a > > batch of skb's from multiple bands from PRIO or RR to the driver. For > > non-Tx multiqueue drivers, this is fine. For Tx multiqueue drivers, > > this isn't fine, since the Tx ring is selected by the value of > > skb->queue_mapping (set by the qdisc on {prio|rr}_classify()). If the > > whole batch comes in with different queue_mappings, this could prove to > > be an interesting issue. > > true, that needs some resolution. Heres a hand-waving thought: > Assuming all packets of a specific map end up in the same qdiscn queue, > it seems feasible to ask the qdisc scheduler to give us enough packages > (ive seen people use that terms to refer to packets) for each hardware > ring's available space. With the patches i posted, i do that via > dev->xmit_win that assumes only one view of the driver; essentially a > single ring. > If that is doable, then it is up to the driver to say > "i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2]" based on > what scheduling scheme the driver implements - the dev->blist can stay > the same. Its a handwave, so there may be issues there and there could > be better ways to handle this. Add xmit_win to struct net_device_subqueue, problem solved. From swise at opengridcomputing.com Mon Oct 8 14:54:49 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 08 Oct 2007 16:54:49 -0500 Subject: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. In-Reply-To: <20070809.145534.102938208.davem@davemloft.net> References: <46B883B5.8040702@opengridcomputing.com> <46BB61D0.4090101@opengridcomputing.com> <46BB89C0.4040303@ichips.intel.com> <20070809.145534.102938208.davem@davemloft.net> Message-ID: <470AA729.2050009@opengridcomputing.com> David Miller wrote: > From: Sean Hefty > Date: Thu, 09 Aug 2007 14:40:16 -0700 > >> Steve Wise wrote: >>> Any more comments? >> Does anyone have ideas on how to reserve the port space without using a >> struct socket? > > How about we just remove the RDMA stack altogether? I am not at all > kidding. If you guys can't stay in your sand box and need to cause > problems for the normal network stack, it's unacceptable. We were > told all along the if RDMA went into the tree none of this kind of > stuff would be an issue. > > These are exactly the kinds of problems for which people like myself > were dreading. These subsystems have no buisness using the TCP port > space of the Linux software stack, absolutely none. > > After TCP port reservation, what's next? It seems an at least > bi-monthly event that the RDMA folks need to put their fingers > into something else in the normal networking stack. No more. > > I will NACK any patch that opens up sockets to eat up ports or > anything stupid like that. Hey Dave, The hack to use a socket and bind it to claim the port was just for demostrating the idea. The correct solution, IMO, is to enhance the core low level 4-tuple allocation services to be more generic (eg: not be tied to a struct sock). Then the host tcp stack and the host rdma stack can allocate TCP/iWARP ports/4tuples from this common exported service and share the port space. This allocation service could also be used by other deep adapters like iscsi adapters if needed. Will you NAK such a solution if I go implement it and submit for review? The dual ip subnet solution really sux, and I'm trying one more time to see if you will entertain the common port space solution, if done correctly. Thanks, Steve. From hadi at cyberus.ca Mon Oct 8 15:30:18 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 18:30:18 -0400 Subject: [ofa-general] Re: parallel networking In-Reply-To: <20071008.141154.107706003.davem@davemloft.net> References: <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> <470A3D24.3050803@garzik.org> <20071008.141154.107706003.davem@davemloft.net> Message-ID: <1191882618.4373.99.camel@localhost> On Mon, 2007-08-10 at 14:11 -0700, David Miller wrote: > The problem is that the packet schedulers want global guarantees > on packet ordering, not flow centric ones. > > That is the issue Jamal is concerned about. indeed, thank you for giving it better wording. > The more I think about it, the more inevitable it seems that we really > might need multiple qdiscs, one for each TX queue, to pull this full > parallelization off. > > But the semantics of that don't smell so nice either. If the user > attaches a new qdisc to "ethN", does it go to all the TX queues, or > what? > > All of the traffic shaping technology deals with the device as a unary > object. It doesn't fit to multi-queue at all. If you let only one CPU at a time access the "xmit path" you solve all the reordering. If you want to be more fine grained you make the serialization point as low as possible in the stack - perhaps in the driver. But I think even what we have today with only one cpu entering the dequeue/scheduler region, _for starters_, is not bad actually ;-> What i am finding (and i can tell you i have been trying hard;->) is that a sufficiently fast cpu doesnt sit in the dequeue area for "too long" (and batching reduces the time spent further). Very quickly there are no more packets for it to dequeue from the qdisc or the driver is stoped and it has to get out of there. If you dont have any interupt tied to a specific cpu then you can have many cpus enter and leave that region all the time. cheers, jamal From peter.p.waskiewicz.jr at intel.com Mon Oct 8 15:33:42 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 8 Oct 2007 15:33:42 -0700 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191876530.4373.58.camel@localhost> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> Message-ID: > true, that needs some resolution. Heres a hand-waving thought: > Assuming all packets of a specific map end up in the same > qdiscn queue, it seems feasible to ask the qdisc scheduler to > give us enough packages (ive seen people use that terms to > refer to packets) for each hardware ring's available space. > With the patches i posted, i do that via > dev->xmit_win that assumes only one view of the driver; essentially a > single ring. > If that is doable, then it is up to the driver to say "i have > space for 5 in ring[0], 10 in ring[1] 0 in ring[2]" based on > what scheduling scheme the driver implements - the dev->blist > can stay the same. Its a handwave, so there may be issues > there and there could be better ways to handle this. > > Note: The other issue that needs resolving that i raised > earlier was in regards to multiqueue running on multiple cpus > servicing different rings concurently. I can see the qdisc being modified to send batches per queue_mapping. This shouldn't be too difficult, and if we had the xmit_win per queue (in the subqueue struct like Dave pointed out). Addressing your note/issue with different rings being services concurrently: I'd like to remove the QDISC_RUNNING bit from the global device; with Tx multiqueue, this bit should be set on each queue (if at all), allowing multiple Tx rings to be loaded simultaneously. The biggest issue today with the multiqueue implementation is the global queue_lock. I see it being a hot source of contention in my testing; my setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using 8 Tx and 8 Rx queues. On transmit, when loading all 8 queues, the enqueue/dequeue are hitting that lock quite a bit for the whole device. I really think that the queue_lock should join the queue_state, so the device no longer manages the top-level state (since we're operating per-queue instead of per-device). > It's the core that does that, not the driver; the driver > continues to use ->hard_start_xmit() (albeit modified one). > The idea is not to have many new interfaces. I'll look closer at this, since I think I confused myself. > Isnt multiqueue mainline already? Well, it's in 2.6.23-rc*. I imagine it won't see much action though until 2.6.24, since people will be porting drivers during that time. Plus having the native Rx multiqueue w/NAPI code in 2.6.24 makes sense to have Tx multiqueue at that time. > The point behind batching is to reduce the cost of the locks > by amortizing across the locks. Even better if one can, they > should get rid of locks. Remind me, why do you need the > per-queuemap lock? And is it needed from the enqueuing side > too? Maybe lets start there to help me understand things? The multiqueue implementation today enforces the number of qdisc bands (RR or PRIO) to be equal to the number of Tx rings your hardware/driver is supporting. Therefore, the queue_lock and queue_state in the kernel directly relate to the qdisc band management. If the queue stops from the driver, then the qdisc won't try to dequeue from the band. What I'm working on is to move the lock there too, so I can lock the queue when I enqueue (protect the band from multiple sources modifying the skb chain), and lock it when I dequeue. This is purely for concurrency of adding/popping skb's from the qdisc queues. Right now, we take the whole global lock to add and remove skb's. This is the next logical step for separating the queue dependancy on each other. Please let me know if this doesn't make sense, or if you have any questions at all about my reasoning. I agree that this is where we should be on the same page before moving onto anything else in this discussion. :) > Sure, that is doable if the driver can set a per > queue_mapping xmit_win and the qdisc can be taught to say > "give me packets for queue_mapping X" Yes, I like this idea very much. Do that, modify the qdisc to send in chunks from a queue, and the problem should be solved. I will try and find some additional cycles to get my patches completely working, and send them. It'd be easier I think to see what's going on if I did that. I'll also try to make them work with the ideas of xmit_win per queue and batched queue qdisc sends. Stay tuned... Thanks Jamal, -PJ Waskiewicz From davem at davemloft.net Mon Oct 8 15:33:53 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 15:33:53 -0700 (PDT) Subject: [ofa-general] Re: parallel networking In-Reply-To: <1191882618.4373.99.camel@localhost> References: <470A3D24.3050803@garzik.org> <20071008.141154.107706003.davem@davemloft.net> <1191882618.4373.99.camel@localhost> Message-ID: <20071008.153353.58431888.davem@davemloft.net> From: jamal Date: Mon, 08 Oct 2007 18:30:18 -0400 > Very quickly there are no more packets for it to dequeue from the > qdisc or the driver is stoped and it has to get out of there. If you > dont have any interupt tied to a specific cpu then you can have many > cpus enter and leave that region all the time. With the lock shuttling back and forth between those cpus, which is what we're trying to avoid. Multiply whatever effect you think you might be able to measure due to that on your 2 or 4 way system, and multiple it up to 64 cpus or so for machines I am using. This is where machines are going, and is going to become the norm. From hadi at cyberus.ca Mon Oct 8 15:34:18 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 18:34:18 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.142626.26988698.davem@davemloft.net> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> <20071008.142626.26988698.davem@davemloft.net> Message-ID: <1191882858.4373.103.camel@localhost> On Mon, 2007-08-10 at 14:26 -0700, David Miller wrote: > Add xmit_win to struct net_device_subqueue, problem solved. If net_device_subqueue is visible from both driver and core scheduler area (couldnt tell from looking at whats in there already), then that'll do it. cheers, jamal From peter.p.waskiewicz.jr at intel.com Mon Oct 8 15:35:53 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 8 Oct 2007 15:35:53 -0700 Subject: [ofa-general] RE: parallel networking In-Reply-To: <20071008.153353.58431888.davem@davemloft.net> References: <470A3D24.3050803@garzik.org><20071008.141154.107706003.davem@davemloft.net><1191882618.4373.99.camel@localhost> <20071008.153353.58431888.davem@davemloft.net> Message-ID: > Multiply whatever effect you think you might be able to > measure due to that on your 2 or 4 way system, and multiple > it up to 64 cpus or so for machines I am using. This is > where machines are going, and is going to become the norm. That along with speeds going to 10 GbE with multiple Tx/Rx queues (with 40 and 100 GbE under discussion now), where multiple CPU's hitting the driver are needed to push line rate without cratering the entire machine. -PJ Waskiewicz From peter.p.waskiewicz.jr at intel.com Mon Oct 8 15:36:52 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Mon, 8 Oct 2007 15:36:52 -0700 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191882858.4373.103.camel@localhost> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> <20071008.142626.26988698.davem@davemloft.net> <1191882858.4373.103.camel@localhost> Message-ID: > If net_device_subqueue is visible from both driver and core > scheduler area (couldnt tell from looking at whats in there > already), then that'll do it. Yes, I use the net_device_subqueue structs (the state variable in there) in the prio and rr qdiscs right now. It's an indexed list at the very end of struct netdevice. -PJ Waskiewicz From hadi at cyberus.ca Mon Oct 8 16:40:45 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 19:40:45 -0400 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> Message-ID: <1191886845.4373.138.camel@localhost> On Mon, 2007-08-10 at 15:33 -0700, Waskiewicz Jr, Peter P wrote: > Addressing your note/issue with different rings being services > concurrently: I'd like to remove the QDISC_RUNNING bit from the global The challenge to deal with is that netdevices, filters, the queues and scheduler are closely inter-twined. So it is not just the scheduling region and QDISC_RUNNING. For example, lets pick just the filters because they are simple to see: You need to attach them to something - whatever that is, you then need to synchronize against config and multiple cpus trying to use them. You could: a) replicate them across cpus and only lock on config, but you are wasting RAM then b) attach them to rings instead of netdevices - but that makes me wonder if those subqueues are now going to become netdevices. This also means you change all user space interfaces to know about subqueues. If you recall this was a major contention in our earlier discussion. > device; with Tx multiqueue, this bit should be set on each queue (if at > all), allowing multiple Tx rings to be loaded simultaneously. This is the issue i raised - refer to Dave's wording of it. If you run access to the rings simultenously you may not be able to guarantee any ordering or proper qos in contention for wire-resources (think strict prio in hardware) - as long as you have the qdisc area. You may actually get away with it with something like DRR. You could totaly bypass the qdisc region and go to the driver directly and let it worry about the scheduling but youd have to make the qdisc area a "passthrough" while providing the illusion to user space that all is as before. > The > biggest issue today with the multiqueue implementation is the global > queue_lock. I see it being a hot source of contention in my testing; my > setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using > 8 Tx and 8 Rx queues. On transmit, when loading all 8 queues, the > enqueue/dequeue are hitting that lock quite a bit for the whole device. Yes, the queuelock is expensive; in your case if all 8 hardware threads are contending for that one device, you will suffer. The txlock on the other hand is not that expensive since the contention is for a max of 2 cpus (tx and rx softirq). I tried to use that fact in the batching to move things that i processed under queue lock into the area for txlock. I'd be very interested in some results on such a piece of hardware with the 10G nic to see if these theories make any sense. > I really think that the queue_lock should join the queue_state, so the > device no longer manages the top-level state (since we're operating > per-queue instead of per-device). Refer to above. > > The multiqueue implementation today enforces the number of qdisc bands > (RR or PRIO) to be equal to the number of Tx rings your hardware/driver > is supporting. Therefore, the queue_lock and queue_state in the kernel > directly relate to the qdisc band management. If the queue stops from > the driver, then the qdisc won't try to dequeue from the band. Good start. > What I'm > working on is to move the lock there too, so I can lock the queue when I > enqueue (protect the band from multiple sources modifying the skb > chain), and lock it when I dequeue. This is purely for concurrency of > adding/popping skb's from the qdisc queues. Ok, so the "concurency" aspect is what worries me. What i am saying is that sooner or later you have to serialize (which is anti-concurency) For example, consider CPU0 running a high prio queue and CPU1 running the low prio queue of the same netdevice. Assume CPU0 is getting a lot of interupts or other work while CPU1 doesnt (so as to create a condition that CPU1 is slower). Then as long as there packets and there is space on the drivers rings, CPU1 will send more packets per unit time than CPU0. This contradicts the strict prio scheduler which says higher priority packets ALWAYS go out first regardless of the presence of low prio packets. I am not sure i made sense. cheers, jamal From hadi at cyberus.ca Mon Oct 8 16:42:25 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 19:42:25 -0400 Subject: [ofa-general] Re: parallel networking In-Reply-To: <20071008.153353.58431888.davem@davemloft.net> References: <470A3D24.3050803@garzik.org> <20071008.141154.107706003.davem@davemloft.net> <1191882618.4373.99.camel@localhost> <20071008.153353.58431888.davem@davemloft.net> Message-ID: <1191886945.4373.141.camel@localhost> On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote: > Multiply whatever effect you think you might be able to measure due to > that on your 2 or 4 way system, and multiple it up to 64 cpus or so > for machines I am using. This is where machines are going, and is > going to become the norm. Yes, i keep forgetting that ;-> I need to train my brain to remember that. cheers, jamal From rdreier at cisco.com Mon Oct 8 16:43:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 08 Oct 2007 16:43:48 -0700 Subject: [ofa-general] librdmacm feature request In-Reply-To: <470A632D.1050001@ichips.intel.com> (Sean Hefty's message of "Mon, 08 Oct 2007 10:04:45 -0700") References: <1191767680.19888.310.camel@firewall.xsintricity.com> <470A632D.1050001@ichips.intel.com> Message-ID: > Thanks for the feedback. I'll give this some thought and see how > difficult it is to add an rdma_modify_channel() routine. I think this needs to be handled with care, because there is the obvious window while an rdma_modify_channel() operation is pending where an application has to be prepared for an event to appear on either the old channel and the new channel. - R. From rdreier at cisco.com Mon Oct 8 16:52:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 08 Oct 2007 16:52:27 -0700 Subject: [ofa-general] [PATCH] ipoib_cm: Changed the way QP is being created in ipoib_cm_create_tx_qp In-Reply-To: <200710070930.48454.dotanb@dev.mellanox.co.il> (Dotan Barak's message of "Sun, 7 Oct 2007 09:30:48 +0200") References: <200710070930.48454.dotanb@dev.mellanox.co.il> Message-ID: thanks, applied with the extra line suggested by Eli. From rdreier at cisco.com Mon Oct 8 16:54:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 08 Oct 2007 16:54:59 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <47090708.6060604@opengridcomputing.com> (Steve Wise's message of "Sun, 07 Oct 2007 11:19:20 -0500") References: <47090708.6060604@opengridcomputing.com> Message-ID: > No mention about the iwarp port space issue? I don't think we're at a stage where I'm prepared to merge something-- we all agree the latest patch has serious drawbacks, and it commits us to a suboptimal interface that is userspace-visible. > I'm at a loss as to how to proceed. Could we try to do some cleanups to the net core to make the alias stuff less painful? eg is there any sane way to make it possible for a device that creates 'eth0' to also create an 'iw0' alias without an assigning an address? - R. From sean.hefty at intel.com Mon Oct 8 17:15:39 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 8 Oct 2007 17:15:39 -0700 Subject: [ofa-general] librdmacm feature request In-Reply-To: References: <1191767680.19888.310.camel@firewall.xsintricity.com><470A632D.1050001@ichips.intel.com> Message-ID: <000101c80a09$86064380$2cc8180a@amr.corp.intel.com> >I think this needs to be handled with care, because there is the >obvious window while an rdma_modify_channel() operation is pending >where an application has to be prepared for an event to appear on >either the old channel and the new channel. Agreed - this may work okay for new connection requests, where pending events end up being suppressed in the kernel until the user accepts the connection, but will be challenging as a generic API. Folding this functionality into rdma_accept() may work better as long as it doesn't break the ABI. - Sean From jeff at garzik.org Mon Oct 8 18:13:59 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 08 Oct 2007 21:13:59 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191886845.4373.138.camel@localhost> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> <1191886845.4373.138.camel@localhost> Message-ID: <470AD5D7.1070000@garzik.org> jamal wrote: > Ok, so the "concurency" aspect is what worries me. What i am saying is > that sooner or later you have to serialize (which is anti-concurency) > For example, consider CPU0 running a high prio queue and CPU1 running > the low prio queue of the same netdevice. > Assume CPU0 is getting a lot of interupts or other work while CPU1 > doesnt (so as to create a condition that CPU1 is slower). Then as long > as there packets and there is space on the drivers rings, CPU1 will send > more packets per unit time than CPU0. > This contradicts the strict prio scheduler which says higher priority > packets ALWAYS go out first regardless of the presence of low prio > packets. I am not sure i made sense. You made sense. I think it is important to note simply that the packet scheduling algorithm itself will dictate the level of concurrency you can achieve. Strict prio is fundamentally an interface to a big imaginary queue, with multiple packet insertion points (the individual bands/rings for each prio band). If you assume a scheduler implementation where each prio band is mapped to a separate CPU, you can certainly see where some CPUs could be substantially idle while others are overloaded, largely depending on the data workload (and priority contained within). Moreover, you increase L1/L2 cache traffic, not just because of locks, but because of data dependencies: user prio packet NIC TX ring process band scheduler cpu7 1 cpu1 1 cpu5 1 cpu1 1 cpu2 0 cpu0 0 At that point, it is probably more cache- and lock-friendly to keep the current TX softirq scheme. In contrast, a pure round-robin approach is more friendly to concurrency. Jeff From jeff at garzik.org Mon Oct 8 18:31:09 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 08 Oct 2007 21:31:09 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191886845.4373.138.camel@localhost> References: <1191868010.4335.33.camel@localhost> <1191876530.4373.58.camel@localhost> <1191886845.4373.138.camel@localhost> Message-ID: <470AD9DD.2080707@garzik.org> jamal wrote: > The challenge to deal with is that netdevices, filters, the queues and > scheduler are closely inter-twined. So it is not just the scheduling > region and QDISC_RUNNING. For example, lets pick just the filters > because they are simple to see: You need to attach them to something - > whatever that is, you then need to synchronize against config and > multiple cpus trying to use them. You could: > a) replicate them across cpus and only lock on config, but you are > wasting RAM then I think you've pretty much bought into the cost of wasting RAM, when doing multiple TX rings. So logic implies associated costs, like the ones you describe, come along for the ride. > b) attach them to rings instead of netdevices - but that makes me wonder > if those subqueues are now going to become netdevices. This also means > you change all user space interfaces to know about subqueues. If you > recall this was a major contention in our earlier discussion. That's definitely a good question, and I honestly don't see any easy solutions. Multiple net devices makes a -lot- of things easier, with regards to existing infrastructure, but it also imposes potentially annoying administrative burdens: Not only must each interface be set up individually, but the userland apps must be made aware of this unique method of concurrency. Jeff From TedwatchDean at uncrate.com Mon Oct 8 07:30:17 2007 From: TedwatchDean at uncrate.com (Ted Harvey) Date: Mon, 8 Oct 2007 20:30:17 +0600 Subject: [ofa-general] Re: Thank you, we are ready to lend you some cash Message-ID: <09d001c80a14$75c7c830$0600000a@Jory> Your credit score doesn't matter to us! If you have your own business and need IMMEDIATE cash to spend ANY way you like or want Extra money to give the company a boost or require A low interest loan - NO STRINGS ATTACHED, here is best deal we can offer you TONIGHT (hurry, this lot will expire TONIGHT): $23,000+ loan Hurry, when our deal is gone, it is gone. Simply Call Us... Do not worry about approval, your credit history will not disqualify you! Call Us Free on 877-347-3607 -------------- next part -------------- An HTML attachment was scrubbed... URL: From davem at davemloft.net Mon Oct 8 18:41:26 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 18:41:26 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470AD5D7.1070000@garzik.org> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> Message-ID: <20071008.184126.124062865.davem@davemloft.net> From: Jeff Garzik Date: Mon, 08 Oct 2007 21:13:59 -0400 > If you assume a scheduler implementation where each prio band is mapped > to a separate CPU, you can certainly see where some CPUs could be > substantially idle while others are overloaded, largely depending on the > data workload (and priority contained within). Right, which is why Peter added the prio DRR scheduler stuff for TX multiqueue (see net/sched/sch_prio.c:rr_qdisc_ops) because this is what the chips do. But this doesn't get us to where we want to be as Peter has been explaining a bit these past few days. Ok, we're talking a lot but not pouring much concrete, let's start doing that. I propose: 1) A library for transmit load balancing functions, with an interface that can be made visible to userspace. I can write this and test it on real multiqueue hardware. The whole purpose of this library is to set skb->queue_mapping based upon the load balancing function. Facilities will be added to handle virtualization port selection based upon destination MAC address as one of the "load balancing" methods. 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo with load balancing using the code in #1. I think this is kind of in the territory of what Peter said he is working on. I know this is controversial, but realistically I doubt users benefit at all from the prioritization that pfifo provides. They will, on the other hand, benefit from TX queue load balancing on fast interfaces. 3) Work on discovering a way to make the locking on transmit as localized to the current thread of execution as possible. Things like RCU and statistic replication, techniques we use widely elsewhere in the stack, begin to come to mind. I also want to point out another issue. Any argument wrt. reordering is specious at best because right now reordering from qdisc to device happens anyways. And that's because we drop the qdisc lock first, then we grab the transmit lock on the device and submit the packet. So, after we drop the qdisc lock, another cpu can get the qdisc lock, get the next packet (perhaps a lower priority one) and then sneak in to get the device transmit lock before the first thread can, and thus the packets will be submitted out of order. This, along with other things, makes me believe that ordering really doesn't matter in practice. And therefore, in practice, we can treat everything from the qdisc to the real hardware as a FIFO even if something else is going on inside the black box which might reorder packets on the wire. From dledford at redhat.com Mon Oct 8 18:48:27 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 09 Oct 2007 01:48:27 +0000 Subject: [ofa-general] librdmacm feature request In-Reply-To: References: <1191767680.19888.310.camel@firewall.xsintricity.com> <470A632D.1050001@ichips.intel.com> Message-ID: <1191894507.19888.360.camel@firewall.xsintricity.com> On Mon, 2007-10-08 at 16:43 -0700, Roland Dreier wrote: > > Thanks for the feedback. I'll give this some thought and see how > > difficult it is to add an rdma_modify_channel() routine. > > I think this needs to be handled with care, because there is the > obvious window while an rdma_modify_channel() operation is pending > where an application has to be prepared for an event to appear on > either the old channel and the new channel. > > - R. It shouldn't be too hard. Assuming you handle the modify channel as a synchronous action, the thread calling modify channel can't also be in rdma_get_cm_event at the same time. So, if you get there and someone is blocking on that channel and just hasn't been scheduled to run yet, then leave the event where it is while you switch the channel and send new events to the new channel. If they aren't then move any pending events to the new channel as you do the change. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jeff at garzik.org Mon Oct 8 18:53:25 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 08 Oct 2007 21:53:25 -0400 Subject: [ofa-general] Re: parallel networking In-Reply-To: <20071008.141154.107706003.davem@davemloft.net> References: <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> <470A3D24.3050803@garzik.org> <20071008.141154.107706003.davem@davemloft.net> Message-ID: <470ADF15.2090100@garzik.org> David Miller wrote: > From: Jeff Garzik > Date: Mon, 08 Oct 2007 10:22:28 -0400 > >> In terms of overall parallelization, both for TX as well as RX, my gut >> feeling is that we want to move towards an MSI-X, multi-core friendly >> model where packets are LIKELY to be sent and received by the same set >> of [cpus | cores | packages | nodes] that the [userland] processes >> dealing with the data. > > The problem is that the packet schedulers want global guarantees > on packet ordering, not flow centric ones. > > That is the issue Jamal is concerned about. Oh, absolutely. I think, fundamentally, any amount of cross-flow resource management done in software is an obstacle to concurrency. That's not a value judgement, just a statement of fact. "traffic cops" are intentional bottlenecks we add to the process, to enable features like priority flows, filtering, or even simple socket fairness guarantees. Each of those bottlenecks serves a valid purpose, but at the end of the day, it's still a bottleneck. So, improving concurrency may require turning off useful features that nonetheless hurt concurrency. > The more I think about it, the more inevitable it seems that we really > might need multiple qdiscs, one for each TX queue, to pull this full > parallelization off. > > But the semantics of that don't smell so nice either. If the user > attaches a new qdisc to "ethN", does it go to all the TX queues, or > what? > > All of the traffic shaping technology deals with the device as a unary > object. It doesn't fit to multi-queue at all. Well the easy solutions to networking concurrency are * use virtualization to carve up the machine into chunks * use multiple net devices Since new NIC hardware is actively trying to be friendly to multi-channel/virt scenarios, either of these is reasonably straightforward given the current state of the Linux net stack. Using multiple net devices is especially attractive because it works very well with the existing packet scheduling. Both unfortunately impose a burden on the developer and admin, to force their apps to distribute flows across multiple [VMs | net devs]. The third alternative is to use a single net device, with SMP-friendly packet scheduling. Here you run into the problems you described "device as a unary object" etc. with the current infrastructure. With multiple TX rings, consider that we are pushing the packet scheduling from software to hardware... which implies * hardware-specific packet scheduling * some TC/shaping features not available, because hardware doesn't support it Jeff From herbert at gondor.apana.org.au Mon Oct 8 19:01:15 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:01:15 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.184126.124062865.davem@davemloft.net> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: <20071009020115.GA14635@gondor.apana.org.au> On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote: > > I also want to point out another issue. Any argument wrt. reordering > is specious at best because right now reordering from qdisc to device > happens anyways. This is not true. If your device has a qdisc at all, then you will end up in the function qdisc_restart, where we release the queue lock only after acquiring the TX lock. So right now this path does not create any reordering. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert at gondor.apana.org.au Mon Oct 8 19:03:18 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:03:18 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009020115.GA14635@gondor.apana.org.au> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> Message-ID: <20071009020318.GA14708@gondor.apana.org.au> On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote: > On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote: > > > > I also want to point out another issue. Any argument wrt. reordering > > is specious at best because right now reordering from qdisc to device > > happens anyways. > > This is not true. > > If your device has a qdisc at all, then you will end up in the > function qdisc_restart, where we release the queue lock only > after acquiring the TX lock. > > So right now this path does not create any reordering. Argh! Someone's just broken this. I think we should restore the original behaviour. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert at gondor.apana.org.au Mon Oct 8 19:04:42 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:04:42 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009020318.GA14708@gondor.apana.org.au> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> Message-ID: <20071009020442.GA14746@gondor.apana.org.au> On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote: > On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote: > > On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote: > > > > > > I also want to point out another issue. Any argument wrt. reordering > > > is specious at best because right now reordering from qdisc to device > > > happens anyways. > > > > This is not true. > > > > If your device has a qdisc at all, then you will end up in the > > function qdisc_restart, where we release the queue lock only > > after acquiring the TX lock. > > > > So right now this path does not create any reordering. > > Argh! Someone's just broken this. I think we should restore > the original behaviour. Please revert commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 Author: Jamal Hadi Salim Date: Tue Sep 25 19:27:13 2007 -0700 [NET_SCHED]: explict hold dev tx lock As this change introduces potential reordering and I don't think we've discussed this aspect sufficiently. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From jeff at garzik.org Mon Oct 8 19:12:03 2007 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 08 Oct 2007 22:12:03 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.184126.124062865.davem@davemloft.net> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: <470AE373.9020207@garzik.org> David Miller wrote: > 1) A library for transmit load balancing functions, with an interface > that can be made visible to userspace. I can write this and test > it on real multiqueue hardware. > > The whole purpose of this library is to set skb->queue_mapping > based upon the load balancing function. > > Facilities will be added to handle virtualization port selection > based upon destination MAC address as one of the "load balancing" > methods. Groovy. I'm interested in working on a load balancer function that approximates skb->queue_mapping = smp_processor_id() I'd be happy to code and test in that direction, based on your lib. > 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo > with load balancing using the code in #1. I think this is kind > of in the territory of what Peter said he is working on. > > I know this is controversial, but realistically I doubt users > benefit at all from the prioritization that pfifo provides. They > will, on the other hand, benefit from TX queue load balancing on > fast interfaces. IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. And I'll volunteer to write the net driver code for that, if people want to see how things would look for that type of hardware packet scheduling. > 3) Work on discovering a way to make the locking on transmit as > localized to the current thread of execution as possible. Things > like RCU and statistic replication, techniques we use widely > elsewhere in the stack, begin to come to mind. Definitely. Jeff From hadi at cyberus.ca Mon Oct 8 19:14:30 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 22:14:30 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.184126.124062865.davem@davemloft.net> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: <1191896071.4373.156.camel@localhost> On Mon, 2007-08-10 at 18:41 -0700, David Miller wrote: > I also want to point out another issue. Any argument wrt. reordering > is specious at best because right now reordering from qdisc to device > happens anyways. > > And that's because we drop the qdisc lock first, then we grab the > transmit lock on the device and submit the packet. So, after we > drop the qdisc lock, another cpu can get the qdisc lock, get the > next packet (perhaps a lower priority one) and then sneak in to > get the device transmit lock before the first thread can, and > thus the packets will be submitted out of order. > You forgot QDISC_RUNNING Dave;-> the above cant happen. Essentially at any one point in time, we are guaranteed that we can have multiple cpus enqueueing but only can be dequeueing (the one that managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue but only one consumer. Only the dequeuer has access to the txlock. > This, along with other things, makes me believe that ordering really > doesn't matter in practice. And therefore, in practice, we can treat > everything from the qdisc to the real hardware as a FIFO even if > something else is going on inside the black box which might reorder > packets on the wire. I think it is important to get the scheduling right - estimations can be a last resort. For example, If i have voip competing for the wire with ftp on two different rings/cpus and i specified that voip should be more important i may consider equipment faulty if it works "most of the time" (when ftp is not clogging the wire) and at times i am asked to repeat what i just said. cheers, jamal From hadi at cyberus.ca Mon Oct 8 19:15:49 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 22:15:49 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009020442.GA14746@gondor.apana.org.au> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071009020442.GA14746@gondor.apana.org.au> Message-ID: <1191896149.4373.157.camel@localhost> On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote: > Please revert > > commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 > Author: Jamal Hadi Salim > Date: Tue Sep 25 19:27:13 2007 -0700 > > [NET_SCHED]: explict hold dev tx lock > > As this change introduces potential reordering and I don't think > we've discussed this aspect sufficiently. How does it introduce reordering? cheers, jamal From herbert at gondor.apana.org.au Mon Oct 8 19:16:20 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:16:20 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191896071.4373.156.camel@localhost> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <1191896071.4373.156.camel@localhost> Message-ID: <20071009021620.GA14917@gondor.apana.org.au> On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote: > > You forgot QDISC_RUNNING Dave;-> the above cant happen. > Essentially at any one point in time, we are guaranteed that we can have > multiple cpus enqueueing but only can be dequeueing (the one that > managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue > but only one consumer. Only the dequeuer has access to the txlock. Good point. You had me worried for a sec :) Dave, Jamal's patch is fine as it is and doesn't actually create any packet reordering. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert at gondor.apana.org.au Mon Oct 8 19:16:46 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:16:46 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191896149.4373.157.camel@localhost> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071009020442.GA14746@gondor.apana.org.au> <1191896149.4373.157.camel@localhost> Message-ID: <20071009021646.GB14917@gondor.apana.org.au> On Mon, Oct 08, 2007 at 10:15:49PM -0400, jamal wrote: > On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote: > > > Please revert > > > > commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 > > Author: Jamal Hadi Salim > > Date: Tue Sep 25 19:27:13 2007 -0700 > > > > [NET_SCHED]: explict hold dev tx lock > > > > As this change introduces potential reordering and I don't think > > we've discussed this aspect sufficiently. > > How does it introduce reordering? No it doesn't. I'd forgotten about the QDISC_RUNNING bit :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From hadi at cyberus.ca Mon Oct 8 19:19:02 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 08 Oct 2007 22:19:02 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009021646.GB14917@gondor.apana.org.au> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071009020442.GA14746@gondor.apana.org.au> <1191896149.4373.157.camel@localhost> <20071009021646.GB14917@gondor.apana.org.au> Message-ID: <1191896342.4373.159.camel@localhost> On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote: > > No it doesn't. I'd forgotten about the QDISC_RUNNING bit :) You should not better, you wrote it and ive been going insane trying to break for at least a year now ;-> cheers, jamal From herbert at gondor.apana.org.au Mon Oct 8 19:20:03 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:20:03 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191896342.4373.159.camel@localhost> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071009020442.GA14746@gondor.apana.org.au> <1191896149.4373.157.camel@localhost> <20071009021646.GB14917@gondor.apana.org.au> <1191896342.4373.159.camel@localhost> Message-ID: <20071009022003.GC14917@gondor.apana.org.au> On Mon, Oct 08, 2007 at 10:19:02PM -0400, jamal wrote: > On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote: > > > > > No it doesn't. I'd forgotten about the QDISC_RUNNING bit :) > > You should not better, you wrote it and ive been going insane trying to > break for at least a year now ;-> Well you've broken me at least :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From davem at davemloft.net Mon Oct 8 19:43:43 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 19:43:43 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009020318.GA14708@gondor.apana.org.au> References: <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> Message-ID: <20071008.194343.52093065.davem@davemloft.net> From: Herbert Xu Date: Tue, 9 Oct 2007 10:03:18 +0800 > On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote: > > On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote: > > > > > > I also want to point out another issue. Any argument wrt. reordering > > > is specious at best because right now reordering from qdisc to device > > > happens anyways. > > > > This is not true. > > > > If your device has a qdisc at all, then you will end up in the > > function qdisc_restart, where we release the queue lock only > > after acquiring the TX lock. > > > > So right now this path does not create any reordering. > > Argh! Someone's just broken this. I think we should restore > the original behaviour. Right, that's Jamal's recent patch. It looked funny to me too. I think we can't make this change, the acquisition of the device transmit lock before we release the qdisc is the only thing that prevents reordering between qdisc and device. Otherwise all of the prioritization is pretty much for nothing as I described in another email today. Jamal, I'm pretty sure we have to revert this, you can't change the locking in this way. commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 Author: Jamal Hadi Salim Date: Tue Sep 25 19:27:13 2007 -0700 [NET_SCHED]: explict hold dev tx lock For N cpus, with full throttle traffic on all N CPUs, funneling traffic to the same ethernet device, the devices queue lock is contended by all N CPUs constantly. The TX lock is only contended by a max of 2 CPUS. In the current mode of operation, after all the work of entering the dequeue region, we may endup aborting the path if we are unable to get the tx lock and go back to contend for the queue lock. As N goes up, this gets worse. The changes in this patch result in a small increase in performance with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3 showed similar behavior; Signed-off-by: Jamal Hadi Salim Signed-off-by: David S. Miller diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index e970e8e..95ae119 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; - unsigned lockless; int ret; /* Dequeue packet */ if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) return 0; - /* - * When the driver has LLTX set, it does its own locking in - * start_xmit. These checks are worth it because even uncongested - * locks can be quite expensive. The driver can do a trylock, as - * is being done here; in case of lock contention it should return - * NETDEV_TX_LOCKED and the packet will be requeued. - */ - lockless = (dev->features & NETIF_F_LLTX); - - if (!lockless && !netif_tx_trylock(dev)) { - /* Another CPU grabbed the driver tx lock */ - return handle_dev_cpu_collision(skb, dev, q); - } /* And release queue */ spin_unlock(&dev->queue_lock); + HARD_TX_LOCK(dev, smp_processor_id()); ret = dev_hard_start_xmit(skb, dev); - - if (!lockless) - netif_tx_unlock(dev); + HARD_TX_UNLOCK(dev); spin_lock(&dev->queue_lock); q = dev->qdisc; From davem at davemloft.net Mon Oct 8 19:45:17 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 19:45:17 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009020442.GA14746@gondor.apana.org.au> References: <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071009020442.GA14746@gondor.apana.org.au> Message-ID: <20071008.194517.32744598.davem@davemloft.net> From: Herbert Xu Date: Tue, 9 Oct 2007 10:04:42 +0800 > On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote: > > On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote: > > > On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote: > > > > > > > > I also want to point out another issue. Any argument wrt. reordering > > > > is specious at best because right now reordering from qdisc to device > > > > happens anyways. > > > > > > This is not true. > > > > > > If your device has a qdisc at all, then you will end up in the > > > function qdisc_restart, where we release the queue lock only > > > after acquiring the TX lock. > > > > > > So right now this path does not create any reordering. > > > > Argh! Someone's just broken this. I think we should restore > > the original behaviour. > > Please revert > > commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 > Author: Jamal Hadi Salim > Date: Tue Sep 25 19:27:13 2007 -0700 > > [NET_SCHED]: explict hold dev tx lock > > As this change introduces potential reordering and I don't think > we've discussed this aspect sufficiently. Agreed, and done. From davem at davemloft.net Mon Oct 8 19:46:36 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 19:46:36 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470AE373.9020207@garzik.org> References: <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <470AE373.9020207@garzik.org> Message-ID: <20071008.194636.56034897.davem@davemloft.net> From: Jeff Garzik Date: Mon, 08 Oct 2007 22:12:03 -0400 > I'm interested in working on a load balancer function that approximates > > skb->queue_mapping = smp_processor_id() > > I'd be happy to code and test in that direction, based on your lib. It's the second algorithm that will be available :-) Just add a "% num_tx_queues" to the result. > IMO the net driver really should provide a hint as to what it wants. > > 8139cp and tg3 would probably prefer multiple TX queue behavior to match > silicon behavior -- strict prio. > > And I'll volunteer to write the net driver code for that, if people want > to see how things would look for that type of hardware packet scheduling. Ok. From herbert at gondor.apana.org.au Mon Oct 8 19:46:00 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 10:46:00 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.194343.52093065.davem@davemloft.net> References: <20071008.184126.124062865.davem@davemloft.net> <20071009020115.GA14635@gondor.apana.org.au> <20071009020318.GA14708@gondor.apana.org.au> <20071008.194343.52093065.davem@davemloft.net> Message-ID: <20071009024600.GA15215@gondor.apana.org.au> On Mon, Oct 08, 2007 at 07:43:43PM -0700, David Miller wrote: > > Right, that's Jamal's recent patch. It looked funny to me too. Hang on Dave. It was too early in the morning for me :) I'd forgotten about the QDISC_RUNNING bit which did what the queue lock did without actually holding the queue lock. So there is no reordering with or without Jamal's patch. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From davem at davemloft.net Mon Oct 8 19:47:06 2007 From: davem at davemloft.net (David Miller) Date: Mon, 08 Oct 2007 19:47:06 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009021620.GA14917@gondor.apana.org.au> References: <20071008.184126.124062865.davem@davemloft.net> <1191896071.4373.156.camel@localhost> <20071009021620.GA14917@gondor.apana.org.au> Message-ID: <20071008.194706.97044591.davem@davemloft.net> From: Herbert Xu Date: Tue, 9 Oct 2007 10:16:20 +0800 > On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote: > > > > You forgot QDISC_RUNNING Dave;-> the above cant happen. > > Essentially at any one point in time, we are guaranteed that we can have > > multiple cpus enqueueing but only can be dequeueing (the one that > > managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue > > but only one consumer. Only the dequeuer has access to the txlock. > > Good point. You had me worried for a sec :) > > Dave, Jamal's patch is fine as it is and doesn't actually create > any packet reordering. Ok, then, I'll un-revert. :-) From krkumar2 at in.ibm.com Mon Oct 8 20:09:31 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 9 Oct 2007 08:39:31 +0530 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191849444.4352.29.camel@localhost> Message-ID: J Hadi Salim wrote on 10/08/2007 06:47:24 PM: > two, there should _never_ be any requeueing even if LLTX in the previous > patches when i supported them; if there is, it is a bug. This is because > we dont send more than what the driver asked for via xmit_win. So if it > asked for more than it can handle, that is a bug. If its available space > changes while we are sending to it, that too is a bug. Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? - KK From kliteyn at mellanox.co.il Mon Oct 8 22:05:29 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 9 Oct 2007 07:05:29 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-09:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-08 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From monisonlists at gmail.com Tue Oct 9 00:24:49 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 09 Oct 2007 09:24:49 +0200 Subject: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <23084.1191348619@death> References: <46F7D770.4090500@voltaire.com> <10376.1190733869@death> <470268A2.7080102@gmail.com> <4702773E.4090201@pobox.com> <23084.1191348619@death> Message-ID: <470B2CC1.1010205@gmail.com> Jay Vosburgh wrote: > Jeff Garzik wrote: > >> Moni Shoua wrote: >>> Jay Vosburgh wrote: >>>> ACK patches 3 - 9. >>>> >>>> Roland, are you comfortable with the IB changes in patches 1 and 2? >>>> >>>> Jeff, when Roland acks patches 1 and 2, please apply all 9. >>>> >>>> -J >>> Hi Jeff, >>> Roland acked the IPoIB patches. If you haven't done so already can you please apply them? >>> I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. >> hrm, I don't see them in my inbox for some reason. can someone bounce >> them to me? or give me a git tree to pull from? > > Moni, can you repost the patch series to Jeff, and put the > appropriate "Acked-by" lines in for myself (patches 3 - 8) and Roland > (patches 1 and 2)? You can probably leave off the netdev and > openfabrics lists, but cc me. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com > Hi Jeff, I don't commits of the patches in http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=summary (I hope that I'm looking in the right place). Did you get them? thanks MoniS From Sumit.Gaur at Sun.COM Tue Oct 9 00:31:36 2007 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Tue, 09 Oct 2007 13:01:36 +0530 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <46A9C633.7040302@Sun.COM> References: <46A9C633.7040302@Sun.COM> Message-ID: <470B2E58.2040509@Sun.COM> Hi, It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not possible to recv MAD specific to GSI or SMI type. As per my impression if I have two separate threads to send and receive then I could send MADs to different qp 0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please suggest if there is any workaround for it. Thanks and Regards sumit From krkumar2 at in.ibm.com Tue Oct 9 01:14:38 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 9 Oct 2007 13:44:38 +0530 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: <1191852320.4352.73.camel@localhost> Message-ID: J Hadi Salim wrote on 10/08/2007 07:35:20 PM: > I dont see something from Krishna's approach that i can take and reuse. > This maybe because my old approaches have evolved from the same path. > There is a long list but as a sample: i used to do a lot more work while > holding the queue lock which i have now moved post queue lock; i dont > have any speacial interfaces/tricks just for batching, i provide hints > to the core of how much the driver can take etc etc. I have offered > Krishna co-authorship if he makes the IPOIB driver to work on my > patches, that offer still stands if he chooses to take it. My feeling is that since the approaches are very different, it would be a good idea to test the two for performance. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. ******************* Previous mail ****************** Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK From ogerlitz at voltaire.com Tue Oct 9 01:33:26 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 09 Oct 2007 10:33:26 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <1191853008.7337.16.camel@mtls03> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> <470A27CB.9040403@voltaire.com> <1191853008.7337.16.camel@mtls03> Message-ID: <470B3CD6.2040808@voltaire.com> Eli Cohen wrote: >> Since you have posted the patch, I am asking you if it has any negative >> influence on packet forwarding. >> >> I am not asking you to test it or whether you tested it with forwarding. >> > > The answer is yes since I do not recalculate TCP checksum as I aggregate > the SKBs so the kernel might forward the TCP segment as multiple IP > packets but with wrong TCP checksum (which is that of the first > aggregated packet) but not of the overall aggregated segment. OK, thanks for this clarification. Can you clarify if/how this patch is related to the "lro: Generic Large Receive Offload for TCP traffic" RFC sent on August this year to netdev (eg see http://lwn.net/Articles/244206) ? Assuming LRO is a --pure software-- optimization, what's the rational to put its whole implementation in the ipoib driver and not divide it to general part implemented in the net core and per driver part implemented per device driver that wants to support LRO (if such second part is needed at all)? If I am wrong and their is some LRO assistance from the connectX HW, what is it doing? Or. From kliteyn at dev.mellanox.co.il Tue Oct 9 02:00:04 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 09 Oct 2007 11:00:04 +0200 Subject: [ofa-general] [PATCH 1/3] osm: QoS- bug in opening policy file Message-ID: <470B4314.1050702@dev.mellanox.co.il> Fixing bug in opening QoS policy file Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.y | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index e0faaaf..8e9f282 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -129,6 +130,7 @@ extern char * __qos_parser_text; extern void __qos_parser_error (char *s); extern int __qos_parser_lex (void); extern FILE * __qos_parser_in; +extern int errno; #define RESET_BUFFER __parser_tmp_struct_reset() @@ -1750,13 +1752,13 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) osm_qos_policy_destroy(p_subn->p_qos_policy); p_subn->p_qos_policy = NULL; - if (!stat(p_subn->opt.qos_policy_file, &statbuf)) { + if (stat(p_subn->opt.qos_policy_file, &statbuf)) { if (strcmp(p_subn->opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) { osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, "osm_qos_parse_policy_file: ERR AC01: " - "QoS policy file not found (%s)\n", - p_subn->opt.qos_policy_file); + "Failed opening QoS policy file %s - %s\n", + p_subn->opt.qos_policy_file, strerror(errno)); res = 1; } else -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Tue Oct 9 02:00:38 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 09 Oct 2007 11:00:38 +0200 Subject: [ofa-general] [PATCH 2/3] osm: QoS - fixing memory leaks Message-ID: <470B4336.9000207@dev.mellanox.co.il> Fixing bunch of memory leaks and pointer mismatches in QoS. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.l | 16 ++++++++++++---- opensm/opensm/osm_qos_parser.y | 15 ++++++++------- opensm/opensm/osm_qos_policy.c | 21 +++++++++++++-------- 3 files changed, 33 insertions(+), 19 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.l b/opensm/opensm/osm_qos_parser.l index 0b096f8..60b2d1c 100644 --- a/opensm/opensm/osm_qos_parser.l +++ b/opensm/opensm/osm_qos_parser.l @@ -260,33 +260,41 @@ WHITE_DOTDOT_WHITE [ \t]*:[ \t]* - { SAVE_POS; - __qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) + { + __qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; + } return TK_DASH; } : { SAVE_POS; - __qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) + { + __qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; + } return TK_DOTDOT; } , { SAVE_POS; - __qos_parser_lval = strdup(__qos_parser_text); if (in_description) + { + __qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; + } return TK_COMMA; } \* { SAVE_POS; - __qos_parser_lval = strdup(__qos_parser_text); if (in_description || in_list_of_strings || in_single_string) + { + __qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT; + } return TK_ASTERISK; } diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 8e9f282..2405519 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -2105,15 +2105,15 @@ static void __sort_reduce_rangearr( unsigned last_valid_ind = 0; unsigned valid_cnt = 0; uint64_t ** res_arr; - boolean_t * is_valir_arr; + boolean_t * is_valid_arr; *p_res_arr = NULL; *p_res_arr_len = 0; qsort(arr, arr_len, sizeof(uint64_t*), __cmp_num_range); - is_valir_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t)); - is_valir_arr[last_valid_ind] = TRUE; + is_valid_arr = (boolean_t *)malloc(arr_len * sizeof(boolean_t)); + is_valid_arr[last_valid_ind] = TRUE; valid_cnt++; for (i = 1; i < arr_len; i++) { @@ -2123,18 +2123,18 @@ static void __sort_reduce_rangearr( arr[last_valid_ind][1] = arr[i][1]; free(arr[i]); arr[i] = NULL; - is_valir_arr[i] = FALSE; + is_valid_arr[i] = FALSE; } else if ((arr[i][0] - 1) == arr[last_valid_ind][1]) { arr[last_valid_ind][1] = arr[i][1]; free(arr[i]); arr[i] = NULL; - is_valir_arr[i] = FALSE; + is_valid_arr[i] = FALSE; } else { - is_valir_arr[i] = TRUE; + is_valid_arr[i] = TRUE; last_valid_ind = i; valid_cnt++; } @@ -2143,9 +2143,10 @@ static void __sort_reduce_rangearr( res_arr = (uint64_t **)malloc(valid_cnt * sizeof(uint64_t *)); for (i = 0; i < arr_len; i++) { - if (is_valir_arr[i]) + if (is_valid_arr[i]) res_arr[j++] = arr[i]; } + free(is_valid_arr); free(arr); *p_res_arr = res_arr; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index c84fb8b..51dd7b9 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -101,12 +101,6 @@ static void __free_single_element(void *p_element, void *context) free(p_element); } -static void __free_port_map_element(cl_map_item_t *p_element, void *context) -{ - if (p_element) - free(p_element); -} - /*************************************************** ***************************************************/ @@ -145,6 +139,9 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) { + osm_qos_port_t * p_port; + osm_qos_port_t * p_old_port; + if (!p) return; @@ -157,7 +154,13 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) cl_list_remove_all(&p->port_name_list); cl_list_destroy(&p->port_name_list); - cl_qmap_apply_func(&p->port_map, __free_port_map_element, NULL); + p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); + while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) + { + p_old_port = p_port; + p_port = (osm_qos_port_t *) cl_qmap_next(&p_port->map_item); + free(p_old_port); + } cl_qmap_remove_all(&p->port_map); free(p); @@ -219,7 +222,7 @@ osm_qos_sl2vl_scope_t *osm_qos_policy_sl2vl_scope_create() if (!p) return NULL; - memset(p, 0, sizeof(osm_qos_vlarb_scope_t)); + memset(p, 0, sizeof(osm_qos_sl2vl_scope_t)); cl_list_init(&p->group_list, 10); cl_list_init(&p->across_from_list, 10); @@ -274,6 +277,8 @@ void osm_qos_policy_qos_level_destroy(osm_qos_level_t * p) if (!p) return; + if (p->name) + free(p->name); if (p->use) free(p->use); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Tue Oct 9 02:01:40 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 09 Oct 2007 11:01:40 +0200 Subject: [ofa-general] [PATCH 3/3] osm: QoS - parsing port names Message-ID: <470B4374.6040502@dev.mellanox.co.il> Added CA-by-name hash to the QoS policy object and as port names are parsed they use this hash to locate that actual port that the name refers to. For now I prefer to keep this hash local, so it's part of QoS policy object. When the same parser will be used for partitions too, this hash will be moved to be part of the subnet object. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 3 +- opensm/opensm/osm_qos_parser.y | 73 +++++++++++++++++++++++++++----- opensm/opensm/osm_qos_policy.c | 36 +++++++++++++--- 3 files changed, 94 insertions(+), 18 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 30c2e6d..5c32896 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -49,6 +49,7 @@ #include #include +#include #include #include #include @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ - cl_list_t port_name_list; /* list of port names (.../.../...) */ uint8_t node_types; /* node types bitmask */ cl_qmap_t port_map; } osm_qos_port_group_t; @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ osm_subn_t *p_subn; /* osm subnet object */ + st_table * p_ca_hash; /* hash of CAs by node description */ } osm_qos_policy_t; /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index 2405519..cf342d3 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { port_group_port_name: port_group_port_name_start string_list { /* 'port-name' in 'port-group' - any num of instances */ - cl_list_iterator_t list_iterator; - char * tmp_str; - - list_iterator = cl_list_head(&tmp_parser_struct.str_list); - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) + cl_list_iterator_t list_iterator; + osm_node_t * p_node; + osm_physp_t * p_physp; + unsigned port_num; + char * name_str; + char * tmp_str; + char * host_str; + char * ca_str; + char * port_str; + char * node_desc = (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); + + /* parsing port name strings */ + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); + list_iterator != cl_list_end(&tmp_parser_struct.str_list); + list_iterator = cl_list_next(list_iterator)) { tmp_str = (char*)cl_list_obj(list_iterator); + if (tmp_str && *tmp_str) + { + name_str = tmp_str; + host_str = strtok (name_str,"/"); + ca_str = strtok (NULL, "/"); + port_str = strtok (NULL, "/"); + + if (!host_str || !(*host_str) || + !ca_str || !(*ca_str) || + !port_str || !(*port_str) || + (port_str[0] != 'p' && port_str[0] != 'P')) { + yyerror("illegal port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - /* - * TODO: parse port name strings - */ + if (!(port_num = strtoul(&port_str[1],NULL,0))) { + yyerror("illegal port number in port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - if (tmp_str) - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); - list_iterator = cl_list_next(list_iterator); + sprintf(node_desc,"%s %s",host_str,ca_str); + free(tmp_str); + + if (st_lookup(p_qos_policy->p_ca_hash, + (st_data_t)node_desc, + (st_data_t*)&p_node)) + { + /* we found the node, now get the right port */ + CL_ASSERT(p_node); + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp) { + yyerror("port number out of range in port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } + /* we found the port, now add it to guid table */ + __parser_add_port_to_port_map(&p_current_port_group->port_map, + p_physp); + } + } } cl_list_remove_all(&tmp_parser_struct.str_list); + free(node_desc); } ; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 51dd7b9..0d7235f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -59,6 +59,31 @@ /*************************************************** ***************************************************/ +static void +__build_cabyname_hash(osm_qos_policy_t * p_qos_policy) +{ + osm_node_t * p_node; + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; + + p_qos_policy->p_ca_hash = st_init_strtable(); + CL_ASSERT(p_qos_policy->p_ca_hash); + + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) + return; + + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) + st_insert(p_qos_policy->p_ca_hash, + (st_data_t)p_node->print_desc, + (st_data_t)p_node); + } +} + +/*************************************************** + ***************************************************/ + static boolean_t __is_num_in_range_arr(uint64_t ** range_arr, unsigned range_arr_len, uint64_t num) @@ -127,8 +152,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() return NULL; memset(p, 0, sizeof(osm_qos_port_group_t)); - - cl_list_init(&p->port_name_list, 10); cl_qmap_init(&p->port_map); return p; @@ -150,10 +173,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) if (p->use) free(p->use); - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); - cl_list_remove_all(&p->port_name_list); - cl_list_destroy(&p->port_name_list); - p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) { @@ -423,6 +442,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) cl_list_init(&p_qos_policy->qos_match_rules, 10); p_qos_policy->p_subn = p_subn; + __build_cabyname_hash(p_qos_policy); + return p_qos_policy; } @@ -495,6 +516,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) cl_list_remove_all(&p_qos_policy->qos_match_rules); cl_list_destroy(&p_qos_policy->qos_match_rules); + if (p_qos_policy->p_ca_hash) + st_free_table(p_qos_policy->p_ca_hash); + free(p_qos_policy); p_qos_policy = NULL; -- 1.5.1.4 From vlad at lists.openfabrics.org Tue Oct 9 02:58:44 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 9 Oct 2007 02:58:44 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071009-0200 daily build status Message-ID: <20071009095847.B8EEAE6085B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.9-22.ELsmp Log: Applying patch libiscsi_no_flush_to_2_6_9.patch patching file drivers/scsi/libiscsi.c Hunk #1 FAILED at 1225. Hunk #2 succeeded at 1640 (offset 32 lines). Hunk #3 FAILED at 1784. 2 out of 3 hunks FAILED -- rejects in file drivers/scsi/libiscsi.c Patch libiscsi_no_flush_to_2_6_9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.19 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.19_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.19' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.18_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.18' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.17 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.17_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.17' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:936: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:939: error: invalid type argument of '->' /home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.c:940: error: invalid type argument of '->' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071009-0200_linux-2.6.16_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.16' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From eli at mellanox.co.il Tue Oct 9 03:09:33 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 09 Oct 2007 12:09:33 +0200 Subject: [ofa-general] [PATCH 13 of 17]: add LRO support In-Reply-To: <470B3CD6.2040808@voltaire.com> References: <1189526095.13053.123.camel@mtls03> <470A06F7.3090602@voltaire.com> <1191846661.7337.4.camel@mtls03> <470A27CB.9040403@voltaire.com> <1191853008.7337.16.camel@mtls03> <470B3CD6.2040808@voltaire.com> Message-ID: <1191924573.7337.35.camel@mtls03> > Can you clarify if/how this patch is related to the "lro: Generic Large > Receive Offload for TCP traffic" RFC sent on August this year to netdev > (eg see http://lwn.net/Articles/244206) ? I referred to mtnic driver when I made this patch which referred to other code examples, possibly from this one too. > > Assuming LRO is a --pure software-- optimization, what's the rational to > put its whole implementation in the ipoib driver and not divide it to > general part implemented in the net core and per driver part implemented > per device driver that wants to support LRO (if such second part is > needed at all)? It is a pure software optimization but it relies on the HW to report whether the checksum of the packet is valid or not in order for it to be liable for aggregation. I think it would be good however if the kernel would support this and take this from the specific drivers. > > If I am wrong and their is some LRO assistance from the connectX HW, > what is it doing? > > Or. > > From krkumar2 at in.ibm.com Tue Oct 9 03:58:27 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 9 Oct 2007 16:28:27 +0530 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: Message-ID: Hi Peter, "Waskiewicz Jr, Peter P" wrote on 10/09/2007 04:03:42 AM: > > true, that needs some resolution. Heres a hand-waving thought: > > Assuming all packets of a specific map end up in the same > > qdiscn queue, it seems feasible to ask the qdisc scheduler to > > give us enough packages (ive seen people use that terms to > > refer to packets) for each hardware ring's available space. > > With the patches i posted, i do that via > > dev->xmit_win that assumes only one view of the driver; essentially a > > single ring. > > If that is doable, then it is up to the driver to say "i have > > space for 5 in ring[0], 10 in ring[1] 0 in ring[2]" based on > > what scheduling scheme the driver implements - the dev->blist > > can stay the same. Its a handwave, so there may be issues > > there and there could be better ways to handle this. > > > > Note: The other issue that needs resolving that i raised > > earlier was in regards to multiqueue running on multiple cpus > > servicing different rings concurently. > > I can see the qdisc being modified to send batches per queue_mapping. > This shouldn't be too difficult, and if we had the xmit_win per queue > (in the subqueue struct like Dave pointed out). I hope my understanding of multiqueue is correct for this mail to make sense :-) Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. (This is required only if driver is supposed to transmit >1 skb in one call, otherwise it is not an issue) Alternatively, supporting drivers could return a different code on mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only) so that qdisc_run() could retry. Would that work? Secondly having xmit_win per queue: would it help in multiple skb case? Currently there is no way to tell qdisc to dequeue skbs from a particular band - it returns skb from highest priority band. thanks, - KK From davem at davemloft.net Tue Oct 9 04:02:55 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 04:02:55 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: Message-ID: <20071009.040255.71088090.davem@davemloft.net> From: Krishna Kumar2 Date: Tue, 9 Oct 2007 16:28:27 +0530 > Isn't it enough that the multiqueue+batching drivers handle skbs > belonging to different queue's themselves, instead of qdisc having > to figure that out? This will reduce costs for most skbs that are > neither batched nor sent to multiqueue devices. > > Eg, driver can keep processing skbs and put to the correct tx_queue > as long as mapping remains the same. If the mapping changes, it posts > earlier skbs (with the correct lock) and then iterates for the other > skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like "have the driver pick the queue in ->hard_start_xmit() and return some special status if the queue becomes different"..... you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. From krkumar2 at in.ibm.com Tue Oct 9 04:20:14 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 9 Oct 2007 16:50:14 +0530 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.040255.71088090.davem@davemloft.net> Message-ID: Hi Dave, David Miller wrote on 10/09/2007 04:32:55 PM: > > Isn't it enough that the multiqueue+batching drivers handle skbs > > belonging to different queue's themselves, instead of qdisc having > > to figure that out? This will reduce costs for most skbs that are > > neither batched nor sent to multiqueue devices. > > > > Eg, driver can keep processing skbs and put to the correct tx_queue > > as long as mapping remains the same. If the mapping changes, it posts > > earlier skbs (with the correct lock) and then iterates for the other > > skbs that have the next different mapping, and so on. > > The complexity in most of these suggestions is beginning to drive me a > bit crazy :-) > > This should be the simplest thing in the world, when TX queue has > space, give it packets. Period. > > When I hear suggestions like "have the driver pick the queue in > ->hard_start_xmit() and return some special status if the queue > becomes different"..... you know, I really begin to wonder :-) > > If we have to go back, get into the queueing layer locks, have these > special cases, and whatnot, what's the point? I understand your point, but the qdisc code itself needs almost no change, as small as: qdisc_restart() { ... case NETDEV_TX_MAPPING_CHANGED: /* * Driver sent some skbs from one mapping, and found others * are for different queue_mapping. Try again. */ ret = 1; /* guaranteed to have atleast 1 skb in batch list */ break; ... } Alternatively if the driver does all the dirty work, qdisc needs no change at all. However, I am not sure if this addresses all the concerns raised by you, Peter, Jamal, others. > This code should eventually be able to run lockless all the way to the > TX queue handling code of the driver. The queueing code should know > what TX queue the packet will be bound for, and always precisely > invoke the driver in a state where the driver can accept the packet. This sounds like a good idea :) I need to think more on this, esp as my batching sends multiple skbs of possibly different mappings to device, and those skbs stay in batch list if driver couldn't send them out. thanks, - KK From krkumar2 at in.ibm.com Tue Oct 9 04:21:14 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 9 Oct 2007 16:51:14 +0530 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.040255.71088090.davem@davemloft.net> Message-ID: David Miller wrote on 10/09/2007 04:32:55 PM: > Ignore LLTX, it sucks, it was a big mistake, and we will get rid of > it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. thanks, - KK From davem at davemloft.net Tue Oct 9 04:24:41 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 04:24:41 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <20071009.040255.71088090.davem@davemloft.net> Message-ID: <20071009.042441.30182968.davem@davemloft.net> From: Krishna Kumar2 Date: Tue, 9 Oct 2007 16:51:14 +0530 > David Miller wrote on 10/09/2007 04:32:55 PM: > > > Ignore LLTX, it sucks, it was a big mistake, and we will get rid of > > it. > > Great, this will make life easy. Any idea how long that would take? > It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) From hrosenstock at xsigo.com Tue Oct 9 04:43:26 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 09 Oct 2007 04:43:26 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <470B2E58.2040509@Sun.COM> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> Message-ID: <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-09 at 13:01 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi, > > It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not > possible to recv MAD specific to GSI or SMI type. As per my impression if I have > two separate threads to send and receive then I could send MADs to different qp > 0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please > suggest if there is any workaround for it. See umad_register(). -- Hal > > Thanks and Regards > sumit > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jeff at garzik.org Tue Oct 9 05:44:25 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 08:44:25 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.042441.30182968.davem@davemloft.net> References: <20071009.040255.71088090.davem@davemloft.net> <20071009.042441.30182968.davem@davemloft.net> Message-ID: <470B77A9.600@garzik.org> David Miller wrote: > From: Krishna Kumar2 > Date: Tue, 9 Oct 2007 16:51:14 +0530 > >> David Miller wrote on 10/09/2007 04:32:55 PM: >> >>> Ignore LLTX, it sucks, it was a big mistake, and we will get rid of >>> it. >> Great, this will make life easy. Any idea how long that would take? >> It seems simple enough to do. > > I'd say we can probably try to get rid of it in 2.6.25, this is > assuming we get driver authors to cooperate and do the conversions > or alternatively some other motivated person. > > I can just threaten to do them all and that should get the driver > maintainers going :-) What, like this? :) Jeff -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch URL: From herbert at gondor.apana.org.au Tue Oct 9 05:55:13 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 9 Oct 2007 20:55:13 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470B77A9.600@garzik.org> References: <20071009.040255.71088090.davem@davemloft.net> <20071009.042441.30182968.davem@davemloft.net> <470B77A9.600@garzik.org> Message-ID: <20071009125513.GA19650@gondor.apana.org.au> On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: > David Miller wrote: > > > >I can just threaten to do them all and that should get the driver > >maintainers going :-) > > What, like this? :) Awsome :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From jeff at garzik.org Tue Oct 9 06:00:10 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 09:00:10 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009125513.GA19650@gondor.apana.org.au> References: <20071009.040255.71088090.davem@davemloft.net> <20071009.042441.30182968.davem@davemloft.net> <470B77A9.600@garzik.org> <20071009125513.GA19650@gondor.apana.org.au> Message-ID: <470B7B5A.7050307@garzik.org> Herbert Xu wrote: > On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: >> David Miller wrote: >>> I can just threaten to do them all and that should get the driver >>> maintainers going :-) >> What, like this? :) > > Awsome :) Note my patch is just to get the maintainers going. :) I'm not going to commit that, since I don't have any way to test any of the drivers I touched (but I wouldn't scream if it appeared in net-2.6.24 either) Jeff From hadi at cyberus.ca Tue Oct 9 06:10:02 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 09:10:02 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: Message-ID: <1191935402.4373.172.camel@localhost> On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote: > Driver might ask for 10 and we send 10, but LLTX driver might fail to get > lock and return TX_LOCKED. I haven't seen your code in greater detail, but > don't you requeue in that case too? For others drivers that are non-batching and LLTX, it is possible - at the moment in my patch i whine that the driver is buggy. I will fix this up so it checks for NETIF_F_BTX. Thanks for pointing the above use case. cheers, jamal From hadi at cyberus.ca Tue Oct 9 06:25:42 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 09:25:42 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching In-Reply-To: References: Message-ID: <1191936342.4373.187.camel@localhost> On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote: > My feeling is that since the approaches are very different, My concern is the approaches are different only for short periods of time. For example, I do requeueing, have xmit_win, have ->end_xmit, do batching from core etc; if you see value in any of these concepts, they will appear in your patches and this goes on a loop. Perhaps what we need is a referee and use our energies in something more positive. > it would be a good idea to test the two for performance. Which i dont mind as long as it has an analysis that goes with it. If all you post is "heres what netperf showed", it is not useful at all. There are also a lot of affecting variables. For example, is the receiver a bottleneck? To make it worse, I could demonstrate to you that if i slowed down the driver and allowed more packets to queue up on the qdisc, batching will do well. In the past my feeling is you glossed over such details and i am sucker for things like that - hence the conflict. > Do you mind me doing > that? Ofcourse others and/or you are more than welcome to do the same. > > I had sent a note to you yesterday about this, please let me know > either way. > I responded to you - but it may have been lost in the noise; heres a copy: http://marc.info/?l=linux-netdev&m=119185137124008&w=2 cheers, jamal From jlentini at netapp.com Tue Oct 9 06:44:02 2007 From: jlentini at netapp.com (James Lentini) Date: Tue, 9 Oct 2007 09:44:02 -0400 (EDT) Subject: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. In-Reply-To: <470AA729.2050009@opengridcomputing.com> References: <46B883B5.8040702@opengridcomputing.com> <46BB61D0.4090101@opengridcomputing.com> <46BB89C0.4040303@ichips.intel.com> <20070809.145534.102938208.davem@davemloft.net> <470AA729.2050009@opengridcomputing.com> Message-ID: On Mon, 8 Oct 2007, Steve Wise wrote: > The correct solution, IMO, is to enhance the core low level 4-tuple > allocation services to be more generic (eg: not be tied to a struct > sock). Then the host tcp stack and the host rdma stack can allocate > TCP/iWARP ports/4tuples from this common exported service and share > the port space. This allocation service could also be used by other > deep adapters like iscsi adapters if needed. As a developer of an RDMA ULP, NFS-RDMA, I like this approach because it will simplify the configuration of an RDMA device and the services that use it. From swise at opengridcomputing.com Tue Oct 9 07:15:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 09 Oct 2007 09:15:32 -0500 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <47090708.6060604@opengridcomputing.com> Message-ID: <470B8D04.30401@opengridcomputing.com> Roland Dreier wrote: > > No mention about the iwarp port space issue? > > I don't think we're at a stage where I'm prepared to merge something-- > we all agree the latest patch has serious drawbacks, and it commits us > to a suboptimal interface that is userspace-visible. > Fair enough. > > I'm at a loss as to how to proceed. > > Could we try to do some cleanups to the net core to make the alias > stuff less painful? eg is there any sane way to make it possible for > a device that creates 'eth0' to also create an 'iw0' alias without an > assigning an address? > Well, "alias" interfaces really don't exist. ethX:iw is really just adding a address record (struct in_ifaddr) to ethX. So in the current core design, adding an alias without an address is really adding the alias with address 0.0.0.0. And I think the core net code assumes if an in_ifaddr struct exists for a device, then its IP address is indeed valid. So I think the changes wouldn't be small to enhance the design to add a concept of an alias interface. I'll look into this more though. Steve. From krause at cup.hp.com Tue Oct 9 07:59:17 2007 From: krause at cup.hp.com (Michael Krause) Date: Tue, 09 Oct 2007 07:59:17 -0700 Subject: [ofa-general] Re: parallel networking In-Reply-To: <470ADF15.2090100@garzik.org> References: <20071007.215124.85709188.davem@davemloft.net> <1191850490.4352.41.camel@localhost> <470A3D24.3050803@garzik.org> <20071008.141154.107706003.davem@davemloft.net> <470ADF15.2090100@garzik.org> Message-ID: <6.2.0.14.2.20071009073934.02539930@esmail.cup.hp.com> At 06:53 PM 10/8/2007, Jeff Garzik wrote: >David Miller wrote: >>From: Jeff Garzik >>Date: Mon, 08 Oct 2007 10:22:28 -0400 >> >>>In terms of overall parallelization, both for TX as well as RX, my gut >>>feeling is that we want to move towards an MSI-X, multi-core friendly >>>model where packets are LIKELY to be sent and received by the same set >>>of [cpus | cores | packages | nodes] that the [userland] processes >>>dealing with the data. >>The problem is that the packet schedulers want global guarantees >>on packet ordering, not flow centric ones. >>That is the issue Jamal is concerned about. > >Oh, absolutely. > >I think, fundamentally, any amount of cross-flow resource management done >in software is an obstacle to concurrency. > >That's not a value judgement, just a statement of fact. Correct. >"traffic cops" are intentional bottlenecks we add to the process, to >enable features like priority flows, filtering, or even simple socket >fairness guarantees. Each of those bottlenecks serves a valid purpose, >but at the end of the day, it's still a bottleneck. > >So, improving concurrency may require turning off useful features that >nonetheless hurt concurrency. Software needs to get out of the main data path - another fact of life. >>The more I think about it, the more inevitable it seems that we really >>might need multiple qdiscs, one for each TX queue, to pull this full >>parallelization off. >>But the semantics of that don't smell so nice either. If the user >>attaches a new qdisc to "ethN", does it go to all the TX queues, or >>what? >>All of the traffic shaping technology deals with the device as a unary >>object. It doesn't fit to multi-queue at all. > >Well the easy solutions to networking concurrency are > >* use virtualization to carve up the machine into chunks > >* use multiple net devices > >Since new NIC hardware is actively trying to be friendly to >multi-channel/virt scenarios, either of these is reasonably >straightforward given the current state of the Linux net stack. Using >multiple net devices is especially attractive because it works very well >with the existing packet scheduling. > >Both unfortunately impose a burden on the developer and admin, to force >their apps to distribute flows across multiple [VMs | net devs]. Not the most optimal approach. >The third alternative is to use a single net device, with SMP-friendly >packet scheduling. Here you run into the problems you described "device >as a unary object" etc. with the current infrastructure. > >With multiple TX rings, consider that we are pushing the packet scheduling >from software to hardware... which implies >* hardware-specific packet scheduling >* some TC/shaping features not available, because hardware doesn't support it For a number of years now, we have designed interconnects to support a reasonable range of arbitration capabilities among hardware resource sets. With reasonable classification by software to identify a hardware resource sets (ideally interpretation of the application's view of its priority combined with policy management software that determines how that should map among competing application views), one can eliminate most of the CPU cycles spent into today's implementations. I and others presented a number of these concepts many years ago during the development which eventually led to IB and iWARP. - Each resource set can be assigned to a unique PCIe function or a function group to enable function / group arbitration to the PCIe link. - Each resource set can be assigned to a unique PCIe TC and with improved ordering hints (coming soon) can be used to eliminate false ordering dependencies. - Each resource set can be assigned to a unique IB TC / SL or iWARP 802.1p to signal priority. These can then be used to program respective link arbitration as well as path selection to enable multi-path load balancing. - Many IHV have picked up on the arbitration capabilities and extended them as shown years ago by a number of us to enable resource set arbitration and a variety of QoS based policies. If software defines a reasonable (i.e. small) number of management and control knobs, then these can be easily mapped to most h/w implementations. Some of us are working on how to do this for virtualized environments and I expect these to be applicable to all environments in the end. One other key item to keep in mind is that unless there is contention in the system, the majority of the QoS mechanisms are meaningless and in a very large percentage of customer environments, they simply don't scale with device and interconnect performance. Many applications in fact remain processor / memory constrained and therefore do not stress the I/O subsystem or the external interconnects making most of the software mechanisms rather moot in real customer environments. Simple truth is it is nearly always cheaper to over-provision the I/O / interconnects than to use the software approach which while quite applicable in many environments for the 1 Gbps and below speeds, generally has less meaning / value in the 10 moving to 40 moving to 100 Gbps environments. Does not really matter whether one believes in protocol off-load or protocol on-load, the interconnects will be able to handle all commercial workloads and perhaps all but the most extreme HPC (even there one might contend that any software intermediary would be discarded in favor of reducing OS / kernel overhead from the main data path). This isn't to say that software has no role to play only that role needs to shift from main data path overhead to one of policy shaping and programming of h/w based arbitration. This will hold true for both virtualized and non-virtualized environments. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jim.Langston at Sun.COM Tue Oct 9 08:13:08 2007 From: Jim.Langston at Sun.COM (Jim Langston) Date: Tue, 09 Oct 2007 11:13:08 -0400 Subject: [ofa-general] SDP ? Message-ID: <470B9A84.9000008@sun.com> Hi all, I'm working on porting SDP to OpenSolaris and am looking at a compile error that I get. Essentially, I have a conflict of types on the compile: bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\"/usr/local/etc\" -g -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o "port.c", line 1896: identifier redeclared: getsockname current : function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to unsigned int) returning int previous: function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to void) returning int : "/usr/include/sys/socket.h", line 436 Line 436 in /usr/include/sys/socket.h extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); and Psocklen_t #if defined(_XPG4_2) || defined(_BOOT) typedef socklen_t *_RESTRICT_KYWD Psocklen_t; #else typedef void *_RESTRICT_KYWD Psocklen_t; #endif /* defined(_XPG4_2) || defined(_BOOT) */ Do I need to change port.c getsockname to type void * ? Thanks, Jim From rdreier at cisco.com Tue Oct 9 08:16:34 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 08:16:34 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: <47083E38.2050005@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Sat, 06 Oct 2007 19:02:32 -0700") References: <47083E38.2050005@linux.vnet.ibm.com> Message-ID: > Roland, I submitted an updated patch incorporating some of Sean's comments within > a day or two. Rest of comments pertained to restructuring the code and adding > some additional module parameters. > > This would require more discussions since some of these had been already discussed > previously. We had decided upon this code structure after a lot of discussions and > incorporating these would be undoing some of that. Can you give a link to your current final version of the patch? Sean, what's your opinion of where we stand? Since module parameters create a userspace-visible interface that we are stuck with for a long time, we definitely have to get at least that much right before merging. - R. From hayden31hauhua97 at articlesandcontent.com Tue Oct 9 06:55:55 2007 From: hayden31hauhua97 at articlesandcontent.com (Zelma Lott) Date: Tue, 09 Oct 2007 13:55:55 +0000 Subject: [ofa-general] Fw: Thank you, we accepted your refinance appication Message-ID: <000801c80a8b$02d513b2$2071c5a0@tmutjy> If you have your own business and require IMMEDIATE ready money to spend ANY way you like or want Extra money to give your company a boost or wish A low interest loan - NO STRINGS ATTACHED, here is the deal we can offer you NOW (hurry, this offer will expire TODAY):   $49,000+ loan   Hurry, when our best deal is gone, it is gone. Simply Call Us Free on 877-292-6896 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Oct 9 08:43:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 08:43:24 -0700 Subject: [ofa-general] librdmacm feature request References: <1191767680.19888.310.camel@firewall.xsintricity.com> <470A632D.1050001@ichips.intel.com> <1191894507.19888.360.camel@firewall.xsintricity.com> Message-ID: > It shouldn't be too hard. Assuming you handle the modify channel as a > synchronous action, the thread calling modify channel can't also be in > rdma_get_cm_event at the same time. So, if you get there and someone is > blocking on that channel and just hasn't been scheduled to run yet, then > leave the event where it is while you switch the channel and send new > events to the new channel. If they aren't then move any pending events > to the new channel as you do the change. Hmm, how do you move events? Keep in mind that there may be an arbitrary number of pending events that belong to other cm_ids that are queued before the events you want to move. And you can't really do anything too funky with the event channel fd, because you don't want to mess up some other thread that might be waiting for events in poll() or whatever. - R. From dotanb at dev.mellanox.co.il Tue Oct 9 08:48:02 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 9 Oct 2007 17:48:02 +0200 Subject: [ofa-general] [PATCH] core: Check that the function reg_phys_mr is not NULL before executing it Message-ID: <200710091748.02776.dotanb@dev.mellanox.co.il> Check that the function reg_phys_mr is not NULL before executing it. There are devices (for example: mlx4) that their low level driver doesn't support this verb, so this patch will prevent kernel oops on them. Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 86ed8af..e2d54cb 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -672,6 +672,9 @@ struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, { struct ib_mr *mr; + if (!pd->device->reg_phys_mr) + return -ENOSYS; + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, mr_access_flags, iova_start); From amiller at vims.edu Tue Oct 9 08:56:03 2007 From: amiller at vims.edu (Adam Miller) Date: Tue, 09 Oct 2007 11:56:03 -0400 Subject: [ofa-general] RLIMIT_MEMLOCK Message-ID: <470BA493.9040501@vims.edu> We have run into this problem with using mpiexec. SLES 10 is on the cluster and we have set the limits under /etc/security/limits.conf and they work there, even when we run mpirun commands work fine but when tying them all in using mpiexec it still comes back with the 32K limit in memory. Any and all users can log in and in bash type "ulimit -a" and tcsh type "limit" and both state the correct full memory limits, but when using mpiexec under both shells they get the 32k limit. Any suggestions? thanks -- Adam Miller The College of William and Mary Virginia Institute of Marine Science -Infrastructure Services Architect- -Information Technology and Networking Services- Watermens Hall Mail: P.O. Box 1346 Deliveries: Route 1208, Greate Road Gloucester Point, VA 23062-1346, USA p(804)684-7077 f(804)684-7097 email: amiller at vims.edu email cell: amiller at vtext.com From vlad at dev.mellanox.co.il Tue Oct 9 08:57:08 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 09 Oct 2007 17:57:08 +0200 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470A9363.4010007@opengridcomputing.com> References: <470A9363.4010007@opengridcomputing.com> Message-ID: <470BA4D4.3080707@dev.mellanox.co.il> Steve Wise wrote: > Vlad/Tziporet, > > Can you please pull version 1.0.3 of libcxgb3 for inclusion in > ofed-1.2.5 and ofed-1.3? It contains a bug fix for olders kernels like > RHEL4U4. You can use the master branch for both releases: > > git://git.openfabrics.org/~swise/libcxgb3.git master > > Also, please update the spec file you're using to reflect the release > (1.0.3). The spec file in the libcxgb3 git tree should be correct. > > > Thanks, > > Steve. > Done, Regards, Vladimir From swise at opengridcomputing.com Tue Oct 9 09:15:56 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 09 Oct 2007 11:15:56 -0500 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470BA4D4.3080707@dev.mellanox.co.il> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> Message-ID: <470BA93C.3010601@opengridcomputing.com> Thanks Vlad, Can you crank a ofed-1.2.5 development build too? Thanks, Steve. Vladimir Sokolovsky wrote: > Steve Wise wrote: >> Vlad/Tziporet, >> >> Can you please pull version 1.0.3 of libcxgb3 for inclusion in >> ofed-1.2.5 and ofed-1.3? It contains a bug fix for olders kernels >> like RHEL4U4. You can use the master branch for both releases: >> >> git://git.openfabrics.org/~swise/libcxgb3.git master >> >> Also, please update the spec file you're using to reflect the release >> (1.0.3). The spec file in the libcxgb3 git tree should be correct. >> >> >> Thanks, >> >> Steve. >> > > Done, > > Regards, > Vladimir From andi at firstfloor.org Tue Oct 9 09:51:51 2007 From: andi at firstfloor.org (Andi Kleen) Date: 09 Oct 2007 18:51:51 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071008.184126.124062865.davem@davemloft.net> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: David Miller writes: > > 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo > with load balancing using the code in #1. I think this is kind > of in the territory of what Peter said he is working on. Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. > I know this is controversial, but realistically I doubt users > benefit at all from the prioritization that pfifo provides. I agree. For most interfaces the priority is probably dubious. Even for DSL the prioritization will be likely usually done in a router these days. Also for the fast interfaces where we do TSO priority doesn't work very well anyways -- with large packets there is not too much to prioritize. > 3) Work on discovering a way to make the locking on transmit as > localized to the current thread of execution as possible. Things > like RCU and statistic replication, techniques we use widely > elsewhere in the stack, begin to come to mind. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) -Andi From mshefty at ichips.intel.com Tue Oct 9 10:18:37 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Oct 2007 10:18:37 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <47083E38.2050005@linux.vnet.ibm.com> Message-ID: <470BB7ED.7070007@ichips.intel.com> > Can you give a link to your current final version of the patch? > > Sean, what's your opinion of where we stand? Let me look back over the last version that was sent and reply back later today or tomorrow. Several of my initial comments were on code structure. > Since module parameters create a userspace-visible interface that we > are stuck with for a long time, we definitely have to get at least > that much right before merging. I was taking a slightly different view of the design. It would be nice to agree on whether SRQ should be separated from the QP type before merging upstream, even if the implementation doesn't immediately support all available options. - Sean From vlad at dev.mellanox.co.il Tue Oct 9 10:22:35 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 09 Oct 2007 19:22:35 +0200 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470BA93C.3010601@opengridcomputing.com> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> Message-ID: <470BB8DB.8090107@dev.mellanox.co.il> Steve Wise wrote: > Thanks Vlad, > > Can you crank a ofed-1.2.5 development build too? > > Thanks, > > Steve. > Done: http://www.openfabrics.org/builds/connectx/OFED-1.2.5-20071009-0955.tgz Regards, Vladimir From sean.hefty at intel.com Tue Oct 9 11:20:22 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 9 Oct 2007 11:20:22 -0700 Subject: [ofa-general] [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests Message-ID: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com> Deadlock condition reported by Kanoj Sarcar The deadlock occurs when a connection request arrives at the same time that a wildcard listen is being destroyed. A wildcard listen maintains per device listen requests for each RDMA device in the system. The per device listens are automatically added and removed when RDMA devices are inserted or removed from the system. When a wildcard listen is destroyed, rdma_destroy_id() acquires the rdma_cm's device mutex ('lock') to protect against hot-plug events adding or removing per device listens. It then tries to destroy the per device listens by calling ib_destroy_cm_id() or iw_destroy_cm_id(). It does this while holding the device mutex. However, if the underlying iw/ib CM reports a connection request while this is occurring, the rdma_cm callback function will try to acquire the same device mutex. Since we're in a callback, the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until their callback thread returns, but the callback is blocked waiting for the device mutex. Fix this by re-working how per device listens are destroyed. Use rdma_destroy_id(), which avoids the deadlock, in place of cma_destroy_listen(). Additional synchronization is added to handle device hot-plug events and ensure that the id is not destroyed twice. Signed-off-by: Sean Hefty --- Fix from discussion started at: http://lists.openfabrics.org/pipermail/general/2007-October/041456.html Kanoj, please verify that this fix looks correct and works for you, and I will queue for 2.6.24. drivers/infiniband/core/cma.c | 70 +++++++++++++---------------------------- 1 files changed, 23 insertions(+), 47 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ffb998..21ea92c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,11 +113,12 @@ struct rdma_id_private { struct rdma_bind_list *bind_list; struct hlist_node node; - struct list_head list; - struct list_head listen_list; + struct list_head list; /* listen_any_list or cma_device.list */ + struct list_head listen_list; /* per device listens */ struct cma_device *cma_dev; struct list_head mc_list; + int internal_id; enum cma_state state; spinlock_t lock; struct completion comp; @@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) } } -static inline int cma_internal_listen(struct rdma_id_private *id_priv) -{ - return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && - cma_any_addr(&id_priv->id.route.addr.src_addr); -} - -static void cma_destroy_listen(struct rdma_id_private *id_priv) -{ - cma_exch(id_priv, CMA_DESTROYING); - - if (id_priv->cma_dev) { - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { - case RDMA_TRANSPORT_IB: - if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) - ib_destroy_cm_id(id_priv->cm_id.ib); - break; - case RDMA_TRANSPORT_IWARP: - if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) - iw_destroy_cm_id(id_priv->cm_id.iw); - break; - default: - break; - } - cma_detach_from_dev(id_priv); - } - list_del(&id_priv->listen_list); - - cma_deref_id(id_priv); - wait_for_completion(&id_priv->comp); - - kfree(id_priv); -} - static void cma_cancel_listens(struct rdma_id_private *id_priv) { struct rdma_id_private *dev_id_priv; + /* + * Remove from listen_any_list to prevent added devices from spawning + * additional listen requests. + */ mutex_lock(&lock); list_del(&id_priv->list); while (!list_empty(&id_priv->listen_list)) { dev_id_priv = list_entry(id_priv->listen_list.next, struct rdma_id_private, listen_list); - cma_destroy_listen(dev_id_priv); + /* sync with device removal to avoid duplicate destruction */ + list_del_init(&dev_id_priv->list); + list_del(&dev_id_priv->listen_list); + mutex_unlock(&lock); + + rdma_destroy_id(&dev_id_priv->id); + mutex_lock(&lock); } mutex_unlock(&lock); } @@ -846,6 +824,9 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv); wait_for_completion(&id_priv->comp); + if (id_priv->internal_id) + cma_deref_id(id_priv->id.context); + kfree(id_priv->id.route.path_rec); kfree(id_priv); } @@ -1401,14 +1382,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv, cma_attach_to_dev(dev_id_priv, cma_dev); list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + atomic_inc(&id_priv->refcount); + dev_id_priv->internal_id = 1; ret = rdma_listen(id, id_priv->backlog); if (ret) - goto err; - - return; -err: - cma_destroy_listen(dev_id_priv); + printk(KERN_WARNING "RDMA CMA: cma_listen_on_dev, error %d, " + "listening on device %s", ret, cma_dev->device->name); } static void cma_listen_on_all(struct rdma_id_private *id_priv) @@ -2729,16 +2709,12 @@ static void cma_process_remove(struct cma_device *cma_dev) id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); - if (cma_internal_listen(id_priv)) { - cma_destroy_listen(id_priv); - continue; - } - + list_del(&id_priv->listen_list); list_del_init(&id_priv->list); atomic_inc(&id_priv->refcount); mutex_unlock(&lock); - ret = cma_remove_id_dev(id_priv); + ret = id_priv->internal_id ? 1 : cma_remove_id_dev(id_priv); cma_deref_id(id_priv); if (ret) rdma_destroy_id(&id_priv->id); From weiny2 at llnl.gov Tue Oct 9 11:23:40 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 9 Oct 2007 11:23:40 -0700 Subject: [ofa-general] Has libmlx4 been released? Message-ID: <20071009112340.0719ea4e.weiny2@llnl.gov> looking at git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git I don't see any tags or branches. If not, when is the initial release planned? Thanks, Ira From shemminger at linux-foundation.org Tue Oct 9 11:22:25 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Tue, 9 Oct 2007 11:22:25 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: <20071009112225.5f9756e7@freepuppy.rosehill> On 09 Oct 2007 18:51:51 +0200 Andi Kleen wrote: > David Miller writes: > > > > 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo > > with load balancing using the code in #1. I think this is kind > > of in the territory of what Peter said he is working on. > > Hopefully that new qdisc will just use the TX rings of the hardware > directly. They are typically large enough these days. That might avoid > some locking in this critical path. > > > I know this is controversial, but realistically I doubt users > > benefit at all from the prioritization that pfifo provides. > > I agree. For most interfaces the priority is probably dubious. > Even for DSL the prioritization will be likely usually done in a router > these days. > > Also for the fast interfaces where we do TSO priority doesn't work > very well anyways -- with large packets there is not too much > to prioritize. > > > 3) Work on discovering a way to make the locking on transmit as > > localized to the current thread of execution as possible. Things > > like RCU and statistic replication, techniques we use widely > > elsewhere in the stack, begin to come to mind. > > If the data is just passed on to the hardware queue, why is any > locking needed at all? (except for the driver locking of course) > > -Andi I wonder about the whole idea of queueing in general at such high speeds. Given the normal bi-modal distribution of packets, and the predominance of 1500 byte MTU; does it make sense to even have any queueing in software at all? -- Stephen Hemminger From andi at firstfloor.org Tue Oct 9 11:30:27 2007 From: andi at firstfloor.org (Andi Kleen) Date: Tue, 9 Oct 2007 20:30:27 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009112225.5f9756e7@freepuppy.rosehill> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009112225.5f9756e7@freepuppy.rosehill> Message-ID: <20071009183027.GA552@one.firstfloor.org> > I wonder about the whole idea of queueing in general at such high speeds. > Given the normal bi-modal distribution of packets, and the predominance > of 1500 byte MTU; does it make sense to even have any queueing in software > at all? Yes that is my point -- it should just pass it through directly and the driver can then put it into the different per CPU (or per whatever) queues managed by the hardware. The only thing the qdisc needs to do is to set some bit that says "it is ok to put this into difference queues; don't need strict ordering" Otherwise if the drivers did that unconditionally they might cause problems with other qdiscs. This would also require that the driver exports some hint to the upper layer on how large its internal queues are. A device with a short queue would still require pfifo_fast. Long queue devices could just pass through. That again could be a single flag. -Andi From kanoj at netxen.com Tue Oct 9 11:40:00 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Tue, 09 Oct 2007 11:40:00 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests In-Reply-To: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com> Message-ID: <470BCB00.1040702@netxen.com> Sean, I will take a look at your code changes and comment, and hopefully be able to run a quick test on your patch within this week. Just so I understand, did you discover problems (maybe preexisting race conditions) with my previously posted patch? If yes, please point it out, so its easier to review yours; if not, I will assume your patch implements a better locking scheme and review it as such. Thanks. Kanoj Sean Hefty wrote: >Deadlock condition reported by Kanoj Sarcar >The deadlock occurs when a connection request arrives at the same >time that a wildcard listen is being destroyed. > >A wildcard listen maintains per device listen requests for each >RDMA device in the system. The per device listens are automatically >added and removed when RDMA devices are inserted or removed from >the system. > >When a wildcard listen is destroyed, rdma_destroy_id() acquires >the rdma_cm's device mutex ('lock') to protect against hot-plug >events adding or removing per device listens. It then tries to >destroy the per device listens by calling ib_destroy_cm_id() or >iw_destroy_cm_id(). It does this while holding the device mutex. > >However, if the underlying iw/ib CM reports a connection request >while this is occurring, the rdma_cm callback function will try >to acquire the same device mutex. Since we're in a callback, >the ib_destroy_cm_id() or iw_destroy_cm_id() calls will block until >their callback thread returns, but the callback is blocked waiting for >the device mutex. > >Fix this by re-working how per device listens are destroyed. Use >rdma_destroy_id(), which avoids the deadlock, in place of >cma_destroy_listen(). Additional synchronization is added >to handle device hot-plug events and ensure that the id is not destroyed >twice. > >Signed-off-by: Sean Hefty >--- >Fix from discussion started at: >http://lists.openfabrics.org/pipermail/general/2007-October/041456.html > >Kanoj, please verify that this fix looks correct and works for you, and I >will queue for 2.6.24. > > drivers/infiniband/core/cma.c | 70 +++++++++++++---------------------------- > 1 files changed, 23 insertions(+), 47 deletions(-) > >diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >index 9ffb998..21ea92c 100644 >--- a/drivers/infiniband/core/cma.c >+++ b/drivers/infiniband/core/cma.c >@@ -113,11 +113,12 @@ struct rdma_id_private { > > struct rdma_bind_list *bind_list; > struct hlist_node node; >- struct list_head list; >- struct list_head listen_list; >+ struct list_head list; /* listen_any_list or cma_device.list */ >+ struct list_head listen_list; /* per device listens */ > struct cma_device *cma_dev; > struct list_head mc_list; > >+ int internal_id; > enum cma_state state; > spinlock_t lock; > struct completion comp; >@@ -715,50 +716,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) > } > } > >-static inline int cma_internal_listen(struct rdma_id_private *id_priv) >-{ >- return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && >- cma_any_addr(&id_priv->id.route.addr.src_addr); >-} >- >-static void cma_destroy_listen(struct rdma_id_private *id_priv) >-{ >- cma_exch(id_priv, CMA_DESTROYING); >- >- if (id_priv->cma_dev) { >- switch (rdma_node_get_transport(id_priv->id.device->node_type)) { >- case RDMA_TRANSPORT_IB: >- if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) >- ib_destroy_cm_id(id_priv->cm_id.ib); >- break; >- case RDMA_TRANSPORT_IWARP: >- if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) >- iw_destroy_cm_id(id_priv->cm_id.iw); >- break; >- default: >- break; >- } >- cma_detach_from_dev(id_priv); >- } >- list_del(&id_priv->listen_list); >- >- cma_deref_id(id_priv); >- wait_for_completion(&id_priv->comp); >- >- kfree(id_priv); >-} >- > static void cma_cancel_listens(struct rdma_id_private *id_priv) > { > struct rdma_id_private *dev_id_priv; > >+ /* >+ * Remove from listen_any_list to prevent added devices from spawning >+ * additional listen requests. >+ */ > mutex_lock(&lock); > list_del(&id_priv->list); > > while (!list_empty(&id_priv->listen_list)) { > dev_id_priv = list_entry(id_priv->listen_list.next, > struct rdma_id_private, listen_list); >- cma_destroy_listen(dev_id_priv); >+ /* sync with device removal to avoid duplicate destruction */ >+ list_del_init(&dev_id_priv->list); >+ list_del(&dev_id_priv->listen_list); >+ mutex_unlock(&lock); >+ >+ rdma_destroy_id(&dev_id_priv->id); >+ mutex_lock(&lock); > } > mutex_unlock(&lock); > } >@@ -846,6 +824,9 @@ void rdma_destroy_id(struct rdma_cm_id *id) > cma_deref_id(id_priv); > wait_for_completion(&id_priv->comp); > >+ if (id_priv->internal_id) >+ cma_deref_id(id_priv->id.context); >+ > kfree(id_priv->id.route.path_rec); > kfree(id_priv); > } >@@ -1401,14 +1382,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv, > > cma_attach_to_dev(dev_id_priv, cma_dev); > list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); >+ atomic_inc(&id_priv->refcount); >+ dev_id_priv->internal_id = 1; > > ret = rdma_listen(id, id_priv->backlog); > if (ret) >- goto err; >- >- return; >-err: >- cma_destroy_listen(dev_id_priv); >+ printk(KERN_WARNING "RDMA CMA: cma_listen_on_dev, error %d, " >+ "listening on device %s", ret, cma_dev->device->name); > } > > static void cma_listen_on_all(struct rdma_id_private *id_priv) >@@ -2729,16 +2709,12 @@ static void cma_process_remove(struct cma_device *cma_dev) > id_priv = list_entry(cma_dev->id_list.next, > struct rdma_id_private, list); > >- if (cma_internal_listen(id_priv)) { >- cma_destroy_listen(id_priv); >- continue; >- } >- >+ list_del(&id_priv->listen_list); > list_del_init(&id_priv->list); > atomic_inc(&id_priv->refcount); > mutex_unlock(&lock); > >- ret = cma_remove_id_dev(id_priv); >+ ret = id_priv->internal_id ? 1 : cma_remove_id_dev(id_priv); > cma_deref_id(id_priv); > if (ret) > rdma_destroy_id(&id_priv->id); > > > > From pradeeps at linux.vnet.ibm.com Tue Oct 9 11:49:20 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 09 Oct 2007 11:49:20 -0700 Subject: [ofa-general] Updated InfiniBand/RDMA merge plans for 2.6.24 In-Reply-To: References: <47083E38.2050005@linux.vnet.ibm.com> Message-ID: <470BCD30.3050705@linux.vnet.ibm.com> Roland Dreier wrote: > > Roland, I submitted an updated patch incorporating some of Sean's comments within > > a day or two. Rest of comments pertained to restructuring the code and adding > > some additional module parameters. > > > > This would require more discussions since some of these had been already discussed > > previously. We had decided upon this code structure after a lot of discussions and > > incorporating these would be undoing some of that. > > Can you give a link to your current final version of the patch? > Roland, This is the link to the last one that I submitted on 09/18. http://lists.openfabrics.org/pipermail/general/2007-September/040917.html Pradeep From peter.p.waskiewicz.jr at intel.com Tue Oct 9 11:48:45 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Tue, 9 Oct 2007 11:48:45 -0700 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470AE373.9020207@garzik.org> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <470AE373.9020207@garzik.org> Message-ID: > IMO the net driver really should provide a hint as to what it wants. > > 8139cp and tg3 would probably prefer multiple TX queue > behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. -PJ Waskiewicz peter.p.waskiewicz.jr at intel.com From rdreier at cisco.com Tue Oct 9 11:54:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 11:54:35 -0700 Subject: [ofa-general] Has libmlx4 been released? In-Reply-To: <20071009112340.0719ea4e.weiny2@llnl.gov> (Ira Weiny's message of "Tue, 9 Oct 2007 11:23:40 -0700") References: <20071009112340.0719ea4e.weiny2@llnl.gov> Message-ID: > looking at git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git > > I don't see any tags or branches. That's right, I haven't made any real release yet. > If not, when is the initial release planned? Soon I guess. I don't know of any outstanding issues so it's just a matter of doing a release. - R. From jeff at garzik.org Tue Oct 9 12:04:23 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 15:04:23 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <470AE373.9020207@garzik.org> Message-ID: <470BD0B7.4070607@garzik.org> Waskiewicz Jr, Peter P wrote: >> IMO the net driver really should provide a hint as to what it wants. >> >> 8139cp and tg3 would probably prefer multiple TX queue >> behavior to match silicon behavior -- strict prio. > > If I understand what you just said, I disagree. If your hardware is > running strict prio, you don't want to enforce strict prio in the qdisc > layer; performing two layers of QoS is excessive, and may lead to > results you don't want. The reason I added the DRR qdisc is for the Si > that has its own queueing strategy that is not RR. For Si that > implements RR (like e1000), you can either use the DRR qdisc, or if you > want to prioritize your flows, use PRIO. A misunderstanding, I think. To my brain, DaveM's item #2 seemed to assume/require the NIC hardware to balance fairly across hw TX rings, which seemed to preclude the 8139cp/tg3 style of strict-prio hardware. That's what I was responding to. As long as there is some modular way to fit 8139cp/tg3 style multi-TX into our universe, I'm happy :) Jeff From peter.p.waskiewicz.jr at intel.com Tue Oct 9 12:07:25 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Tue, 9 Oct 2007 12:07:25 -0700 Subject: [ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470BD0B7.4070607@garzik.org> References: <1191886845.4373.138.camel@localhost> <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <470AE373.9020207@garzik.org> <470BD0B7.4070607@garzik.org> Message-ID: > A misunderstanding, I think. > > To my brain, DaveM's item #2 seemed to assume/require the NIC > hardware to balance fairly across hw TX rings, which seemed > to preclude the > 8139cp/tg3 style of strict-prio hardware. That's what I was > responding to. > > As long as there is some modular way to fit 8139cp/tg3 style > multi-TX into our universe, I'm happy :) Ah hah. Yes, a misunderstanding on my part. Thanks for the clarification. Methinks more caffeine is required for today... -PJ From mshefty at ichips.intel.com Tue Oct 9 12:21:09 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Oct 2007 12:21:09 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests In-Reply-To: <470BCB00.1040702@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com> <470BCB00.1040702@netxen.com> Message-ID: <470BD4A5.40902@ichips.intel.com> > Just so I understand, did you discover problems (maybe preexisting race > conditions) with my previously posted patch? If yes, please point it > out, so its easier to review yours; if not, I will assume your patch > implements a better locking scheme and review it as such. I tried to explain the issue somewhat in my change commit and code comments. The issue is synchronizing cleanup of the listen_list with device removal. When an RDMA device is added to the system, a new listen request is added for all wildcard listens. Since the original locking held the mutex throughout the cleanup of the listen list, it prevented adding another listen request during that same time. Similar protection was there for handling device removal. When a device is removed from the system, all internal listen requests associated with that device are destroyed. If the associated wildcard listen is also being destroyed, we need to ensure that we don't try to destroy the same listen twice. My patch, like yours, ends up releasing the mutex while cleaning up the listen_list. I choose to eliminate the cma_destroy_listen() call, and use rdma_destroy_id() as a single destruction path instead. This keeps the locking contained to a single function. (I don't like acquiring a lock in one call and releasing it in another. It puts too much assumption on the caller.) What was missing was ensuring that a device removal didn't try to destroy the same listen request. This is handled by the adding the list_del*() calls to cma_cancel_listens(). Whichever thread removes the listening id from the device list is responsible for its destruction. And because that thread could be the device removal thread, I added a reference from the per device listen to the wildcard listen. Hopefully this makes sense. - Sean From arthur.jones at qlogic.com Tue Oct 9 12:59:14 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:14 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- patches for 2.6.24 Message-ID: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> hi roland, here is our current batch of patches. i realize that they are a bit later than you would probably like, i'm sorry about that -- i hope they are straightforward enough to make it into your for-2.6.24 branch. these patches can be git pulled from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur From arthur.jones at qlogic.com Tue Oct 9 12:59:20 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:20 -0700 Subject: [ofa-general] [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> On iba6110 rev4, support for three more IB counters were added. The LocalLinkIntegrityError counter, the ExcessiveBufferOverrunErrors counter and support for error counting of flow control packets on an invalid VL. These counters trigger GPIO interrupts and the sw keeps track of the counts. Since we also use GPIO interrupts to signal packet reception, we need to turn off the fast interrupts, or we risk losing a GPIO interrupt. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_iba6110.c | 8 ++++++++ drivers/infiniband/hw/ipath/ipath_intr.c | 4 ++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index 650745d..e1c5998 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1559,6 +1559,14 @@ static int ipath_ht_early_init(struct ipath_devdata *dd) ipath_dev_err(dd, "Unsupported InfiniPath serial " "number %.16s!\n", dd->ipath_serial); + if (dd->ipath_minrev >= 4) { + /* Rev4+ reports extra errors via internal GPIO pins */ + dd->ipath_flags |= IPATH_GPIO_ERRINTRS; + dd->ipath_gpio_mask |= IPATH_GPIO_ERRINTR_MASK; + ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); + } + return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index b29fe7e..11b3614 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1085,8 +1085,8 @@ irqreturn_t ipath_intr(int irq, void *data) * GPIO_2 indicates (on some HT4xx boards) that a packet * has arrived for Port 0. Checking for this * is controlled by flag IPATH_GPIO_INTR. - * GPIO_3..5 on IBA6120 Rev2 chips indicate errors - * that we need to count. Checking for this + * GPIO_3..5 on IBA6120 Rev2 and IBA6110 Rev4 chips indicate + * errors that we need to count. Checking for this * is controlled by flag IPATH_GPIO_ERRINTRS. */ u32 gpiostatus; From arthur.jones at qlogic.com Tue Oct 9 12:59:25 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:25 -0700 Subject: [ofa-general] [PATCH 02/23] IB/ipath - performance optimization for CPU differences In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195925.7151.65317.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell Different processors have different ordering restrictions for write combining. By taking advantage of this, we can eliminate some write barriers when writing to the send buffers. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_diag.c | 22 +++++----- drivers/infiniband/hw/ipath/ipath_iba6120.c | 2 + drivers/infiniband/hw/ipath/ipath_kernel.h | 2 + drivers/infiniband/hw/ipath/ipath_verbs.c | 62 ++++++++++++++++----------- 4 files changed, 53 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_diag.c b/drivers/infiniband/hw/ipath/ipath_diag.c index cf25cda..4137c77 100644 --- a/drivers/infiniband/hw/ipath/ipath_diag.c +++ b/drivers/infiniband/hw/ipath/ipath_diag.c @@ -446,19 +446,21 @@ static ssize_t ipath_diagpkt_write(struct file *fp, dd->ipath_unit, plen - 1, pbufn); if (dp.pbc_wd == 0) - /* Legacy operation, use computed pbc_wd */ dp.pbc_wd = plen; - - /* we have to flush after the PBC for correctness on some cpus - * or WC buffer can be written out of order */ writeq(dp.pbc_wd, piobuf); - ipath_flush_wc(); - /* copy all by the trigger word, then flush, so it's written + /* + * Copy all by the trigger word, then flush, so it's written * to chip before trigger word, then write trigger word, then - * flush again, so packet is sent. */ - __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1); - ipath_flush_wc(); - __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1); + * flush again, so packet is sent. + */ + if (dd->ipath_flags & IPATH_PIO_FLUSH_WC) { + ipath_flush_wc(); + __iowrite32_copy(piobuf + 2, tmpbuf, clen - 1); + ipath_flush_wc(); + __raw_writel(tmpbuf[clen - 1], piobuf + clen + 1); + } else + __iowrite32_copy(piobuf + 2, tmpbuf, clen); + ipath_flush_wc(); ret = sizeof(dp); diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 5b6ac9a..a324c6f 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -1273,6 +1273,8 @@ static void ipath_pe_tidtemplate(struct ipath_devdata *dd) static int ipath_pe_early_init(struct ipath_devdata *dd) { dd->ipath_flags |= IPATH_4BYTE_TID; + if (ipath_unordered_wc()) + dd->ipath_flags |= IPATH_PIO_FLUSH_WC; /* * For openfabrics, we need to be able to handle an IB header of diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 7a7966f..d983f92 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -724,6 +724,8 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv); #define IPATH_LINKACTIVE 0x200 /* link current state is unknown */ #define IPATH_LINKUNK 0x400 + /* Write combining flush needed for PIO */ +#define IPATH_PIO_FLUSH_WC 0x1000 /* no IB cable, or no device on IB cable */ #define IPATH_NOCABLE 0x4000 /* Supports port zero per packet receive interrupts via diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 16aa61f..559d4a6 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -631,7 +631,7 @@ static inline u32 clear_upper_bytes(u32 data, u32 n, u32 off) #endif static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, - u32 length) + u32 length, unsigned flush_wc) { u32 extra = 0; u32 data = 0; @@ -757,11 +757,14 @@ static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, } /* Update address before sending packet. */ update_sge(ss, length); - /* must flush early everything before trigger word */ - ipath_flush_wc(); - __raw_writel(last, piobuf); - /* be sure trigger word is written */ - ipath_flush_wc(); + if (flush_wc) { + /* must flush early everything before trigger word */ + ipath_flush_wc(); + __raw_writel(last, piobuf); + /* be sure trigger word is written */ + ipath_flush_wc(); + } else + __raw_writel(last, piobuf); } /** @@ -776,6 +779,7 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, u32 *hdr, u32 len, struct ipath_sge_state *ss) { u32 __iomem *piobuf; + unsigned flush_wc; u32 plen; int ret; @@ -799,47 +803,55 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, * or WC buffer can be written out of order. */ writeq(plen, piobuf); - ipath_flush_wc(); piobuf += 2; + + flush_wc = dd->ipath_flags & IPATH_PIO_FLUSH_WC; if (len == 0) { /* * If there is just the header portion, must flush before * writing last word of header for correctness, and after * the last header word (trigger word). */ - __iowrite32_copy(piobuf, hdr, hdrwords - 1); - ipath_flush_wc(); - __raw_writel(hdr[hdrwords - 1], piobuf + hdrwords - 1); - ipath_flush_wc(); - ret = 0; - goto bail; + if (flush_wc) { + ipath_flush_wc(); + __iowrite32_copy(piobuf, hdr, hdrwords - 1); + ipath_flush_wc(); + __raw_writel(hdr[hdrwords - 1], piobuf + hdrwords - 1); + ipath_flush_wc(); + } else + __iowrite32_copy(piobuf, hdr, hdrwords); + goto done; } + if (flush_wc) + ipath_flush_wc(); __iowrite32_copy(piobuf, hdr, hdrwords); piobuf += hdrwords; /* The common case is aligned and contained in one segment. */ if (likely(ss->num_sge == 1 && len <= ss->sge.length && !((unsigned long)ss->sge.vaddr & (sizeof(u32) - 1)))) { - u32 w; + u32 dwords; u32 *addr = (u32 *) ss->sge.vaddr; /* Update address before sending packet. */ update_sge(ss, len); /* Need to round up for the last dword in the packet. */ - w = (len + 3) >> 2; - __iowrite32_copy(piobuf, addr, w - 1); - /* must flush early everything before trigger word */ - ipath_flush_wc(); - __raw_writel(addr[w - 1], piobuf + w - 1); - /* be sure trigger word is written */ - ipath_flush_wc(); - ret = 0; - goto bail; + dwords = (len + 3) >> 2; + if (flush_wc) { + __iowrite32_copy(piobuf, addr, dwords - 1); + /* must flush early everything before trigger word */ + ipath_flush_wc(); + __raw_writel(addr[dwords - 1], piobuf + dwords - 1); + /* be sure trigger word is written */ + ipath_flush_wc(); + } else + __iowrite32_copy(piobuf, addr, dwords); + goto done; } - copy_io(piobuf, ss, len); + copy_io(piobuf, ss, len, flush_wc); +done: ret = 0; - bail: return ret; } From arthur.jones at qlogic.com Tue Oct 9 12:59:30 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:30 -0700 Subject: [ofa-general] [PATCH 03/23] IB/ipath - change UD to queue work requests like RC & UC In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195930.7151.83770.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The code to post UD sends tried to process work requests at the time ib_post_send() is called without using a WQE queue. This was fine as long as HW resources were available for sending a packet. This patch changes UD to be handled more like RC and UC and shares more code. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_qp.c | 11 - drivers/infiniband/hw/ipath/ipath_rc.c | 61 +++-- drivers/infiniband/hw/ipath/ipath_ruc.c | 308 ++++++++---------------- drivers/infiniband/hw/ipath/ipath_uc.c | 77 ++---- drivers/infiniband/hw/ipath/ipath_ud.c | 372 ++++++++++------------------- drivers/infiniband/hw/ipath/ipath_verbs.c | 241 +++++++++++++------ drivers/infiniband/hw/ipath/ipath_verbs.h | 35 ++- 7 files changed, 494 insertions(+), 611 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 1324b35..a8c4a6b 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -338,6 +338,7 @@ static void ipath_reset_qp(struct ipath_qp *qp) qp->s_busy = 0; qp->s_flags &= IPATH_S_SIGNAL_REQ_WR; qp->s_hdrwords = 0; + qp->s_wqe = NULL; qp->s_psn = 0; qp->r_psn = 0; qp->r_msn = 0; @@ -751,6 +752,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, switch (init_attr->qp_type) { case IB_QPT_UC: case IB_QPT_RC: + case IB_QPT_UD: + case IB_QPT_SMI: + case IB_QPT_GSI: sz = sizeof(struct ipath_sge) * init_attr->cap.max_send_sge + sizeof(struct ipath_swqe); @@ -759,10 +763,6 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, ret = ERR_PTR(-ENOMEM); goto bail; } - /* FALLTHROUGH */ - case IB_QPT_UD: - case IB_QPT_SMI: - case IB_QPT_GSI: sz = sizeof(*qp); if (init_attr->srq) { struct ipath_srq *srq = to_isrq(init_attr->srq); @@ -805,8 +805,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, spin_lock_init(&qp->r_rq.lock); atomic_set(&qp->refcount, 0); init_waitqueue_head(&qp->wait); - tasklet_init(&qp->s_task, ipath_do_ruc_send, - (unsigned long)qp); + tasklet_init(&qp->s_task, ipath_do_send, (unsigned long)qp); INIT_LIST_HEAD(&qp->piowait); INIT_LIST_HEAD(&qp->timerwait); qp->state = IB_QPS_RESET; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 46744ea..53259da 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -81,9 +81,8 @@ static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) * Note that we are in the responder's side of the QP context. * Note the QP s_lock must be held. */ -static int ipath_make_rc_ack(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p) +static int ipath_make_rc_ack(struct ipath_ibdev *dev, struct ipath_qp *qp, + struct ipath_other_headers *ohdr, u32 pmtu) { struct ipath_ack_entry *e; u32 hwords; @@ -192,8 +191,7 @@ static int ipath_make_rc_ack(struct ipath_qp *qp, } qp->s_hdrwords = hwords; qp->s_cur_size = len; - *bth0p = bth0 | (1 << 22); /* Set M bit */ - *bth2p = bth2; + ipath_make_ruc_header(dev, qp, ohdr, bth0, bth2); return 1; bail: @@ -203,32 +201,39 @@ bail: /** * ipath_make_rc_req - construct a request packet (SEND, RDMA r/w, ATOMIC) * @qp: a pointer to the QP - * @ohdr: a pointer to the IB header being constructed - * @pmtu: the path MTU - * @bth0p: pointer to the BTH opcode word - * @bth2p: pointer to the BTH PSN word * * Return 1 if constructed; otherwise, return 0. - * Note the QP s_lock must be held and interrupts disabled. */ -int ipath_make_rc_req(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p) +int ipath_make_rc_req(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_other_headers *ohdr; struct ipath_sge_state *ss; struct ipath_swqe *wqe; u32 hwords; u32 len; u32 bth0; u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); char newreq; + unsigned long flags; + int ret = 0; + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + + /* + * The lock is needed to synchronize between the sending tasklet, + * the receive interrupt handler, and timeout resends. + */ + spin_lock_irqsave(&qp->s_lock, flags); /* Sending responses has higher priority over sending requests. */ if ((qp->r_head_ack_queue != qp->s_tail_ack_queue || (qp->s_flags & IPATH_S_ACK_PENDING) || qp->s_ack_state != OP(ACKNOWLEDGE)) && - ipath_make_rc_ack(qp, ohdr, pmtu, bth0p, bth2p)) + ipath_make_rc_ack(dev, qp, ohdr, pmtu)) goto done; if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || @@ -560,13 +565,12 @@ int ipath_make_rc_req(struct ipath_qp *qp, qp->s_hdrwords = hwords; qp->s_cur_sge = ss; qp->s_cur_size = len; - *bth0p = bth0 | (qp->s_state << 24); - *bth2p = bth2; + ipath_make_ruc_header(dev, qp, ohdr, bth0 | (qp->s_state << 24), bth2); done: - return 1; - + ret = 1; bail: - return 0; + spin_unlock_irqrestore(&qp->s_lock, flags); + return ret; } /** @@ -627,7 +631,7 @@ static void send_rc_ack(struct ipath_qp *qp) /* * If we can send the ACK, clear the ACK state. */ - if (ipath_verbs_send(dev->dd, hwords, (u32 *) &hdr, 0, NULL) == 0) { + if (ipath_verbs_send(qp, &hdr, hwords, NULL, 0) == 0) { dev->n_unicast_xmit++; goto done; } @@ -757,7 +761,9 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) wc->vendor_err = 0; wc->byte_len = 0; wc->qp = &qp->ibqp; + wc->imm_data = 0; wc->src_qp = qp->remote_qpn; + wc->wc_flags = 0; wc->pkey_index = 0; wc->slid = qp->remote_ah_attr.dlid; wc->sl = qp->remote_ah_attr.sl; @@ -1041,7 +1047,9 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, wc.vendor_err = 0; wc.byte_len = 0; wc.qp = &qp->ibqp; + wc.imm_data = 0; wc.src_qp = qp->remote_qpn; + wc.wc_flags = 0; wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; @@ -1454,6 +1462,19 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev, goto send_ack; } /* + * Try to send a simple ACK to work around a Mellanox bug + * which doesn't accept a RDMA read response or atomic + * response as an ACK for earlier SENDs or RDMA writes. + */ + if (qp->r_head_ack_queue == qp->s_tail_ack_queue && + !(qp->s_flags & IPATH_S_ACK_PENDING) && + qp->s_ack_state == OP(ACKNOWLEDGE)) { + spin_unlock_irqrestore(&qp->s_lock, flags); + qp->r_nak_state = 0; + qp->r_ack_psn = qp->s_ack_queue[i].psn - 1; + goto send_ack; + } + /* * Resend the RDMA read or atomic op which * ACKs this duplicate request. */ diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index c69c252..4b6b7ee 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -31,6 +31,8 @@ * SOFTWARE. */ +#include + #include "ipath_verbs.h" #include "ipath_kernel.h" @@ -106,27 +108,30 @@ void ipath_insert_rnr_queue(struct ipath_qp *qp) spin_unlock_irqrestore(&dev->pending_lock, flags); } -static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe) +/** + * ipath_init_sge - Validate a RWQE and fill in the SGE state + * @qp: the QP + * + * Return 1 if OK. + */ +int ipath_init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss) { - int user = to_ipd(qp->ibqp.pd)->user; int i, j, ret; struct ib_wc wc; - qp->r_len = 0; + *lengthp = 0; for (i = j = 0; i < wqe->num_sge; i++) { if (wqe->sg_list[i].length == 0) continue; /* Check LKEY */ - if ((user && wqe->sg_list[i].lkey == 0) || - !ipath_lkey_ok(qp, &qp->r_sg_list[j], &wqe->sg_list[i], - IB_ACCESS_LOCAL_WRITE)) + if (!ipath_lkey_ok(qp, j ? &ss->sg_list[j - 1] : &ss->sge, + &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) goto bad_lkey; - qp->r_len += wqe->sg_list[i].length; + *lengthp += wqe->sg_list[i].length; j++; } - qp->r_sge.sge = qp->r_sg_list[0]; - qp->r_sge.sg_list = qp->r_sg_list + 1; - qp->r_sge.num_sge = j; + ss->num_sge = j; ret = 1; goto bail; @@ -172,6 +177,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) u32 tail; int ret; + qp->r_sge.sg_list = qp->r_sg_list; + if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); handler = srq->ibsrq.event_handler; @@ -199,7 +206,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) wqe = get_rwqe_ptr(rq, tail); if (++tail >= rq->size) tail = 0; - } while (!wr_id_only && !init_sge(qp, wqe)); + } while (!wr_id_only && !ipath_init_sge(qp, wqe, &qp->r_len, + &qp->r_sge)); qp->r_wr_id = wqe->wr_id; wq->tail = tail; @@ -239,9 +247,9 @@ bail: /** * ipath_ruc_loopback - handle UC and RC lookback requests - * @sqp: the loopback QP + * @sqp: the sending QP * - * This is called from ipath_do_uc_send() or ipath_do_rc_send() to + * This is called from ipath_do_send() to * forward a WQE addressed to the same HCA. * Note that although we are single threaded due to the tasklet, we still * have to protect against post_send(). We don't have to worry about @@ -450,40 +458,18 @@ again: wc.byte_len = wqe->length; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; - /* XXX do we know which pkey matched? Only needed for GSI. */ wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; wc.dlid_path_bits = 0; + wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, wqe->wr.send_flags & IB_SEND_SOLICITED); send_comp: sqp->s_rnr_retry = sqp->s_rnr_retry_cnt; - - if (!(sqp->s_flags & IPATH_S_SIGNAL_REQ_WR) || - (wqe->wr.send_flags & IB_SEND_SIGNALED)) { - wc.wr_id = wqe->wr.wr_id; - wc.status = IB_WC_SUCCESS; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; - wc.byte_len = wqe->length; - wc.qp = &sqp->ibqp; - wc.src_qp = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - ipath_cq_enter(to_icq(sqp->ibqp.send_cq), &wc, 0); - } - - /* Update s_last now that we are finished with the SWQE */ - spin_lock_irqsave(&sqp->s_lock, flags); - if (++sqp->s_last >= sqp->s_size) - sqp->s_last = 0; - spin_unlock_irqrestore(&sqp->s_lock, flags); + ipath_send_complete(sqp, wqe, IB_WC_SUCCESS); goto again; done: @@ -491,13 +477,11 @@ done: wake_up(&qp->wait); } -static int want_buffer(struct ipath_devdata *dd) +static void want_buffer(struct ipath_devdata *dd) { set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); - - return 0; } /** @@ -507,14 +491,11 @@ static int want_buffer(struct ipath_devdata *dd) * * Called when we run out of PIO buffers. */ -static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) +static void ipath_no_bufs_available(struct ipath_qp *qp, + struct ipath_ibdev *dev) { unsigned long flags; - spin_lock_irqsave(&dev->pending_lock, flags); - if (list_empty(&qp->piowait)) - list_add_tail(&qp->piowait, &dev->piowait); - spin_unlock_irqrestore(&dev->pending_lock, flags); /* * Note that as soon as want_buffer() is called and * possibly before it returns, ipath_ib_piobufavail() @@ -524,101 +505,14 @@ static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev * We leave the busy flag set so that another post send doesn't * try to put the same QP on the piowait list again. */ + spin_lock_irqsave(&dev->pending_lock, flags); + list_add_tail(&qp->piowait, &dev->piowait); + spin_unlock_irqrestore(&dev->pending_lock, flags); want_buffer(dev->dd); dev->n_piowait++; } /** - * ipath_post_ruc_send - post RC and UC sends - * @qp: the QP to post on - * @wr: the work request to send - */ -int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr) -{ - struct ipath_swqe *wqe; - unsigned long flags; - u32 next; - int i, j; - int acc; - int ret; - - /* - * Don't allow RDMA reads or atomic operations on UC or - * undefined operations. - * Make sure buffer is large enough to hold the result for atomics. - */ - if (qp->ibqp.qp_type == IB_QPT_UC) { - if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) { - ret = -EINVAL; - goto bail; - } - } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) { - ret = -EINVAL; - goto bail; - } else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && - (wr->num_sge == 0 || - wr->sg_list[0].length < sizeof(u64) || - wr->sg_list[0].addr & (sizeof(u64) - 1))) { - ret = -EINVAL; - goto bail; - } else if (wr->opcode >= IB_WR_RDMA_READ && !qp->s_max_rd_atomic) { - ret = -EINVAL; - goto bail; - } - /* IB spec says that num_sge == 0 is OK. */ - if (wr->num_sge > qp->s_max_sge) { - ret = -ENOMEM; - goto bail; - } - spin_lock_irqsave(&qp->s_lock, flags); - next = qp->s_head + 1; - if (next >= qp->s_size) - next = 0; - if (next == qp->s_last) { - spin_unlock_irqrestore(&qp->s_lock, flags); - ret = -EINVAL; - goto bail; - } - - wqe = get_swqe_ptr(qp, qp->s_head); - wqe->wr = *wr; - wqe->ssn = qp->s_ssn++; - wqe->sg_list[0].mr = NULL; - wqe->sg_list[0].vaddr = NULL; - wqe->sg_list[0].length = 0; - wqe->sg_list[0].sge_length = 0; - wqe->length = 0; - acc = wr->opcode >= IB_WR_RDMA_READ ? IB_ACCESS_LOCAL_WRITE : 0; - for (i = 0, j = 0; i < wr->num_sge; i++) { - if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) { - spin_unlock_irqrestore(&qp->s_lock, flags); - ret = -EINVAL; - goto bail; - } - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(qp, &wqe->sg_list[j], &wr->sg_list[i], - acc)) { - spin_unlock_irqrestore(&qp->s_lock, flags); - ret = -EINVAL; - goto bail; - } - wqe->length += wr->sg_list[i].length; - j++; - } - wqe->wr.num_sge = j; - qp->s_head = next; - spin_unlock_irqrestore(&qp->s_lock, flags); - - ipath_do_ruc_send((unsigned long) qp); - - ret = 0; - -bail: - return ret; -} - -/** * ipath_make_grh - construct a GRH header * @dev: a pointer to the ipath device * @hdr: a pointer to the GRH header being constructed @@ -648,39 +542,66 @@ u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr, return sizeof(struct ib_grh) / sizeof(u32); } +void ipath_make_ruc_header(struct ipath_ibdev *dev, struct ipath_qp *qp, + struct ipath_other_headers *ohdr, + u32 bth0, u32 bth2) +{ + u16 lrh0; + u32 nwords; + u32 extra_bytes; + + /* Construct the header. */ + extra_bytes = -qp->s_cur_size & 3; + nwords = (qp->s_cur_size + extra_bytes) >> 2; + lrh0 = IPATH_LRH_BTH; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh, + &qp->remote_ah_attr.grh, + qp->s_hdrwords, nwords); + lrh0 = IPATH_LRH_GRH; + } + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(dev->dd->ipath_lid); + bth0 |= ipath_get_pkey(dev->dd, qp->s_pkey_index); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0 | (1 << 22)); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); +} + /** - * ipath_do_ruc_send - perform a send on an RC or UC QP + * ipath_do_send - perform a send on a QP * @data: contains a pointer to the QP * * Process entries in the send work queue until credit or queue is * exhausted. Only allow one CPU to send a packet per QP (tasklet). - * Otherwise, after we drop the QP s_lock, two threads could send - * packets out of order. + * Otherwise, two threads could send packets out of order. */ -void ipath_do_ruc_send(unsigned long data) +void ipath_do_send(unsigned long data) { struct ipath_qp *qp = (struct ipath_qp *)data; struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - unsigned long flags; - u16 lrh0; - u32 nwords; - u32 extra_bytes; - u32 bth0; - u32 bth2; - u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); - struct ipath_other_headers *ohdr; + int (*make_req)(struct ipath_qp *qp); if (test_and_set_bit(IPATH_S_BUSY, &qp->s_busy)) goto bail; - if (unlikely(qp->remote_ah_attr.dlid == dev->dd->ipath_lid)) { + if ((qp->ibqp.qp_type == IB_QPT_RC || + qp->ibqp.qp_type == IB_QPT_UC) && + qp->remote_ah_attr.dlid == dev->dd->ipath_lid) { ipath_ruc_loopback(qp); goto clear; } - ohdr = &qp->s_hdr.u.oth; - if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) - ohdr = &qp->s_hdr.u.l.oth; + if (qp->ibqp.qp_type == IB_QPT_RC) + make_req = ipath_make_rc_req; + else if (qp->ibqp.qp_type == IB_QPT_UC) + make_req = ipath_make_uc_req; + else + make_req = ipath_make_ud_req; again: /* Check for a constructed packet to be sent. */ @@ -689,9 +610,8 @@ again: * If no PIO bufs are available, return. An interrupt will * call ipath_ib_piobufavail() when one is available. */ - if (ipath_verbs_send(dev->dd, qp->s_hdrwords, - (u32 *) &qp->s_hdr, qp->s_cur_size, - qp->s_cur_sge)) { + if (ipath_verbs_send(qp, &qp->s_hdr, qp->s_hdrwords, + qp->s_cur_sge, qp->s_cur_size)) { ipath_no_bufs_available(qp, dev); goto bail; } @@ -700,54 +620,42 @@ again: qp->s_hdrwords = 0; } - /* - * The lock is needed to synchronize between setting - * qp->s_ack_state, resend timer, and post_send(). - */ - spin_lock_irqsave(&qp->s_lock, flags); - - if (!((qp->ibqp.qp_type == IB_QPT_RC) ? - ipath_make_rc_req(qp, ohdr, pmtu, &bth0, &bth2) : - ipath_make_uc_req(qp, ohdr, pmtu, &bth0, &bth2))) { - /* - * Clear the busy bit before unlocking to avoid races with - * adding new work queue items and then failing to process - * them. - */ - clear_bit(IPATH_S_BUSY, &qp->s_busy); - spin_unlock_irqrestore(&qp->s_lock, flags); - goto bail; - } + if (make_req(qp)) + goto again; +clear: + clear_bit(IPATH_S_BUSY, &qp->s_busy); +bail:; +} - spin_unlock_irqrestore(&qp->s_lock, flags); +void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, + enum ib_wc_status status) +{ + u32 last = qp->s_last; - /* Construct the header. */ - extra_bytes = (4 - qp->s_cur_size) & 3; - nwords = (qp->s_cur_size + extra_bytes) >> 2; - lrh0 = IPATH_LRH_BTH; - if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { - qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh, - &qp->remote_ah_attr.grh, - qp->s_hdrwords, nwords); - lrh0 = IPATH_LRH_GRH; - } - lrh0 |= qp->remote_ah_attr.sl << 4; - qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); - qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); - qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + - SIZE_OF_CRC); - qp->s_hdr.lrh[3] = cpu_to_be16(dev->dd->ipath_lid); - bth0 |= ipath_get_pkey(dev->dd, qp->s_pkey_index); - bth0 |= extra_bytes << 20; - ohdr->bth[0] = cpu_to_be32(bth0); - ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); - ohdr->bth[2] = cpu_to_be32(bth2); + if (++last == qp->s_size) + last = 0; + qp->s_last = last; - /* Check for more work to do. */ - goto again; + /* See ch. 11.2.4.1 and 10.7.3.1 */ + if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || + (wqe->wr.send_flags & IB_SEND_SIGNALED) || + status != IB_WC_SUCCESS) { + struct ib_wc wc; -clear: - clear_bit(IPATH_S_BUSY, &qp->s_busy); -bail: - return; + wc.wr_id = wqe->wr.wr_id; + wc.status = status; + wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.imm_data = 0; + wc.qp = &qp->ibqp; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + } } diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index 8380fbc..767beb9 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -37,72 +37,40 @@ /* cut down ridiculously long IB macro names */ #define OP(x) IB_OPCODE_UC_##x -static void complete_last_send(struct ipath_qp *qp, struct ipath_swqe *wqe, - struct ib_wc *wc) -{ - if (++qp->s_last == qp->s_size) - qp->s_last = 0; - if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || - (wqe->wr.send_flags & IB_SEND_SIGNALED)) { - wc->wr_id = wqe->wr.wr_id; - wc->status = IB_WC_SUCCESS; - wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc->vendor_err = 0; - wc->byte_len = wqe->length; - wc->qp = &qp->ibqp; - wc->src_qp = qp->remote_qpn; - wc->pkey_index = 0; - wc->slid = qp->remote_ah_attr.dlid; - wc->sl = qp->remote_ah_attr.sl; - wc->dlid_path_bits = 0; - wc->port_num = 0; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 0); - } -} - /** * ipath_make_uc_req - construct a request packet (SEND, RDMA write) * @qp: a pointer to the QP - * @ohdr: a pointer to the IB header being constructed - * @pmtu: the path MTU - * @bth0p: pointer to the BTH opcode word - * @bth2p: pointer to the BTH PSN word * * Return 1 if constructed; otherwise, return 0. - * Note the QP s_lock must be held and interrupts disabled. */ -int ipath_make_uc_req(struct ipath_qp *qp, - struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p) +int ipath_make_uc_req(struct ipath_qp *qp) { + struct ipath_other_headers *ohdr; struct ipath_swqe *wqe; u32 hwords; u32 bth0; u32 len; - struct ib_wc wc; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + int ret = 0; if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) goto done; + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ hwords = 5; bth0 = 1 << 22; /* Set M bit */ /* Get the next send request. */ - wqe = get_swqe_ptr(qp, qp->s_last); + wqe = get_swqe_ptr(qp, qp->s_cur); + qp->s_wqe = NULL; switch (qp->s_state) { default: - /* - * Signal the completion of the last send - * (if there is one). - */ - if (qp->s_last != qp->s_tail) { - complete_last_send(qp, wqe, &wc); - wqe = get_swqe_ptr(qp, qp->s_last); - } - /* Check if send work queue is empty. */ - if (qp->s_tail == qp->s_head) + if (qp->s_cur == qp->s_head) goto done; /* * Start a new request. @@ -131,6 +99,9 @@ int ipath_make_uc_req(struct ipath_qp *qp, } if (wqe->wr.send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; + qp->s_wqe = wqe; + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; break; case IB_WR_RDMA_WRITE: @@ -157,13 +128,14 @@ int ipath_make_uc_req(struct ipath_qp *qp, if (wqe->wr.send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; } + qp->s_wqe = wqe; + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; break; default: goto done; } - if (++qp->s_tail >= qp->s_size) - qp->s_tail = 0; break; case OP(SEND_FIRST): @@ -185,6 +157,9 @@ int ipath_make_uc_req(struct ipath_qp *qp, } if (wqe->wr.send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; + qp->s_wqe = wqe; + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; break; case OP(RDMA_WRITE_FIRST): @@ -207,18 +182,22 @@ int ipath_make_uc_req(struct ipath_qp *qp, if (wqe->wr.send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; } + qp->s_wqe = wqe; + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; break; } qp->s_len -= len; qp->s_hdrwords = hwords; qp->s_cur_sge = &qp->s_sge; qp->s_cur_size = len; - *bth0p = bth0 | (qp->s_state << 24); - *bth2p = qp->s_next_psn++ & IPATH_PSN_MASK; - return 1; + ipath_make_ruc_header(to_idev(qp->ibqp.device), + qp, ohdr, bth0 | (qp->s_state << 24), + qp->s_next_psn++ & IPATH_PSN_MASK); + ret = 1; done: - return 0; + return ret; } /** diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index f9a3338..34c4a0a 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -36,68 +36,17 @@ #include "ipath_verbs.h" #include "ipath_kernel.h" -static int init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, - u32 *lengthp, struct ipath_sge_state *ss) -{ - int user = to_ipd(qp->ibqp.pd)->user; - int i, j, ret; - struct ib_wc wc; - - *lengthp = 0; - for (i = j = 0; i < wqe->num_sge; i++) { - if (wqe->sg_list[i].length == 0) - continue; - /* Check LKEY */ - if ((user && wqe->sg_list[i].lkey == 0) || - !ipath_lkey_ok(qp, j ? &ss->sg_list[j - 1] : &ss->sge, - &wqe->sg_list[i], IB_ACCESS_LOCAL_WRITE)) - goto bad_lkey; - *lengthp += wqe->sg_list[i].length; - j++; - } - ss->num_sge = j; - ret = 1; - goto bail; - -bad_lkey: - wc.wr_id = wqe->wr_id; - wc.status = IB_WC_LOC_PROT_ERR; - wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; - wc.qp = &qp->ibqp; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - /* Signal solicited completion event. */ - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); - ret = 0; -bail: - return ret; -} - /** * ipath_ud_loopback - handle send on loopback QPs - * @sqp: the QP - * @ss: the SGE state - * @length: the length of the data to send - * @wr: the work request - * @wc: the work completion entry + * @sqp: the sending QP + * @swqe: the send work request * - * This is called from ipath_post_ud_send() to forward a WQE addressed + * This is called from ipath_make_ud_req() to forward a WQE addressed * to the same HCA. * Note that the receive interrupt handler may be calling ipath_ud_rcv() * while this is being called. */ -static void ipath_ud_loopback(struct ipath_qp *sqp, - struct ipath_sge_state *ss, - u32 length, struct ib_send_wr *wr, - struct ib_wc *wc) +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) { struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); struct ipath_qp *qp; @@ -110,12 +59,18 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_rwq *wq; struct ipath_rwqe *wqe; void (*handler)(struct ib_event *, void *); + struct ib_wc wc; u32 tail; u32 rlen; + u32 length; - qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); - if (!qp) - return; + qp = ipath_lookup_qpn(&dev->qp_table, swqe->wr.wr.ud.remote_qpn); + if (!qp) { + dev->n_pkt_drops++; + goto send_comp; + } + + rsge.sg_list = NULL; /* * Check that the qkey matches (except for QP0, see 9.6.1.4.1). @@ -123,39 +78,34 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, * qkey from the QP context instead of the WR (see 10.2.5). */ if (unlikely(qp->ibqp.qp_num && - ((int) wr->wr.ud.remote_qkey < 0 - ? qp->qkey : wr->wr.ud.remote_qkey) != qp->qkey)) { + ((int) swqe->wr.wr.ud.remote_qkey < 0 ? + sqp->qkey : swqe->wr.wr.ud.remote_qkey) != qp->qkey)) { /* XXX OK to lose a count once in a while. */ dev->qkey_violations++; dev->n_pkt_drops++; - goto done; + goto drop; } /* * A GRH is expected to preceed the data even if not * present on the wire. */ - wc->byte_len = length + sizeof(struct ib_grh); + length = swqe->length; + wc.byte_len = length + sizeof(struct ib_grh); - if (wr->opcode == IB_WR_SEND_WITH_IMM) { - wc->wc_flags = IB_WC_WITH_IMM; - wc->imm_data = wr->imm_data; + if (swqe->wr.opcode == IB_WR_SEND_WITH_IMM) { + wc.wc_flags = IB_WC_WITH_IMM; + wc.imm_data = swqe->wr.imm_data; } else { - wc->wc_flags = 0; - wc->imm_data = 0; + wc.wc_flags = 0; + wc.imm_data = 0; } - if (wr->num_sge > 1) { - rsge.sg_list = kmalloc((wr->num_sge - 1) * - sizeof(struct ipath_sge), - GFP_ATOMIC); - } else - rsge.sg_list = NULL; - /* - * Get the next work request entry to find where to put the data. - * Note that it is safe to drop the lock after changing rq->tail - * since ipath_post_receive() won't fill the empty slot. + * This would be a lot simpler if we could call ipath_get_rwqe() + * but that uses state that the receive interrupt handler uses + * so we would need to lock out receive interrupts while doing + * local loopback. */ if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); @@ -167,32 +117,53 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, rq = &qp->r_rq; } + if (rq->max_sge > 1) { + /* + * XXX We could use GFP_KERNEL if ipath_do_send() + * was always called from the tasklet instead of + * from ipath_post_send(). + */ + rsge.sg_list = kmalloc((rq->max_sge - 1) * + sizeof(struct ipath_sge), + GFP_ATOMIC); + if (!rsge.sg_list) { + dev->n_pkt_drops++; + goto drop; + } + } + + /* + * Get the next work request entry to find where to put the data. + * Note that it is safe to drop the lock after changing rq->tail + * since ipath_post_receive() won't fill the empty slot. + */ spin_lock_irqsave(&rq->lock, flags); wq = rq->wq; tail = wq->tail; - while (1) { - if (unlikely(tail == wq->head)) { - spin_unlock_irqrestore(&rq->lock, flags); - dev->n_pkt_drops++; - goto bail_sge; - } - /* Make sure entry is read after head index is read. */ - smp_rmb(); - wqe = get_rwqe_ptr(rq, tail); - if (++tail >= rq->size) - tail = 0; - if (init_sge(qp, wqe, &rlen, &rsge)) - break; - wq->tail = tail; + /* Validate tail before using it since it is user writable. */ + if (tail >= rq->size) + tail = 0; + if (unlikely(tail == wq->head)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto drop; + } + wqe = get_rwqe_ptr(rq, tail); + if (!ipath_init_sge(qp, wqe, &rlen, &rsge)) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto drop; } /* Silently drop packets which are too big. */ - if (wc->byte_len > rlen) { + if (wc.byte_len > rlen) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; - goto bail_sge; + goto drop; } + if (++tail >= rq->size) + tail = 0; wq->tail = tail; - wc->wr_id = wqe->wr_id; + wc.wr_id = wqe->wr_id; if (handler) { u32 n; @@ -221,13 +192,13 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, } else spin_unlock_irqrestore(&rq->lock, flags); - ah_attr = &to_iah(wr->wr.ud.ah)->attr; + ah_attr = &to_iah(swqe->wr.wr.ud.ah)->attr; if (ah_attr->ah_flags & IB_AH_GRH) { ipath_copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); - wc->wc_flags |= IB_WC_GRH; + wc.wc_flags |= IB_WC_GRH; } else ipath_skip_sge(&rsge, sizeof(struct ib_grh)); - sge = &ss->sge; + sge = swqe->sg_list; while (length) { u32 len = sge->length; @@ -241,8 +212,8 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, sge->length -= len; sge->sge_length -= len; if (sge->sge_length == 0) { - if (--ss->num_sge) - *sge = *ss->sg_list++; + if (--swqe->wr.num_sge) + sge++; } else if (sge->length == 0 && sge->mr != NULL) { if (++sge->n >= IPATH_SEGSZ) { if (++sge->m >= sge->mr->mapsz) @@ -256,123 +227,60 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, } length -= len; } - wc->status = IB_WC_SUCCESS; - wc->opcode = IB_WC_RECV; - wc->vendor_err = 0; - wc->qp = &qp->ibqp; - wc->src_qp = sqp->ibqp.qp_num; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp = &qp->ibqp; + wc.src_qp = sqp->ibqp.qp_num; /* XXX do we know which pkey matched? Only needed for GSI. */ - wc->pkey_index = 0; - wc->slid = dev->dd->ipath_lid | + wc.pkey_index = 0; + wc.slid = dev->dd->ipath_lid | (ah_attr->src_path_bits & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1)); - wc->sl = ah_attr->sl; - wc->dlid_path_bits = + wc.sl = ah_attr->sl; + wc.dlid_path_bits = ah_attr->dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ - ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, - wr->send_flags & IB_SEND_SOLICITED); - -bail_sge: + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + swqe->wr.send_flags & IB_SEND_SOLICITED); +drop: kfree(rsge.sg_list); -done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); +send_comp: + ipath_send_complete(sqp, swqe, IB_WC_SUCCESS); } /** - * ipath_post_ud_send - post a UD send on QP + * ipath_make_ud_req - construct a UD request packet * @qp: the QP - * @wr: the work request * - * Note that we actually send the data as it is posted instead of putting - * the request into a ring buffer. If we wanted to use a ring buffer, - * we would need to save a reference to the destination address in the SWQE. + * Return 1 if constructed; otherwise, return 0. */ -int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) +int ipath_make_ud_req(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ipath_other_headers *ohdr; struct ib_ah_attr *ah_attr; - struct ipath_sge_state ss; - struct ipath_sge *sg_list; - struct ib_wc wc; - u32 hwords; + struct ipath_swqe *wqe; u32 nwords; - u32 len; u32 extra_bytes; u32 bth0; u16 lrh0; u16 lid; - int i; - int ret; + int ret = 0; - if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) { - ret = 0; + if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK))) goto bail; - } - if (wr->wr.ud.ah->pd != qp->ibqp.pd) { - ret = -EPERM; + if (qp->s_cur == qp->s_head) goto bail; - } - /* IB spec says that num_sge == 0 is OK. */ - if (wr->num_sge > qp->s_max_sge) { - ret = -EINVAL; - goto bail; - } - - if (wr->num_sge > 1) { - sg_list = kmalloc((qp->s_max_sge - 1) * sizeof(*sg_list), - GFP_ATOMIC); - if (!sg_list) { - ret = -ENOMEM; - goto bail; - } - } else - sg_list = NULL; - - /* Check the buffer to send. */ - ss.sg_list = sg_list; - ss.sge.mr = NULL; - ss.sge.vaddr = NULL; - ss.sge.length = 0; - ss.sge.sge_length = 0; - ss.num_sge = 0; - len = 0; - for (i = 0; i < wr->num_sge; i++) { - /* Check LKEY */ - if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) { - ret = -EINVAL; - goto bail; - } - - if (wr->sg_list[i].length == 0) - continue; - if (!ipath_lkey_ok(qp, ss.num_sge ? - sg_list + ss.num_sge - 1 : &ss.sge, - &wr->sg_list[i], 0)) { - ret = -EINVAL; - goto bail; - } - len += wr->sg_list[i].length; - ss.num_sge++; - } - /* Check for invalid packet size. */ - if (len > dev->dd->ipath_ibmtu) { - ret = -EINVAL; - goto bail; - } - extra_bytes = (4 - len) & 3; - nwords = (len + extra_bytes) >> 2; + wqe = get_swqe_ptr(qp, qp->s_cur); /* Construct the header. */ - ah_attr = &to_iah(wr->wr.ud.ah)->attr; - if (ah_attr->dlid == 0) { - ret = -EINVAL; - goto bail; - } + ah_attr = &to_iah(wqe->wr.wr.ud.ah)->attr; if (ah_attr->dlid >= IPATH_MULTICAST_LID_BASE) { if (ah_attr->dlid != IPATH_PERMISSIVE_LID) dev->n_multicast_xmit++; @@ -383,64 +291,53 @@ int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) lid = ah_attr->dlid & ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); if (unlikely(lid == dev->dd->ipath_lid)) { - /* - * Pass in an uninitialized ib_wc to save stack - * space. - */ - ipath_ud_loopback(qp, &ss, len, wr, &wc); + ipath_ud_loopback(qp, wqe); goto done; } } + + extra_bytes = -wqe->length & 3; + nwords = (wqe->length + extra_bytes) >> 2; + + /* header size in 32-bit words LRH+BTH+DETH = (8+12+8)/4. */ + qp->s_hdrwords = 7; + if (wqe->wr.opcode == IB_WR_SEND_WITH_IMM) + qp->s_hdrwords++; + qp->s_cur_size = wqe->length; + qp->s_cur_sge = &qp->s_sge; + qp->s_wqe = wqe; + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + if (ah_attr->ah_flags & IB_AH_GRH) { /* Header size in 32-bit words. */ - hwords = 17; + qp->s_hdrwords += ipath_make_grh(dev, &qp->s_hdr.u.l.grh, + &ah_attr->grh, + qp->s_hdrwords, nwords); lrh0 = IPATH_LRH_GRH; ohdr = &qp->s_hdr.u.l.oth; - qp->s_hdr.u.l.grh.version_tclass_flow = - cpu_to_be32((6 << 28) | - (ah_attr->grh.traffic_class << 20) | - ah_attr->grh.flow_label); - qp->s_hdr.u.l.grh.paylen = - cpu_to_be16(((wr->opcode == - IB_WR_SEND_WITH_IMM ? 6 : 5) + - nwords + SIZE_OF_CRC) << 2); - /* next_hdr is defined by C8-7 in ch. 8.4.1 */ - qp->s_hdr.u.l.grh.next_hdr = 0x1B; - qp->s_hdr.u.l.grh.hop_limit = ah_attr->grh.hop_limit; - /* The SGID is 32-bit aligned. */ - qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = - dev->gid_prefix; - qp->s_hdr.u.l.grh.sgid.global.interface_id = - dev->dd->ipath_guid; - qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid; /* * Don't worry about sending to locally attached multicast * QPs. It is unspecified by the spec. what happens. */ } else { /* Header size in 32-bit words. */ - hwords = 7; lrh0 = IPATH_LRH_BTH; ohdr = &qp->s_hdr.u.oth; } - if (wr->opcode == IB_WR_SEND_WITH_IMM) { - ohdr->u.ud.imm_data = wr->imm_data; - wc.imm_data = wr->imm_data; - hwords += 1; + if (wqe->wr.opcode == IB_WR_SEND_WITH_IMM) { + ohdr->u.ud.imm_data = wqe->wr.imm_data; bth0 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE << 24; - } else if (wr->opcode == IB_WR_SEND) { - wc.imm_data = 0; + } else bth0 = IB_OPCODE_UD_SEND_ONLY << 24; - } else { - ret = -EINVAL; - goto bail; - } lrh0 |= ah_attr->sl << 4; if (qp->ibqp.qp_type == IB_QPT_SMI) lrh0 |= 0xF000; /* Set VL (see ch. 13.5.3.1) */ qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */ - qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[2] = cpu_to_be16(qp->s_hdrwords + nwords + + SIZE_OF_CRC); lid = dev->dd->ipath_lid; if (lid) { lid |= ah_attr->src_path_bits & @@ -448,7 +345,7 @@ int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) qp->s_hdr.lrh[3] = cpu_to_be16(lid); } else qp->s_hdr.lrh[3] = IB_LID_PERMISSIVE; - if (wr->send_flags & IB_SEND_SOLICITED) + if (wqe->wr.send_flags & IB_SEND_SOLICITED) bth0 |= 1 << 23; bth0 |= extra_bytes << 20; bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPATH_DEFAULT_P_KEY : @@ -460,38 +357,20 @@ int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) ohdr->bth[1] = ah_attr->dlid >= IPATH_MULTICAST_LID_BASE && ah_attr->dlid != IPATH_PERMISSIVE_LID ? __constant_cpu_to_be32(IPATH_MULTICAST_QPN) : - cpu_to_be32(wr->wr.ud.remote_qpn); - /* XXX Could lose a PSN count but not worth locking */ + cpu_to_be32(wqe->wr.wr.ud.remote_qpn); ohdr->bth[2] = cpu_to_be32(qp->s_next_psn++ & IPATH_PSN_MASK); /* * Qkeys with the high order bit set mean use the * qkey from the QP context instead of the WR (see 10.2.5). */ - ohdr->u.ud.deth[0] = cpu_to_be32((int)wr->wr.ud.remote_qkey < 0 ? - qp->qkey : wr->wr.ud.remote_qkey); + ohdr->u.ud.deth[0] = cpu_to_be32((int)wqe->wr.wr.ud.remote_qkey < 0 ? + qp->qkey : wqe->wr.wr.ud.remote_qkey); ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num); - if (ipath_verbs_send(dev->dd, hwords, (u32 *) &qp->s_hdr, - len, &ss)) - dev->n_no_piobuf++; done: - /* Queue the completion status entry. */ - if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || - (wr->send_flags & IB_SEND_SIGNALED)) { - wc.wr_id = wr->wr_id; - wc.status = IB_WC_SUCCESS; - wc.vendor_err = 0; - wc.opcode = IB_WC_SEND; - wc.byte_len = len; - wc.qp = &qp->ibqp; - wc.src_qp = 0; - wc.wc_flags = 0; - /* XXX initialize other fields? */ - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); - } - kfree(sg_list); - - ret = 0; + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + ret = 1; bail: return ret; @@ -673,6 +552,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, */ wc.dlid_path_bits = dlid >= IPATH_MULTICAST_LID_BASE ? 0 : dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, (ohdr->bth[0] & diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 559d4a6..3cc82b6 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -231,6 +231,103 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length) } /** + * ipath_post_one_send - post one RC, UC, or UD send work request + * @qp: the QP to post on + * @wr: the work request to send + */ +static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ipath_swqe *wqe; + u32 next; + int i; + int j; + int acc; + int ret; + unsigned long flags; + + spin_lock_irqsave(&qp->s_lock, flags); + + /* Check that state is OK to post send. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK)) + goto bail_inval; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + goto bail_inval; + + /* + * Don't allow RDMA reads or atomic operations on UC or + * undefined operations. + * Make sure buffer is large enough to hold the result for atomics. + */ + if (qp->ibqp.qp_type == IB_QPT_UC) { + if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) + goto bail_inval; + } else if (qp->ibqp.qp_type == IB_QPT_UD) { + /* Check UD opcode */ + if (wr->opcode != IB_WR_SEND && + wr->opcode != IB_WR_SEND_WITH_IMM) + goto bail_inval; + /* Check UD destination address PD */ + if (qp->ibqp.pd != wr->wr.ud.ah->pd) + goto bail_inval; + } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) + goto bail_inval; + else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && + (wr->num_sge == 0 || + wr->sg_list[0].length < sizeof(u64) || + wr->sg_list[0].addr & (sizeof(u64) - 1))) + goto bail_inval; + else if (wr->opcode >= IB_WR_RDMA_READ && !qp->s_max_rd_atomic) + goto bail_inval; + + next = qp->s_head + 1; + if (next >= qp->s_size) + next = 0; + if (next == qp->s_last) + goto bail_inval; + + wqe = get_swqe_ptr(qp, qp->s_head); + wqe->wr = *wr; + wqe->ssn = qp->s_ssn++; + wqe->length = 0; + if (wr->num_sge) { + acc = wr->opcode >= IB_WR_RDMA_READ ? + IB_ACCESS_LOCAL_WRITE : 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + u32 length = wr->sg_list[i].length; + int ok; + + if (length == 0) + continue; + ok = ipath_lkey_ok(qp, &wqe->sg_list[j], + &wr->sg_list[i], acc); + if (!ok) + goto bail_inval; + wqe->length += length; + j++; + } + wqe->wr.num_sge = j; + } + if (qp->ibqp.qp_type == IB_QPT_UC || + qp->ibqp.qp_type == IB_QPT_RC) { + if (wqe->length > 0x80000000U) + goto bail_inval; + } else if (wqe->length > to_idev(qp->ibqp.device)->dd->ipath_ibmtu) + goto bail_inval; + qp->s_head = next; + + ret = 0; + goto bail; + +bail_inval: + ret = -EINVAL; +bail: + spin_unlock_irqrestore(&qp->s_lock, flags); + return ret; +} + +/** * ipath_post_send - post a send on a QP * @ibqp: the QP to post the send on * @wr: the list of work requests to post @@ -244,35 +341,17 @@ static int ipath_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ipath_qp *qp = to_iqp(ibqp); int err = 0; - /* Check that state is OK to post send. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK)) { - *bad_wr = wr; - err = -EINVAL; - goto bail; - } - for (; wr; wr = wr->next) { - switch (qp->ibqp.qp_type) { - case IB_QPT_UC: - case IB_QPT_RC: - err = ipath_post_ruc_send(qp, wr); - break; - - case IB_QPT_SMI: - case IB_QPT_GSI: - case IB_QPT_UD: - err = ipath_post_ud_send(qp, wr); - break; - - default: - err = -EINVAL; - } + err = ipath_post_one_send(qp, wr); if (err) { *bad_wr = wr; - break; + goto bail; } } + /* Try to do the send work in the caller's context. */ + ipath_do_send((unsigned long) qp); + bail: return err; } @@ -641,11 +720,11 @@ static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, u32 len = ss->sge.length; u32 off; - BUG_ON(len == 0); if (len > length) len = length; if (len > ss->sge.sge_length) len = ss->sge.sge_length; + BUG_ON(len == 0); /* If the source address is not aligned, try to align it. */ off = (unsigned long)ss->sge.vaddr & (sizeof(u32) - 1); if (off) { @@ -767,30 +846,15 @@ static void copy_io(u32 __iomem *piobuf, struct ipath_sge_state *ss, __raw_writel(last, piobuf); } -/** - * ipath_verbs_send - send a packet - * @dd: the infinipath device - * @hdrwords: the number of words in the header - * @hdr: the packet header - * @len: the length of the packet in bytes - * @ss: the SGE to send - */ -int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, - u32 *hdr, u32 len, struct ipath_sge_state *ss) +static int ipath_verbs_send_pio(struct ipath_qp *qp, u32 *hdr, u32 hdrwords, + struct ipath_sge_state *ss, u32 len, + u32 plen, u32 dwords) { + struct ipath_devdata *dd = to_idev(qp->ibqp.device)->dd; u32 __iomem *piobuf; unsigned flush_wc; - u32 plen; int ret; - /* +1 is for the qword padding of pbc */ - plen = hdrwords + ((len + 3) >> 2) + 1; - if (unlikely((plen << 2) > dd->ipath_ibmaxlen)) { - ret = -EINVAL; - goto bail; - } - - /* Get a PIO buffer to use. */ piobuf = ipath_getpiobuf(dd, NULL); if (unlikely(piobuf == NULL)) { ret = -EBUSY; @@ -831,13 +895,10 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, /* The common case is aligned and contained in one segment. */ if (likely(ss->num_sge == 1 && len <= ss->sge.length && !((unsigned long)ss->sge.vaddr & (sizeof(u32) - 1)))) { - u32 dwords; u32 *addr = (u32 *) ss->sge.vaddr; /* Update address before sending packet. */ update_sge(ss, len); - /* Need to round up for the last dword in the packet. */ - dwords = (len + 3) >> 2; if (flush_wc) { __iowrite32_copy(piobuf, addr, dwords - 1); /* must flush early everything before trigger word */ @@ -851,11 +912,37 @@ int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, } copy_io(piobuf, ss, len, flush_wc); done: + if (qp->s_wqe) + ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS); ret = 0; bail: return ret; } +/** + * ipath_verbs_send - send a packet + * @qp: the QP to send on + * @hdr: the packet header + * @hdrwords: the number of words in the header + * @ss: the SGE to send + * @len: the length of the packet in bytes + */ +int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr, + u32 hdrwords, struct ipath_sge_state *ss, u32 len) +{ + u32 plen; + int ret; + u32 dwords = (len + 3) >> 2; + + /* +1 is for the qword padding of pbc */ + plen = hdrwords + dwords + 1; + + ret = ipath_verbs_send_pio(qp, (u32 *) hdr, hdrwords, + ss, len, plen, dwords); + + return ret; +} + int ipath_snapshot_counters(struct ipath_devdata *dd, u64 *swords, u64 *rwords, u64 *spkts, u64 *rpkts, u64 *xmit_wait) @@ -864,7 +951,6 @@ int ipath_snapshot_counters(struct ipath_devdata *dd, u64 *swords, if (!(dd->ipath_flags & IPATH_INITTED)) { /* no hardware, freeze, etc. */ - ipath_dbg("unit %u not usable\n", dd->ipath_unit); ret = -EINVAL; goto bail; } @@ -890,48 +976,44 @@ bail: int ipath_get_counters(struct ipath_devdata *dd, struct ipath_verbs_counters *cntrs) { + struct ipath_cregs const *crp = dd->ipath_cregs; int ret; if (!(dd->ipath_flags & IPATH_INITTED)) { /* no hardware, freeze, etc. */ - ipath_dbg("unit %u not usable\n", dd->ipath_unit); ret = -EINVAL; goto bail; } cntrs->symbol_error_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_ibsymbolerrcnt); + ipath_snap_cntr(dd, crp->cr_ibsymbolerrcnt); cntrs->link_error_recovery_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkerrrecovcnt); + ipath_snap_cntr(dd, crp->cr_iblinkerrrecovcnt); /* * The link downed counter counts when the other side downs the * connection. We add in the number of times we downed the link * due to local link integrity errors to compensate. */ cntrs->link_downed_counter = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_iblinkdowncnt); + ipath_snap_cntr(dd, crp->cr_iblinkdowncnt); cntrs->port_rcv_errors = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rxdroppktcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvovflcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_portovflcnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_err_rlencnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_invalidrlencnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_erricrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errvcrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_errlpcrccnt) + - ipath_snap_cntr(dd, dd->ipath_cregs->cr_badformatcnt) + + ipath_snap_cntr(dd, crp->cr_rxdroppktcnt) + + ipath_snap_cntr(dd, crp->cr_rcvovflcnt) + + ipath_snap_cntr(dd, crp->cr_portovflcnt) + + ipath_snap_cntr(dd, crp->cr_err_rlencnt) + + ipath_snap_cntr(dd, crp->cr_invalidrlencnt) + + ipath_snap_cntr(dd, crp->cr_errlinkcnt) + + ipath_snap_cntr(dd, crp->cr_erricrccnt) + + ipath_snap_cntr(dd, crp->cr_errvcrccnt) + + ipath_snap_cntr(dd, crp->cr_errlpcrccnt) + + ipath_snap_cntr(dd, crp->cr_badformatcnt) + dd->ipath_rxfc_unsupvl_errs; cntrs->port_rcv_remphys_errors = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_rcvebpcnt); - cntrs->port_xmit_discards = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_unsupvlcnt); - cntrs->port_xmit_data = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); - cntrs->port_rcv_data = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); - cntrs->port_xmit_packets = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktsendcnt); - cntrs->port_rcv_packets = - ipath_snap_cntr(dd, dd->ipath_cregs->cr_pktrcvcnt); + ipath_snap_cntr(dd, crp->cr_rcvebpcnt); + cntrs->port_xmit_discards = ipath_snap_cntr(dd, crp->cr_unsupvlcnt); + cntrs->port_xmit_data = ipath_snap_cntr(dd, crp->cr_wordsendcnt); + cntrs->port_rcv_data = ipath_snap_cntr(dd, crp->cr_wordrcvcnt); + cntrs->port_xmit_packets = ipath_snap_cntr(dd, crp->cr_pktsendcnt); + cntrs->port_rcv_packets = ipath_snap_cntr(dd, crp->cr_pktrcvcnt); cntrs->local_link_integrity_errors = (dd->ipath_flags & IPATH_GPIO_ERRINTRS) ? dd->ipath_lli_errs : dd->ipath_lli_errors; @@ -1045,8 +1127,9 @@ static int ipath_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props) { struct ipath_ibdev *dev = to_idev(ibdev); + struct ipath_devdata *dd = dev->dd; enum ib_mtu mtu; - u16 lid = dev->dd->ipath_lid; + u16 lid = dd->ipath_lid; u64 ibcstat; memset(props, 0, sizeof(*props)); @@ -1054,16 +1137,16 @@ static int ipath_query_port(struct ib_device *ibdev, props->lmc = dev->mkeyprot_resv_lmc & 7; props->sm_lid = dev->sm_lid; props->sm_sl = dev->sm_sl; - ibcstat = dev->dd->ipath_lastibcstat; + ibcstat = dd->ipath_lastibcstat; props->state = ((ibcstat >> 4) & 0x3) + 1; /* See phys_state_show() */ props->phys_state = ipath_cvt_physportstate[ - dev->dd->ipath_lastibcstat & 0xf]; + dd->ipath_lastibcstat & 0xf]; props->port_cap_flags = dev->port_cap_flags; props->gid_tbl_len = 1; props->max_msg_sz = 0x80000000; - props->pkey_tbl_len = ipath_get_npkeys(dev->dd); - props->bad_pkey_cntr = ipath_get_cr_errpkey(dev->dd) - + props->pkey_tbl_len = ipath_get_npkeys(dd); + props->bad_pkey_cntr = ipath_get_cr_errpkey(dd) - dev->z_pkey_violations; props->qkey_viol_cntr = dev->qkey_violations; props->active_width = IB_WIDTH_4X; @@ -1073,12 +1156,12 @@ static int ipath_query_port(struct ib_device *ibdev, props->init_type_reply = 0; /* - * Note: the chips support a maximum MTU of 4096, but the driver + * Note: the chip supports a maximum MTU of 4096, but the driver * hasn't implemented this feature yet, so set the maximum value * to 2048. */ props->max_mtu = IB_MTU_2048; - switch (dev->dd->ipath_ibmtu) { + switch (dd->ipath_ibmtu) { case 4096: mtu = IB_MTU_4096; break; @@ -1427,9 +1510,7 @@ static int disable_timer(struct ipath_devdata *dd) { /* Disable GPIO bit 2 interrupt */ if (dd->ipath_flags & IPATH_GPIO_INTR) { - u64 val; /* Disable GPIO bit 2 interrupt */ - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_gpio_mask); dd->ipath_gpio_mask &= ~((u64) (1 << IPATH_GPIO_PORT0_BIT)); ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_mask, dd->ipath_gpio_mask); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 1a24c6a..619ad72 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -42,6 +42,8 @@ #include #include +#include "ipath_kernel.h" + #define IPATH_MAX_RDMA_ATOMIC 4 #define QPN_MAX (1 << 24) @@ -59,6 +61,7 @@ */ #define IB_CQ_NONE (IB_CQ_NEXT_COMP + 1) +/* AETH NAK opcode values */ #define IB_RNR_NAK 0x20 #define IB_NAK_PSN_ERROR 0x60 #define IB_NAK_INVALID_REQUEST 0x61 @@ -66,6 +69,7 @@ #define IB_NAK_REMOTE_OPERATIONAL_ERROR 0x63 #define IB_NAK_INVALID_RD_REQUEST 0x64 +/* Flags for checking QP state (see ib_ipath_state_ops[]) */ #define IPATH_POST_SEND_OK 0x01 #define IPATH_POST_RECV_OK 0x02 #define IPATH_PROCESS_RECV_OK 0x04 @@ -239,7 +243,7 @@ struct ipath_mregion { */ struct ipath_sge { struct ipath_mregion *mr; - void *vaddr; /* current pointer into the segment */ + void *vaddr; /* kernel virtual address of segment */ u32 sge_length; /* length of the SGE */ u32 length; /* remaining length of the segment */ u16 m; /* current index: mr->map[m] */ @@ -407,6 +411,7 @@ struct ipath_qp { u32 s_ssn; /* SSN of tail entry */ u32 s_lsn; /* limit sequence number (credit) */ struct ipath_swqe *s_wq; /* send work queue */ + struct ipath_swqe *s_wqe; struct ipath_rq r_rq; /* receive work queue */ struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; @@ -683,8 +688,8 @@ void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc); void ipath_get_credit(struct ipath_qp *qp, u32 aeth); -int ipath_verbs_send(struct ipath_devdata *dd, u32 hdrwords, - u32 *hdr, u32 len, struct ipath_sge_state *ss); +int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr, + u32 hdrwords, struct ipath_sge_state *ss, u32 len); void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig); @@ -692,8 +697,6 @@ void ipath_copy_sge(struct ipath_sge_state *ss, void *data, u32 length); void ipath_skip_sge(struct ipath_sge_state *ss, u32 length); -int ipath_post_ruc_send(struct ipath_qp *qp, struct ib_send_wr *wr); - void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, int has_grh, void *data, u32 tlen, struct ipath_qp *qp); @@ -733,6 +736,8 @@ int ipath_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); int ipath_destroy_srq(struct ib_srq *ibsrq); +void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig); + int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vector, @@ -782,18 +787,28 @@ int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); void ipath_insert_rnr_queue(struct ipath_qp *qp); +int ipath_init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, + u32 *lengthp, struct ipath_sge_state *ss); + int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only); u32 ipath_make_grh(struct ipath_ibdev *dev, struct ib_grh *hdr, struct ib_global_route *grh, u32 hwords, u32 nwords); -void ipath_do_ruc_send(unsigned long data); +void ipath_make_ruc_header(struct ipath_ibdev *dev, struct ipath_qp *qp, + struct ipath_other_headers *ohdr, + u32 bth0, u32 bth2); + +void ipath_do_send(unsigned long data); + +void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, + enum ib_wc_status status); + +int ipath_make_rc_req(struct ipath_qp *qp); -int ipath_make_rc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p); +int ipath_make_uc_req(struct ipath_qp *qp); -int ipath_make_uc_req(struct ipath_qp *qp, struct ipath_other_headers *ohdr, - u32 pmtu, u32 *bth0p, u32 *bth2p); +int ipath_make_ud_req(struct ipath_qp *qp); int ipath_register_ib_device(struct ipath_devdata *); From arthur.jones at qlogic.com Tue Oct 9 12:59:35 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:35 -0700 Subject: [ofa-general] [PATCH 04/23] IB/ipath - Verify host bus bandwidth to chip will not limit performance In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195935.7151.18898.stgit@eng-46.internal.keyresearch.com> From: Dave Olson There have been a number of issues where host bandwidth via HyperTransport or PCIe to the InfiniPath chip has been limited in some fashion (BIOS, configuration, etc.), resulting in user confusion. This check gives a clear warning that something is wrong and needs to be resolved. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_driver.c | 85 ++++++++++++++++++++++++++++ 1 files changed, 85 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 6ccba36..8fa2bb5 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -280,6 +281,88 @@ void __attribute__((weak)) ipath_disable_wc(struct ipath_devdata *dd) { } +/* + * Perform a PIO buffer bandwidth write test, to verify proper system + * configuration. Even when all the setup calls work, occasionally + * BIOS or other issues can prevent write combining from working, or + * can cause other bandwidth problems to the chip. + * + * This test simply writes the same buffer over and over again, and + * measures close to the peak bandwidth to the chip (not testing + * data bandwidth to the wire). On chips that use an address-based + * trigger to send packets to the wire, this is easy. On chips that + * use a count to trigger, we want to make sure that the packet doesn't + * go out on the wire, or trigger flow control checks. + */ +static void ipath_verify_pioperf(struct ipath_devdata *dd) +{ + u32 pbnum, cnt, lcnt; + u32 __iomem *piobuf; + u32 *addr; + u64 msecs, emsecs; + + piobuf = ipath_getpiobuf(dd, &pbnum); + if (!piobuf) { + dev_info(&dd->pcidev->dev, + "No PIObufs for checking perf, skipping\n"); + goto done; + + } + + /* + * Enough to give us a reasonable test, less than piobuf size, and + * likely multiple of store buffer length. + */ + cnt = 1024; + + addr = vmalloc(cnt); + if (!addr) { + dev_info(&dd->pcidev->dev, + "Couldn't get memory for checking PIO perf," + " skipping\n"); + goto done; + } + + + preempt_disable(); /* we want reasonably accurate elapsed time */ + msecs = 1 + jiffies_to_msecs(jiffies); + for (lcnt = 0; lcnt < 10000U; lcnt++) { + /* wait until we cross msec boundary */ + if (jiffies_to_msecs(jiffies) >= msecs) + break; + udelay(1); + } + + writeq(0, piobuf); /* length 0, no dwords actually sent */ + ipath_flush_wc(); + + /* + * this is only roughly accurate, since even with preempt we + * still take interrupts that could take a while. Running for + * >= 5 msec seems to get us "close enough" to accurate values + */ + msecs = jiffies_to_msecs(jiffies); + for (emsecs = lcnt = 0; emsecs <= 5UL; lcnt++) { + __iowrite32_copy(piobuf + 64, addr, cnt >> 2); + emsecs = jiffies_to_msecs(jiffies) - msecs; + } + + /* 1 GiB/sec, slightly over IB SDR line rate */ + if (lcnt < (emsecs * 1024U)) + ipath_dev_err(dd, + "Performance problem: bandwidth to PIO buffers is " + "only %u MiB/sec\n", + lcnt / (u32) emsecs); + else + ipath_dbg("PIO buffer bandwidth %u MiB/sec is OK\n", + lcnt / (u32) emsecs); + + preempt_enable(); +done: + if (piobuf) /* disarm it, so it's available again */ + ipath_disarm_piobufs(dd, pbnum, 1); +} + static int __devinit ipath_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { @@ -515,6 +598,8 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, ret = 0; } + ipath_verify_pioperf(dd); + ipath_device_create_group(&pdev->dev, dd); ipathfs_add_device(dd); ipath_user_add(dd); From arthur.jones at qlogic.com Tue Oct 9 12:59:41 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:41 -0700 Subject: [ofa-general] [PATCH 05/23] IB/ipath - Remove unneeded code for ipathfs In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195940.7151.9338.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The ipathfs file system is used to export binary data verses ASCII data such as through /sys. This patch removes some unneeded files since the data is available through other /sys files. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_fs.c | 187 -------------------------------- 1 files changed, 0 insertions(+), 187 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c index 2e689b9..262c25d 100644 --- a/drivers/infiniband/hw/ipath/ipath_fs.c +++ b/drivers/infiniband/hw/ipath/ipath_fs.c @@ -130,175 +130,6 @@ static const struct file_operations atomic_counters_ops = { .read = atomic_counters_read, }; -static ssize_t atomic_node_info_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) -{ - u32 nodeinfo[10]; - struct ipath_devdata *dd; - u64 guid; - - dd = file->f_path.dentry->d_inode->i_private; - - guid = be64_to_cpu(dd->ipath_guid); - - nodeinfo[0] = /* BaseVersion is SMA */ - /* ClassVersion is SMA */ - (1 << 8) /* NodeType */ - | (1 << 0); /* NumPorts */ - nodeinfo[1] = (u32) (guid >> 32); - nodeinfo[2] = (u32) (guid & 0xffffffff); - /* PortGUID == SystemImageGUID for us */ - nodeinfo[3] = nodeinfo[1]; - /* PortGUID == SystemImageGUID for us */ - nodeinfo[4] = nodeinfo[2]; - /* PortGUID == NodeGUID for us */ - nodeinfo[5] = nodeinfo[3]; - /* PortGUID == NodeGUID for us */ - nodeinfo[6] = nodeinfo[4]; - nodeinfo[7] = (4 << 16) /* we support 4 pkeys */ - | (dd->ipath_deviceid << 0); - /* our chip version as 16 bits major, 16 bits minor */ - nodeinfo[8] = dd->ipath_minrev | (dd->ipath_majrev << 16); - nodeinfo[9] = (dd->ipath_unit << 24) | (dd->ipath_vendorid << 0); - - return simple_read_from_buffer(buf, count, ppos, nodeinfo, - sizeof nodeinfo); -} - -static const struct file_operations atomic_node_info_ops = { - .read = atomic_node_info_read, -}; - -static ssize_t atomic_port_info_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) -{ - u32 portinfo[13]; - u32 tmp, tmp2; - struct ipath_devdata *dd; - - dd = file->f_path.dentry->d_inode->i_private; - - /* so we only initialize non-zero fields. */ - memset(portinfo, 0, sizeof portinfo); - - /* - * Notimpl yet M_Key (64) - * Notimpl yet GID (64) - */ - - portinfo[4] = (dd->ipath_lid << 16); - - /* - * Notimpl yet SMLID. - * CapabilityMask is 0, we don't support any of these - * DiagCode is 0; we don't store any diag info for now Notimpl yet - * M_KeyLeasePeriod (we don't support M_Key) - */ - - /* LocalPortNum is whichever port number they ask for */ - portinfo[7] = (dd->ipath_unit << 24) - /* LinkWidthEnabled */ - | (2 << 16) - /* LinkWidthSupported (really 2, but not IB valid) */ - | (3 << 8) - /* LinkWidthActive */ - | (2 << 0); - tmp = dd->ipath_lastibcstat & IPATH_IBSTATE_MASK; - tmp2 = 5; - if (tmp == IPATH_IBSTATE_INIT) - tmp = 2; - else if (tmp == IPATH_IBSTATE_ARM) - tmp = 3; - else if (tmp == IPATH_IBSTATE_ACTIVE) - tmp = 4; - else { - tmp = 0; /* down */ - tmp2 = tmp & 0xf; - } - - portinfo[8] = (1 << 28) /* LinkSpeedSupported */ - | (tmp << 24) /* PortState */ - | (tmp2 << 20) /* PortPhysicalState */ - | (2 << 16) - - /* LinkDownDefaultState */ - /* M_KeyProtectBits == 0 */ - /* NotImpl yet LMC == 0 (we can support all values) */ - | (1 << 4) /* LinkSpeedActive */ - | (1 << 0); /* LinkSpeedEnabled */ - switch (dd->ipath_ibmtu) { - case 4096: - tmp = 5; - break; - case 2048: - tmp = 4; - break; - case 1024: - tmp = 3; - break; - case 512: - tmp = 2; - break; - case 256: - tmp = 1; - break; - default: /* oops, something is wrong */ - ipath_dbg("Problem, ipath_ibmtu 0x%x not a valid IB MTU, " - "treat as 2048\n", dd->ipath_ibmtu); - tmp = 4; - break; - } - portinfo[9] = (tmp << 28) - /* NeighborMTU */ - /* Notimpl MasterSMSL */ - | (1 << 20) - - /* VLCap */ - /* Notimpl InitType (actually, an SMA decision) */ - /* VLHighLimit is 0 (only one VL) */ - ; /* VLArbitrationHighCap is 0 (only one VL) */ - /* - * Note: the chips support a maximum MTU of 4096, but the driver - * hasn't implemented this feature yet, so set the maximum - * to 2048. - */ - portinfo[10] = /* VLArbitrationLowCap is 0 (only one VL) */ - /* InitTypeReply is SMA decision */ - (4 << 16) /* MTUCap 2048 */ - | (7 << 13) /* VLStallCount */ - | (0x1f << 8) /* HOQLife */ - | (1 << 4) - - /* OperationalVLs 0 */ - /* PartitionEnforcementInbound */ - /* PartitionEnforcementOutbound not enforced */ - /* FilterRawinbound not enforced */ - ; /* FilterRawOutbound not enforced */ - /* M_KeyViolations are not counted by hardware, SMA can count */ - tmp = ipath_read_creg32(dd, dd->ipath_cregs->cr_errpkey); - /* P_KeyViolations are counted by hardware. */ - portinfo[11] = ((tmp & 0xffff) << 0); - portinfo[12] = - /* Q_KeyViolations are not counted by hardware */ - (1 << 8) - - /* GUIDCap */ - /* SubnetTimeOut handled by SMA */ - /* RespTimeValue handled by SMA */ - ; - /* LocalPhyErrors are programmed to max */ - portinfo[12] |= (0xf << 20) - | (0xf << 16) /* OverRunErrors are programmed to max */ - ; - - return simple_read_from_buffer(buf, count, ppos, portinfo, - sizeof portinfo); -} - -static const struct file_operations atomic_port_info_ops = { - .read = atomic_port_info_read, -}; - static ssize_t flash_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { @@ -427,22 +258,6 @@ static int create_device_files(struct super_block *sb, goto bail; } - ret = create_file("node_info", S_IFREG|S_IRUGO, dir, &tmp, - &atomic_node_info_ops, dd); - if (ret) { - printk(KERN_ERR "create_file(%s/node_info) " - "failed: %d\n", unit, ret); - goto bail; - } - - ret = create_file("port_info", S_IFREG|S_IRUGO, dir, &tmp, - &atomic_port_info_ops, dd); - if (ret) { - printk(KERN_ERR "create_file(%s/port_info) " - "failed: %d\n", unit, ret); - goto bail; - } - ret = create_file("flash", S_IFREG|S_IWUSR|S_IRUGO, dir, &tmp, &flash_ops, dd); if (ret) { @@ -508,8 +323,6 @@ static int remove_device_files(struct super_block *sb, } remove_file(dir, "flash"); - remove_file(dir, "port_info"); - remove_file(dir, "node_info"); remove_file(dir, "atomic_counters"); d_delete(dir); ret = simple_rmdir(root->d_inode, dir); From arthur.jones at qlogic.com Tue Oct 9 12:59:46 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:46 -0700 Subject: [ofa-general] [PATCH 06/23] IB/ipath - correctly describe workaround for TID write chip bug In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195946.7151.81689.stgit@eng-46.internal.keyresearch.com> From: Dave Olson This is a comment change, only, correcting the comment to match the implemented workaround, rather than the original workaround, and clarifying why it's needed. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_iba6120.c | 13 ++++++++----- 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index a324c6f..d43f0b3 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -1143,11 +1143,14 @@ static void ipath_pe_put_tid(struct ipath_devdata *dd, u64 __iomem *tidptr, pa |= 2 << 29; } - /* workaround chip bug 9437 by writing each TID twice - * and holding a spinlock around the writes, so they don't - * intermix with other TID (eager or expected) writes - * Unfortunately, this call can be done from interrupt level - * for the port 0 eager TIDs, so we have to use irqsave + /* + * Workaround chip bug 9437 by writing the scratch register + * before and after the TID, and with an io write barrier. + * We use a spinlock around the writes, so they can't intermix + * with other TID (eager or expected) writes (the chip bug + * is triggered by back to back TID writes). Unfortunately, this + * call can be done from interrupt level for the port 0 eager TIDs, + * so we have to use irqsave locks. */ spin_lock_irqsave(&dd->ipath_tid_lock, flags); ipath_write_kreg(dd, dd->ipath_kregs->kr_scratch, 0xfeeddeaf); From arthur.jones at qlogic.com Tue Oct 9 12:59:51 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:51 -0700 Subject: [ofa-general] [PATCH 07/23] IB/ipath - UC RDMA WRITE with IMMEDIATE doesn't send the immediate In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195951.7151.35880.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch fixes a bug in the receive processing for UC RDMA WRITE with immediate which caused the last packet to be dropped. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_uc.c | 21 +++++++++++---------- 1 files changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index 767beb9..2dd8de2 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -464,6 +464,16 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, case OP(RDMA_WRITE_LAST_WITH_IMMEDIATE): rdma_last_imm: + if (header_in_data) { + wc.imm_data = *(__be32 *) data; + data += sizeof(__be32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* Get the number of bytes the message was padded by. */ pad = (be32_to_cpu(ohdr->bth[0]) >> 20) & 3; /* Check for invalid length. */ @@ -484,16 +494,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, dev->n_pkt_drops++; goto done; } - if (header_in_data) { - wc.imm_data = *(__be32 *) data; - data += sizeof(__be32); - } else { - /* Immediate data comes after BTH */ - wc.imm_data = ohdr->u.imm_data; - } - hdrsize += 4; - wc.wc_flags = IB_WC_WITH_IMM; - wc.byte_len = 0; + wc.byte_len = qp->r_len; goto last_imm; case OP(RDMA_WRITE_LAST): From arthur.jones at qlogic.com Tue Oct 9 12:59:56 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 12:59:56 -0700 Subject: [ofa-general] [PATCH 08/23] IB/ipath - future proof eeprom checksum code (contents reading) In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009195956.7151.38955.stgit@eng-46.internal.keyresearch.com> From: Dave Olson In an earlier change, the amount of data read from the flash was mistakenly limited to the size known to the current driver. This causes problems when the length is increased, and written with the new longer version; the checksum would fail because not enough data was read. Always read the full 128 byte length to prevent this. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_eeprom.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c index b4503e9..bcfa3cc 100644 --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c @@ -596,7 +596,11 @@ void ipath_get_eeprom_info(struct ipath_devdata *dd) goto bail; } - len = offsetof(struct ipath_flash, if_future); + /* + * read full flash, not just currently used part, since it may have + * been written with a newer definition + * */ + len = sizeof(struct ipath_flash); buf = vmalloc(len); if (!buf) { ipath_dev_err(dd, "Couldn't allocate memory to read %u " @@ -737,8 +741,10 @@ int ipath_update_eeprom_log(struct ipath_devdata *dd) /* * The quick-check above determined that there is something worthy * of logging, so get current contents and do a more detailed idea. + * read full flash, not just currently used part, since it may have + * been written with a newer definition */ - len = offsetof(struct ipath_flash, if_future); + len = sizeof(struct ipath_flash); buf = vmalloc(len); ret = 1; if (!buf) { From arthur.jones at qlogic.com Tue Oct 9 13:00:01 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:01 -0700 Subject: [ofa-general] [PATCH 09/23] IB/ipath - Remove redundant code In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200001.7151.74042.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch removes some redundant initialization code. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_driver.c | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 8fa2bb5..e5d058a 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -381,8 +381,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, ipath_cdbg(VERBOSE, "initializing unit #%u\n", dd->ipath_unit); - read_bars(dd, pdev, &bar0, &bar1); - ret = pci_enable_device(pdev); if (ret) { /* This can happen iff: @@ -528,9 +526,6 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, goto bail_regions; } - dd->ipath_deviceid = ent->device; /* save for later use */ - dd->ipath_vendorid = ent->vendor; - dd->ipath_pcirev = pdev->revision; #if defined(__powerpc__) From arthur.jones at qlogic.com Tue Oct 9 13:00:06 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:06 -0700 Subject: [ofa-general] [PATCH 10/23] IB/ipath - generate flush CQE when QP is in error state. In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200006.7151.20819.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell Follow the IB spec. (C10-96) for post send which states that a flushed completion event should be generated when the QP is in the error state. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_verbs.c | 22 ++++++++++++++++++++-- 1 files changed, 20 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 3cc82b6..495194b 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -230,6 +230,18 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length) } } +static void ipath_flush_wqe(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ib_wc wc; + + memset(&wc, 0, sizeof(wc)); + wc.wr_id = wr->wr_id; + wc.status = IB_WC_WR_FLUSH_ERR; + wc.opcode = ib_ipath_wc_opcode[wr->opcode]; + wc.qp = &qp->ibqp; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); +} + /** * ipath_post_one_send - post one RC, UC, or UD send work request * @qp: the QP to post on @@ -248,8 +260,14 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) spin_lock_irqsave(&qp->s_lock, flags); /* Check that state is OK to post send. */ - if (!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK)) - goto bail_inval; + if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK))) { + if (qp->state != IB_QPS_SQE && qp->state != IB_QPS_ERR) + goto bail_inval; + /* C10-96 says generate a flushed completion entry. */ + ipath_flush_wqe(qp, wr); + ret = 0; + goto bail; + } /* IB spec says that num_sge == 0 is OK. */ if (wr->num_sge > qp->s_max_sge) From arthur.jones at qlogic.com Tue Oct 9 13:00:11 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:11 -0700 Subject: [ofa-general] [PATCH 11/23] IB/ipath - implement IB_EVENT_QP_LAST_WQE_REACHED In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200011.7151.96154.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch implements the IB_EVENT_QP_LAST_WQE_REACHED event which is needed by ib_ipoib to destroy the QP when used in connected mode. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_qp.c | 20 +++++++++++++++++--- drivers/infiniband/hw/ipath/ipath_rc.c | 12 +++++++++++- drivers/infiniband/hw/ipath/ipath_verbs.h | 2 +- 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index a8c4a6b..6a41fdb 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -377,13 +377,15 @@ static void ipath_reset_qp(struct ipath_qp *qp) * @err: the receive completion error to signal if a RWQE is active * * Flushes both send and receive work queues. + * Returns true if last WQE event should be generated. * The QP s_lock should be held and interrupts disabled. */ -void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) +int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ib_wc wc; + int ret = 0; ipath_dbg("QP%d/%d in error state\n", qp->ibqp.qp_num, qp->remote_qpn); @@ -454,7 +456,10 @@ void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) wq->tail = tail; spin_unlock(&qp->r_rq.lock); - } + } else if (qp->ibqp.event_handler) + ret = 1; + + return ret; } /** @@ -473,6 +478,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, struct ipath_qp *qp = to_iqp(ibqp); enum ib_qp_state cur_state, new_state; unsigned long flags; + int lastwqe = 0; int ret; spin_lock_irqsave(&qp->s_lock, flags); @@ -532,7 +538,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, break; case IB_QPS_ERR: - ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); + lastwqe = ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); break; default: @@ -591,6 +597,14 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, qp->state = new_state; spin_unlock_irqrestore(&qp->s_lock, flags); + if (lastwqe) { + struct ib_event ev; + + ev.device = qp->ibqp.device; + ev.element.qp = &qp->ibqp; + ev.event = IB_EVENT_QP_LAST_WQE_REACHED; + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); + } ret = 0; goto bail; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 53259da..5c29b2b 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -1497,11 +1497,21 @@ send_ack: static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) { unsigned long flags; + int lastwqe; spin_lock_irqsave(&qp->s_lock, flags); qp->state = IB_QPS_ERR; - ipath_error_qp(qp, err); + lastwqe = ipath_error_qp(qp, err); spin_unlock_irqrestore(&qp->s_lock, flags); + + if (lastwqe) { + struct ib_event ev; + + ev.device = qp->ibqp.device; + ev.element.qp = &qp->ibqp; + ev.event = IB_EVENT_QP_LAST_WQE_REACHED; + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context); + } } static inline void ipath_update_ack_queue(struct ipath_qp *qp, unsigned n) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 619ad72..a197229 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -672,7 +672,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, int ipath_destroy_qp(struct ib_qp *ibqp); -void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err); +int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err); int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_udata *udata); From arthur.jones at qlogic.com Tue Oct 9 13:00:17 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:17 -0700 Subject: [ofa-general] [PATCH 12/23] IB/ipath - optimize completion queue entry insertion and polling In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200016.7151.1427.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The code to add an entry to the completion queue stored the QPN which is needed for the user level verbs view of the completion queue entry but the kernel struct ib_wc contains a pointer to the QP instead of a QPN. When the kernel polled for a completion queue entry, the QPN was lookup up and the QP pointer recovered. This patch stores the CQE differently based on whether the CQ is a kernel CQ or a user CQ thus avoiding the QPN to QP lookup overhead. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_cq.c | 94 +++++++++++++++-------------- drivers/infiniband/hw/ipath/ipath_verbs.h | 6 ++ 2 files changed, 53 insertions(+), 47 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index a6f04d2..645ed71 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -76,22 +76,25 @@ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) } return; } - wc->queue[head].wr_id = entry->wr_id; - wc->queue[head].status = entry->status; - wc->queue[head].opcode = entry->opcode; - wc->queue[head].vendor_err = entry->vendor_err; - wc->queue[head].byte_len = entry->byte_len; - wc->queue[head].imm_data = (__u32 __force)entry->imm_data; - wc->queue[head].qp_num = entry->qp->qp_num; - wc->queue[head].src_qp = entry->src_qp; - wc->queue[head].wc_flags = entry->wc_flags; - wc->queue[head].pkey_index = entry->pkey_index; - wc->queue[head].slid = entry->slid; - wc->queue[head].sl = entry->sl; - wc->queue[head].dlid_path_bits = entry->dlid_path_bits; - wc->queue[head].port_num = entry->port_num; - /* Make sure queue entry is written before the head index. */ - smp_wmb(); + if (cq->ip) { + wc->uqueue[head].wr_id = entry->wr_id; + wc->uqueue[head].status = entry->status; + wc->uqueue[head].opcode = entry->opcode; + wc->uqueue[head].vendor_err = entry->vendor_err; + wc->uqueue[head].byte_len = entry->byte_len; + wc->uqueue[head].imm_data = (__u32 __force)entry->imm_data; + wc->uqueue[head].qp_num = entry->qp->qp_num; + wc->uqueue[head].src_qp = entry->src_qp; + wc->uqueue[head].wc_flags = entry->wc_flags; + wc->uqueue[head].pkey_index = entry->pkey_index; + wc->uqueue[head].slid = entry->slid; + wc->uqueue[head].sl = entry->sl; + wc->uqueue[head].dlid_path_bits = entry->dlid_path_bits; + wc->uqueue[head].port_num = entry->port_num; + /* Make sure entry is written before the head index. */ + smp_wmb(); + } else + wc->kqueue[head] = *entry; wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || @@ -130,6 +133,12 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) int npolled; u32 tail; + /* The kernel can only poll a kernel completion queue */ + if (cq->ip) { + npolled = -EINVAL; + goto bail; + } + spin_lock_irqsave(&cq->lock, flags); wc = cq->queue; @@ -137,31 +146,10 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) if (tail > (u32) cq->ibcq.cqe) tail = (u32) cq->ibcq.cqe; for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { - struct ipath_qp *qp; - if (tail == wc->head) break; - /* Make sure entry is read after head index is read. */ - smp_rmb(); - qp = ipath_lookup_qpn(&to_idev(cq->ibcq.device)->qp_table, - wc->queue[tail].qp_num); - entry->qp = &qp->ibqp; - if (atomic_dec_and_test(&qp->refcount)) - wake_up(&qp->wait); - - entry->wr_id = wc->queue[tail].wr_id; - entry->status = wc->queue[tail].status; - entry->opcode = wc->queue[tail].opcode; - entry->vendor_err = wc->queue[tail].vendor_err; - entry->byte_len = wc->queue[tail].byte_len; - entry->imm_data = wc->queue[tail].imm_data; - entry->src_qp = wc->queue[tail].src_qp; - entry->wc_flags = wc->queue[tail].wc_flags; - entry->pkey_index = wc->queue[tail].pkey_index; - entry->slid = wc->queue[tail].slid; - entry->sl = wc->queue[tail].sl; - entry->dlid_path_bits = wc->queue[tail].dlid_path_bits; - entry->port_num = wc->queue[tail].port_num; + /* The kernel doesn't need a RMB since it has the lock. */ + *entry = wc->kqueue[tail]; if (tail >= cq->ibcq.cqe) tail = 0; else @@ -171,6 +159,7 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) spin_unlock_irqrestore(&cq->lock, flags); +bail: return npolled; } @@ -215,6 +204,7 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vec struct ipath_cq *cq; struct ipath_cq_wc *wc; struct ib_cq *ret; + u32 sz; if (entries < 1 || entries > ib_ipath_max_cqes) { ret = ERR_PTR(-EINVAL); @@ -235,7 +225,12 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vec * We need to use vmalloc() in order to support mmap and large * numbers of entries. */ - wc = vmalloc_user(sizeof(*wc) + sizeof(struct ib_wc) * entries); + sz = sizeof(*wc); + if (udata && udata->outlen >= sizeof(__u64)) + sz += sizeof(struct ib_uverbs_wc) * (entries + 1); + else + sz += sizeof(struct ib_wc) * (entries + 1); + wc = vmalloc_user(sz); if (!wc) { ret = ERR_PTR(-ENOMEM); goto bail_cq; @@ -247,9 +242,8 @@ struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, int comp_vec */ if (udata && udata->outlen >= sizeof(__u64)) { int err; - u32 s = sizeof *wc + sizeof(struct ib_wc) * entries; - cq->ip = ipath_create_mmap_info(dev, s, context, wc); + cq->ip = ipath_create_mmap_info(dev, sz, context, wc); if (!cq->ip) { ret = ERR_PTR(-ENOMEM); goto bail_wc; @@ -380,6 +374,7 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) struct ipath_cq_wc *wc; u32 head, tail, n; int ret; + u32 sz; if (cqe < 1 || cqe > ib_ipath_max_cqes) { ret = -EINVAL; @@ -389,7 +384,12 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) /* * Need to use vmalloc() if we want to support large #s of entries. */ - wc = vmalloc_user(sizeof(*wc) + sizeof(struct ib_wc) * cqe); + sz = sizeof(*wc); + if (udata && udata->outlen >= sizeof(__u64)) + sz += sizeof(struct ib_uverbs_wc) * (cqe + 1); + else + sz += sizeof(struct ib_wc) * (cqe + 1); + wc = vmalloc_user(sz); if (!wc) { ret = -ENOMEM; goto bail; @@ -430,7 +430,10 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) goto bail; } for (n = 0; tail != head; n++) { - wc->queue[n] = old_wc->queue[tail]; + if (cq->ip) + wc->uqueue[n] = old_wc->uqueue[tail]; + else + wc->kqueue[n] = old_wc->kqueue[tail]; if (tail == (u32) cq->ibcq.cqe) tail = 0; else @@ -447,9 +450,8 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) if (cq->ip) { struct ipath_ibdev *dev = to_idev(ibcq->device); struct ipath_mmap_info *ip = cq->ip; - u32 s = sizeof *wc + sizeof(struct ib_wc) * cqe; - ipath_update_mmap_info(dev, ip, s, wc); + ipath_update_mmap_info(dev, ip, sz, wc); spin_lock_irq(&dev->pending_lock); if (list_empty(&ip->pending_mmaps)) list_add(&ip->pending_mmaps, &dev->pending_mmaps); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index a197229..9be9bf9 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -191,7 +191,11 @@ struct ipath_mmap_info { struct ipath_cq_wc { u32 head; /* index of next entry to fill */ u32 tail; /* index of next ib_poll_cq() entry */ - struct ib_uverbs_wc queue[1]; /* this is actually size ibcq.cqe + 1 */ + union { + /* these are actually size ibcq.cqe + 1 */ + struct ib_uverbs_wc uqueue[0]; + struct ib_wc kqueue[0]; + }; }; /* From arthur.jones at qlogic.com Tue Oct 9 13:00:22 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:22 -0700 Subject: [ofa-general] [PATCH 13/23] IB/ipath -- Add ability to set the LMC via the sysfs debugging interface In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200022.7151.83645.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch adds the ability to set the LMC via a sysfs file as if the SM sent a SubnSet(PortInfo) MAD. It is useful for debugging when no SM is running. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_sysfs.c | 40 ++++++++++++++++++++++++++++- 1 files changed, 39 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c index 16238cd..e1ad7cf 100644 --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c @@ -163,6 +163,42 @@ static ssize_t show_boardversion(struct device *dev, return scnprintf(buf, PAGE_SIZE, "%s", dd->ipath_boardversion); } +static ssize_t show_lmc(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + return scnprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_lmc); +} + +static ssize_t store_lmc(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t count) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + u16 lmc = 0; + int ret; + + ret = ipath_parse_ushort(buf, &lmc); + if (ret < 0) + goto invalid; + + if (lmc > 7) { + ret = -EINVAL; + goto invalid; + } + + ipath_set_lid(dd, dd->ipath_lid, lmc); + + goto bail; +invalid: + ipath_dev_err(dd, "attempt to set invalid LMC %u\n", lmc); +bail: + return ret; +} + static ssize_t show_lid(struct device *dev, struct device_attribute *attr, char *buf) @@ -190,7 +226,7 @@ static ssize_t store_lid(struct device *dev, goto invalid; } - ipath_set_lid(dd, lid, 0); + ipath_set_lid(dd, lid, dd->ipath_lmc); goto bail; invalid: @@ -648,6 +684,7 @@ static struct attribute_group driver_attr_group = { }; static DEVICE_ATTR(guid, S_IWUSR | S_IRUGO, show_guid, store_guid); +static DEVICE_ATTR(lmc, S_IWUSR | S_IRUGO, show_lmc, store_lmc); static DEVICE_ATTR(lid, S_IWUSR | S_IRUGO, show_lid, store_lid); static DEVICE_ATTR(link_state, S_IWUSR, NULL, store_link_state); static DEVICE_ATTR(mlid, S_IWUSR | S_IRUGO, show_mlid, store_mlid); @@ -667,6 +704,7 @@ static DEVICE_ATTR(logged_errors, S_IRUGO, show_logged_errs, NULL); static struct attribute *dev_attributes[] = { &dev_attr_guid.attr, + &dev_attr_lmc.attr, &dev_attr_lid.attr, &dev_attr_link_state.attr, &dev_attr_mlid.attr, From arthur.jones at qlogic.com Tue Oct 9 13:00:27 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:27 -0700 Subject: [ofa-general] [PATCH 14/23] IB/ipath - remove duplicate copy of LMC In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200027.7151.73840.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The LMC value was being saved by the SMA in two places. This patch cleans it up so only one copy is kept. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_mad.c | 39 ++++++++++++++++------------- drivers/infiniband/hw/ipath/ipath_ud.c | 10 ++++--- drivers/infiniband/hw/ipath/ipath_verbs.c | 4 +-- drivers/infiniband/hw/ipath/ipath_verbs.h | 2 + 4 files changed, 29 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index d61c030..8f15216 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -245,7 +245,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp, /* Only return the mkey if the protection field allows it. */ if (smp->method == IB_MGMT_METHOD_SET || dev->mkey == smp->mkey || - (dev->mkeyprot_resv_lmc >> 6) == 0) + dev->mkeyprot == 0) pip->mkey = dev->mkey; pip->gid_prefix = dev->gid_prefix; lid = dev->dd->ipath_lid; @@ -264,7 +264,7 @@ static int recv_subn_get_portinfo(struct ib_smp *smp, pip->portphysstate_linkdown = (ipath_cvt_physportstate[ibcstat & 0xf] << 4) | (get_linkdowndefaultstate(dev->dd) ? 1 : 2); - pip->mkeyprot_resv_lmc = dev->mkeyprot_resv_lmc; + pip->mkeyprot_resv_lmc = (dev->mkeyprot << 6) | dev->dd->ipath_lmc; pip->linkspeedactive_enabled = 0x11; /* 2.5Gbps, 2.5Gbps */ switch (dev->dd->ipath_ibmtu) { case 4096: @@ -401,6 +401,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, struct ib_port_info *pip = (struct ib_port_info *)smp->data; struct ib_event event; struct ipath_ibdev *dev; + struct ipath_devdata *dd; u32 flags; char clientrereg = 0; u16 lid, smlid; @@ -415,6 +416,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, goto err; dev = to_idev(ibdev); + dd = dev->dd; event.device = ibdev; event.element.port_num = port; @@ -423,11 +425,12 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, dev->mkey_lease_period = be16_to_cpu(pip->mkey_lease_period); lid = be16_to_cpu(pip->lid); - if (lid != dev->dd->ipath_lid) { + if (dd->ipath_lid != lid || + dd->ipath_lmc != (pip->mkeyprot_resv_lmc & 7)) { /* Must be a valid unicast LID address. */ if (lid == 0 || lid >= IPATH_MULTICAST_LID_BASE) goto err; - ipath_set_lid(dev->dd, lid, pip->mkeyprot_resv_lmc & 7); + ipath_set_lid(dd, lid, pip->mkeyprot_resv_lmc & 7); event.event = IB_EVENT_LID_CHANGE; ib_dispatch_event(&event); } @@ -461,18 +464,18 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, case 0: /* NOP */ break; case 1: /* SLEEP */ - if (set_linkdowndefaultstate(dev->dd, 1)) + if (set_linkdowndefaultstate(dd, 1)) goto err; break; case 2: /* POLL */ - if (set_linkdowndefaultstate(dev->dd, 0)) + if (set_linkdowndefaultstate(dd, 0)) goto err; break; default: goto err; } - dev->mkeyprot_resv_lmc = pip->mkeyprot_resv_lmc; + dev->mkeyprot = pip->mkeyprot_resv_lmc >> 6; dev->vl_high_limit = pip->vl_high_limit; switch ((pip->neighbormtu_mastersmsl >> 4) & 0xF) { @@ -495,7 +498,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, /* XXX We have already partially updated our state! */ goto err; } - ipath_set_mtu(dev->dd, mtu); + ipath_set_mtu(dd, mtu); dev->sm_sl = pip->neighbormtu_mastersmsl & 0xF; @@ -511,16 +514,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * later. */ if (pip->pkey_violations == 0) - dev->z_pkey_violations = ipath_get_cr_errpkey(dev->dd); + dev->z_pkey_violations = ipath_get_cr_errpkey(dd); if (pip->qkey_violations == 0) dev->qkey_violations = 0; ore = pip->localphyerrors_overrunerrors; - if (set_phyerrthreshold(dev->dd, (ore >> 4) & 0xF)) + if (set_phyerrthreshold(dd, (ore >> 4) & 0xF)) goto err; - if (set_overrunthreshold(dev->dd, (ore & 0xF))) + if (set_overrunthreshold(dd, (ore & 0xF))) goto err; dev->subnet_timeout = pip->clientrereg_resv_subnetto & 0x1F; @@ -538,7 +541,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * is down or is being set to down. */ state = pip->linkspeed_portstate & 0xF; - flags = dev->dd->ipath_flags; + flags = dd->ipath_flags; lstate = (pip->portphysstate_linkdown >> 4) & 0xF; if (lstate && !(state == IB_PORT_DOWN || state == IB_PORT_NOP)) goto err; @@ -554,7 +557,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, /* FALLTHROUGH */ case IB_PORT_DOWN: if (lstate == 0) - if (get_linkdowndefaultstate(dev->dd)) + if (get_linkdowndefaultstate(dd)) lstate = IPATH_IB_LINKDOWN_SLEEP; else lstate = IPATH_IB_LINKDOWN; @@ -566,7 +569,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, lstate = IPATH_IB_LINKDOWN_DISABLE; else goto err; - ipath_set_linkstate(dev->dd, lstate); + ipath_set_linkstate(dd, lstate); if (flags & IPATH_LINKACTIVE) { event.event = IB_EVENT_PORT_ERR; ib_dispatch_event(&event); @@ -575,7 +578,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, case IB_PORT_ARMED: if (!(flags & (IPATH_LINKINIT | IPATH_LINKACTIVE))) break; - ipath_set_linkstate(dev->dd, IPATH_IB_LINKARM); + ipath_set_linkstate(dd, IPATH_IB_LINKARM); if (flags & IPATH_LINKACTIVE) { event.event = IB_EVENT_PORT_ERR; ib_dispatch_event(&event); @@ -584,7 +587,7 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, case IB_PORT_ACTIVE: if (!(flags & IPATH_LINKARMED)) break; - ipath_set_linkstate(dev->dd, IPATH_IB_LINKACTIVE); + ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE); event.event = IB_EVENT_PORT_ACTIVE; ib_dispatch_event(&event); break; @@ -1350,7 +1353,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, if (dev->mkey_lease_timeout && jiffies >= dev->mkey_lease_timeout) { /* Clear timeout and mkey protection field. */ dev->mkey_lease_timeout = 0; - dev->mkeyprot_resv_lmc &= 0x3F; + dev->mkeyprot = 0; } /* @@ -1361,7 +1364,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, dev->mkey != smp->mkey && (smp->method == IB_MGMT_METHOD_SET || (smp->method == IB_MGMT_METHOD_GET && - (dev->mkeyprot_resv_lmc >> 7) != 0))) { + dev->mkeyprot >= 2))) { if (dev->mkey_violations != 0xFFFF) ++dev->mkey_violations; if (dev->mkey_lease_timeout || diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 34c4a0a..16a2a93 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -236,10 +236,10 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) wc.pkey_index = 0; wc.slid = dev->dd->ipath_lid | (ah_attr->src_path_bits & - ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1)); + ((1 << dev->dd->ipath_lmc) - 1)); wc.sl = ah_attr->sl; wc.dlid_path_bits = - ah_attr->dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + ah_attr->dlid & ((1 << dev->dd->ipath_lmc) - 1); wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, @@ -289,7 +289,7 @@ int ipath_make_ud_req(struct ipath_qp *qp) } else { dev->n_unicast_xmit++; lid = ah_attr->dlid & - ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + ~((1 << dev->dd->ipath_lmc) - 1); if (unlikely(lid == dev->dd->ipath_lid)) { ipath_ud_loopback(qp, wqe); goto done; @@ -341,7 +341,7 @@ int ipath_make_ud_req(struct ipath_qp *qp) lid = dev->dd->ipath_lid; if (lid) { lid |= ah_attr->src_path_bits & - ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + ((1 << dev->dd->ipath_lmc) - 1); qp->s_hdr.lrh[3] = cpu_to_be16(lid); } else qp->s_hdr.lrh[3] = IB_LID_PERMISSIVE; @@ -551,7 +551,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, * Save the LMC lower bits if the destination LID is a unicast LID. */ wc.dlid_path_bits = dlid >= IPATH_MULTICAST_LID_BASE ? 0 : - dlid & ((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + dlid & ((1 << dev->dd->ipath_lmc) - 1); wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 495194b..13aba3d 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -513,7 +513,7 @@ void ipath_ib_rcv(struct ipath_ibdev *dev, void *rhdr, void *data, /* Check for a valid destination LID (see ch. 7.11.1). */ lid = be16_to_cpu(hdr->lrh[1]); if (lid < IPATH_MULTICAST_LID_BASE) { - lid &= ~((1 << (dev->mkeyprot_resv_lmc & 7)) - 1); + lid &= ~((1 << dev->dd->ipath_lmc) - 1); if (unlikely(lid != dev->dd->ipath_lid)) { dev->rcv_errors++; goto bail; @@ -1152,7 +1152,7 @@ static int ipath_query_port(struct ib_device *ibdev, memset(props, 0, sizeof(*props)); props->lid = lid ? lid : __constant_be16_to_cpu(IB_LID_PERMISSIVE); - props->lmc = dev->mkeyprot_resv_lmc & 7; + props->lmc = dd->ipath_lmc; props->sm_lid = dev->sm_lid; props->sm_sl = dev->sm_sl; ibcstat = dd->ipath_lastibcstat; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 9be9bf9..6ccb54f 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -501,7 +501,7 @@ struct ipath_ibdev { int ib_unit; /* This is the device number */ u16 sm_lid; /* in host order */ u8 sm_sl; - u8 mkeyprot_resv_lmc; + u8 mkeyprot; /* non-zero when timer is set */ unsigned long mkey_lease_timeout; From arthur.jones at qlogic.com Tue Oct 9 13:00:32 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:32 -0700 Subject: [ofa-general] [PATCH 15/23] IB/ipath - use counters in ipath_poll and cleanup interrupts in ipath_close In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200032.7151.27634.stgit@eng-46.internal.keyresearch.com> ipath_poll() suffered from a couple subtle bugs. Under the right conditions we could leave recv interrupts enabled on an ipath user context on close, thereby taking potentially unwanted interrupts on the next open -- this is fixed by unconditionally turning off recv interrupts on close. Also, we now use counters rather than set/clear bits which allows us to make sure we catch all interrupts at the cost of changing the semantics slightly (it's now give me all events since the last time I called poll() rather than give me all events since I called _this_ poll routine). We also added some memory barriers which may help ensure we get all notifications in a timely manner. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_file_ops.c | 67 ++++++++++++++++---------- drivers/infiniband/hw/ipath/ipath_intr.c | 33 ++++--------- drivers/infiniband/hw/ipath/ipath_kernel.h | 8 ++- 3 files changed, 57 insertions(+), 51 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 33ab0d6..016e7c4 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -1341,6 +1341,19 @@ bail: return ret; } +static unsigned ipath_poll_hdrqfull(struct ipath_portdata *pd) +{ + unsigned pollflag = 0; + + if ((pd->poll_type & IPATH_POLL_TYPE_OVERFLOW) && + pd->port_hdrqfull != pd->port_hdrqfull_poll) { + pollflag |= POLLIN | POLLRDNORM; + pd->port_hdrqfull_poll = pd->port_hdrqfull; + } + + return pollflag; +} + static unsigned int ipath_poll_urgent(struct ipath_portdata *pd, struct file *fp, struct poll_table_struct *pt) @@ -1350,22 +1363,20 @@ static unsigned int ipath_poll_urgent(struct ipath_portdata *pd, dd = pd->port_dd; - if (test_bit(IPATH_PORT_WAITING_OVERFLOW, &pd->int_flag)) { - pollflag |= POLLERR; - clear_bit(IPATH_PORT_WAITING_OVERFLOW, &pd->int_flag); - } + /* variable access in ipath_poll_hdrqfull() needs this */ + rmb(); + pollflag = ipath_poll_hdrqfull(pd); - if (test_bit(IPATH_PORT_WAITING_URG, &pd->int_flag)) { + if (pd->port_urgent != pd->port_urgent_poll) { pollflag |= POLLIN | POLLRDNORM; - clear_bit(IPATH_PORT_WAITING_URG, &pd->int_flag); + pd->port_urgent_poll = pd->port_urgent; } if (!pollflag) { + /* this saves a spin_lock/unlock in interrupt handler... */ set_bit(IPATH_PORT_WAITING_URG, &pd->port_flag); - if (pd->poll_type & IPATH_POLL_TYPE_OVERFLOW) - set_bit(IPATH_PORT_WAITING_OVERFLOW, - &pd->port_flag); - + /* flush waiting flag so don't miss an event... */ + wmb(); poll_wait(fp, &pd->port_wait, pt); } @@ -1376,31 +1387,27 @@ static unsigned int ipath_poll_next(struct ipath_portdata *pd, struct file *fp, struct poll_table_struct *pt) { - u32 head, tail; + u32 head; + u32 tail; unsigned pollflag = 0; struct ipath_devdata *dd; dd = pd->port_dd; + /* variable access in ipath_poll_hdrqfull() needs this */ + rmb(); + pollflag = ipath_poll_hdrqfull(pd); + head = ipath_read_ureg32(dd, ur_rcvhdrhead, pd->port_port); tail = *(volatile u64 *)pd->port_rcvhdrtail_kvaddr; - if (test_bit(IPATH_PORT_WAITING_OVERFLOW, &pd->int_flag)) { - pollflag |= POLLERR; - clear_bit(IPATH_PORT_WAITING_OVERFLOW, &pd->int_flag); - } - - if (tail != head || - test_bit(IPATH_PORT_WAITING_RCV, &pd->int_flag)) { + if (head != tail) pollflag |= POLLIN | POLLRDNORM; - clear_bit(IPATH_PORT_WAITING_RCV, &pd->int_flag); - } - - if (!pollflag) { + else { + /* this saves a spin_lock/unlock in interrupt handler */ set_bit(IPATH_PORT_WAITING_RCV, &pd->port_flag); - if (pd->poll_type & IPATH_POLL_TYPE_OVERFLOW) - set_bit(IPATH_PORT_WAITING_OVERFLOW, - &pd->port_flag); + /* flush waiting flag so we don't miss an event */ + wmb(); set_bit(pd->port_port + INFINIPATH_R_INTRAVAIL_SHIFT, &dd->ipath_rcvctrl); @@ -1917,6 +1924,12 @@ static int ipath_do_user_init(struct file *fp, ipath_cdbg(VERBOSE, "Wrote port%d egrhead %x from tail regs\n", pd->port_port, head32); pd->port_tidcursor = 0; /* start at beginning after open */ + + /* initialize poll variables... */ + pd->port_urgent = 0; + pd->port_urgent_poll = 0; + pd->port_hdrqfull_poll = pd->port_hdrqfull; + /* * now enable the port; the tail registers will be written to memory * by the chip as soon as it sees the write to @@ -2039,9 +2052,11 @@ static int ipath_close(struct inode *in, struct file *fp) if (dd->ipath_kregbase) { int i; - /* atomically clear receive enable port. */ + /* atomically clear receive enable port and intr avail. */ clear_bit(INFINIPATH_R_PORTENABLE_SHIFT + port, &dd->ipath_rcvctrl); + clear_bit(pd->port_port + INFINIPATH_R_INTRAVAIL_SHIFT, + &dd->ipath_rcvctrl); ipath_write_kreg( dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl); /* and read back from chip to be sure that nothing diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 11b3614..61eac8c 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -688,17 +688,9 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) chkerrpkts = 1; dd->ipath_lastrcvhdrqtails[i] = tl; pd->port_hdrqfull++; - if (test_bit(IPATH_PORT_WAITING_OVERFLOW, - &pd->port_flag)) { - clear_bit( - IPATH_PORT_WAITING_OVERFLOW, - &pd->port_flag); - set_bit( - IPATH_PORT_WAITING_OVERFLOW, - &pd->int_flag); - wake_up_interruptible( - &pd->port_wait); - } + /* flush hdrqfull so that poll() sees it */ + wmb(); + wake_up_interruptible(&pd->port_wait); } } } @@ -960,6 +952,8 @@ static void handle_urcv(struct ipath_devdata *dd, u32 istat) int i; int rcvdint = 0; + /* test_bit below needs this... */ + rmb(); portr = ((istat >> INFINIPATH_I_RCVAVAIL_SHIFT) & dd->ipath_i_rcvavail_mask) | ((istat >> INFINIPATH_I_RCVURG_SHIFT) & @@ -967,22 +961,15 @@ static void handle_urcv(struct ipath_devdata *dd, u32 istat) for (i = 1; i < dd->ipath_cfgports; i++) { struct ipath_portdata *pd = dd->ipath_pd[i]; if (portr & (1 << i) && pd && pd->port_cnt) { - if (test_bit(IPATH_PORT_WAITING_RCV, - &pd->port_flag)) { - clear_bit(IPATH_PORT_WAITING_RCV, - &pd->port_flag); - set_bit(IPATH_PORT_WAITING_RCV, - &pd->int_flag); + if (test_and_clear_bit(IPATH_PORT_WAITING_RCV, + &pd->port_flag)) { clear_bit(i + INFINIPATH_R_INTRAVAIL_SHIFT, &dd->ipath_rcvctrl); wake_up_interruptible(&pd->port_wait); rcvdint = 1; - } else if (test_bit(IPATH_PORT_WAITING_URG, - &pd->port_flag)) { - clear_bit(IPATH_PORT_WAITING_URG, - &pd->port_flag); - set_bit(IPATH_PORT_WAITING_URG, - &pd->int_flag); + } else if (test_and_clear_bit(IPATH_PORT_WAITING_URG, + &pd->port_flag)) { + pd->port_urgent++; wake_up_interruptible(&pd->port_wait); } } diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index d983f92..872fb36 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -139,6 +139,12 @@ struct ipath_portdata { u32 port_pionowait; /* total number of rcvhdrqfull errors */ u32 port_hdrqfull; + /* saved total number of rcvhdrqfull errors for poll edge trigger */ + u32 port_hdrqfull_poll; + /* total number of polled urgent packets */ + u32 port_urgent; + /* saved total number of polled urgent packets for poll edge trigger */ + u32 port_urgent_poll; /* pid of process using this port */ pid_t port_pid; /* same size as task_struct .comm[] */ @@ -757,8 +763,6 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv); #define IPATH_PORT_MASTER_UNINIT 4 /* waiting for an urgent packet to arrive */ #define IPATH_PORT_WAITING_URG 5 - /* waiting for a header overflow */ -#define IPATH_PORT_WAITING_OVERFLOW 6 /* free up any allocated data at closes */ void ipath_free_data(struct ipath_portdata *dd); From arthur.jones at qlogic.com Tue Oct 9 13:00:37 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:37 -0700 Subject: [ofa-general] [PATCH 16/23] IB/ipath - iba6110 rev4 no longer needs recv header overrun workaround In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200037.7151.87203.stgit@eng-46.internal.keyresearch.com> iba6110 rev3 and earlier had a chip bug where the chip could overrun the recv header queue. rev4 fixed this chip bug so userspace no longer needs to workaround it. Now we only set the workaround flag for older chip versions. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_iba6110.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index e1c5998..d4940be 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1599,8 +1599,10 @@ static int ipath_ht_get_base_info(struct ipath_portdata *pd, void *kbase) { struct ipath_base_info *kinfo = kbase; - kinfo->spi_runtime_flags |= IPATH_RUNTIME_HT | - IPATH_RUNTIME_RCVHDR_COPY; + kinfo->spi_runtime_flags |= IPATH_RUNTIME_HT; + + if (pd->port_dd->ipath_minrev < 4) + kinfo->spi_runtime_flags |= IPATH_RUNTIME_RCVHDR_COPY; return 0; } From arthur.jones at qlogic.com Tue Oct 9 13:00:42 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:42 -0700 Subject: [ofa-general] [PATCH 17/23] IB/ipath - indicate to userspace a couple chip bugs In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200042.7151.39685.stgit@eng-46.internal.keyresearch.com> A couple chip bugs in the iba6110 and in the iba6120 are not in more recent chips. This first bug swaps two of the pioavail register locations. In the second bug, the chip can sometimes forget to dma the pio avail register to memory. We indicate the presence of these bugs with runtime flags and we indicate the presence of the flags by bumping the SWMINOR. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_common.h | 4 +++- drivers/infiniband/hw/ipath/ipath_iba6110.c | 3 ++- drivers/infiniband/hw/ipath/ipath_iba6120.c | 3 ++- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h index 6ad822c..851df8a 100644 --- a/drivers/infiniband/hw/ipath/ipath_common.h +++ b/drivers/infiniband/hw/ipath/ipath_common.h @@ -189,6 +189,8 @@ typedef enum _ipath_ureg { #define IPATH_RUNTIME_RCVHDR_COPY 0x8 #define IPATH_RUNTIME_MASTER 0x10 /* 0x20 and 0x40 are no longer used, but are reserved for ABI compatibility */ +#define IPATH_RUNTIME_FORCE_PIOAVAIL 0x400 +#define IPATH_RUNTIME_PIO_REGSWAPPED 0x800 /* * This structure is returned by ipath_userinit() immediately after @@ -350,7 +352,7 @@ struct ipath_base_info { * may not be implemented; the user code must deal with this if it * cares, or it must abort after initialization reports the difference. */ -#define IPATH_USER_SWMINOR 5 +#define IPATH_USER_SWMINOR 6 #define IPATH_USER_SWVERSION ((IPATH_USER_SWMAJOR<<16) | IPATH_USER_SWMINOR) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index d4940be..df42a1e 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1599,7 +1599,8 @@ static int ipath_ht_get_base_info(struct ipath_portdata *pd, void *kbase) { struct ipath_base_info *kinfo = kbase; - kinfo->spi_runtime_flags |= IPATH_RUNTIME_HT; + kinfo->spi_runtime_flags |= IPATH_RUNTIME_HT | + IPATH_RUNTIME_PIO_REGSWAPPED; if (pd->port_dd->ipath_minrev < 4) kinfo->spi_runtime_flags |= IPATH_RUNTIME_RCVHDR_COPY; diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index d43f0b3..0103d6f 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -1348,7 +1348,8 @@ static int ipath_pe_get_base_info(struct ipath_portdata *pd, void *kbase) dd = pd->port_dd; done: - kinfo->spi_runtime_flags |= IPATH_RUNTIME_PCIE; + kinfo->spi_runtime_flags |= IPATH_RUNTIME_PCIE | + IPATH_RUNTIME_FORCE_PIOAVAIL | IPATH_RUNTIME_PIO_REGSWAPPED; return 0; } From arthur.jones at qlogic.com Tue Oct 9 13:00:47 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:47 -0700 Subject: [ofa-general] [PATCH 18/23] IB/ipath - fix QHT7040 serial number check In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200047.7151.6884.stgit@eng-46.internal.keyresearch.com> From: Dave Olson Removed all the OEM and bringup boards, and complain and fail initialization if one is found. QHT7040 with GPIO rework (128ywwuuuu) is OK, older 112ywwuuuu is no longer supported). The check that had been added was failing both the 112 and 128 series. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_iba6110.c | 44 +++++++++------------------ 1 files changed, 15 insertions(+), 29 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index df42a1e..ddbebe4 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -631,56 +631,35 @@ static int ipath_ht_boardname(struct ipath_devdata *dd, char *name, { char *n = NULL; u8 boardrev = dd->ipath_boardrev; - int ret; + int ret = 0; switch (boardrev) { - case 4: /* Ponderosa is one of the bringup boards */ - n = "Ponderosa"; - break; case 5: /* * original production board; two production levels, with * different serial number ranges. See ipath_ht_early_init() for * case where we enable IPATH_GPIO_INTR for later serial # range. + * Original 112* serial number is no longer supported. */ n = "InfiniPath_QHT7040"; break; - case 6: - n = "OEM_Board_3"; - break; case 7: /* small form factor production board */ n = "InfiniPath_QHT7140"; break; - case 8: - n = "LS/X-1"; - break; - case 9: /* Comstock bringup test board */ - n = "Comstock"; - break; - case 10: - n = "OEM_Board_2"; - break; - case 11: - n = "InfiniPath_HT-470"; /* obsoleted */ - break; - case 12: - n = "OEM_Board_4"; - break; default: /* don't know, just print the number */ ipath_dev_err(dd, "Don't yet know about board " "with ID %u\n", boardrev); snprintf(name, namelen, "Unknown_InfiniPath_QHT7xxx_%u", boardrev); + ret = 1; break; } if (n) snprintf(name, namelen, "%s", n); - if (dd->ipath_boardrev != 6 && dd->ipath_boardrev != 7 && - dd->ipath_boardrev != 11) { + if (ret) { ipath_dev_err(dd, "Unsupported InfiniPath board %s!\n", name); - ret = 1; goto bail; } if (dd->ipath_majrev != 3 || (dd->ipath_minrev < 2 || @@ -1554,10 +1533,17 @@ static int ipath_ht_early_init(struct ipath_devdata *dd) * can use GPIO interrupts. They have serial #'s starting * with 128, rather than 112. */ - dd->ipath_flags |= IPATH_GPIO_INTR; - } else - ipath_dev_err(dd, "Unsupported InfiniPath serial " - "number %.16s!\n", dd->ipath_serial); + if (dd->ipath_serial[0] == '1' && + dd->ipath_serial[1] == '2' && + dd->ipath_serial[2] == '8') + dd->ipath_flags |= IPATH_GPIO_INTR; + else { + ipath_dev_err(dd, "Unsupported InfiniPath board " + "(serial number %.16s)!\n", + dd->ipath_serial); + return 1; + } + } if (dd->ipath_minrev >= 4) { /* Rev4+ reports extra errors via internal GPIO pins */ From arthur.jones at qlogic.com Tue Oct 9 13:00:52 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:52 -0700 Subject: [ofa-general] [PATCH 19/23] IB/ipath - Maintain active time on all chips In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200052.7151.87971.stgit@eng-46.internal.keyresearch.com> From: Michael Albaugh There is a count of "active hours" maintained in EEPROM, to aid troubleshooting. The definition of "active" is based on traffic exceeding a threshold in any given 5-second polling interval. As originally written, the check was inadvertently bypassed for chips whose counters were 64-bits wide, and only applied to chips with 32-bit wide counters. This patch moves the test for amount of traffic "out" to a more common location, rather than depending on a side-effect of the software emulation of 64-bit counts on chips whose hardware is only 32-bits wide. Signed-off-by: Michael Albaugh --- drivers/infiniband/hw/ipath/ipath_stats.c | 17 ++++++----------- 1 files changed, 6 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c index bae4f56..f027141 100644 --- a/drivers/infiniband/hw/ipath/ipath_stats.c +++ b/drivers/infiniband/hw/ipath/ipath_stats.c @@ -55,7 +55,6 @@ u64 ipath_snap_cntr(struct ipath_devdata *dd, ipath_creg creg) u64 val64; unsigned long t0, t1; u64 ret; - unsigned long flags; t0 = jiffies; /* If fast increment counters are only 32 bits, snapshot them, @@ -92,18 +91,12 @@ u64 ipath_snap_cntr(struct ipath_devdata *dd, ipath_creg creg) if (creg == dd->ipath_cregs->cr_wordsendcnt) { if (val != dd->ipath_lastsword) { dd->ipath_sword += val - dd->ipath_lastsword; - spin_lock_irqsave(&dd->ipath_eep_st_lock, flags); - dd->ipath_traffic_wds += val - dd->ipath_lastsword; - spin_unlock_irqrestore(&dd->ipath_eep_st_lock, flags); dd->ipath_lastsword = val; } val64 = dd->ipath_sword; } else if (creg == dd->ipath_cregs->cr_wordrcvcnt) { if (val != dd->ipath_lastrword) { dd->ipath_rword += val - dd->ipath_lastrword; - spin_lock_irqsave(&dd->ipath_eep_st_lock, flags); - dd->ipath_traffic_wds += val - dd->ipath_lastrword; - spin_unlock_irqrestore(&dd->ipath_eep_st_lock, flags); dd->ipath_lastrword = val; } val64 = dd->ipath_rword; @@ -247,6 +240,7 @@ void ipath_get_faststats(unsigned long opaque) u32 val; static unsigned cnt; unsigned long flags; + u64 traffic_wds; /* * don't access the chip while running diags, or memory diags can @@ -262,12 +256,13 @@ void ipath_get_faststats(unsigned long opaque) * exceeding a threshold, so we need to check the word-counts * even if they are 64-bit. */ - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt); - ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); + traffic_wds = ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordsendcnt) + + ipath_snap_cntr(dd, dd->ipath_cregs->cr_wordrcvcnt); spin_lock_irqsave(&dd->ipath_eep_st_lock, flags); - if (dd->ipath_traffic_wds >= IPATH_TRAFFIC_ACTIVE_THRESHOLD) + traffic_wds -= dd->ipath_traffic_wds; + dd->ipath_traffic_wds += traffic_wds; + if (traffic_wds >= IPATH_TRAFFIC_ACTIVE_THRESHOLD) atomic_add(5, &dd->ipath_active_time); /* S/B #define */ - dd->ipath_traffic_wds = 0; spin_unlock_irqrestore(&dd->ipath_eep_st_lock, flags); if (dd->ipath_flags & IPATH_32BITCOUNTERS) { From arthur.jones at qlogic.com Tue Oct 9 13:00:58 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:00:58 -0700 Subject: [ofa-general] [PATCH 20/23] IB/ipath - better handling of unexpected GPIO interrupts In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200057.7151.52814.stgit@eng-46.internal.keyresearch.com> From: Michael Albaugh The General Purpose I/O pins can be configured to cause interrupts. At the end of the interrupt code dealing with all known causes, a message is output if any bits remain un-handled. Since this is a "can't happen" scenario, it should only be triggered by bugs elsewhere. It is harmless, and potentially beneficial, to limit the damage by masking any such unexpected interrupts. This patch adds disabling of interrupts from any pins that should not have been allowed to interrupt, in addition to emitting a message. Signed-off-by: Michael Albaugh --- drivers/infiniband/hw/ipath/ipath_intr.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 61eac8c..801a20d 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1124,10 +1124,8 @@ irqreturn_t ipath_intr(int irq, void *data) /* * Some unexpected bits remain. If they could have * caused the interrupt, complain and clear. - * MEA: this is almost certainly non-ideal. - * we should look into auto-disable of unexpected - * GPIO interrupts, possibly on a "three strikes" - * basis. + * To avoid repetition of this condition, also clear + * the mask. It is almost certainly due to error. */ const u32 mask = (u32) dd->ipath_gpio_mask; @@ -1135,6 +1133,10 @@ irqreturn_t ipath_intr(int irq, void *data) ipath_dbg("Unexpected GPIO IRQ bits %x\n", gpiostatus & mask); to_clear |= (gpiostatus & mask); + dd->ipath_gpio_mask &= ~(gpiostatus & mask); + ipath_write_kreg(dd, + dd->ipath_kregs->kr_gpio_mask, + dd->ipath_gpio_mask); } } if (to_clear) { From arthur.jones at qlogic.com Tue Oct 9 13:01:03 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:01:03 -0700 Subject: [ofa-general] [PATCH 21/23] IB/ipath - fix IB_EVENT_PORT_ERR event In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200103.7151.15096.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The link state event calls were being generated when the SM told the SMA to change link states. This works for IB_EVENT_PORT_ACTIVE but not if the link goes down and stays down. The fix is to generate event calls from the interrupt handler when the HW link state changes. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_driver.c | 2 ++ drivers/infiniband/hw/ipath/ipath_intr.c | 17 +++++++++++++++++ drivers/infiniband/hw/ipath/ipath_kernel.h | 2 ++ drivers/infiniband/hw/ipath/ipath_mad.c | 10 ---------- drivers/infiniband/hw/ipath/ipath_verbs.c | 12 ++++++++++-- 5 files changed, 31 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index e5d058a..799fac2 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2085,6 +2085,8 @@ void ipath_shutdown_device(struct ipath_devdata *dd) INFINIPATH_IBCC_LINKINITCMD_SHIFT); ipath_cancel_sends(dd, 0); + signal_ib_event(dd, IB_EVENT_PORT_ERR); + /* disable IBC */ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; ipath_write_kreg(dd, dd->ipath_kregs->kr_control, diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 801a20d..6a5dd5c 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -275,6 +275,16 @@ static char *ib_linkstate(u32 linkstate) return ret; } +void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev) +{ + struct ib_event event; + + event.device = &dd->verbs_dev->ibdev; + event.element.port_num = 1; + event.event = ev; + ib_dispatch_event(&event); +} + static void handle_e_ibstatuschanged(struct ipath_devdata *dd, ipath_err_t errs, int noprint) { @@ -373,6 +383,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, dd->ipath_ibpollcnt = 0; /* some state other than 2 or 3 */ ipath_stats.sps_iblink++; if (ltstate != INFINIPATH_IBCS_LT_STATE_LINKUP) { + if (dd->ipath_flags & IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); dd->ipath_flags |= IPATH_LINKDOWN; dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT | IPATH_LINKACTIVE | @@ -405,7 +417,10 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, *dd->ipath_statusp |= IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF; dd->ipath_f_setextled(dd, lstate, ltstate); + signal_ib_event(dd, IB_EVENT_PORT_ACTIVE); } else if ((val & IPATH_IBSTATE_MASK) == IPATH_IBSTATE_INIT) { + if (dd->ipath_flags & IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); /* * set INIT and DOWN. Down is checked by most of the other * code, but INIT is useful to know in a few places. @@ -418,6 +433,8 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, | IPATH_STATUS_IB_READY); dd->ipath_f_setextled(dd, lstate, ltstate); } else if ((val & IPATH_IBSTATE_MASK) == IPATH_IBSTATE_ARM) { + if (dd->ipath_flags & IPATH_LINKACTIVE) + signal_ib_event(dd, IB_EVENT_PORT_ERR); dd->ipath_flags |= IPATH_LINKARMED; dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT | diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 872fb36..8786dd7 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -42,6 +42,7 @@ #include #include #include +#include #include "ipath_common.h" #include "ipath_debug.h" @@ -775,6 +776,7 @@ void ipath_get_eeprom_info(struct ipath_devdata *); int ipath_update_eeprom_log(struct ipath_devdata *dd); void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr); u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); +void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev); /* * Set LED override, only the two LSBs have "public" meaning, but diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 8f15216..0ae3a7c 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -570,26 +570,16 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, else goto err; ipath_set_linkstate(dd, lstate); - if (flags & IPATH_LINKACTIVE) { - event.event = IB_EVENT_PORT_ERR; - ib_dispatch_event(&event); - } break; case IB_PORT_ARMED: if (!(flags & (IPATH_LINKINIT | IPATH_LINKACTIVE))) break; ipath_set_linkstate(dd, IPATH_IB_LINKARM); - if (flags & IPATH_LINKACTIVE) { - event.event = IB_EVENT_PORT_ERR; - ib_dispatch_event(&event); - } break; case IB_PORT_ACTIVE: if (!(flags & IPATH_LINKARMED)) break; ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE); - event.event = IB_EVENT_PORT_ACTIVE; - ib_dispatch_event(&event); break; default: /* XXX We have already partially updated our state! */ diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 13aba3d..74f77e7 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -948,6 +948,7 @@ bail: int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr, u32 hdrwords, struct ipath_sge_state *ss, u32 len) { + struct ipath_devdata *dd = to_idev(qp->ibqp.device)->dd; u32 plen; int ret; u32 dwords = (len + 3) >> 2; @@ -955,8 +956,15 @@ int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr, /* +1 is for the qword padding of pbc */ plen = hdrwords + dwords + 1; - ret = ipath_verbs_send_pio(qp, (u32 *) hdr, hdrwords, - ss, len, plen, dwords); + /* Drop non-VL15 packets if we are not in the active state */ + if (!(dd->ipath_flags & IPATH_LINKACTIVE) && + qp->ibqp.qp_type != IB_QPT_SMI) { + if (qp->s_wqe) + ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS); + ret = 0; + } else + ret = ipath_verbs_send_pio(qp, (u32 *) hdr, hdrwords, + ss, len, plen, dwords); return ret; } From arthur.jones at qlogic.com Tue Oct 9 13:01:08 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:01:08 -0700 Subject: [ofa-general] [PATCH 22/23] IB/ipath - remove redundant link state checks In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200108.7151.47111.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch removes some redundant checks when the SMA changes the link state since the same checks are made in the lower level function that sets the state. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_mad.c | 6 ------ 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 0ae3a7c..3d1432d 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -402,7 +402,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, struct ib_event event; struct ipath_ibdev *dev; struct ipath_devdata *dd; - u32 flags; char clientrereg = 0; u16 lid, smlid; u8 lwe; @@ -541,7 +540,6 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, * is down or is being set to down. */ state = pip->linkspeed_portstate & 0xF; - flags = dd->ipath_flags; lstate = (pip->portphysstate_linkdown >> 4) & 0xF; if (lstate && !(state == IB_PORT_DOWN || state == IB_PORT_NOP)) goto err; @@ -572,13 +570,9 @@ static int recv_subn_set_portinfo(struct ib_smp *smp, ipath_set_linkstate(dd, lstate); break; case IB_PORT_ARMED: - if (!(flags & (IPATH_LINKINIT | IPATH_LINKACTIVE))) - break; ipath_set_linkstate(dd, IPATH_IB_LINKARM); break; case IB_PORT_ACTIVE: - if (!(flags & IPATH_LINKARMED)) - break; ipath_set_linkstate(dd, IPATH_IB_LINKACTIVE); break; default: From arthur.jones at qlogic.com Tue Oct 9 13:01:13 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 09 Oct 2007 13:01:13 -0700 Subject: [ofa-general] [PATCH 23/23] IB/ipath -- Minor fix to ordering of freeing and zeroing of tid pages. In-Reply-To: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071009200113.7151.55925.stgit@eng-46.internal.keyresearch.com> From: Dave Olson Fixed to be the same as everywhere else. copy and then zero the page * in the array first, and then pass the copy to the VM routines. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_file_ops.c | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 016e7c4..5de3243 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -538,6 +538,9 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport, continue; cnt++; if (dd->ipath_pageshadow[porttid + tid]) { + struct page *p; + p = dd->ipath_pageshadow[porttid + tid]; + dd->ipath_pageshadow[porttid + tid] = NULL; ipath_cdbg(VERBOSE, "PID %u freeing TID %u\n", pd->port_pid, tid); dd->ipath_f_put_tid(dd, &tidbase[tid], @@ -546,9 +549,7 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport, pci_unmap_page(dd->pcidev, dd->ipath_physshadow[porttid + tid], PAGE_SIZE, PCI_DMA_FROMDEVICE); - ipath_release_user_pages( - &dd->ipath_pageshadow[porttid + tid], 1); - dd->ipath_pageshadow[porttid + tid] = NULL; + ipath_release_user_pages(&p, 1); ipath_stats.sps_pageunlocks++; } else ipath_dbg("Unused tid %u, ignoring\n", tid); From jimmott at austin.rr.com Tue Oct 9 13:04:25 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Tue, 9 Oct 2007 15:04:25 -0500 Subject: [ofa-general] SDP ? In-Reply-To: <470B9A84.9000008@sun.com> References: <470B9A84.9000008@sun.com> Message-ID: <00ac01c80aaf$9c98e700$d5cab500$@rr.com> That should work fine. You might be able to build with -D_XPG4_2 as well. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jim Langston Sent: Tuesday, October 09, 2007 10:13 AM To: general at lists.openfabrics.org Subject: [ofa-general] SDP ? Hi all, I'm working on porting SDP to OpenSolaris and am looking at a compile error that I get. Essentially, I have a conflict of types on the compile: bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\"/usr/local/etc\" -g -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o "port.c", line 1896: identifier redeclared: getsockname current : function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to unsigned int) returning int previous: function(int, pointer to struct sockaddr {unsigned short sa_family, array[14] of char sa_data}, pointer to void) returning int : "/usr/include/sys/socket.h", line 436 Line 436 in /usr/include/sys/socket.h extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); and Psocklen_t #if defined(_XPG4_2) || defined(_BOOT) typedef socklen_t *_RESTRICT_KYWD Psocklen_t; #else typedef void *_RESTRICT_KYWD Psocklen_t; #endif /* defined(_XPG4_2) || defined(_BOOT) */ Do I need to change port.c getsockname to type void * ? Thanks, Jim _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Oct 9 13:11:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 13:11:07 -0700 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Mon, 24 Sep 2007 14:07:28 -0700") References: <46F7FDE5.9070305@oracle.com> <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> Message-ID: Did we ever get any confirmation that this fixed the problem that Olaf saw? From davem at davemloft.net Tue Oct 9 13:14:38 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 13:14:38 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470B77A9.600@garzik.org> References: <20071009.042441.30182968.davem@davemloft.net> <470B77A9.600@garzik.org> Message-ID: <20071009.131438.74562715.davem@davemloft.net> From: Jeff Garzik Date: Tue, 09 Oct 2007 08:44:25 -0400 > David Miller wrote: > > From: Krishna Kumar2 > > Date: Tue, 9 Oct 2007 16:51:14 +0530 > > > >> David Miller wrote on 10/09/2007 04:32:55 PM: > >> > >>> Ignore LLTX, it sucks, it was a big mistake, and we will get rid of > >>> it. > >> Great, this will make life easy. Any idea how long that would take? > >> It seems simple enough to do. > > > > I'd say we can probably try to get rid of it in 2.6.25, this is > > assuming we get driver authors to cooperate and do the conversions > > or alternatively some other motivated person. > > > > I can just threaten to do them all and that should get the driver > > maintainers going :-) > > What, like this? :) Thanks, but it's probably going to need some corrections and/or an audit. If you unconditionally take those locks in the transmit function, there is probably an ABBA deadlock elsewhere in the driver now, most likely in the TX reclaim processing, and you therefore need to handle that too. From rdreier at cisco.com Tue Oct 9 13:19:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 13:19:17 -0700 Subject: [ofa-general] [PATCH] fix some ehca limits In-Reply-To: <20071001153620.GA31830@kryten> (Anton Blanchard's message of "Mon, 1 Oct 2007 10:36:20 -0500") References: <20070930053726.GA28619@kryten> <20071001153620.GA31830@kryten> Message-ID: I didn't see a response to my earlier email about the other uses of min_t(int, x, INT_MAX) so I fixed it up myself and added this to my tree. I don't have a working setup to test yet so please let me know if you see anything wrong with this: commit 919225e60a1a73e3518f257f040f74e9379a61c3 Author: Roland Dreier Date: Tue Oct 9 13:17:42 2007 -0700 IB/ehca: Fix clipping of device limits to INT_MAX Doing min_t(int, foo, INT_MAX) doesn't work correctly, because if foo is bigger than INT_MAX, then when treated as a signed integer, it will become negative and hence such an expression is just an elaborate NOP. Fix such cases in ehca to do min_t(unsigned, foo, INT_MAX) instead. This fixes negative reported values for max_cqe, max_pd and max_ah: Before: max_cqe: -64 max_pd: -1 max_ah: -1 After: max_cqe: 2147483647 max_pd: 2147483647 max_ah: 2147483647 Based on a bug report and fix from Anton Blanchard . Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 3436c49..4aa3ffa 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -82,17 +82,17 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props->vendor_id = rblock->vendor_id >> 8; props->vendor_part_id = rblock->vendor_part_id >> 16; props->hw_ver = rblock->hw_ver; - props->max_qp = min_t(int, rblock->max_qp, INT_MAX); - props->max_qp_wr = min_t(int, rblock->max_wqes_wq, INT_MAX); - props->max_sge = min_t(int, rblock->max_sge, INT_MAX); - props->max_sge_rd = min_t(int, rblock->max_sge_rd, INT_MAX); - props->max_cq = min_t(int, rblock->max_cq, INT_MAX); - props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); - props->max_mr = min_t(int, rblock->max_mr, INT_MAX); - props->max_mw = min_t(int, rblock->max_mw, INT_MAX); - props->max_pd = min_t(int, rblock->max_pd, INT_MAX); - props->max_ah = min_t(int, rblock->max_ah, INT_MAX); - props->max_fmr = min_t(int, rblock->max_mr, INT_MAX); + props->max_qp = min_t(unsigned, rblock->max_qp, INT_MAX); + props->max_qp_wr = min_t(unsigned, rblock->max_wqes_wq, INT_MAX); + props->max_sge = min_t(unsigned, rblock->max_sge, INT_MAX); + props->max_sge_rd = min_t(unsigned, rblock->max_sge_rd, INT_MAX); + props->max_cq = min_t(unsigned, rblock->max_cq, INT_MAX); + props->max_cqe = min_t(unsigned, rblock->max_cqe, INT_MAX); + props->max_mr = min_t(unsigned, rblock->max_mr, INT_MAX); + props->max_mw = min_t(unsigned, rblock->max_mw, INT_MAX); + props->max_pd = min_t(unsigned, rblock->max_pd, INT_MAX); + props->max_ah = min_t(unsigned, rblock->max_ah, INT_MAX); + props->max_fmr = min_t(unsigned, rblock->max_mr, INT_MAX); if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca->hca_cap)) { props->max_srq = props->max_qp; @@ -104,15 +104,15 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props->local_ca_ack_delay = rblock->local_ca_ack_delay; props->max_raw_ipv6_qp - = min_t(int, rblock->max_raw_ipv6_qp, INT_MAX); + = min_t(unsigned, rblock->max_raw_ipv6_qp, INT_MAX); props->max_raw_ethy_qp - = min_t(int, rblock->max_raw_ethy_qp, INT_MAX); + = min_t(unsigned, rblock->max_raw_ethy_qp, INT_MAX); props->max_mcast_grp - = min_t(int, rblock->max_mcast_grp, INT_MAX); + = min_t(unsigned, rblock->max_mcast_grp, INT_MAX); props->max_mcast_qp_attach - = min_t(int, rblock->max_mcast_qp_attach, INT_MAX); + = min_t(unsigned, rblock->max_mcast_qp_attach, INT_MAX); props->max_total_mcast_qp_attach - = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); + = min_t(unsigned, rblock->max_total_mcast_qp_attach, INT_MAX); /* translate device capabilities */ props->device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID | From jeff at garzik.org Tue Oct 9 13:20:14 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 16:20:14 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.131438.74562715.davem@davemloft.net> References: <20071009.042441.30182968.davem@davemloft.net> <470B77A9.600@garzik.org> <20071009.131438.74562715.davem@davemloft.net> Message-ID: <470BE27E.4000101@garzik.org> David Miller wrote: > From: Jeff Garzik > Date: Tue, 09 Oct 2007 08:44:25 -0400 > >> David Miller wrote: >>> From: Krishna Kumar2 >>> Date: Tue, 9 Oct 2007 16:51:14 +0530 >>> >>>> David Miller wrote on 10/09/2007 04:32:55 PM: >>>> >>>>> Ignore LLTX, it sucks, it was a big mistake, and we will get rid of >>>>> it. >>>> Great, this will make life easy. Any idea how long that would take? >>>> It seems simple enough to do. >>> I'd say we can probably try to get rid of it in 2.6.25, this is >>> assuming we get driver authors to cooperate and do the conversions >>> or alternatively some other motivated person. >>> >>> I can just threaten to do them all and that should get the driver >>> maintainers going :-) >> What, like this? :) > > Thanks, but it's probably going to need some corrections and/or > an audit. I would be happy if someone wanted to audit that patch. > If you unconditionally take those locks in the transmit function, > there is probably an ABBA deadlock elsewhere in the driver now, most > likely in the TX reclaim processing, and you therefore need to handle > that too. And I most certainly checked the relevant transmit paths and other locking to make sure lock ordering was correct. Jeff From rdreier at cisco.com Tue Oct 9 13:22:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 13:22:44 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.042441.30182968.davem@davemloft.net> (David Miller's message of "Tue, 09 Oct 2007 04:24:41 -0700 (PDT)") References: <20071009.040255.71088090.davem@davemloft.net> <20071009.042441.30182968.davem@davemloft.net> Message-ID: > I'd say we can probably try to get rid of it in 2.6.25, this is > assuming we get driver authors to cooperate and do the conversions > or alternatively some other motivated person. > > I can just threaten to do them all and that should get the driver > maintainers going :-) I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to my TODO list so I don't forget. In fact if 2.6.23 drags on long enough I may do it for 2.6.24.... From mshefty at ichips.intel.com Tue Oct 9 13:42:15 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Oct 2007 13:42:15 -0700 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses In-Reply-To: References: <46F7FDE5.9070305@oracle.com> <000401c7feee$ea073180$ff0da8c0@amr.corp.intel.com> Message-ID: <470BE7A7.3020406@ichips.intel.com> > Did we ever get any confirmation that this fixed the problem that Olaf saw? No. I haven't seen a response. From davem at davemloft.net Tue Oct 9 13:43:31 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 13:43:31 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> Message-ID: <20071009.134331.35664207.davem@davemloft.net> From: Andi Kleen Date: 09 Oct 2007 18:51:51 +0200 > Hopefully that new qdisc will just use the TX rings of the hardware > directly. They are typically large enough these days. That might avoid > some locking in this critical path. Indeed, I also realized last night that for the default qdiscs we do a lot of stupid useless work. If the queue is a FIFO and the device can take packets, we should send it directly and not stick it into the qdisc at all. > If the data is just passed on to the hardware queue, why is any > locking needed at all? (except for the driver locking of course) Absolutely. Our packet scheduler subsystem is great, but by default it should just get out of the way. From davem at davemloft.net Tue Oct 9 13:51:45 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 13:51:45 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <20071009.042441.30182968.davem@davemloft.net> Message-ID: <20071009.135145.95506679.davem@davemloft.net> From: Roland Dreier Date: Tue, 09 Oct 2007 13:22:44 -0700 > I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to > my TODO list so I don't forget. > > In fact if 2.6.23 drags on long enough I may do it for 2.6.24.... Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) Right now that's a more pressing task to complete. From shemminger at linux-foundation.org Tue Oct 9 13:53:40 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Tue, 9 Oct 2007 13:53:40 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.134331.35664207.davem@davemloft.net> References: <470AD5D7.1070000@garzik.org> <20071008.184126.124062865.davem@davemloft.net> <20071009.134331.35664207.davem@davemloft.net> Message-ID: <20071009135340.33e5922c@freepuppy.rosehill> On Tue, 09 Oct 2007 13:43:31 -0700 (PDT) David Miller wrote: > From: Andi Kleen > Date: 09 Oct 2007 18:51:51 +0200 > > > Hopefully that new qdisc will just use the TX rings of the hardware > > directly. They are typically large enough these days. That might avoid > > some locking in this critical path. > > Indeed, I also realized last night that for the default qdiscs > we do a lot of stupid useless work. If the queue is a FIFO > and the device can take packets, we should send it directly > and not stick it into the qdisc at all. > > > If the data is just passed on to the hardware queue, why is any > > locking needed at all? (except for the driver locking of course) > > Absolutely. > > Our packet scheduler subsystem is great, but by default it should just > get out of the way. I was thinking why not have a default transmit queue len of 0 like the virtual devices. -- Stephen Hemminger From davem at davemloft.net Tue Oct 9 14:22:35 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 14:22:35 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009135340.33e5922c@freepuppy.rosehill> References: <20071009.134331.35664207.davem@davemloft.net> <20071009135340.33e5922c@freepuppy.rosehill> Message-ID: <20071009.142235.74385364.davem@davemloft.net> From: Stephen Hemminger Date: Tue, 9 Oct 2007 13:53:40 -0700 > I was thinking why not have a default transmit queue len of 0 like > the virtual devices. I'm not so sure. Even if the device has "huge queues" I still think we need a software queue for when the hardware one backs up. It is even beneficial to stick with reasonably sized TX queues because it keeps the total resident state accessed by the CPU within the bounds of the L2 cache. If you go past that it actually hurts to make the TX queue larger instead of helps even if it means you never hit back pressure. From davem at davemloft.net Tue Oct 9 14:25:04 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 14:25:04 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <470BE27E.4000101@garzik.org> References: <470B77A9.600@garzik.org> <20071009.131438.74562715.davem@davemloft.net> <470BE27E.4000101@garzik.org> Message-ID: <20071009.142504.116349168.davem@davemloft.net> From: Jeff Garzik Date: Tue, 09 Oct 2007 16:20:14 -0400 > David Miller wrote: > > If you unconditionally take those locks in the transmit function, > > there is probably an ABBA deadlock elsewhere in the driver now, most > > likely in the TX reclaim processing, and you therefore need to handle > > that too. > > And I most certainly checked the relevant transmit paths and other > locking to make sure lock ordering was correct. Awesome. From rdreier at cisco.com Tue Oct 9 14:40:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 14:40:21 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.135145.95506679.davem@davemloft.net> (David Miller's message of "Tue, 09 Oct 2007 13:51:45 -0700 (PDT)") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: > Before you add new entries to your list, how is that ibm driver NAPI > conversion coming along? :-) I still haven't done much. OK, I will try to get my board booting again this week. From rdreier at cisco.com Tue Oct 9 14:52:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 14:52:41 -0700 Subject: [ofa-general] Re: [PATCH v3 for 2.6.24] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: (Or Gerlitz's message of "Mon, 8 Oct 2007 10:13:00 +0200 (IST)") References: Message-ID: OK, I will merge this for 2.6.24. However, I still don't really understand the changelog entry: > The kernel IB stack allows (through the RDMA CM) user space multicast applications > to interoperate with IP based apps optionally running at a different IP subnet. > > To support this inter-op for the case where the receiving party resides at > the IB side, there is a need to handle IGMP (reports/queries) else the local > IP router would not forward multicast traffic towards the IB network. So in other words you have a userspace app that joins an IPoIB multicast group and then it has to do an IP_ADD_MEMBERSHIP socket option to trigger IGMP messages being sent out, so that traffic gets routed to it? > This patch does a lookup on the database used for multicast reference counting and > enhances IPoIB to ignore multicast group which is already handled by user space, all > this under a per device policy flag. That is when the policy flag allows it, IPoIB > will not join and attach its QP to a multicast group which has an entry on the database. And then you don't want the kernel IPoIB driver to actually join the multicast group for the IP multicast group you added with IP_ADD_MEMBERSHIP? Why is that exactly -- this is the part I'm especially hazy on. - R. From rdreier at cisco.com Tue Oct 9 14:55:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 14:55:31 -0700 Subject: [ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support In-Reply-To: <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Tue, 09 Oct 2007 12:59:20 -0700") References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> Message-ID: OK, I'll grudgingly merge these patch, even though they all arrived on the exact day that Linus released 2.6.23... but you guys really need to fix your development process so you don't accumulate a huge bolus of patches that you then vomit out. In the future I'm not going to accept giant merges like this -- please send your patches as soon as you've accumulated say 5 or 10. - R. From hadi at cyberus.ca Tue Oct 9 14:56:46 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 17:56:46 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.142235.74385364.davem@davemloft.net> References: <20071009.134331.35664207.davem@davemloft.net> <20071009135340.33e5922c@freepuppy.rosehill> <20071009.142235.74385364.davem@davemloft.net> Message-ID: <1191967006.5324.14.camel@localhost> On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote: > Even if the device has "huge queues" I still think we need a software > queue for when the hardware one backs up. It should be fine to just "pretend" the qdisc exists despite it sitting in the driver and not have s/ware queues at all to avoid all the challenges that qdiscs bring; if the h/ware queues are full because of link pressure etc, you drop. We drop today when the s/ware queues are full. The driver txmit lock takes place of the qdisc queue lock etc. I am assuming there is still need for that locking. The filter/classification scheme still works as is and select classes which map to rings. tc still works as is etc. cheers, jamal From hadi at cyberus.ca Tue Oct 9 15:07:19 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:07:19 -0400 Subject: [ofa-general] [PATCHES] TX batching rev2.5 Message-ID: <1191967639.5324.25.camel@localhost> Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I am on travel mode so wont have time to do more testing for the next few days - i do consider this to be stable at this point based on what i have been testing (famous last words). Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb What has changed since i posted last: ------------------------------------- 1) There was some cruft leftover from prep_frame feature that i forgot to remove last time - it is now gone. 2) In the shower this AM, i realized that it is plausible that a batch of packets sent to the driver may all be dropped because they are badly formatted. Current drivers all return NETDEV_TX_OK in all such cases. This will lead for us to call dev->hard_end_xmit() to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within batching drivers, so i made that global and make the batching drivers return it if they drop packets. The core calls dev->hard_end_xmit() when at least one packet makes it through. Things i am gonna say that nobody will see (wink) ------------------------------------------------- Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). UPCOMING PATCHES --------------- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto that i will post soon today. PERFORMANCE TESTING -------------------- System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal From hadi at cyberus.ca Tue Oct 9 15:10:47 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:10:47 -0400 Subject: [ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5 Message-ID: <1191967847.5324.31.camel@localhost> This patch introduces the netdevice interface for batching. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 01-introduce-batching-interface.patch Type: text/x-patch Size: 8834 bytes Desc: not available URL: From hadi at cyberus.ca Tue Oct 9 15:12:24 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:12:24 -0400 Subject: [ofa-general] [PATCH 2/3][NET_BATCH] Rev2.5 net core use batching Message-ID: <1191967944.5324.33.camel@localhost> This patch adds the usage of batching within the core. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 02-net-core-use-batching.patch Type: text/x-patch Size: 4508 bytes Desc: not available URL: From hadi at cyberus.ca Tue Oct 9 15:13:57 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:13:57 -0400 Subject: [ofa-general] [PATCH 3/3][NET_BATCH] Rev2.5 kill dev->gso_skb Message-ID: <1191968037.5324.36.camel@localhost> This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal -------------- next part -------------- A non-text attachment was scrubbed... Name: 03-kill-dev-gso-skb.patch Type: text/x-patch Size: 2277 bytes Desc: not available URL: From hadi at cyberus.ca Tue Oct 9 15:20:26 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:20:26 -0400 Subject: [ofa-general] Re: [PATCHES] TX batching rev2.5 In-Reply-To: <1191967639.5324.25.camel@localhost> References: <1191967639.5324.25.camel@localhost> Message-ID: <1191968426.5324.45.camel@localhost> On Tue, 2007-09-10 at 18:07 -0400, jamal wrote: > Please provide feedback on the code and/or architecture. > They are now updated to work with the latest rebased net-2.6.24 > from a few hours ago. I should have added i have tested this with just the batching changes and it is within the performance realm of the changes from yesterday. If anyone wants exact numbers, i can send them. cheers, jamal From hadi at cyberus.ca Tue Oct 9 15:29:02 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 09 Oct 2007 18:29:02 -0400 Subject: [ofa-general] [DOC][NET_BATCH]Rev2.5 Driver Howto Message-ID: <1191968942.5324.48.camel@localhost> I updated this doc to match the latest patch. cheers, jamal -------------- next part -------------- Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites ------------------------------ For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API ----------------------------------- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev->hard_end_xmit() 2) dev->xmit_win 2.1 Using Core driver changes ----------------------------- To provide context, let's look at a typical driver abstraction for dev->hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1) use its dev->hard_end_xmit() method to achieve #d 2) use dev->xmit_win to tell the core how much space you have. #b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -------------------------- 1. Core dequeues from qdiscs upto dev->xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: ->invokes driver dev->hard_start_xmit() 4. Core invokes driver dev->hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX ------------------------- Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win ---------------- dev->xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev->xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -------------------------- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials --------------------- The typical driver tx state machine is: ---- -1-> +Core sends packets +--> Driver puts packet onto hardware queue + if hardware queue is full, netif_stop_queue(dev) + -2-> +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3-> +---> driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1-> +Cycle repeats and core sends more packets (step 1). ---- 3.1 Driver prerequisite -------------------------- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev->xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(&tun->readq); if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { tun->dev->xmit_win = tun->dev->tx_queue_len; netif_wake_queue(tun->dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned && netif_carrier_ok(netdev) && E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev->xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: ----- if (netif_queue_stopped(tp->dev) && (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp->dev); --- 3.2 Driver Setup ----------------- *) On initialization (before netdev registration) 1) set NETIF_F_BTX in dev->features i.e., dev->features |= NETIF_F_BTX This makes the core do proper initialization. 2) set dev->xmit_win to something reasonable like maybe half the tx DMA ring size etc. 3) create proper pointer to the ->hard_end_xmit() method. 3.3 Annotation on the different methods ---------------------------------------- This section shows examples and offers suggestions on how the different methods and variable could be used. 3.3.1 dev->hard_start_xmit() ---------------------------- Here's an example of tx routine that is similar to the one I added to the current tun driver. bxmit suffix is kept so that you can turn off batching if needed via an ethtool interface and call already existing interface. ---- static int xxx_net_bxmit(struct net_device *dev) { .... .... enqueue onto hardware ring if (hardware ring full) { netif_stop_queue(dev); dev->xmit_win = 1; } ....... .. . } ------ All return codes like NETDEV_TX_OK etc. still apply. In addition a new code NETDEV_TX_DROPPED should be returned if the packet is dropped. This helps the core layer to account for transmitted packets and invoke dev->hard_end_xmit() at the end of batch when one or more packets are transmitted.. 3.3.2 The tx complete, dev->hard_end_xmit() ------------------------------------------------- In this method, if there are any IO operations that apply to a set of packets such as kicking DMA, setting of interrupt thresholds etc., leave them to the end and apply them once if you have successfully enqueued. This provides a mechanism for saving a lot of CPU cycles since IO is cycle expensive. Here is a simplified tg3 dev->hard_end_xmit(): ---- void tg3_complete_xmit(struct net_device *dev) { /* Packets are ready, update Tx producer idx local and on card. */ tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry); if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) { netif_stop_queue(dev); dev->xmit_win = 1; if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp)) { tg3_set_win(tp); netif_wake_queue(dev); } } else { tg3_set_win(tp); } mmiowb(); dev->trans_start = jiffies; } ------- 3.3.3 setting the dev->xmit_win --------------------------------- As mentioned earlier this variable provides hints on how much data to send from the core to the driver. Here are the obvious ways: a) on doing a netif_stop, set it to 1. By default all drivers have this value set to 1 to emulate old behavior where a driver only receives one packet at a time. b) on netif_wake_queue set it to the max available space. You have to be careful if your hardware does scatter-gather since the core will pass you scatter-gatherable skbs and so you want to at least leave enough space for the maximum allowed. Look at the tg3 and e1000 to see how this is implemented. The variable is important because it avoids the core sending any more than what the driver can handle, therefore avoiding any need to muck with packet scheduling mechanisms. Appendix 1: History ------------------- June 11/2007: Initial revision June 11/2007: Fixed typo on e1000 netif_wake description .. Aug 08/2007: Added info on VLAN and the skb->cb[] danger .. Sep 24/2007: Revised and cleaned up Sep 25/2007: Cleanups from Randy Dunlap Oct 08/2007: Removed references to LLTX and packet formatting Oct 09/2007: Added reference to NETDEV_TX_DROPPED From rdreier at cisco.com Tue Oct 9 15:44:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:44:39 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.135145.95506679.davem@davemloft.net> (David Miller's message of "Tue, 09 Oct 2007 13:51:45 -0700 (PDT)") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: > Before you add new entries to your list, how is that ibm driver NAPI > conversion coming along? :-) OK, thanks for the kick in the pants, I have a couple of patches for net-2.6.24 coming (including an unrelated trivial warning fix for IPoIB). - R. From jim at mellanox.com Tue Oct 9 15:45:28 2007 From: jim at mellanox.com (Jim Mott) Date: Tue, 9 Oct 2007 15:45:28 -0700 Subject: [ofa-general] [PATCH 1/1] IB/SDP - Zero copy bcopy support In-Reply-To: <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> Message-ID: This patch adds zero copy send support to SDP. Below 2K transfer size, it is better to bcopy. With larger transfers, this is a net win on bandwidth. Latency testing is yet to be done. BCOPY BZCOPY 1K TCP_STREAM 3555 Mb/sec 2250 Mb/sec 2K TCP_STREAM 3650 Mb/sec 3785 Mb/sec 4K TCP_STREAM 3560 Mb/sec 6220 Mb/sec 8K TCP_STREAM 3555 Mb/sec 6190 Mb/sec 16K TCP_STREAM 5100 Mb/sec 6155 Mb/sec 1M TCP_STREAM 4630 Mb/sec 6210 Mb/sec Performance work still remains. Open issues include correct setsockopt defines (use previous SDP values?), code cleanup, performance tuning, rigorous regression testing, and multi-OS build+test. Simple testing to date includes netperf and iperf, ^C recovery, unload/load, and checking for gross memory leaks on Rhat4u4. Signed-off-by: Jim Mott --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h 2007-10-08 08:20:57.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h 2007-10-08 08:31:41.000000000 -0500 @@ -50,6 +50,9 @@ extern int sdp_data_debug_level; #define SDP_HEAD_SIZE (PAGE_SIZE / 2 + sizeof(struct sdp_bsdh)) #define SDP_NUM_WC 4 +#define SDP_MIN_ZCOPY_THRESH 1024 +#define SDP_MAX_ZCOPY_THRESH 1048576 + #define SDP_OP_RECV 0x800000000LL enum sdp_mid { @@ -70,6 +73,13 @@ enum { SDP_MIN_BUFS = 2 }; +enum { + SDP_ERR_ERROR = -4, + SDP_ERR_FAULT = -3, + SDP_NEW_SEG = -2, + SDP_DO_WAIT_MEM = -1 +}; + struct rdma_cm_id; struct rdma_cm_event; @@ -148,6 +158,9 @@ struct sdp_sock { int recv_frags; int send_frags; + /* ZCOPY data */ + int zcopy_thresh; + struct ib_sge ibsge[SDP_MAX_SEND_SKB_FRAGS + 1]; struct ib_wc ibwc[SDP_NUM_WC]; }; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c 2007-10-08 08:21:05.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c 2007-10-09 16:52:34.000000000 -0500 @@ -65,6 +65,16 @@ unsigned int csum_partial_copy_from_user #include "sdp.h" #include +struct bzcopy_state { + unsigned char __user *u_base; + int u_len; + int left; + int page_cnt; + int cur_page; + int cur_offset; + struct page **pages; +}; + MODULE_AUTHOR("Michael S. Tsirkin"); MODULE_DESCRIPTION("InfiniBand SDP module"); MODULE_LICENSE("Dual BSD/GPL"); @@ -117,6 +127,10 @@ static int send_poll_thresh = 8192; module_param_named(send_poll_thresh, send_poll_thresh, int, 0644); MODULE_PARM_DESC(send_poll_thresh, "Send message size thresh hold over which to start polling."); +static int sdp_zcopy_thresh = 0; +module_param_named(sdp_zcopy_thresh, sdp_zcopy_thresh, int, 0644); +MODULE_PARM_DESC(sdp_zcopy_thresh, "Zero copy send threshold; 0=0ff."); + struct workqueue_struct *sdp_workqueue; static struct list_head sock_list; @@ -867,6 +881,12 @@ static int sdp_setsockopt(struct sock *s sdp_push_pending_frames(sk); } break; + case SDP_ZCOPY_THRESH: + if (val < SDP_MIN_ZCOPY_THRESH || val > SDP_MAX_ZCOPY_THRESH) + err = -EINVAL; + else + ssk->zcopy_thresh = val; + break; default: err = -ENOPROTOOPT; break; @@ -904,6 +924,9 @@ static int sdp_getsockopt(struct sock *s case TCP_CORK: val = !!(ssk->nonagle&TCP_NAGLE_CORK); break; + case SDP_ZCOPY_THRESH: + val = ssk->zcopy_thresh ? ssk->zcopy_thresh : sdp_zcopy_thresh; + break; default: return -ENOPROTOOPT; } @@ -1051,10 +1074,252 @@ void sdp_push_one(struct sock *sk, unsig { } -/* Like tcp_sendmsg */ -/* TODO: check locking */ +static struct bzcopy_state *sdp_bz_cleanup(struct bzcopy_state *bz) +{ + int i; + + if (bz->pages) { + for (i = bz->cur_page; i < bz->page_cnt; i++) + put_page(bz->pages[i]); + + kfree(bz->pages); + } + + kfree(bz); + + return NULL; +} + + +static struct bzcopy_state *sdp_bz_setup(struct sdp_sock *ssk, + unsigned char __user *base, + int len, + int size_goal) +{ + struct bzcopy_state *bz; + unsigned long addr; + unsigned long locked, locked_limit; + int done_pages; + int thresh; + + thresh = ssk->zcopy_thresh ? : sdp_zcopy_thresh; + if (thresh == 0 || len < thresh) + return NULL; + + if (!can_do_mlock()) + return NULL; + + bz = kzalloc(sizeof(*bz), GFP_KERNEL); + if (!bz) + return NULL; + + /* + * Since we use the TCP segmentation fields of the skb to map user + * pages, we must make sure that everything we send in a single chunk + * fits into the frags array in the skb. + */ + size_goal = size_goal / PAGE_SIZE + 1; + if (size_goal >= MAX_SKB_FRAGS) + return NULL; + + addr = (unsigned long)base; + + bz->u_base = base; + bz->u_len = len; + bz->left = len; + bz->cur_offset = addr & ~PAGE_MASK; + bz->page_cnt = PAGE_ALIGN(len + bz->cur_offset) >> PAGE_SHIFT; + bz->pages = kcalloc(bz->page_cnt, sizeof(struct page *), GFP_KERNEL); + + if (!bz->pages) + goto out_1; + + down_write(¤t->mm->mmap_sem); + + locked = bz->page_cnt + current->mm->locked_vm; + locked_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; + + if ((locked > locked_limit) && !capable(CAP_IPC_LOCK)) + goto out_2; + + addr &= PAGE_MASK; + + done_pages = get_user_pages(current, current->mm, addr, bz->page_cnt, + 0, 0, bz->pages, NULL); + if (unlikely(done_pages != bz->page_cnt)){ + bz->page_cnt = done_pages; + goto out_2; + } + + up_write(¤t->mm->mmap_sem); + + return bz; + +out_2: + up_write(¤t->mm->mmap_sem); +out_1: + sdp_bz_cleanup(bz); + + return NULL; +} + + #define TCP_PAGE(sk) (sk->sk_sndmsg_page) #define TCP_OFF(sk) (sk->sk_sndmsg_off) +static inline int sdp_bcopy_get(struct sock *sk, struct sk_buff *skb, + unsigned char __user *from, int copy) +{ + int err; + struct sdp_sock *ssk = sdp_sk(sk); + + /* Where to copy to? */ + if (skb_tailroom(skb) > 0) { + /* We have some space in skb head. Superb! */ + if (copy > skb_tailroom(skb)) + copy = skb_tailroom(skb); + if ((err = skb_add_data(skb, from, copy)) != 0) + return SDP_ERR_FAULT; + } else { + int merge = 0; + int i = skb_shinfo(skb)->nr_frags; + struct page *page = TCP_PAGE(sk); + int off = TCP_OFF(sk); + + if (skb_can_coalesce(skb, i, page, off) && + off != PAGE_SIZE) { + /* We can extend the last page + * fragment. */ + merge = 1; + } else if (i == ssk->send_frags || + (!i && + !(sk->sk_route_caps & NETIF_F_SG))) { + /* Need to add new fragment and cannot + * do this because interface is non-SG, + * or because all the page slots are + * busy. */ + sdp_mark_push(ssk, skb); + return SDP_NEW_SEG; + } else if (page) { + if (off == PAGE_SIZE) { + put_page(page); + TCP_PAGE(sk) = page = NULL; + off = 0; + } + } else + off = 0; + + if (copy > PAGE_SIZE - off) + copy = PAGE_SIZE - off; + + if (!sk_stream_wmem_schedule(sk, copy)) + return SDP_DO_WAIT_MEM; + + if (!page) { + /* Allocate new cache page. */ + if (!(page = sk_stream_alloc_page(sk))) + return SDP_DO_WAIT_MEM; + } + + /* Time to copy data. We are close to + * the end! */ + err = skb_copy_to_page(sk, from, skb, page, + off, copy); + if (err) { + /* If this page was new, give it to the + * socket so it does not get leaked. + */ + if (!TCP_PAGE(sk)) { + TCP_PAGE(sk) = page; + TCP_OFF(sk) = 0; + } + return SDP_ERR_ERROR; + } + + /* Update the skb. */ + if (merge) { + skb_shinfo(skb)->frags[i - 1].size += + copy; + } else { + skb_fill_page_desc(skb, i, page, off, copy); + if (TCP_PAGE(sk)) { + get_page(page); + } else if (off + copy < PAGE_SIZE) { + get_page(page); + TCP_PAGE(sk) = page; + } + } + + TCP_OFF(sk) = off + copy; + } + + return copy; +} + + +static inline int sdp_bzcopy_get(struct sock *sk, struct sk_buff *skb, + unsigned char __user *from, int copy, + struct bzcopy_state *bz) +{ + int this_page, left; + struct sdp_sock *ssk = sdp_sk(sk); + + if (skb_shinfo(skb)->nr_frags == ssk->send_frags) { + sdp_mark_push(ssk, skb); + return SDP_NEW_SEG; + } + + left = copy; + BUG_ON(left > bz->left); + + while (left) { + if (skb_shinfo(skb)->nr_frags == ssk->send_frags) { + copy = copy - left; + break; + } + + this_page = PAGE_SIZE - bz->cur_offset; + + if (left <= this_page) + this_page = left; + + if (!sk_stream_wmem_schedule(sk, copy)) + return SDP_DO_WAIT_MEM; + + skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags, + bz->pages[bz->cur_page], bz->cur_offset, + this_page); + + BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS); + + bz->cur_offset += this_page; + if (bz->cur_offset == PAGE_SIZE) { + bz->cur_offset = 0; + bz->cur_page++; + + BUG_ON(bz->cur_page > bz->page_cnt); + } else { + BUG_ON(bz->cur_offset > PAGE_SIZE); + + if (bz->cur_page != bz->page_cnt || left != this_page) + get_page(bz->pages[bz->cur_page]); + } + + left -= this_page; + + skb->len += this_page; + skb->data_len = skb->len; + skb->truesize += this_page; + sk->sk_wmem_queued += this_page; + sk->sk_forward_alloc -= this_page; + } + + bz->left -= copy; + return copy; +} + + +/* Like tcp_sendmsg */ +/* TODO: check locking */ int sdp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size) { @@ -1065,6 +1330,7 @@ int sdp_sendmsg(struct kiocb *iocb, stru int mss_now, size_goal; int err, copied; long timeo; + struct bzcopy_state *bz = NULL; lock_sock(sk); sdp_dbg_data(sk, "%s\n", __func__); @@ -1098,6 +1364,8 @@ int sdp_sendmsg(struct kiocb *iocb, stru iov++; + bz = sdp_bz_setup(ssk, from, seglen, size_goal); + while (seglen > 0) { int copy; @@ -1141,84 +1409,17 @@ new_segment: sdp_mark_push(ssk, skb); goto new_segment; } - /* Where to copy to? */ - if (skb_tailroom(skb) > 0) { - /* We have some space in skb head. Superb! */ - if (copy > skb_tailroom(skb)) - copy = skb_tailroom(skb); - if ((err = skb_add_data(skb, from, copy)) != 0) - goto do_fault; - } else { - int merge = 0; - int i = skb_shinfo(skb)->nr_frags; - struct page *page = TCP_PAGE(sk); - int off = TCP_OFF(sk); - - if (skb_can_coalesce(skb, i, page, off) && - off != PAGE_SIZE) { - /* We can extend the last page - * fragment. */ - merge = 1; - } else if (i == ssk->send_frags || - (!i && - !(sk->sk_route_caps & NETIF_F_SG))) { - /* Need to add new fragment and cannot - * do this because interface is non-SG, - * or because all the page slots are - * busy. */ - sdp_mark_push(ssk, skb); - goto new_segment; - } else if (page) { - if (off == PAGE_SIZE) { - put_page(page); - TCP_PAGE(sk) = page = NULL; - off = 0; - } - } else - off = 0; - - if (copy > PAGE_SIZE - off) - copy = PAGE_SIZE - off; - if (!sk_stream_wmem_schedule(sk, copy)) + copy = (bz) ? sdp_bzcopy_get(sk, skb, from, copy, bz) : + sdp_bcopy_get(sk, skb, from, copy); + if (unlikely(copy < 0)) { + if (!++copy) goto wait_for_memory; - - if (!page) { - /* Allocate new cache page. */ - if (!(page = sk_stream_alloc_page(sk))) - goto wait_for_memory; - } - - /* Time to copy data. We are close to - * the end! */ - err = skb_copy_to_page(sk, from, skb, page, - off, copy); - if (err) { - /* If this page was new, give it to the - * socket so it does not get leaked. - */ - if (!TCP_PAGE(sk)) { - TCP_PAGE(sk) = page; - TCP_OFF(sk) = 0; - } - goto do_error; - } - - /* Update the skb. */ - if (merge) { - skb_shinfo(skb)->frags[i - 1].size += - copy; - } else { - skb_fill_page_desc(skb, i, page, off, copy); - if (TCP_PAGE(sk)) { - get_page(page); - } else if (off + copy < PAGE_SIZE) { - get_page(page); - TCP_PAGE(sk) = page; - } - } - - TCP_OFF(sk) = off + copy; + if (!++copy) + goto new_segment; + if (!++copy) + goto do_fault; + goto do_error; } if (!copied) @@ -1259,6 +1460,8 @@ wait_for_memory: } out: + if (bz) + bz = sdp_bz_cleanup(bz); if (copied) sdp_push(sk, ssk, flags, mss_now, ssk->nonagle); if (size > send_poll_thresh) @@ -1278,6 +1481,8 @@ do_error: if (copied) goto out; out_err: + if (bz) + bz = sdp_bz_cleanup(bz); err = sk_stream_error(sk, flags, err); release_sock(sk); return err; Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_socket.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_socket.h 2007-09-26 01:30:20.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_socket.h 2007-10-08 08:33:40.000000000 -0500 @@ -8,6 +8,10 @@ #define PF_INET_SDP AF_INET_SDP #endif +#ifndef SDP_ZCOPY_THRESH +#define SDP_ZCOPY_THRESH 80 +#endif + /* TODO: AF_INET6_SDP ? */ #endif From rdreier at cisco.com Tue Oct 9 15:46:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:46:13 -0700 Subject: [ofa-general] [PATCH 1/4] IPoIB: Fix unused variable warning In-Reply-To: (Roland Dreier's message of "Tue, 09 Oct 2007 15:44:39 -0700") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: The conversion to use netdevice internal stats left an unused variable in ipoib_neigh_free(), since there's no longer any reason to get netdev_priv() in order to increment dropped packets. Delete the unused priv variable. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 6b1b4b2..855c9de 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -854,7 +854,6 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) { - struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; *to_ipoib_neigh(neigh->neighbour) = NULL; while ((skb = __skb_dequeue(&neigh->queue))) { From rdreier at cisco.com Tue Oct 9 15:47:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:47:37 -0700 Subject: [ofa-general] [PATCH 2/4] ibm_emac: Convert to use napi_struct independent of struct net_device In-Reply-To: (Roland Dreier's message of "Tue, 09 Oct 2007 15:46:13 -0700") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: Commit da3dedd9 ("[NET]: Make NAPI polling independent of struct net_device objects.") changed the interface to NAPI polling. Fix up the ibm_emac driver so that it works with this new interface. This is actually a nice cleanup because ibm_emac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC 'Yucca' evaluation board. Signed-off-by: Roland Dreier --- drivers/net/ibm_emac/ibm_emac_mal.c | 48 ++++++++++++---------------------- drivers/net/ibm_emac/ibm_emac_mal.h | 2 +- include/linux/netdevice.h | 10 +++++++ 3 files changed, 28 insertions(+), 32 deletions(-) diff --git a/drivers/net/ibm_emac/ibm_emac_mal.c b/drivers/net/ibm_emac/ibm_emac_mal.c index cabd984..cc3ddc9 100644 --- a/drivers/net/ibm_emac/ibm_emac_mal.c +++ b/drivers/net/ibm_emac/ibm_emac_mal.c @@ -207,10 +207,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct ibm_ocp_mal *mal) { - if (likely(netif_rx_schedule_prep(&mal->poll_dev))) { + if (likely(napi_schedule_prep(&mal->napi))) { MAL_DBG2("%d: schedule_poll" NL, mal->def->index); mal_disable_eob_irq(mal); - __netif_rx_schedule(&mal->poll_dev); + __napi_schedule(&mal->napi); } else MAL_DBG2("%d: already in poll" NL, mal->def->index); } @@ -273,11 +273,11 @@ static irqreturn_t mal_rxde(int irq, void *dev_instance) return IRQ_HANDLED; } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct ibm_ocp_mal *mal = ndev->priv; + struct ibm_ocp_mal *mal = container_of(napi, struct ibm_ocp_mal, napi); struct list_head *l; - int rx_work_limit = min(ndev->quota, *budget), received = 0, done; + int received = 0; MAL_DBG2("%d: poll(%d) %d ->" NL, mal->def->index, *budget, rx_work_limit); @@ -295,38 +295,34 @@ static int mal_poll(struct net_device *ndev, int *budget) list_for_each(l, &mal->poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); - int n = mc->ops->poll_rx(mc->dev, rx_work_limit); + int n = mc->ops->poll_rx(mc->dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit <= 0) { - done = 0; + budget -= n; + if (budget <= 0) goto more_work; // XXX What if this is the last one ? - } } } /* We need to disable IRQs to protect from RXDE IRQ here */ local_irq_disable(); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); local_irq_enable(); - done = 1; - /* Check for "rotting" packet(s) */ list_for_each(l, &mal->poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); if (unlikely(mc->ops->peek_rx(mc->dev) || mc->rx_stopped)) { MAL_DBG2("%d: rotting packet" NL, mal->def->index); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2("%d: already in poll list" NL, mal->def->index); - if (rx_work_limit > 0) + if (budget > 0) goto again; else goto more_work; @@ -335,12 +331,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev->quota -= received; - *budget -= received; - - MAL_DBG2("%d: poll() %d <- %d" NL, mal->def->index, *budget, - done ? 0 : 1); - return done ? 0 : 1; + MAL_DBG2("%d: poll() %d <- %d" NL, mal->def->index, budget, received); + return received; } static void mal_reset(struct ibm_ocp_mal *mal) @@ -425,11 +417,8 @@ static int __init mal_probe(struct ocp_device *ocpdev) mal->def = ocpdev->def; INIT_LIST_HEAD(&mal->poll_list); - set_bit(__LINK_STATE_START, &mal->poll_dev.state); - mal->poll_dev.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; - mal->poll_dev.poll = mal_poll; - mal->poll_dev.priv = mal; - atomic_set(&mal->poll_dev.refcnt, 1); + mal->napi.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; + mal->napi.poll = mal_poll; INIT_LIST_HEAD(&mal->list); @@ -520,11 +509,8 @@ static void __exit mal_remove(struct ocp_device *ocpdev) MAL_DBG("%d: remove" NL, mal->def->index); - /* Syncronize with scheduled polling, - stolen from net/core/dev.c:dev_close() - */ - clear_bit(__LINK_STATE_START, &mal->poll_dev.state); - netif_poll_disable(&mal->poll_dev); + /* Synchronize with scheduled polling */ + napi_disable(&mal->napi); if (!list_empty(&mal->list)) { /* This is *very* bad */ diff --git a/drivers/net/ibm_emac/ibm_emac_mal.h b/drivers/net/ibm_emac/ibm_emac_mal.h index 64bc338..8f54d62 100644 --- a/drivers/net/ibm_emac/ibm_emac_mal.h +++ b/drivers/net/ibm_emac/ibm_emac_mal.h @@ -195,7 +195,7 @@ struct ibm_ocp_mal { dcr_host_t dcrhost; struct list_head poll_list; - struct net_device poll_dev; + struct napi_struct napi; struct list_head list; u32 tx_chan_mask; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..4848c7a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -349,6 +349,16 @@ static inline void napi_schedule(struct napi_struct *n) __napi_schedule(n); } +/* Try to reschedule poll. Called by dev->poll() after napi_complete(). */ +static inline int napi_reschedule(struct napi_struct *napi) +{ + if (napi_schedule_prep(napi)) { + __napi_schedule(napi); + return 1; + } + return 0; +} + /** * napi_complete - NAPI processing complete * @n: napi context From rdreier at cisco.com Tue Oct 9 15:47:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:47:59 -0700 Subject: [ofa-general] [PATCH 3/4] ibm_new_emac: Nuke SET_MODULE_OWNER() use In-Reply-To: (Roland Dreier's message of "Tue, 09 Oct 2007 15:46:13 -0700") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: Signed-off-by: Roland Dreier --- drivers/net/ibm_newemac/core.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c index ce127b9..8ea5009 100644 --- a/drivers/net/ibm_newemac/core.c +++ b/drivers/net/ibm_newemac/core.c @@ -2549,7 +2549,6 @@ static int __devinit emac_probe(struct of_device *ofdev, dev->ndev = ndev; dev->ofdev = ofdev; dev->blist = blist; - SET_MODULE_OWNER(ndev); SET_NETDEV_DEV(ndev, &ofdev->dev); /* Initialize some embedded data structures */ From rdreier at cisco.com Tue Oct 9 15:48:56 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:48:56 -0700 Subject: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device In-Reply-To: (Roland Dreier's message of "Tue, 09 Oct 2007 15:46:13 -0700") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: Commit da3dedd9 ("[NET]: Make NAPI polling independent of struct net_device objects.") changed the interface to NAPI polling. Fix up the ibm_newemac driver so that it works with this new interface. This is actually a nice cleanup because ibm_newemac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Compile-tested only as I don't have a system that uses the ibm_newemac driver. This conversion the conversion for the ibm_emac driver that was tested on real PowerPC 440SPe hardware. Signed-off-by: Roland Dreier --- drivers/net/ibm_newemac/mal.c | 55 ++++++++++++++-------------------------- drivers/net/ibm_newemac/mal.h | 2 +- 2 files changed, 20 insertions(+), 37 deletions(-) diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c index c4335b7..5885411 100644 --- a/drivers/net/ibm_newemac/mal.c +++ b/drivers/net/ibm_newemac/mal.c @@ -235,10 +235,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct mal_instance *mal) { - if (likely(netif_rx_schedule_prep(&mal->poll_dev))) { + if (likely(napi_schedule_prep(&mal->napi))) { MAL_DBG2(mal, "schedule_poll" NL); mal_disable_eob_irq(mal); - __netif_rx_schedule(&mal->poll_dev); + __napi_schedule(&mal->napi); } else MAL_DBG2(mal, "already in poll" NL); } @@ -318,8 +318,7 @@ void mal_poll_disable(struct mal_instance *mal, struct mal_commac *commac) msleep(1); /* Synchronize with the MAL NAPI poller. */ - while (test_bit(__LINK_STATE_RX_SCHED, &mal->poll_dev.state)) - msleep(1); + napi_disable(&mal->napi); } void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) @@ -330,11 +329,11 @@ void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) // XXX might want to kick a poll now... } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct mal_instance *mal = netdev_priv(ndev); + struct mal_instance *mal = container_of(napi, struct mal_instance, napi); struct list_head *l; - int rx_work_limit = min(ndev->quota, *budget), received = 0, done; + int received = 0; unsigned long flags; MAL_DBG2(mal, "poll(%d) %d ->" NL, *budget, @@ -358,26 +357,21 @@ static int mal_poll(struct net_device *ndev, int *budget) int n; if (unlikely(test_bit(MAL_COMMAC_POLL_DISABLED, &mc->flags))) continue; - n = mc->ops->poll_rx(mc->dev, rx_work_limit); + n = mc->ops->poll_rx(mc->dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit <= 0) { - done = 0; - // XXX What if this is the last one ? - goto more_work; - } + budget -= n; + if (budget <= 0) + goto more_work; // XXX What if this is the last one ? } } /* We need to disable IRQs to protect from RXDE IRQ here */ spin_lock_irqsave(&mal->lock, flags); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); spin_unlock_irqrestore(&mal->lock, flags); - done = 1; - /* Check for "rotting" packet(s) */ list_for_each(l, &mal->poll_list) { struct mal_commac *mc = @@ -387,12 +381,12 @@ static int mal_poll(struct net_device *ndev, int *budget) if (unlikely(mc->ops->peek_rx(mc->dev) || test_bit(MAL_COMMAC_RX_STOPPED, &mc->flags))) { MAL_DBG2(mal, "rotting packet" NL); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2(mal, "already in poll list" NL); - if (rx_work_limit > 0) + if (budget > 0) goto again; else goto more_work; @@ -401,13 +395,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev->quota -= received; - *budget -= received; - - MAL_DBG2(mal, "poll() %d <- %d" NL, *budget, - done ? 0 : 1); - - return done ? 0 : 1; + MAL_DBG2(mal, "poll() %d <- %d" NL, budget, received); + return received; } static void mal_reset(struct mal_instance *mal) @@ -538,11 +527,8 @@ static int __devinit mal_probe(struct of_device *ofdev, } INIT_LIST_HEAD(&mal->poll_list); - set_bit(__LINK_STATE_START, &mal->poll_dev.state); - mal->poll_dev.weight = CONFIG_IBM_NEW_EMAC_POLL_WEIGHT; - mal->poll_dev.poll = mal_poll; - mal->poll_dev.priv = mal; - atomic_set(&mal->poll_dev.refcnt, 1); + mal->napi.weight = CONFIG_IBM_NEW_EMAC_POLL_WEIGHT; + mal->napi.poll = mal_poll; INIT_LIST_HEAD(&mal->list); spin_lock_init(&mal->lock); @@ -653,11 +639,8 @@ static int __devexit mal_remove(struct of_device *ofdev) MAL_DBG(mal, "remove" NL); - /* Syncronize with scheduled polling, - stolen from net/core/dev.c:dev_close() - */ - clear_bit(__LINK_STATE_START, &mal->poll_dev.state); - netif_poll_disable(&mal->poll_dev); + /* Synchronize with scheduled polling */ + napi_disable(&mal->napi); if (!list_empty(&mal->list)) { /* This is *very* bad */ diff --git a/drivers/net/ibm_newemac/mal.h b/drivers/net/ibm_newemac/mal.h index 57b69dc..cb1a16d 100644 --- a/drivers/net/ibm_newemac/mal.h +++ b/drivers/net/ibm_newemac/mal.h @@ -197,7 +197,7 @@ struct mal_instance { int serr_irq; /* MAL System Error IRQ */ struct list_head poll_list; - struct net_device poll_dev; + struct napi_struct napi; struct list_head list; u32 tx_chan_mask; From mshefty at ichips.intel.com Tue Oct 9 15:51:27 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 09 Oct 2007 15:51:27 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <46F05476.4090809@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> Message-ID: <470C05EF.90409@ichips.intel.com> I skipped over most of the code restructuring comments and focus mainly on design or issues. (Although code restructuring patches tend not to be written or easily accepted unless they fix a bug, and I would personally like to see at least some of the ones previously mentioned addressed before this code is merged. The ones listed below should be trivial to incorporate before merging.) > This version incorporates some of Sean's comments, especially > relating to locking. > > Sean's comments regarding module parameters, code restructure, > ipoib_cm_rx state and the like will require more discussion and > subsequent testing. They will be addressed with additional set > of patches later on. > > This patch has been tested with linux-2.6.23-rc5 derived from Roland's > for-2.6.24 git tree on ppc64 machines using IBM HCA. > > Signed-off-by: Pradeep Satyanarayana > --- > > --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-31 12:14:30.000000000 -0500 > +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-18 14:31:07.000000000 -0500 > @@ -95,11 +95,13 @@ enum { > IPOIB_MCAST_FLAG_ATTACHED = 3, > }; > > +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) > #define IPOIB_OP_RECV (1ul << 31) > + > #ifdef CONFIG_INFINIBAND_IPOIB_CM > -#define IPOIB_CM_OP_SRQ (1ul << 30) > +#define IPOIB_CM_OP_RECV (1ul << 30) > #else > -#define IPOIB_CM_OP_SRQ (0) > +#define IPOIB_CM_OP_RECV (0) > #endif > > /* structs */ > @@ -166,11 +168,14 @@ enum ipoib_cm_state { > }; > > struct ipoib_cm_rx { > - struct ib_cm_id *id; > - struct ib_qp *qp; > - struct list_head list; > - struct net_device *dev; > - unsigned long jiffies; > + struct ib_cm_id *id; > + struct ib_qp *qp; > + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ > + struct list_head list; > + struct net_device *dev; > + unsigned long jiffies; > + u32 index; /* wr_ids are distinguished by index > + * to identify the QP -no srq only */ > enum ipoib_cm_state state; > }; > > @@ -215,6 +220,8 @@ struct ipoib_cm_dev_priv { > struct ib_wc ibwc[IPOIB_NUM_WC]; > struct ib_sge rx_sge[IPOIB_CM_RX_SG]; > struct ib_recv_wr rx_wr; > + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() > + *for usage of this element */ > }; > > /* > @@ -438,6 +445,7 @@ void ipoib_drain_cq(struct net_device *d > /* We don't support UC connections at the moment */ > #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) > > +extern int max_rc_qp; > static inline int ipoib_cm_admin_enabled(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > --- a/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.000000000 -0500 > +++ b/linux-2.6.23-rc5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-18 17:04:06.000000000 -0500 > @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, > > #include "ipoib.h" > > +int max_rc_qp = 128; > +static int max_recv_buf = 1024; /* Default is 1024 MB */ > + > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); > +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of no srq RC QPs supported; must be a power of 2"); I thought you were going to remove the power of 2 restriction. And to re-start this discussion, I think we should separate the maximum number of QPs from whether we use SRQ, and let the QP type (UD, UC, RC) be controllable. Smaller clusters may perform better without using SRQ, even if it is available. And supporting UC versus RC seems like it should only take a few additional lines of code. > +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); > +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); Based on your response to my feedback, it sounds like the only reason we're keeping this parameter around is in case the admin sets some of the other values (max QPs, message size, RQ size) incorrectly. I agree with Roland that we need to come up with the correct user interface here, and I'm not convinced that what we have is the most adaptable for where the code could go. What about replacing the 2 proposed parameters with these 3? qp_type - ud, uc, rc use_srq - yes/no (default if available) max_conn_qp - uc or rc limit > + > +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ > + > +#define NOSRQ_INDEX_MASK (max_rc_qp -1) Just reserve lower bits of the wr_id for the rx_table to avoid the power of 2 restriction. > #define IPOIB_CM_IETF_ID 0x1000000000000000ULL > > #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) > @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct > ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); > } > > -static int ipoib_cm_post_receive(struct net_device *dev, int id) > +static int post_receive_srq(struct net_device *dev, u64 id) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_recv_wr *bad_wr; > int i, ret; > > - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; > + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; > > for (i = 0; i < IPOIB_CM_RX_SG; ++i) > priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; > > ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); > if (unlikely(ret)) { > - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); > + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", > + (unsigned long long)id, ret); > ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > priv->cm.srq_ring[id].mapping); > dev_kfree_skb_any(priv->cm.srq_ring[id].skb); > @@ -104,12 +117,47 @@ static int ipoib_cm_post_receive(struct > return ret; > } > > -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, > +static int post_receive_nosrq(struct net_device *dev, u64 id) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ib_recv_wr *bad_wr; > + int i, ret; > + u32 index; > + u32 wr_id; > + struct ipoib_cm_rx *rx_ptr; > + > + index = id & NOSRQ_INDEX_MASK; > + wr_id = id >> 32; > + > + rx_ptr = priv->cm.rx_index_table[index]; > + > + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; > + > + for (i = 0; i < IPOIB_CM_RX_SG; ++i) > + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; > + > + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); > + if (unlikely(ret)) { > + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", > + wr_id, ret); > + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, > + rx_ptr->rx_ring[wr_id].mapping); > + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); > + rx_ptr->rx_ring[wr_id].skb = NULL; > + } > + > + return ret; > +} > + > +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, > + int frags, > u64 mapping[IPOIB_CM_RX_SG]) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct sk_buff *skb; > int i; > + struct ipoib_cm_rx *rx_ptr; > + u32 index, wr_id; > > skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); > if (unlikely(!skb)) > @@ -141,7 +189,14 @@ static struct sk_buff *ipoib_cm_alloc_rx > goto partial_error; > } > > - priv->cm.srq_ring[id].skb = skb; > + if (priv->cm.srq) > + priv->cm.srq_ring[id].skb = skb; > + else { > + index = id & NOSRQ_INDEX_MASK; > + wr_id = id >> 32; > + rx_ptr = priv->cm.rx_index_table[index]; > + rx_ptr->rx_ring[wr_id].skb = skb; > + } > return skb; > > partial_error: > @@ -203,11 +258,14 @@ static struct ib_qp *ipoib_cm_create_rx_ > .recv_cq = priv->cq, > .srq = priv->cm.srq, > .cap.max_send_wr = 1, /* For drain WR */ > + .cap.max_recv_wr = ipoib_recvq_size + 1, > .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ > .sq_sig_type = IB_SIGNAL_ALL_WR, > .qp_type = IB_QPT_RC, > .qp_context = p, > }; > + if (!priv->cm.srq) > + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; We can still toss this check. > return ib_create_qp(priv->pd, &attr); > } > > @@ -281,12 +339,131 @@ static int ipoib_cm_send_rep(struct net_ > rep.private_data_len = sizeof data; > rep.flow_control = 0; > rep.rnr_retry_count = req->rnr_retry_count; > - rep.srq = 1; > rep.qp_num = qp->qp_num; > rep.starting_psn = psn; > + rep.srq = !!priv->cm.srq; > return ib_send_cm_rep(cm_id, &rep); > } > > +static void init_context_and_add_list(struct ib_cm_id *cm_id, > + struct ipoib_cm_rx *p, > + struct ipoib_dev_priv *priv) > +{ > + cm_id->context = p; > + p->jiffies = jiffies; > + spin_lock_irq(&priv->lock); > + if (list_empty(&priv->cm.passive_ids)) > + queue_delayed_work(ipoib_workqueue, > + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > + if (priv->cm.srq) { > + /* Add this entry to passive ids list head, but do not re-add > + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush > + * list. > + */ > + if (p->state == IPOIB_CM_RX_LIVE) > + list_move(&p->list, &priv->cm.passive_ids); > + } > + spin_unlock_irq(&priv->lock); > +} > + > +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, > + struct ipoib_cm_rx *p, unsigned psn) > +{ > + struct net_device *dev = cm_id->context; > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + int ret; > + u32 index; > + u64 i, recv_mem_used; > + > + /* In the SRQ case there is a common rx buffer called the srq_ring. > + * However, for the no srq case we create an rx_ring for every > + * struct ipoib_cm_rx. > + */ > + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); > + if (!p->rx_ring) { > + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", > + p->qp->qp_num); > + return -ENOMEM; > + } > + > + spin_lock_irq(&priv->lock); > + list_add(&p->list, &priv->cm.passive_ids); > + spin_unlock_irq(&priv->lock); > + > + init_context_and_add_list(cm_id, p, priv); > + spin_lock_irq(&priv->lock); Just to avoid any possible races, how about just holding the lock throughout and remove acquiring it in init_context_and_add_list()? > + > + for (index = 0; index < max_rc_qp; index++) > + if (priv->cm.rx_index_table[index] == NULL) > + break; > + > + recv_mem_used = (u64)ipoib_recvq_size * > + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; > + if ((index == max_rc_qp) || > + (recv_mem_used >= max_recv_buf * (1ul << 20))) { > + spin_unlock_irq(&priv->lock); > + ipoib_warn(priv, "no srq has reached the configurable limit " > + "of either %d RC QPs or, max recv buf size of " > + "0x%x MB\n", max_rc_qp, max_recv_buf); > + > + /* We send a REJ to the remote side indicating that we > + * have no more free RC QPs and leave it to the remote side > + * to take appropriate action. This should leave the > + * current set of QPs unaffected and any subsequent REQs > + * will be able to use RC QPs if they are available. > + */ > + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); > + ret = -EINVAL; > + goto err_alloc_and_post; > + } > + > + priv->cm.rx_index_table[index] = p; > + > + /* We will subsequently use this stored pointer while freeing > + * resources in stale task > + */ > + p->index = index; > + spin_unlock_irq(&priv->lock); > + > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) { > + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); > + ipoib_cm_dev_cleanup(dev); > + goto err_alloc_and_post; > + } > + > + for (i = 0; i < ipoib_recvq_size; ++i) { > + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, > + IPOIB_CM_RX_SG - 1, > + p->rx_ring[i].mapping)) { > + ipoib_warn(priv, "failed to allocate receive " > + "buffer %d\n", (int)i); > + ipoib_cm_dev_cleanup(dev); > + ret = -ENOMEM; > + goto err_alloc_and_post; > + } > + > + ret = post_receive_nosrq(dev, i << 32 | index); > + if (ret) { > + ipoib_warn(priv, "post_receive_nosrq " > + "failed for buf %lld\n", (unsigned long long)i); > + ipoib_cm_dev_cleanup(dev); > + ret = -EIO; > + goto err_alloc_and_post; > + } > + } > + > + return 0; > + > +err_alloc_and_post: > + atomic_dec(¤t_rc_qp); > + kfree(p->rx_ring); > + spin_lock_irq(&priv->lock); > + list_del_init(&p->list); > + spin_unlock_irq(&priv->lock); > + return ret; > +} > + > static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) > { > struct net_device *dev = cm_id->context; > @@ -301,9 +478,6 @@ static int ipoib_cm_req_handler(struct i > return -ENOMEM; > p->dev = dev; > p->id = cm_id; > - cm_id->context = p; > - p->state = IPOIB_CM_RX_LIVE; > - p->jiffies = jiffies; > INIT_LIST_HEAD(&p->list); > > p->qp = ipoib_cm_create_rx_qp(dev, p); > @@ -313,19 +487,21 @@ static int ipoib_cm_req_handler(struct i > } > > psn = random32() & 0xffffff; > - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > - if (ret) > - goto err_modify; > + if (!priv->cm.srq) { > + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); > + if (ret) > + goto err_modify; > + } else { > + p->rx_ring = NULL; > + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); > + if (ret) > + goto err_modify; > + } > > - spin_lock_irq(&priv->lock); > - queue_delayed_work(ipoib_workqueue, > - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > - /* Add this entry to passive ids list head, but do not re-add it > - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ > - p->jiffies = jiffies; > - if (p->state == IPOIB_CM_RX_LIVE) > - list_move(&p->list, &priv->cm.passive_ids); > - spin_unlock_irq(&priv->lock); > + if (priv->cm.srq) { > + p->state = IPOIB_CM_RX_LIVE; > + init_context_and_add_list(cm_id, p, priv); > + } Merge this if() statement with the if() immediately above it. > > ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); > if (ret) { > @@ -398,29 +574,60 @@ static void skb_put_frags(struct sk_buff > } > } > > -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + unsigned long flags; > + > + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > + spin_lock_irqsave(&priv->lock, flags); > + p->jiffies = jiffies; > + /* Move this entry to list head, but do > + * not re-add it if it has been removed. > + */ > + if (p->state == IPOIB_CM_RX_LIVE) > + list_move(&p->list, &priv->cm.passive_ids); > + spin_unlock_irqrestore(&priv->lock, flags); > + } > +} > + > +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) > +{ > + unsigned long flags; > + > + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > + spin_lock_irqsave(&priv->lock, flags); > + p->jiffies = jiffies; > + /* Move this entry to list head, but do > + * not re-add it if it has been removed. */ > + if (!list_empty(&p->list)) > + list_move(&p->list, &priv->cm.passive_ids); > + spin_unlock_irqrestore(&priv->lock, flags); > + } > +} > + > +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; > + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; > struct sk_buff *skb, *newskb; > struct ipoib_cm_rx *p; > unsigned long flags; > u64 mapping[IPOIB_CM_RX_SG]; > - int frags; > + int frags, ret; > > - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", > - wr_id, wc->status); > + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", > + (unsigned long long)wr_id, wc->status); > > if (unlikely(wr_id >= ipoib_recvq_size)) { > - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { > + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { > spin_lock_irqsave(&priv->lock, flags); > list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); > ipoib_cm_start_rx_drain(priv); > queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > spin_unlock_irqrestore(&priv->lock, flags); > } else > - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", > - wr_id, ipoib_recvq_size); > + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", > + (unsigned long long)wr_id, ipoib_recvq_size); > return; > } > > @@ -428,23 +635,15 @@ void ipoib_cm_handle_rx_wc(struct net_de > > if (unlikely(wc->status != IB_WC_SUCCESS)) { > ipoib_dbg(priv, "cm recv error " > - "(status=%d, wrid=%d vend_err %x)\n", > - wc->status, wr_id, wc->vendor_err); > + "(status=%d, wrid=%lld vend_err %x)\n", > + wc->status, (unsigned long long)wr_id, wc->vendor_err); > ++priv->stats.rx_dropped; > - goto repost; > + goto repost_srq; > } > > if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > p = wc->qp->qp_context; > - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { > - spin_lock_irqsave(&priv->lock, flags); > - p->jiffies = jiffies; > - /* Move this entry to list head, but do not re-add it > - * if it has been moved out of list. */ > - if (p->state == IPOIB_CM_RX_LIVE) > - list_move(&p->list, &priv->cm.passive_ids); > - spin_unlock_irqrestore(&priv->lock, flags); > - } > + timer_check_srq(priv, p); > } > > frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > @@ -456,13 +655,109 @@ void ipoib_cm_handle_rx_wc(struct net_de > * If we can't allocate a new RX buffer, dump > * this packet and reuse the old buffer. > */ > - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", > + (unsigned long long)wr_id); > + ++priv->stats.rx_dropped; > + goto repost_srq; > + } > + > + ipoib_cm_dma_unmap_rx(priv, frags, > + priv->cm.srq_ring[wr_id].mapping); > + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, > + (frags + 1) * sizeof *mapping); > + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > + wc->byte_len, wc->slid); > + > + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > + > + skb->protocol = ((struct ipoib_header *) skb->data)->proto; > + skb_reset_mac_header(skb); > + skb_pull(skb, IPOIB_ENCAP_LEN); > + > + dev->last_rx = jiffies; > + ++priv->stats.rx_packets; > + priv->stats.rx_bytes += skb->len; > + > + skb->dev = dev; > + /* XXX get correct PACKET_ type here */ > + skb->pkt_type = PACKET_HOST; > + netif_receive_skb(skb); > + > +repost_srq: > + ret = post_receive_srq(dev, wr_id); > + > + if (unlikely(ret)) > + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", > + (unsigned long long)wr_id); > + > +} > + > +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct sk_buff *skb, *newskb; > + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; > + u32 index; > + struct ipoib_cm_rx *rx_ptr; > + int frags, ret; > + > + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", > + (unsigned long long)wr_id, wc->status); > + > + if (unlikely(wr_id >= ipoib_recvq_size)) { > + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", > + (unsigned long long)wr_id, ipoib_recvq_size); > + return; > + } > + > + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; > + > + /* This is the only place where rx_ptr could be a NULL - could > + * have just received a packet from a connection that has become > + * stale and so is going away. We will simply drop the packet and > + * let the remote end handle the dropped packet. > + * In the timer_check() function below, p->jiffies is updated and > + * hence the connection will not be stale after that. > + */ > + rx_ptr = priv->cm.rx_index_table[index]; > + if (unlikely(!rx_ptr)) { > + ipoib_warn(priv, "Received packet from a connection " > + "that is going away. Remote end will handle it.\n"); > + return; > + } I thought we could remove this check and the comment above it. It's misleading to keep them around. - Sean From rdreier at cisco.com Tue Oct 9 15:51:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 15:51:59 -0700 Subject: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device In-Reply-To: (Roland Dreier's message of "Tue, 09 Oct 2007 15:48:56 -0700") References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: Sorry... wrong subject here; it should have been "ibm_newemac: ..." - R. From davem at davemloft.net Tue Oct 9 16:17:50 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 16:17:50 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning In-Reply-To: References: <20071009.135145.95506679.davem@davemloft.net> Message-ID: <20071009.161750.02298910.davem@davemloft.net> From: Roland Dreier Date: Tue, 09 Oct 2007 15:46:13 -0700 > The conversion to use netdevice internal stats left an unused variable > in ipoib_neigh_free(), since there's no longer any reason to get > netdev_priv() in order to increment dropped packets. Delete the > unused priv variable. > > Signed-off-by: Roland Dreier Jeff, do you want to merge in Roland's 4 patches to your tree then do a sync with me so I can pull it all in from you? Alternative I can merge in Roland's work directly if that's easier for you. Just let me know. From pradeeps at linux.vnet.ibm.com Tue Oct 9 16:42:13 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 09 Oct 2007 16:42:13 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470C05EF.90409@ichips.intel.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> Message-ID: <470C11D5.7060903@linux.vnet.ibm.com> Sean, Roland, I looked through Sean's latest comments. Yes, they are fairly easy to fix and I will fix them. The only one that might need some debate is the one associated with module parameters. In previous communications with Roland I got the impression that he wants to keep them (module parameters) at a minimum. So, how do we address that now? Last time around (after Sean's comments) I just addressed the bugs and skipped the rest since I had no idea as to how much time I had for the merge. These days I do not have exclusive access to the machines with IB adapters limiting the work I can do at a stretch. How much time do I have before this gets merged into the 2.6.24 tree? Other than the module parameters one I should be able to address the rest either by this evening (west coast US) or maybe in the morning/afternoon tomorrow. Will that be acceptable? Pradeep From davem at davemloft.net Tue Oct 9 17:04:35 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 17:04:35 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1191967006.5324.14.camel@localhost> References: <20071009135340.33e5922c@freepuppy.rosehill> <20071009.142235.74385364.davem@davemloft.net> <1191967006.5324.14.camel@localhost> Message-ID: <20071009.170435.43504422.davem@davemloft.net> From: jamal Date: Tue, 09 Oct 2007 17:56:46 -0400 > if the h/ware queues are full because of link pressure etc, you drop. We > drop today when the s/ware queues are full. The driver txmit lock takes > place of the qdisc queue lock etc. I am assuming there is still need for > that locking. The filter/classification scheme still works as is and > select classes which map to rings. tc still works as is etc. I understand your suggestion. We have to keep in mind, however, that the sw queue right now is 1000 packets. I heavily discourage any driver author to try and use any single TX queue of that size. Which means that just dropping on back pressure might not work so well. Or it might be perfect and signal TCP to backoff, who knows! :-) While working out this issue in my mind, it occured to me that we can put the sw queue into the driver as well. The idea is that the network stack, as in the pure hw queue scheme, unconditionally always submits new packets to the driver. Therefore even if the hw TX queue is full, the driver can still queue to an internal sw queue with some limit (say 1000 for ethernet, as is used now). When the hw TX queue gains space, the driver self-batches packets from the sw queue to the hw queue. It sort of obviates the need for mid-level queue batching in the generic networking. Compared to letting the driver self-batch, the mid-level batching approach is pure overhead. We seem to be sort of all mentioning similar ideas. For example, you can get the above kind of scheme today by using a mid-level queue length of zero, and I believe this idea was mentioned by Stephen Hemminger earlier. I may experiment with this in the NIU driver. From contact_uk4 at bellsouth.net Tue Oct 9 17:19:24 2007 From: contact_uk4 at bellsouth.net (uk@national.co.uk) Date: Wed, 10 Oct 2007 00:19:24 +0000 Subject: [ofa-general] ***SPAM*** Confirm Reciept......Lucky Winner Message-ID: <101020070019.21287.470C1A89000D74500000532722230704929B0A02D2089B9A019C04040A0DBFCB059AA19B0C0E9B02010C@bellsouth.net> The Camelot Group, Operators of The National Lottery. 3b Olympic Way, Sefton Business Park, Aintree, Liverpool , L30 1RD REF:UKL/74A0802742007 BATCH:2006UKL-01 This is to inform you that you have been selected for a cash prize of �891,934.00 pounds held on the 5th October, 2007 in London UK.The selection process was carried out through random selection in Our computerized email selection system(ess) from a database of over 250,000 email Addresses drawn from which you were selected. Contact our fiduciary agent for claims with: Name: MR.Maxwell Johnson Tel:+44 704 570 6460 Fax:+44 707 502 0834 Email: info.uknationallotteryclaims7 at yahoo.co.uk Fill the below: 1. Name: 2. Address 3. Marital Status: 4. Occupation: 5. Age:6. Sex: 7. Nationality: 8. Country of Residence: 9. Telephone Number: Yours faithfully, Sincerely, Mrs Dianne Thompson -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Oct 9 17:20:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 17:20:06 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470C05EF.90409@ichips.intel.com> (Sean Hefty's message of "Tue, 09 Oct 2007 15:51:27 -0700") References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> Message-ID: > And to re-start this discussion, I think we should separate the > maximum number of QPs from whether we use SRQ, and let the QP type > (UD, UC, RC) be controllable. Smaller clusters may perform better > without using SRQ, even if it is available. And supporting UC versus > RC seems like it should only take a few additional lines of code. Actually supporting UC is trickier than it seems, at least for the SRQ case, since attaching UC QPs to an SRQ requires that the IB spec be extended to allow that (and also define some semantics for how to handle messages that encounter an error in the middle of being received, after a work request has been taken from the SRQ). > I agree with Roland that we need to come up with the correct user > interface here, and I'm not convinced that what we have is the most > adaptable for where the code could go. What about replacing the 2 > proposed parameters with these 3? > > qp_type - ud, uc, rc > use_srq - yes/no (default if available) > max_conn_qp - uc or rc limit I don't think we want the qp_type to be a module parameter -- it seems we already have ud vs. rc handled via the parameter that enables connected mode, and if we want to enable uc we should do that in a similar per-interface way. Similarly if there's any point to making use_srq something that can be controlled, ideally it should be per-interface. But this could be tricky because it may be hard to change at runtime. (Ideally max_conn_qp would be per-interface too but that seems too hard as well) I do agree that the memory limit just seems arbitrary and we can probably do away with that. - R. From jeff at garzik.org Tue Oct 9 17:32:24 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 20:32:24 -0400 Subject: [ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning In-Reply-To: <20071009.161750.02298910.davem@davemloft.net> References: <20071009.135145.95506679.davem@davemloft.net> <20071009.161750.02298910.davem@davemloft.net> Message-ID: <470C1D98.6010308@garzik.org> David Miller wrote: > From: Roland Dreier > Date: Tue, 09 Oct 2007 15:46:13 -0700 > >> The conversion to use netdevice internal stats left an unused variable >> in ipoib_neigh_free(), since there's no longer any reason to get >> netdev_priv() in order to increment dropped packets. Delete the >> unused priv variable. >> >> Signed-off-by: Roland Dreier > > Jeff, do you want to merge in Roland's 4 patches to your tree then do > a sync with me so I can pull it all in from you? Grabbing them now... From andi at firstfloor.org Tue Oct 9 17:37:16 2007 From: andi at firstfloor.org (Andi Kleen) Date: Wed, 10 Oct 2007 02:37:16 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.170435.43504422.davem@davemloft.net> References: <20071009135340.33e5922c@freepuppy.rosehill> <20071009.142235.74385364.davem@davemloft.net> <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> Message-ID: <20071010003716.GB552@one.firstfloor.org> On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote: > We have to keep in mind, however, that the sw queue right now is 1000 > packets. I heavily discourage any driver author to try and use any > single TX queue of that size. Why would you discourage them? If 1000 is ok for a software queue why would it not be ok for a hardware queue? > Which means that just dropping on back > pressure might not work so well. > > Or it might be perfect and signal TCP to backoff, who knows! :-) 1000 packets is a lot. I don't have hard data, but gut feeling is less would also do. And if the hw queues are not enough a better scheme might be to just manage this in the sockets in sendmsg. e.g. provide a wait queue that drivers can wake up and let them block on more queue. > The idea is that the network stack, as in the pure hw queue scheme, > unconditionally always submits new packets to the driver. Therefore > even if the hw TX queue is full, the driver can still queue to an > internal sw queue with some limit (say 1000 for ethernet, as is used > now). > > > When the hw TX queue gains space, the driver self-batches packets > from the sw queue to the hw queue. I don't really see the advantage over the qdisc in that scheme. It's certainly not simpler and probably more code and would likely also not require less locks (e.g. a currently lockless driver would need a new lock for its sw queue). Also it is unclear to me it would be really any faster. -Andi From jeff at garzik.org Tue Oct 9 17:47:28 2007 From: jeff at garzik.org (Jeff Garzik) Date: Tue, 09 Oct 2007 20:47:28 -0400 Subject: [ofa-general] Re: [PATCH 1/4] IPoIB: Fix unused variable warning In-Reply-To: References: <20071009.042441.30182968.davem@davemloft.net> <20071009.135145.95506679.davem@davemloft.net> Message-ID: <470C2120.30608@garzik.org> Roland Dreier wrote: > The conversion to use netdevice internal stats left an unused variable > in ipoib_neigh_free(), since there's no longer any reason to get > netdev_priv() in order to increment dropped packets. Delete the > unused priv variable. > > Signed-off-by: Roland Dreier > --- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 1 - > 1 files changed, 0 insertions(+), 1 deletions(-) applied 1-4 From davem at davemloft.net Tue Oct 9 17:50:25 2007 From: davem at davemloft.net (David Miller) Date: Tue, 09 Oct 2007 17:50:25 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010003716.GB552@one.firstfloor.org> References: <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> <20071010003716.GB552@one.firstfloor.org> Message-ID: <20071009.175025.59469417.davem@davemloft.net> From: Andi Kleen Date: Wed, 10 Oct 2007 02:37:16 +0200 > On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote: > > We have to keep in mind, however, that the sw queue right now is 1000 > > packets. I heavily discourage any driver author to try and use any > > single TX queue of that size. > > Why would you discourage them? > > If 1000 is ok for a software queue why would it not be ok > for a hardware queue? Because with the software queue, you aren't accessing 1000 slots shared with the hardware device which does shared-ownership transactions on those L2 cache lines with the cpu. Long ago I did a test on gigabit on a cpu with only 256K of L2 cache. Using a smaller TX queue make things go faster, and it's exactly because of these L2 cache effects. > 1000 packets is a lot. I don't have hard data, but gut feeling > is less would also do. I'll try to see how backlogged my 10Gb tests get when a strong sender is sending to a weak receiver. > And if the hw queues are not enough a better scheme might be to > just manage this in the sockets in sendmsg. e.g. provide a wait queue that > drivers can wake up and let them block on more queue. TCP does this already, but it operates in a lossy manner. > I don't really see the advantage over the qdisc in that scheme. > It's certainly not simpler and probably more code and would likely > also not require less locks (e.g. a currently lockless driver > would need a new lock for its sw queue). Also it is unclear to me > it would be really any faster. You still need a lock to guard hw TX enqueue from hw TX reclaim. A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you increase the size much more performance starts to go down due to L2 cache thrashing. From pradeeps at linux.vnet.ibm.com Tue Oct 9 18:19:19 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 09 Oct 2007 18:19:19 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> Message-ID: <470C2897.1010105@linux.vnet.ibm.com> > > I do agree that the memory limit just seems arbitrary and we can > probably do away with that. We discussed this previously and had agreed upon limiting the memory foot print to 1GB by default. This module parameter was for larger systems that had plenty of memory and could afford to use more. This way the sys admin could increase the limit. Hence I am not really in favour of removing this. Pradeep From Jim.Langston at Sun.COM Tue Oct 9 18:20:50 2007 From: Jim.Langston at Sun.COM (Jim Langston) Date: Tue, 09 Oct 2007 21:20:50 -0400 Subject: [ofa-general] SDP ? In-Reply-To: <00ac01c80aaf$9c98e700$d5cab500$@rr.com> References: <470B9A84.9000008@sun.com> <00ac01c80aaf$9c98e700$d5cab500$@rr.com> Message-ID: <470C28F2.4030905@sun.com> Hi Jim, Thanks, tried early on with -D_XPG4_2, things went from bad to worse, I'll look at switching from int to void. Jim ////////////// Jim Mott wrote: > That should work fine. You might be able to build with -D_XPG4_2 as well. > > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jim Langston > Sent: Tuesday, October 09, 2007 10:13 AM > To: general at lists.openfabrics.org > Subject: [ofa-general] SDP ? > > Hi all, > > I'm working on porting SDP to OpenSolaris and am looking at a > compile error that I get. Essentially, I have a conflict of types on > the compile: > > bash-3.00$ /opt/SUNWspro/bin/cc -DHAVE_CONFIG_H -I. -I. -I.. -g > -D_POSIX_PTHREAD_SEMANTICS -DSYSCONFDIR=\"/usr/local/etc\" -g > -D_POSIX_PTHREAD_SEMANTICS -c port.c -KPIC -DPIC -o .libs/port.o > "port.c", line 1896: identifier redeclared: getsockname > current : function(int, pointer to struct sockaddr {unsigned > short sa_family, array[14] of char sa_data}, pointer to unsigned int) > returning int > previous: function(int, pointer to struct sockaddr {unsigned > short sa_family, array[14] of char sa_data}, pointer to void) returning > int : "/usr/include/sys/socket.h", line 436 > > > Line 436 in /usr/include/sys/socket.h > > extern int getsockname(int, struct sockaddr *_RESTRICT_KYWD, Psocklen_t); > > > and Psocklen_t > > #if defined(_XPG4_2) || defined(_BOOT) > typedef socklen_t *_RESTRICT_KYWD Psocklen_t; > #else > typedef void *_RESTRICT_KYWD Psocklen_t; > #endif /* defined(_XPG4_2) || defined(_BOOT) */ > > > Do I need to change port.c getsockname to type void * ? > > > Thanks, > > Jim > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- ///////////////////////////////////////////// Jim Langston Sun Microsystems, Inc. (877) 854-5583 (AccessLine) AIM: jl9594 jim.langston at sun.com From arthur.jones at qlogic.com Tue Oct 9 19:36:25 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 9 Oct 2007 19:36:25 -0700 Subject: [ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support In-Reply-To: References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071010023625.GA31708@bauxite.pathscale.com> hi roland, i didn't realize it was such a PITA for you to take so many at once. i'll make sure to do them in smaller chunks from now on. thanks for taking these... arthur On Tue, Oct 09, 2007 at 02:55:31PM -0700, Roland Dreier wrote: > OK, I'll grudgingly merge these patch, even though they all arrived on > the exact day that Linus released 2.6.23... but you guys really need > to fix your development process so you don't accumulate a huge bolus > of patches that you then vomit out. In the future I'm not going to > accept giant merges like this -- please send your patches as soon as > you've accumulated say 5 or 10. > > - R. From rdreier at cisco.com Tue Oct 9 20:10:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Oct 2007 20:10:15 -0700 Subject: [ofa-general] Re: [PATCH 01/23] IB/ipath -- iba6110 rev4 GPIO counters support In-Reply-To: <20071010023625.GA31708@bauxite.pathscale.com> (Arthur Jones's message of "Tue, 9 Oct 2007 19:36:25 -0700") References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> <20071009195920.7151.4573.stgit@eng-46.internal.keyresearch.com> <20071010023625.GA31708@bauxite.pathscale.com> Message-ID: > hi roland, i didn't realize it was such > a PITA for you to take so many at once. > i'll make sure to do them in smaller chunks > from now on. Thanks. The reason its a pain is that it's a lot harder to review a ton of patches when they come late like this. Just send the patches as you write them and you have less of a queue to worry about and I can manage my queue a lot better. - R. From pradeeps at linux.vnet.ibm.com Tue Oct 9 20:57:43 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 09 Oct 2007 20:57:43 -0700 Subject: [ofa-general] IPoIB CM (NOSRQ) [Patch V9] revised Message-ID: <470C4DB7.2050103@linux.vnet.ibm.com> This revised version incorporates Sean's comments. The module parameters are unchanged except the restriction on max_rc_qp (that it should be power of 2) has been removed. This patch has been tested with linux-2.6.23-rc7 (derived from Roland's for-2.6.24 git tree) on ppc64 machines using IBM HCA. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-03 12:01:58.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-09 19:42:51.000000000 -0500 @@ -69,6 +69,7 @@ enum { IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, + IPOIB_MAX_RC_QP = 4096, IPOIB_NUM_WC = 4, @@ -95,11 +96,13 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #define IPOIB_OP_RECV (1ul << 31) + #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -186,11 +189,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -no srq only */ enum ipoib_cm_state state; }; @@ -235,6 +241,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -458,6 +466,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) +extern int max_rc_qp; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-09 21:15:25.000000000 -0500 @@ -49,6 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +int max_rc_qp = 128; +static int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of no srq RC QPs supported"); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ + +#define NOSRQ_INDEX_MASK (0xfff) /* This corresponds to a max of 4096 QPs for no srq */ #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +93,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", + (unsigned long long)id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +117,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +189,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -203,11 +258,14 @@ static struct ib_qp *ipoib_cm_create_rx_ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; return ib_create_qp(priv->pd, &attr); } @@ -281,12 +339,127 @@ static int ipoib_cm_send_rep(struct net_ rep.private_data_len = sizeof data; rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 index; + u64 i, recv_mem_used; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the no srq case we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + p->qp->qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + + init_context_and_add_list(cm_id, p, priv); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; + if ((index == max_rc_qp) || + (recv_mem_used >= max_recv_buf * (1ul << 20))) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "no srq has reached the configurable limit " + "of either %d RC QPs or, max recv buf size of " + "0x%x MB\n", max_rc_qp, max_recv_buf); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + + priv->cm.rx_index_table[index] = p; + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + spin_unlock_irq(&priv->lock); + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", (int)i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + ret = post_receive_nosrq(dev, i << 32 | index); + if (ret) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %lld\n", (unsigned long long)i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + atomic_dec(¤t_rc_qp); + kfree(p->rx_ring); + spin_lock_irq(&priv->lock); + list_del_init(&p->list); + spin_unlock_irq(&priv->lock); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -301,9 +474,6 @@ static int ipoib_cm_req_handler(struct i return -ENOMEM; p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -313,19 +483,18 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; - - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (!priv->cm.srq) { + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); + if (ret) + goto err_modify; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + p->state = IPOIB_CM_RX_LIVE; + init_context_and_add_list(cm_id, p, priv); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -398,29 +567,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", - wr_id, wc->status); + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -428,23 +628,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -456,13 +648,96 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", + (unsigned long long)wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; + rx_ptr = priv->cm.rx_index_table[index]; + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) + timer_check_nosrq(priv, rx_ptr); + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -482,10 +757,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n", + (unsigned long long)wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -677,6 +964,43 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); + kfree(priv->cm.rx_index_table); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -691,6 +1015,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -814,7 +1143,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 0; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -854,7 +1185,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1198,6 +1529,8 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) + atomic_dec(¤t_rc_qp); kfree(p); } } @@ -1216,12 +1549,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1275,16 +1615,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1301,20 +1665,32 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + ret = ib_query_device(priv->ca, &attr); + if (ret) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + ret = create_srq(dev, priv); + if (ret) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kcalloc(max_rc_qp, + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate rx_index_table\n"); + return -ENOMEM; + } } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1327,17 +1703,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for no srq we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-09 19:02:45.000000000 -0500 @@ -300,7 +300,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -566,7 +566,7 @@ void ipoib_drain_cq(struct net_device *d if (priv->ibwc[i].status == IB_WC_SUCCESS) priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-10-09 19:02:45.000000000 -0500 @@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if (!priv->cm.srq) + size += (max_rc_qp - 1) * ipoib_recvq_size; +#endif + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-03 12:01:58.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-09 21:34:24.000000000 -0500 @@ -1229,6 +1229,7 @@ static int __init ipoib_init_module(void ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + max_rc_qp = min(max_rc_qp, IPOIB_MAX_RC_QP); ret = ipoib_register_debugfs(); if (ret) From kliteyn at mellanox.co.il Tue Oct 9 22:08:49 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 10 Oct 2007 07:08:49 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-10:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-09 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From info at yahoo.com Mon Oct 8 03:40:01 2007 From: info at yahoo.com (Foundazion Di Vittorio) Date: 8 Oct 2007 12:40:01 +0200 Subject: [ofa-general] Congratulations, Message-ID: <20071008104001.3340.qmail@h1088026.serverkompetenz.net> Attn:Winner Congratulations The Foundazion Di Vittorio has chosenyoubythe board of trustees as one of the final recipients ofacashGrant/Donation for your own personal,educational,andbusinessTocelebrate the 30th anniversary 2007 program,We are giving outayearlydonation of US$200,000.00 to nd it to the PaymentRemitanceOffice Viaemail contact BATCH NO40 lucky recipients,ascharitydonations/aid. fill out below Formse:Batch(N-222-6747,E-900-56) FullName:.............. ResidentialAddress:............... Occupation:.............. Country:.................. Telephone:.................. Fax:...................... Number:.... Sex:................... age:................. NextofKin:............ Winning BatchNo:...... (PaymentRemitanceContact) MrCalvinoCostantino. E-Mail:payout_officeunit at yahoo.it http://www.fondazionedivittorio.it From ogerlitz at voltaire.com Wed Oct 10 01:54:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 10 Oct 2007 10:54:37 +0200 Subject: [ofa-general] Re: [PATCH v3 for 2.6.24] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: Message-ID: <470C934D.5020805@voltaire.com> Roland Dreier wrote: > OK, I will merge this for 2.6.24. However, I still don't really > understand the changelog entry: >> The kernel IB stack allows (through the RDMA CM) user space multicast applications >> to interoperate with IP based apps optionally running at a different IP subnet. >> >> To support this inter-op for the case where the receiving party resides at >> the IB side, there is a need to handle IGMP (reports/queries) else the local >> IP router would not forward multicast traffic towards the IB network. > So in other words you have a userspace app that joins an IPoIB > multicast group and then it has to do an IP_ADD_MEMBERSHIP socket > option to trigger IGMP messages being sent out, so that traffic gets > routed to it? yes >> This patch does a lookup on the database used for multicast reference counting and >> enhances IPoIB to ignore multicast group which is already handled by user space, all >> this under a per device policy flag. That is when the policy flag allows it, IPoIB >> will not join and attach its QP to a multicast group which has an entry on the database. > And then you don't want the kernel IPoIB driver to actually join the > multicast group for the IP multicast group you added with > IP_ADD_MEMBERSHIP? Why is that exactly -- this is the part I'm > especially hazy on. Yes I don't want ipoib to receive packets from this group, so it need not join/attach to it through the flow of the net core calling ipoib's set_multicast_list callback. In the case of IGMP v2 where reports are sent over the actual group, IPoIB does join but as "send only", I have validated this to work fine with my patch. The whole idea is that there's a userspace app that joins through the rdma-cm and attaches its user space QP to this MGID such that it will receive this multicast group packets. Opening a socket and calling add membership on it is done since this is the only means to cause the kernel to issue IGMP reporting etc on this group. Other than that IPoIB need not join/attach to this group, doing so on my system (*) cuts the performance by half. When I attach two user space processes to the same group performance is cut by only ~10%, so the 50% drop might turn to be network stack issue or firmware issue or combination of both and other things. At the bottom line, the umcast flag allow users who need to interop with IP routers, to signal IPoIB that they don't want groups joined for user space receiving to be joined/attached by the kernel. are we done? Or. (*) Mellanox Arbel memfull hca (device 25208), firmware 4.7.600 From tziporet at dev.mellanox.co.il Wed Oct 10 02:07:37 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 10 Oct 2007 11:07:37 +0200 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> Message-ID: <470C9659.1030907@mellanox.co.il> Roland Dreier wrote: > Actually supporting UC is trickier than it seems, at least for the SRQ > case, since attaching UC QPs to an SRQ requires that the IB spec be > extended to allow that (and also define some semantics for how to > handle messages that encounter an error in the middle of being > received, after a work request has been taken from the SRQ). > UC with SRQ was just added to IB SPEC ConnectX with our latest FW already supports this, and we can add it to the low level driver if needed. Arbel can support it too but its not implemented yet in FW but can be added later. Tziporet From andi at firstfloor.org Wed Oct 10 02:16:44 2007 From: andi at firstfloor.org (Andi Kleen) Date: Wed, 10 Oct 2007 11:16:44 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.175025.59469417.davem@davemloft.net> References: <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> <20071010003716.GB552@one.firstfloor.org> <20071009.175025.59469417.davem@davemloft.net> Message-ID: <20071010091644.GA9807@one.firstfloor.org> > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you With TSO really? > increase the size much more performance starts to go down due to L2 > cache thrashing. Another possibility would be to consider using cache avoidance instructions while updating the TX ring (e.g. write combining on x86) -Andi From davem at davemloft.net Wed Oct 10 02:25:50 2007 From: davem at davemloft.net (David Miller) Date: Wed, 10 Oct 2007 02:25:50 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010091644.GA9807@one.firstfloor.org> References: <20071010003716.GB552@one.firstfloor.org> <20071009.175025.59469417.davem@davemloft.net> <20071010091644.GA9807@one.firstfloor.org> Message-ID: <20071010.022550.21928751.davem@davemloft.net> From: Andi Kleen Date: Wed, 10 Oct 2007 11:16:44 +0200 > > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you > > With TSO really? Yes. > > increase the size much more performance starts to go down due to L2 > > cache thrashing. > > Another possibility would be to consider using cache avoidance > instructions while updating the TX ring (e.g. write combining > on x86) The chip I was working with at the time (UltraSPARC-IIi) compressed all the linear stores into 64-byte full cacheline transactions via the store buffer. It's true that it would allocate in the L2 cache on a miss, which is different from your suggestion. In fact, such a thing might not pan out well, because most of the time you write a single descriptor or two, and that isn't a full cacheline, which means a read/modify/write is the only coherent way to make such a write to RAM. Sure you could batch, but I'd rather give the chip work to do unless I unequivocably knew I'd have enough pending to fill a cacheline's worth of descriptors. And since you suggest we shouldn't queue in software... :-) From dotanb at dev.mellanox.co.il Wed Oct 10 02:25:18 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 10 Oct 2007 11:25:18 +0200 Subject: [ofa-general] [PATCH (resend)]: libibverbs: Fix several issues that were reported by valgrind Message-ID: <200710101125.18099.dotanb@dev.mellanox.co.il> Fix several issues that were reported by valgrind: * Initialize the reserved attributes * fixing the pointer + size when calling to VALGRIND_MAKE_MEM_DEFINED * adding VALGRIND_MAKE_MEM_DEFINED to the buffers which were filled with the system call "write". Signed-off-by: Dotan Barak --- diff --git a/src/cmd.c b/src/cmd.c index 6d4331f..31b6092 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -248,7 +248,7 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; - VALGRIND_MAKE_MEM_DEFINED(&resp, sizeof resp); + VALGRIND_MAKE_MEM_DEFINED(resp, resp_size); mr->handle = resp->mr_handle; mr->lkey = resp->lkey; @@ -291,7 +291,7 @@ static int ibv_cmd_create_cq_v2(struct ibv_context *context, int cqe, if (write(context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; - VALGRIND_MAKE_MEM_DEFINED(resp, sizeof resp_size); + VALGRIND_MAKE_MEM_DEFINED(resp, resp_size); cq->handle = resp->cq_handle; cq->cqe = resp->cqe; @@ -432,6 +432,7 @@ int ibv_cmd_destroy_cq(struct ibv_cq *cq) IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_CQ, &resp, sizeof resp); cmd.cq_handle = cq->handle; + cmd.reserved = 0; if (write(cq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) return errno; @@ -539,10 +540,13 @@ int ibv_cmd_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); cmd->srq_handle = srq->handle; + cmd->reserved = 0; if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; + VALGRIND_MAKE_MEM_DEFINED(&resp, sizeof resp); + srq_attr->max_wr = resp.max_wr; srq_attr->max_sge = resp.max_sge; srq_attr->srq_limit = resp.srq_limit; @@ -573,10 +577,13 @@ int ibv_cmd_destroy_srq(struct ibv_srq *srq) IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_SRQ, &resp, sizeof resp); cmd.srq_handle = srq->handle; + cmd.reserved = 0; if (write(srq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) return errno; + VALGRIND_MAKE_MEM_DEFINED(&resp, sizeof resp); + pthread_mutex_lock(&srq->mutex); while (srq->events_completed != resp.events_reported) pthread_cond_wait(&srq->cond, &srq->mutex); @@ -657,6 +664,8 @@ int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, if (write(qp->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; + VALGRIND_MAKE_MEM_DEFINED(&resp, sizeof resp); + attr->qkey = resp.qkey; attr->rq_psn = resp.rq_psn; attr->sq_psn = resp.sq_psn; @@ -1064,6 +1073,7 @@ int ibv_cmd_destroy_qp(struct ibv_qp *qp) IBV_INIT_CMD_RESP(&cmd, sizeof cmd, DESTROY_QP, &resp, sizeof resp); cmd.qp_handle = qp->handle; + cmd.reserved = 0; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) return errno; @@ -1086,6 +1096,7 @@ int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; + cmd.reserved = 0; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) return errno; @@ -1101,6 +1112,7 @@ int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) memcpy(cmd.gid, gid->raw, sizeof cmd.gid); cmd.qp_handle = qp->handle; cmd.mlid = lid; + cmd.reserved = 0; if (write(qp->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) return errno; From dotanb at dev.mellanox.co.il Wed Oct 10 02:26:18 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 10 Oct 2007 11:26:18 +0200 Subject: [ofa-general] [PATCH] libibverbs/examples: Fixes some issues in the examples files Message-ID: <200710101126.18284.dotanb@dev.mellanox.co.il> Fixes the following issues in the examples: * memory leaks * warnings reported by valgrind of uninitialized attributes in strcuts Signed-off-by: Dotan Barak --- diff --git a/examples/device_list.c b/examples/device_list.c index b53d4b1..3ce8cbd 100644 --- a/examples/device_list.c +++ b/examples/device_list.c @@ -45,8 +45,9 @@ int main(int argc, char *argv[]) { struct ibv_device **dev_list; + int num_devices, i; - dev_list = ibv_get_device_list(NULL); + dev_list = ibv_get_device_list(&num_devices); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; @@ -55,12 +56,13 @@ int main(int argc, char *argv[]) printf(" %-16s\t node GUID\n", "device"); printf(" %-16s\t----------------\n", "------"); - while (*dev_list) { + for (i = 0; i < num_devices; ++i) { printf(" %-16s\t%016llx\n", - ibv_get_device_name(*dev_list), - (unsigned long long) ntohll(ibv_get_device_guid(*dev_list))); - ++dev_list; + ibv_get_device_name(dev_list[i]), + (unsigned long long) ntohll(ibv_get_device_guid(dev_list[i]))); } + ibv_free_device_list(dev_list); + return 0; } diff --git a/examples/devinfo.c b/examples/devinfo.c index d054999..4e4316a 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -323,7 +323,7 @@ int main(int argc, char *argv[]) { char *ib_devname = NULL; int ret = 0; - struct ibv_device **dev_list; + struct ibv_device **dev_list, **orig_dev_list; int num_of_hcas; int ib_port = 0; @@ -360,7 +360,7 @@ int main(int argc, char *argv[]) break; case 'l': - dev_list = ibv_get_device_list(&num_of_hcas); + dev_list = orig_dev_list = ibv_get_device_list(&num_of_hcas); if (!dev_list) { fprintf(stderr, "Failed to get IB devices list"); return -1; @@ -375,6 +375,9 @@ int main(int argc, char *argv[]) } printf("\n"); + + ibv_free_device_list(orig_dev_list); + return 0; default: @@ -383,7 +386,7 @@ int main(int argc, char *argv[]) } } - dev_list = ibv_get_device_list(NULL); + dev_list = orig_dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "Failed to get IB device list\n"); return -1; @@ -417,5 +420,7 @@ int main(int argc, char *argv[]) if (ib_devname) free(ib_devname); + ibv_free_device_list(orig_dev_list); + return ret; } diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c index 258eb8f..81fd4a6 100644 --- a/examples/rc_pingpong.c +++ b/examples/rc_pingpong.c @@ -146,6 +146,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + free(service); return NULL; } @@ -160,6 +161,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); @@ -214,6 +216,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + free(service); return NULL; } @@ -232,6 +235,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); @@ -358,12 +362,12 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, } { - struct ibv_qp_attr attr; - - attr.qp_state = IBV_QPS_INIT; - attr.pkey_index = 0; - attr.port_num = port; - attr.qp_access_flags = 0; + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_INIT, + .pkey_index = 0, + .port_num = port, + .qp_access_flags = 0 + }; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | diff --git a/examples/srq_pingpong.c b/examples/srq_pingpong.c index 490ad0a..91fd566 100644 --- a/examples/srq_pingpong.c +++ b/examples/srq_pingpong.c @@ -157,6 +157,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + free(service); return NULL; } @@ -171,6 +172,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); @@ -238,6 +240,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + free(service); return NULL; } @@ -256,6 +259,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); @@ -408,12 +412,12 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, } for (i = 0; i < num_qp; ++i) { - struct ibv_qp_attr attr; - - attr.qp_state = IBV_QPS_INIT; - attr.pkey_index = 0; - attr.port_num = port; - attr.qp_access_flags = 0; + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_INIT, + .pkey_index = 0, + .port_num = port, + .qp_access_flags = 0 + }; if (ibv_modify_qp(ctx->qp[i], &attr, IBV_QP_STATE | diff --git a/examples/uc_pingpong.c b/examples/uc_pingpong.c index b6051c8..32652f5 100644 --- a/examples/uc_pingpong.c +++ b/examples/uc_pingpong.c @@ -134,6 +134,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + free(service); return NULL; } @@ -148,6 +149,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); @@ -202,6 +204,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + free(service); return NULL; } @@ -220,6 +223,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); @@ -346,12 +350,12 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, } { - struct ibv_qp_attr attr; - - attr.qp_state = IBV_QPS_INIT; - attr.pkey_index = 0; - attr.port_num = port; - attr.qp_access_flags = 0; + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_INIT, + .pkey_index = 0, + .port_num = port, + .qp_access_flags = 0 + }; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c index c631e25..baf69b7 100644 --- a/examples/ud_pingpong.c +++ b/examples/ud_pingpong.c @@ -79,7 +79,6 @@ struct pingpong_dest { static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, struct pingpong_dest *dest) { - struct ibv_qp_attr attr; struct ibv_ah_attr ah_attr = { .is_global = 0, .dlid = dest->lid, @@ -87,8 +86,9 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, .src_path_bits = 0, .port_num = port }; - - attr.qp_state = IBV_QPS_RTR; + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_RTR + }; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE)) { fprintf(stderr, "Failed to modify QP to RTR\n"); @@ -135,6 +135,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por if (n < 0) { fprintf(stderr, "%s for %s:%d\n", gai_strerror(n), servername, port); + free(service); return NULL; } @@ -149,6 +150,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port); @@ -203,6 +205,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, if (n < 0) { fprintf(stderr, "%s for port %d\n", gai_strerror(n), port); + free(service); return NULL; } @@ -221,6 +224,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, } freeaddrinfo(res); + free(service); if (sockfd < 0) { fprintf(stderr, "Couldn't listen to port %d\n", port); @@ -347,12 +351,12 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, } { - struct ibv_qp_attr attr; - - attr.qp_state = IBV_QPS_INIT; - attr.pkey_index = 0; - attr.port_num = port; - attr.qkey = 0x11111111; + struct ibv_qp_attr attr = { + .qp_state = IBV_QPS_INIT, + .pkey_index = 0, + .port_num = port, + .qkey = 0x11111111 + }; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | From Sumit.Gaur at Sun.COM Wed Oct 10 02:33:09 2007 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Wed, 10 Oct 2007 15:03:09 +0530 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> Message-ID: <470C9C55.3090304@Sun.COM> Hi, I am using madrpc_init which in turn calling umad_register(). There is no problem in sending and receiving data. Only problem comes when two separate user threads(one for SMI recv and another for GSI recv) are trying to recv data using mad_receive(0, timeout) function simultaneously. I get SMI mad in GSI thread and vice versa sometimes. How to get rid of this problem as mad_receive has no control of qp selection. Thanks and Regards sumit Hal Rosenstock wrote: > On Tue, 2007-10-09 at 13:01 +0530, Sumit Gaur - Sun Microsystem wrote: > >>Hi, >> >>It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not >>possible to recv MAD specific to GSI or SMI type. As per my impression if I have >>two separate threads to send and receive then I could send MADs to different qp >>0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please >>suggest if there is any workaround for it. > > > See umad_register(). > > -- Hal > > >>Thanks and Regards >>sumit >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From herbert at gondor.apana.org.au Wed Oct 10 02:53:16 2007 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Wed, 10 Oct 2007 17:53:16 +0800 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010091644.GA9807@one.firstfloor.org> References: <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> <20071010003716.GB552@one.firstfloor.org> <20071009.175025.59469417.davem@davemloft.net> <20071010091644.GA9807@one.firstfloor.org> Message-ID: <20071010095316.GA32095@gondor.apana.org.au> On Wed, Oct 10, 2007 at 11:16:44AM +0200, Andi Kleen wrote: > > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you > > With TSO really? Hardware queues are generally per-page rather than per-skb so it'd fill up quicker than a software queue even with TSO. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From andi at firstfloor.org Wed Oct 10 03:23:31 2007 From: andi at firstfloor.org (Andi Kleen) Date: Wed, 10 Oct 2007 12:23:31 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010.022550.21928751.davem@davemloft.net> References: <20071010003716.GB552@one.firstfloor.org> <20071009.175025.59469417.davem@davemloft.net> <20071010091644.GA9807@one.firstfloor.org> <20071010.022550.21928751.davem@davemloft.net> Message-ID: <20071010102331.GA10496@one.firstfloor.org> On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote: > The chip I was working with at the time (UltraSPARC-IIi) compressed > all the linear stores into 64-byte full cacheline transactions via > the store buffer. That's a pretty old CPU. Conclusions on more modern ones might be different. > In fact, such a thing might not pan out well, because most of the time > you write a single descriptor or two, and that isn't a full cacheline, > which means a read/modify/write is the only coherent way to make such > a write to RAM. x86 WC does R-M-W and is coherent of course. The main difference is just that the result is not cached. When the hardware accesses the cache line then the cache should be also invalidated. > Sure you could batch, but I'd rather give the chip work to do unless > I unequivocably knew I'd have enough pending to fill a cacheline's > worth of descriptors. And since you suggest we shouldn't queue in > software... :-) Hmm, it probably would need to be coupled with batched submission if multiple packets are available you're right. Probably not worth doing explicit queueing though. I suppose it would be an interesting experiment at least. -Andi From davem at davemloft.net Wed Oct 10 03:44:46 2007 From: davem at davemloft.net (David Miller) Date: Wed, 10 Oct 2007 03:44:46 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010102331.GA10496@one.firstfloor.org> References: <20071010091644.GA9807@one.firstfloor.org> <20071010.022550.21928751.davem@davemloft.net> <20071010102331.GA10496@one.firstfloor.org> Message-ID: <20071010.034446.85819294.davem@davemloft.net> From: Andi Kleen Date: Wed, 10 Oct 2007 12:23:31 +0200 > On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote: > > The chip I was working with at the time (UltraSPARC-IIi) compressed > > all the linear stores into 64-byte full cacheline transactions via > > the store buffer. > > That's a pretty old CPU. Conclusions on more modern ones might be different. Cache matters, just scale the numbers. > I suppose it would be an interesting experiment at least. Absolutely. I've always gotten very poor results when increasing the TX queue a lot, for example with NIU the point of diminishing returns seems to be in the range of 256-512 TX descriptor entries and this was with 1.6Ghz cpus. From vlad at lists.openfabrics.org Wed Oct 10 04:20:46 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 10 Oct 2007 04:20:46 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071010-0200 daily build status Message-ID: <20071010112046.B2C54E60875@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From comisaru at ns.cnlo.ro Wed Oct 10 04:53:25 2007 From: comisaru at ns.cnlo.ro (James Jeri) Date: Wed, 10 Oct 2007 11:53:25 -0000 (UTC) Subject: [ofa-general] Congratulations Message-ID: <50594.81.199.242.3.1192017205.squirrel@81.199.242.3> Attn:Winner Congratulations The Foundazion Di Vittorio has chosenyoubythe board of trustees as one of the final recipients ofacashGrant/Donation for your own personal,educational,andbusinessTocelebrate the 30th anniversary 2007 program,We are giving outayearlydonation of US$200,000.00 to nd it to the PaymentRemitanceOffice Viaemail contact BATCH NO40 lucky recipients,ascharitydonations/aid. fill out below Formse:Batch(N-222-6747,E-900-56) FullName:.............. ResidentialAddress:............... Occupation:.............. Country:.................. Telephone:.................. Fax:...................... Number:.... Sex:................... age:................. NextofKin:............ Winning BatchNo:...... (PaymentRemitanceContact) MrCalvinoCostantino. E-Mail:payout_officeunit at yahoo.it http://www.fondazionedivittorio.it From hadi at cyberus.ca Wed Oct 10 06:08:48 2007 From: hadi at cyberus.ca (jamal) Date: Wed, 10 Oct 2007 09:08:48 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010.034446.85819294.davem@davemloft.net> References: <20071010091644.GA9807@one.firstfloor.org> <20071010.022550.21928751.davem@davemloft.net> <20071010102331.GA10496@one.firstfloor.org> <20071010.034446.85819294.davem@davemloft.net> Message-ID: <1192021728.4853.17.camel@localhost> On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote: > I've always gotten very poor results when increasing the TX queue a > lot, for example with NIU the point of diminishing returns seems to > be in the range of 256-512 TX descriptor entries and this was with > 1.6Ghz cpus. Is it interupt per packet? From my experience, you may find interesting results varying tx interupt mitigation parameters in addition to the ring parameters. Unfortunately when you do that, optimal parameters also depends on packet size. so what may work for 64B, wont work well for 1400B. cheers, jamal From peter.p.waskiewicz.jr at intel.com Wed Oct 10 08:35:31 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Wed, 10 Oct 2007 08:35:31 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010.034446.85819294.davem@davemloft.net> References: <20071010091644.GA9807@one.firstfloor.org><20071010.022550.21928751.davem@davemloft.net><20071010102331.GA10496@one.firstfloor.org> <20071010.034446.85819294.davem@davemloft.net> Message-ID: > From: Andi Kleen > Date: Wed, 10 Oct 2007 12:23:31 +0200 > > > On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote: > > > The chip I was working with at the time (UltraSPARC-IIi) > compressed > > > all the linear stores into 64-byte full cacheline > transactions via > > > the store buffer. > > > > That's a pretty old CPU. Conclusions on more modern ones > might be different. > > Cache matters, just scale the numbers. > > > I suppose it would be an interesting experiment at least. > > Absolutely. > > I've always gotten very poor results when increasing the TX > queue a lot, for example with NIU the point of diminishing > returns seems to be in the range of 256-512 TX descriptor > entries and this was with 1.6Ghz cpus. We've done similar testing with ixgbe to push maximum descriptor counts, and we lost performance very quickly in the same range you're quoting on NIU. Cheers, -PJ Waskiewicz From jackm at dev.mellanox.co.il Wed Oct 10 08:44:21 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 10 Oct 2007 17:44:21 +0200 Subject: [ofa-general] [PATCH v5] IB/mlx4: shrinking WQE In-Reply-To: <20070910142241.GA12546@mellanox.co.il> References: <20070909112917.GA25910@mellanox.co.il> <20070909140201.GD25910@mellanox.co.il> <20070910142241.GA12546@mellanox.co.il> Message-ID: <200710101744.21620.jackm@dev.mellanox.co.il> commit c0aa89f0b295dd0c20b2ff2b1d2eca10cdc84f4b Author: Michael S. Tsirkin Date: Thu Aug 30 15:51:40 2007 +0300 IB/mlx4: shrinking WQE ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use WR with NOP opcode to avoid wrap-around in the middle of WR. We set NoErrorCompletion bit to avoid getting completions with error for NOP WRs. Since NEC is only supported starting with firmware 2.2.232, we use constant-sized WRs for older firmware. And, since MLX QPs only support SEND, we use constant-sized WRs in this case. Signed-off-by: Michael S. Tsirkin --- Changes since v4: fix calls to stamp_send_wqe, and stamping placement inside post_nop_wqe. Found by regression, fixed by Jack Morgenstein. Changes since v3: fix nop formatting. Found by Eli Cohen. Changes since v2: fix memory leak in mlx4_buf_alloc. Found by internal code review. changes since v1: add missing patch hunks Index: infiniband/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/cq.c 2007-10-10 17:12:05.184757000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/cq.c 2007-10-10 17:23:02.337140000 +0200 @@ -331,6 +331,12 @@ static int mlx4_ib_poll_one(struct mlx4_ is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP && + is_send)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +359,10 @@ static int mlx4_ib_poll_one(struct mlx4_ if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { Index: infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-10-10 17:21:17.844882000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-10-10 17:23:02.341138000 +0200 @@ -120,6 +120,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; Index: infiniband/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/qp.c 2007-10-10 17:21:17.853882000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/qp.c 2007-10-10 17:23:02.350137000 +0200 @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *de static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0x7fffffff) : + cpu_to_be32(0xffffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + + stamp_send_wqe(qp, n + qp->sq_spare_wqes, size); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -237,6 +310,8 @@ static int set_rq_size(struct mlx4_ib_de static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +327,69 @@ static int set_kernel_sq_size(struct mlx cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * We set NEC bit to avoid getting completions with error for NOP WRs. + * Since NEC is only supported starting with firmware 2.2.232, + * we use constant-sized WRs for older firmware. + * + * And, since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. + * + * We set WQE size to at least 64 bytes, this way stamping invalidates each WQE. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && + qp->sq_signal_bits && BITS_PER_LONG == 64 && + type != IB_QPT_SMI && type != IB_QPT_GSI) + qp->sq.wqe_shift = ilog2(64); + else + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +401,8 @@ static int set_kernel_sq_size(struct mlx qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +440,12 @@ static int create_qp_common(struct mlx4_ qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +536,6 @@ static int create_qp_common(struct mlx4_ */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1030,7 @@ static int __mlx4_ib_modify_qp(struct ib ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1266,13 +1392,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1288,7 +1415,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1401,16 +1528,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = ind + qp->sq_spare_wqes; + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1432,8 +1566,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); Index: infiniband/drivers/net/mlx4/alloc.c =================================================================== --- infiniband.orig/drivers/net/mlx4/alloc.c 2007-10-10 17:12:12.259502000 +0200 +++ infiniband/drivers/net/mlx4/alloc.c 2007-10-10 17:23:02.356137000 +0200 @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, Index: infiniband/include/linux/mlx4/device.h =================================================================== --- infiniband.orig/include/linux/mlx4/device.h 2007-10-10 17:21:17.954882000 +0200 +++ infiniband/include/linux/mlx4/device.h 2007-10-10 17:23:02.363137000 +0200 @@ -133,6 +133,11 @@ enum { MLX4_STAT_RATE_OFFSET = 5 }; +static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) +{ + return (major << 32) | (minor << 16) | subminor; +} + struct mlx4_caps { u64 fw_ver; int num_ports; @@ -189,7 +194,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; Index: infiniband/include/linux/mlx4/qp.h =================================================================== --- infiniband.orig/include/linux/mlx4/qp.h 2007-10-10 17:12:38.460566000 +0200 +++ infiniband/include/linux/mlx4/qp.h 2007-10-10 17:23:02.366140000 +0200 @@ -154,7 +154,11 @@ struct mlx4_qp_context { u32 reserved5[10]; }; +/* Which firmware version adds support for NEC (NoErrorCompletion) bit */ +#define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) + enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, From eli at mellanox.co.il Wed Oct 10 08:55:32 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 17:55:32 +0200 Subject: [ofa-general] [PATCH] IB/ipoib: optimize receive flow Message-ID: <1192031732.7337.58.camel@mtls03> Optimize IPOIB CM receive flow This patch tries to reduce the number of accesses to the skb object and save CPU cycles and cache misses. Signed-off-by: Eli Cohen Index: ofa_kernel-1.2.5/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_kernel-1.2.5.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-10 15:10:27.000000000 +0200 +++ ofa_kernel-1.2.5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-10 15:35:01.000000000 +0200 @@ -374,6 +374,8 @@ static void skb_put_frags(struct sk_buff { int i, num_frags; unsigned int size; + int unused_frags = 0; + unsigned int used_size = 0; /* put header into skb */ size = min(length, hdr_space); @@ -382,23 +384,25 @@ static void skb_put_frags(struct sk_buff length -= size; num_frags = skb_shinfo(skb)->nr_frags; - for (i = 0; i < num_frags; i++) { + for (i = 0; i < num_frags; ++i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; if (length == 0) { /* don't need this page */ skb_fill_page_desc(toskb, i, frag->page, 0, PAGE_SIZE); - --skb_shinfo(skb)->nr_frags; + ++unused_frags; } else { - size = min(length, (unsigned) PAGE_SIZE); + size = length & PAGE_MASK ? PAGE_SIZE : length & (PAGE_SIZE - 1); frag->size = size; - skb->data_len += size; - skb->truesize += size; - skb->len += size; + used_size += size; length -= size; } } + skb->data_len += used_size; + skb->truesize += used_size; + skb->len += used_size; + skb_shinfo(skb)->nr_frags -= unused_frags; } void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) @@ -437,7 +441,7 @@ void ipoib_cm_handle_rx_wc(struct net_de goto repost; } - if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + if (unlikely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); From eli at mellanox.co.il Wed Oct 10 08:55:37 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 17:55:37 +0200 Subject: [ofa-general] [PATCH] IB/mthca: optimize post srq Message-ID: <1192031738.7337.59.camel@mtls03> Put likely/unlikely in post srq Signed-off-by: Eli Cohen --- If this approach is accepted I can do the same for mlx4 Index: ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- ofa_kernel-1.2.5.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2007-10-10 15:18:20.000000000 +0200 +++ ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c 2007-10-10 15:18:40.000000000 +0200 @@ -509,7 +509,7 @@ int mthca_tavor_post_srq_recv(struct ib_ for (nreq = 0; wr; wr = wr->next) { ind = srq->first_free; - if (ind < 0) { + if (unlikely(ind < 0)) { mthca_err(dev, "SRQ %06x full\n", srq->srqn); err = -ENOMEM; *bad_wr = wr; @@ -519,7 +519,7 @@ int mthca_tavor_post_srq_recv(struct ib_ wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); - if (next_ind < 0) { + if (unlikely(next_ind < 0)) { mthca_err(dev, "SRQ %06x full\n", srq->srqn); err = -ENOMEM; *bad_wr = wr; @@ -631,7 +631,7 @@ int mthca_arbel_post_srq_recv(struct ib_ for (nreq = 0; wr; ++nreq, wr = wr->next) { ind = srq->first_free; - if (ind < 0) { + if (unlikely(ind < 0)) { mthca_err(dev, "SRQ %06x full\n", srq->srqn); err = -ENOMEM; *bad_wr = wr; @@ -641,7 +641,7 @@ int mthca_arbel_post_srq_recv(struct ib_ wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); - if (next_ind < 0) { + if (unlikely(next_ind < 0)) { mthca_err(dev, "SRQ %06x full\n", srq->srqn); err = -ENOMEM; *bad_wr = wr; From eli at mellanox.co.il Wed Oct 10 08:55:40 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 17:55:40 +0200 Subject: [ofa-general] [PATCH 1/3]: IB/core: allow lockless SRQ Message-ID: <1192031740.7337.60.camel@mtls03> Allow to modify a SRQ to be lockless This patch allow the consumer to call ib_modify_srq and specify whether the SRQ is lockless or not. Signed-off-by: Eli Cohen --- This allows the consumer to decide if it needs a lock or not. IPOIB CM for example does not need it and can benefit from this approach. Index: ofa_kernel-1.2.5/include/rdma/ib_verbs.h =================================================================== --- ofa_kernel-1.2.5.orig/include/rdma/ib_verbs.h 2007-10-10 15:29:36.000000000 +0200 +++ ofa_kernel-1.2.5/include/rdma/ib_verbs.h 2007-10-10 15:35:58.000000000 +0200 @@ -442,18 +442,21 @@ enum ib_cq_notify_flags { enum ib_srq_attr_mask { IB_SRQ_MAX_WR = 1 << 0, IB_SRQ_LIMIT = 1 << 1, + IB_SRQ_LOCKNESS = 1 << 2, }; struct ib_srq_attr { u32 max_wr; u32 max_sge; u32 srq_limit; + int use_lock; }; struct ib_srq_init_attr { void (*event_handler)(struct ib_event *, void *); void *srq_context; struct ib_srq_attr attr; + u32 flags; }; struct ib_qp_cap { From eli at mellanox.co.il Wed Oct 10 08:55:42 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 17:55:42 +0200 Subject: [ofa-general] [PATCH 2/3]: IB/mthca: allow lockless SRQ Message-ID: <1192031742.7337.61.camel@mtls03> Add support to mthca for lockless SRQ Signed-off-by: Eli Cohen --- Index: ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- ofa_kernel-1.2.5.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2007-10-10 15:18:40.000000000 +0200 +++ ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c 2007-10-10 15:24:05.000000000 +0200 @@ -394,6 +394,9 @@ int mthca_modify_srq(struct ib_srq *ibsr return -EINVAL; } + if (attr_mask & IB_SRQ_LOCKNESS) + srq->use_lock = !!attr->use_lock; + return 0; } @@ -473,7 +476,8 @@ void mthca_free_srq_wqe(struct mthca_srq ind = wqe_addr >> srq->wqe_shift; - spin_lock(&srq->lock); + if (srq->use_lock) + spin_lock(&srq->lock); if (likely(srq->first_free >= 0)) *wqe_to_link(get_wqe(srq, srq->last_free)) = ind; @@ -483,7 +487,8 @@ void mthca_free_srq_wqe(struct mthca_srq *wqe_to_link(get_wqe(srq, ind)) = -1; srq->last_free = ind; - spin_unlock(&srq->lock); + if (srq->use_lock) + spin_unlock(&srq->lock); } int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, @@ -502,7 +507,8 @@ int mthca_tavor_post_srq_recv(struct ib_ void *wqe; void *prev_wqe; - spin_lock_irqsave(&srq->lock, flags); + if (srq->use_lock) + spin_lock_irqsave(&srq->lock, flags); first_ind = srq->first_free; @@ -609,7 +615,9 @@ int mthca_tavor_post_srq_recv(struct ib_ */ mmiowb(); - spin_unlock_irqrestore(&srq->lock, flags); + if (srq->use_lock) + spin_unlock_irqrestore(&srq->lock, flags); + return err; } @@ -626,7 +634,8 @@ int mthca_arbel_post_srq_recv(struct ib_ int i; void *wqe; - spin_lock_irqsave(&srq->lock, flags); + if (srq->use_lock) + spin_lock_irqsave(&srq->lock, flags); for (nreq = 0; wr; ++nreq, wr = wr->next) { ind = srq->first_free; @@ -692,7 +701,9 @@ int mthca_arbel_post_srq_recv(struct ib_ *srq->db = cpu_to_be32(srq->counter); } - spin_unlock_irqrestore(&srq->lock, flags); + if (srq->use_lock) + spin_unlock_irqrestore(&srq->lock, flags); + return err; } Index: ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- ofa_kernel-1.2.5.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2007-10-10 15:10:22.000000000 +0200 +++ ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_provider.h 2007-10-10 15:24:05.000000000 +0200 @@ -222,6 +222,7 @@ struct mthca_cq { struct mthca_srq { struct ib_srq ibsrq; spinlock_t lock; + int use_lock; int refcount; int srqn; int max; From eli at mellanox.co.il Wed Oct 10 08:55:47 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 17:55:47 +0200 Subject: [ofa-general] [PATCH 3/3]: IB/ipoib: use lockless SRQ in IPOIB CM Message-ID: <1192031747.7337.62.camel@mtls03> Modify IPOIB CM to use a lockless SRQ IPOIB CM uses NAPI which allows the poll function to be lockless. This patches modifies IPOIB to utilize this. Signed-off-by: Eli Cohen --- Index: ofa_kernel-1.2.5/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_kernel-1.2.5.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-10 15:10:30.000000000 +0200 +++ ofa_kernel-1.2.5/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-10 15:28:52.000000000 +0200 @@ -1292,6 +1292,9 @@ int ipoib_cm_dev_init(struct net_device .max_sge = IPOIB_CM_RX_SG } }; + struct ib_srq_attr attr = { + .use_lock = 0, + }; int ret, i; INIT_LIST_HEAD(&priv->cm.passive_ids); @@ -1316,6 +1319,12 @@ int ipoib_cm_dev_init(struct net_device return ret; } + ret = ib_modify_srq(priv->cm.srq, &attr, IB_SRQ_LOCKNESS); + if (ret) { + ipoib_cm_dev_cleanup(dev); + return ret; + } + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, GFP_KERNEL); if (!priv->cm.srq_ring) { From andi at firstfloor.org Wed Oct 10 09:02:11 2007 From: andi at firstfloor.org (Andi Kleen) Date: Wed, 10 Oct 2007 18:02:11 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: References: <20071010.034446.85819294.davem@davemloft.net> Message-ID: <20071010160211.GA14239@one.firstfloor.org> > We've done similar testing with ixgbe to push maximum descriptor counts, > and we lost performance very quickly in the same range you're quoting on > NIU. Did you try it with WC writes to the ring or CLFLUSH? -Andi From yaw_osafo1 at yahoo.co.uk Wed Oct 10 08:48:52 2007 From: yaw_osafo1 at yahoo.co.uk (Yaw Osafo-Maafo.) Date: 10 Oct 2007 17:48:52 +0200 Subject: [ofa-general] ***SPAM*** Attention Needed . Message-ID: <20071010154852.3058.qmail@wpc0712.amenworld.com> Dear Sir, How is your family? Hope all of you are fine, if so splendid. I wish to accost you with a request that would be of immense benefit to both of us. However, I got your contact from the Ghanian Chamber Of Commerce, and after careful consideration with my wife and children, we resolved to contact you for your most needed assistance in this manner. I duly apologize for infringing on your privacy, if this contact is not acceptable to you, as I make this proposal to you. Yaw Osafo-Maafo is my name and former Ghanaian minister of finance. Although I was sacked by President John Kufuor on 28 April 2006 for the fact I signed $29 million book publication contract with Macmillan Education without reference to the Public Procurement Board and without Parliamentary approval. I have taken pains to find your contact through personal endeavours.I decided to contact you after due thought in respect of a transfer of a huge sum of money, You can read more on this web page http://www.ghanaweb.com/GhanaHomePage/NewsArchive/artikel.php?ID=103713 Right now, I was being probed because of it. As matter of fact, $10.5 million part of the money was lodged into security company in europe, were the funds is been deposited. The money will be released to you for investment. I would like you to give this a highly confidential approach. Any question you wish to ask concerning this deal, do not delay to ask. Presently, I do not need any telephone discussion for safety. The moment I hear from you, my lawyer will contact you. If you are interested in this proposal send me a mail to my privert Email address:(yaw_osafo1 at yahoo.co.uk) PLEASE CONFIDENTIAL I wait your response. My regards, Minister,Yaw Osafo-Maafo. From billfink at mindspring.com Wed Oct 10 09:02:15 2007 From: billfink at mindspring.com (Bill Fink) Date: Wed, 10 Oct 2007 12:02:15 -0400 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.170435.43504422.davem@davemloft.net> References: <20071009135340.33e5922c@freepuppy.rosehill> <20071009.142235.74385364.davem@davemloft.net> <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> Message-ID: <20071010120215.7ec19323.billfink@mindspring.com> On Tue, 09 Oct 2007, David Miller wrote: > From: jamal > Date: Tue, 09 Oct 2007 17:56:46 -0400 > > > if the h/ware queues are full because of link pressure etc, you drop. We > > drop today when the s/ware queues are full. The driver txmit lock takes > > place of the qdisc queue lock etc. I am assuming there is still need for > > that locking. The filter/classification scheme still works as is and > > select classes which map to rings. tc still works as is etc. > > I understand your suggestion. > > We have to keep in mind, however, that the sw queue right now is 1000 > packets. I heavily discourage any driver author to try and use any > single TX queue of that size. Which means that just dropping on back > pressure might not work so well. > > Or it might be perfect and signal TCP to backoff, who knows! :-) I can't remember the details anymore, but for 10-GigE, I have encountered cases where I was able to significantly increase TCP performance by increasing the txqueuelen to 10000, which is the setting I now use for any 10-GigE testing. -Bill From changquing.tang at hp.com Wed Oct 10 09:09:48 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 10 Oct 2007 16:09:48 -0000 Subject: [ofa-general] [PATCH v5] IB/mlx4: shrinking WQE In-Reply-To: <200710101744.21620.jackm@dev.mellanox.co.il> References: <20070909112917.GA25910@mellanox.co.il><20070909140201.GD25910@mellanox.co.il><20070910142241.GA12546@mellanox.co.il> <200710101744.21620.jackm@dev.mellanox.co.il> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403027D232D@G3W0634.americas.hpqcorp.net> Can you provide sample code to use these new features ? --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Jack Morgenstein > Sent: Wednesday, October 10, 2007 10:44 AM > To: general at lists.openfabrics.org > Cc: Roland Dreier > Subject: [ofa-general] [PATCH v5] IB/mlx4: shrinking WQE > > commit c0aa89f0b295dd0c20b2ff2b1d2eca10cdc84f4b > Author: Michael S. Tsirkin > Date: Thu Aug 30 15:51:40 2007 +0300 > > IB/mlx4: shrinking WQE > > ConnectX supports shrinking wqe, such that a single WR can include > multiple units of wqe_shift. This way, WRs can differ in > size, and > do not have to be a power of 2 in size, saving memory and > speeding up > send WR posting. Unfortunately, if we do this wqe_index > field in CQE > can't be used to look up the WR ID anymore, so do this only if > selective signalling is off. > > Further, on 32-bit platforms, we can't use vmap to make > the QP buffer virtually contigious. Thus we have to use > constant-sized WRs to make sure a WR is always fully within > a single page-sized chunk. > > Finally, we use WR with NOP opcode to avoid wrap-around > in the middle of WR. We set NoErrorCompletion bit to avoid getting > completions with error for NOP WRs. Since NEC is only supported > starting with firmware 2.2.232, we use constant-sized WRs > for older firmware. And, since MLX QPs only support SEND, we use > constant-sized WRs in this case. > > Signed-off-by: Michael S. Tsirkin > > --- > > Changes since v4: fix calls to stamp_send_wqe, and stamping placement > inside post_nop_wqe. > Found by regression, fixed by Jack Morgenstein. > Changes since v3: fix nop formatting. > Found by Eli Cohen. > Changes since v2: fix memory leak in mlx4_buf_alloc. > Found by internal code review. > changes since v1: add missing patch hunks > > Index: infiniband/drivers/infiniband/hw/mlx4/cq.c > =================================================================== > --- infiniband.orig/drivers/infiniband/hw/mlx4/cq.c > 2007-10-10 17:12:05.184757000 +0200 > +++ infiniband/drivers/infiniband/hw/mlx4/cq.c > 2007-10-10 17:23:02.337140000 +0200 > @@ -331,6 +331,12 @@ static int mlx4_ib_poll_one(struct mlx4_ > is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == > MLX4_CQE_OPCODE_ERROR; > > + if (unlikely((cqe->owner_sr_opcode & > MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP && > + is_send)) { > + printk(KERN_WARNING "Completion for NOP opcode > detected!\n"); > + return -EINVAL; > + } > + > if (!*cur_qp || > (be32_to_cpu(cqe->my_qpn) & 0xffffff) != > (*cur_qp)->mqp.qpn) { > /* > @@ -353,8 +359,10 @@ static int mlx4_ib_poll_one(struct mlx4_ > > if (is_send) { > wq = &(*cur_qp)->sq; > - wqe_ctr = be16_to_cpu(cqe->wqe_index); > - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); > + if (!(*cur_qp)->sq_signal_bits) { > + wqe_ctr = be16_to_cpu(cqe->wqe_index); > + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); > + } > wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; > ++wq->tail; > } else if ((*cur_qp)->ibqp.srq) { > Index: infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h > =================================================================== > --- infiniband.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h > 2007-10-10 17:21:17.844882000 +0200 > +++ infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h > 2007-10-10 17:23:02.341138000 +0200 > @@ -120,6 +120,8 @@ struct mlx4_ib_qp { > > u32 doorbell_qpn; > __be32 sq_signal_bits; > + unsigned sq_next_wqe; > + int sq_max_wqes_per_wr; > int sq_spare_wqes; > struct mlx4_ib_wq sq; > > Index: infiniband/drivers/infiniband/hw/mlx4/qp.c > =================================================================== > --- infiniband.orig/drivers/infiniband/hw/mlx4/qp.c > 2007-10-10 17:21:17.853882000 +0200 > +++ infiniband/drivers/infiniband/hw/mlx4/qp.c > 2007-10-10 17:23:02.350137000 +0200 > @@ -30,6 +30,7 @@ > * SOFTWARE. > */ > > +#include > #include > #include > > @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *de > > static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { > - if (qp->buf.nbufs == 1) > + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) > return qp->buf.u.direct.buf + offset; > else > return qp->buf.u.page_list[offset >> > PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void > *get_send_wqe(struct mlx4_ib > > /* > * Stamp a SQ WQE so that it is invalid if prefetched by marking the > - * first four bytes of every 64 byte chunk with 0xffffffff, > except for > - * the very first chunk of the WQE. > + * first four bytes of every 64 byte chunk with > + * 0x7FFFFFF | (invalid_ownership_value << 31). > + * > + * When max WR is than or equal to the WQE size, > + * as an optimization, we can stamp WQE with 0xffffffff, > + * and skip the very first chunk of the WQE. > */ > -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) > +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) > { > - u32 *wqe = get_send_wqe(qp, n); > + u32 *wqe; > int i; > + int s; > + int ind; > + void *buf; > + __be32 stamp; > + > + s = roundup(size, 1 << qp->sq.wqe_shift); > + if (qp->sq_max_wqes_per_wr > 1) { > + for (i = 0; i < s; i += 64) { > + ind = (i >> qp->sq.wqe_shift) + n; > + stamp = ind & qp->sq.wqe_cnt ? > cpu_to_be32(0x7fffffff) : > + > cpu_to_be32(0xffffffff); > + buf = get_send_wqe(qp, ind & > (qp->sq.wqe_cnt - 1)); > + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); > + *wqe = stamp; > + } > + } else { > + buf = get_send_wqe(qp, n); > + for (i = 64; i < s; i += 64) { > + wqe = buf + i; > + *wqe = 0xffffffff; > + } > + } > +} > + > +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) { > + struct mlx4_wqe_ctrl_seg *ctrl; > + struct mlx4_wqe_inline_seg *inl; > + void *wqe; > + int s; > + > + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); > + s = sizeof(struct mlx4_wqe_ctrl_seg); > + > + if (qp->ibqp.qp_type == IB_QPT_UD) { > + struct mlx4_wqe_datagram_seg *dgram = wqe + > sizeof *ctrl; > + struct mlx4_av *av = (struct mlx4_av *)dgram->av; > + memset(dgram, 0, sizeof *dgram); > + av->port_pd = cpu_to_be32((qp->port << 24) | > to_mpd(qp->ibqp.pd)->pdn); > + s += sizeof(struct mlx4_wqe_datagram_seg); > + } > + > + /* Pad the remainder of the WQE with an inline data segment. */ > + if (size > s) { > + inl = wqe + s; > + inl->byte_count = cpu_to_be32(1 << 31 | (size - > s - sizeof *inl)); > + } > + ctrl->srcrb_flags = 0; > + ctrl->fence_size = size / 16; > + /* > + * Make sure descriptor is fully written before > + * setting ownership bit (because HW can start > + * executing as soon as we do). > + */ > + wmb(); > > - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) > - wqe[i] = 0xffffffff; > + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | > MLX4_WQE_CTRL_NEC) | > + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); > + > + stamp_send_wqe(qp, n + qp->sq_spare_wqes, size); } > + > +/* Post NOP WQE to prevent wrap-around in the middle of WR */ static > +inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) { > + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); > + if (unlikely(s < qp->sq_max_wqes_per_wr)) { > + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); > + ind += s; > + } > + return ind; > } > > static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum > mlx4_event type) @@ -237,6 +310,8 @@ static int > set_rq_size(struct mlx4_ib_de static int > set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, > enum ib_qp_type type, struct > mlx4_ib_qp *qp) { > + int s; > + > /* Sanity check SQ size before proceeding */ > if (cap->max_send_wr > dev->dev->caps.max_wqes || > cap->max_send_sge > dev->dev->caps.max_sq_sg || > @@ -252,20 +327,69 @@ static int set_kernel_sq_size(struct mlx > cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) > return -EINVAL; > > - qp->sq.wqe_shift = > ilog2(roundup_pow_of_two(max(cap->max_send_sge * > - sizeof > (struct mlx4_wqe_data_seg), > - > cap->max_inline_data + > - sizeof > (struct mlx4_wqe_inline_seg)) + > - send_wqe_overhead(type))); > - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - > send_wqe_overhead(type)) / > - sizeof (struct mlx4_wqe_data_seg); > + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), > + cap->max_inline_data + sizeof (struct > mlx4_wqe_inline_seg)) + > + send_wqe_overhead(type); > > /* > - * We need to leave 2 KB + 1 WQE of headroom in the SQ to > - * allow HW to prefetch. > + * Hermon supports shrinking wqe, such that a single WR > can include > + * multiple units of wqe_shift. This way, WRs can > differ in size, and > + * do not have to be a power of 2 in size, saving > memory and speeding up > + * send WR posting. Unfortunately, if we do this > wqe_index field in CQE > + * can't be used to look up the WR ID anymore, so do > this only if > + * selective signalling is off. > + * > + * Further, on 32-bit platforms, we can't use vmap to make > + * the QP buffer virtually contigious. Thus we have to use > + * constant-sized WRs to make sure a WR is always fully within > + * a single page-sized chunk. > + * > + * Finally, we use NOP opcode to avoid wrap-around in > the middle of WR. > + * We set NEC bit to avoid getting completions with > error for NOP WRs. > + * Since NEC is only supported starting with firmware 2.2.232, > + * we use constant-sized WRs for older firmware. > + * > + * And, since MLX QPs only support SEND, we use > constant-sized WRs in this > + * case. > + * > + * We look for the smallest value of wqe_shift such > that the resulting > + * number of wqes does not exceed device capabilities. > + * > + * We set WQE size to at least 64 bytes, this way > stamping invalidates each WQE. > */ > - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; > - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + > qp->sq_spare_wqes); > + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && > + qp->sq_signal_bits && BITS_PER_LONG == 64 && > + type != IB_QPT_SMI && type != IB_QPT_GSI) > + qp->sq.wqe_shift = ilog2(64); > + else > + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); > + > + for (;;) { > + if (1 << qp->sq.wqe_shift > > dev->dev->caps.max_sq_desc_sz) > + return -EINVAL; > + > + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << > qp->sq.wqe_shift); > + > + /* > + * We need to leave 2 KB + 1 WR of headroom in the SQ to > + * allow HW to prefetch. > + */ > + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) > + qp->sq_max_wqes_per_wr; > + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * > + > qp->sq_max_wqes_per_wr + > + qp->sq_spare_wqes); > + > + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) > + break; > + > + if (qp->sq_max_wqes_per_wr <= 1) > + return -EINVAL; > + > + ++qp->sq.wqe_shift; > + } > + > + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - > + send_wqe_overhead(type)) / sizeof > (struct mlx4_wqe_data_seg); > > qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + > (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 > +401,8 @@ static int set_kernel_sq_size(struct mlx > qp->sq.offset = 0; > } > > - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - > qp->sq_spare_wqes; > + cap->max_send_wr = qp->sq.max_post = > + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / > qp->sq_max_wqes_per_wr; > cap->max_send_sge = qp->sq.max_gs; > /* We don't support inline sends for kernel QPs (yet) */ > cap->max_inline_data = 0; > @@ -315,6 +440,12 @@ static int create_qp_common(struct mlx4_ > qp->rq.tail = 0; > qp->sq.head = 0; > qp->sq.tail = 0; > + qp->sq_next_wqe = 0; > + > + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) > + qp->sq_signal_bits = > cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); > + else > + qp->sq_signal_bits = 0; > > err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, > !!init_attr->srq, qp); > if (err) > @@ -405,11 +536,6 @@ static int create_qp_common(struct mlx4_ > */ > qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); > > - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) > - qp->sq_signal_bits = > cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); > - else > - qp->sq_signal_bits = 0; > - > qp->mqp.event = mlx4_ib_qp_event; > > return 0; > @@ -904,7 +1030,7 @@ static int __mlx4_ib_modify_qp(struct ib > ctrl = get_send_wqe(qp, i); > ctrl->owner_opcode = cpu_to_be32(1 << 31); > > - stamp_send_wqe(qp, i); > + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); > } > } > > @@ -1266,13 +1392,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp > unsigned long flags; > int nreq; > int err = 0; > - int ind; > - int size; > + unsigned ind; > + int uninitialized_var(stamp); > + int uninitialized_var(size); > int i; > > spin_lock_irqsave(&qp->rq.lock, flags); > > - ind = qp->sq.head; > + ind = qp->sq_next_wqe; > > for (nreq = 0; wr; ++nreq, wr = wr->next) { > if (mlx4_wq_overflow(&qp->sq, nreq, > qp->ibqp.send_cq)) { @@ -1288,7 +1415,7 @@ int > mlx4_ib_post_send(struct ib_qp *ibqp > } > > ctrl = wqe = get_send_wqe(qp, ind & > (qp->sq.wqe_cnt - 1)); > - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; > + qp->sq.wrid[(qp->sq.head + nreq) & > (qp->sq.wqe_cnt - 1)] = wr->wr_id; > > ctrl->srcrb_flags = > (wr->send_flags & IB_SEND_SIGNALED ? > @@ -1401,16 +1528,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp > ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | > (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 > << 31) : 0); > > + stamp = ind + qp->sq_spare_wqes; > + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); > + > /* > * We can improve latency by not stamping the last > * send queue WQE until after ringing the doorbell, so > * only stamp here if there are still more WQEs to post. > + * > + * Same optimization applies to padding with NOP wqe > + * in case of WQE shrinking (used to prevent wrap-around > + * in the middle of WR). > */ > - if (wr->next) > - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & > - (qp->sq.wqe_cnt - 1)); > + if (wr->next) { > + stamp_send_wqe(qp, stamp, size * 16); > + ind = pad_wraparound(qp, ind); > + } > > - ++ind; > } > > out: > @@ -1432,8 +1566,10 @@ out: > */ > mmiowb(); > > - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & > - (qp->sq.wqe_cnt - 1)); > + stamp_send_wqe(qp, stamp, size * 16); > + > + ind = pad_wraparound(qp, ind); > + qp->sq_next_wqe = ind; > } > > spin_unlock_irqrestore(&qp->rq.lock, flags); > Index: infiniband/drivers/net/mlx4/alloc.c > =================================================================== > --- infiniband.orig/drivers/net/mlx4/alloc.c 2007-10-10 > 17:12:12.259502000 +0200 > +++ infiniband/drivers/net/mlx4/alloc.c 2007-10-10 > 17:23:02.356137000 +0200 > @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, > > memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); > } > + > + if (BITS_PER_LONG == 64) { > + struct page **pages; > + pages = kmalloc(sizeof *pages * > buf->nbufs, GFP_KERNEL); > + if (!pages) > + goto err_free; > + for (i = 0; i < buf->nbufs; ++i) > + pages[i] = > virt_to_page(buf->u.page_list[i].buf); > + buf->u.direct.buf = vmap(pages, > buf->nbufs, VM_MAP, PAGE_KERNEL); > + kfree(pages); > + if (!buf->u.direct.buf) > + goto err_free; > + } > } > > return 0; > @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, > dma_free_coherent(&dev->pdev->dev, size, > buf->u.direct.buf, > buf->u.direct.map); > else { > + if (BITS_PER_LONG == 64) > + vunmap(buf->u.direct.buf); > + > for (i = 0; i < buf->nbufs; ++i) > dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, > buf->u.page_list[i].buf, > Index: infiniband/include/linux/mlx4/device.h > =================================================================== > --- infiniband.orig/include/linux/mlx4/device.h > 2007-10-10 17:21:17.954882000 +0200 > +++ infiniband/include/linux/mlx4/device.h 2007-10-10 > 17:23:02.363137000 +0200 > @@ -133,6 +133,11 @@ enum { > MLX4_STAT_RATE_OFFSET = 5 > }; > > +static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) { > + return (major << 32) | (minor << 16) | subminor; } > + > struct mlx4_caps { > u64 fw_ver; > int num_ports; > @@ -189,7 +194,7 @@ struct mlx4_buf_list { }; > > struct mlx4_buf { > - union { > + struct { > struct mlx4_buf_list direct; > struct mlx4_buf_list *page_list; > } u; > Index: infiniband/include/linux/mlx4/qp.h > =================================================================== > --- infiniband.orig/include/linux/mlx4/qp.h 2007-10-10 > 17:12:38.460566000 +0200 > +++ infiniband/include/linux/mlx4/qp.h 2007-10-10 > 17:23:02.366140000 +0200 > @@ -154,7 +154,11 @@ struct mlx4_qp_context { > u32 reserved5[10]; > }; > > +/* Which firmware version adds support for NEC > (NoErrorCompletion) bit > +*/ #define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) > + > enum { > + MLX4_WQE_CTRL_NEC = 1 << 29, > MLX4_WQE_CTRL_FENCE = 1 << 6, > MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, > MLX4_WQE_CTRL_SOLICITED = 1 << 1, > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Oct 10 09:15:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 09:15:17 -0700 Subject: [ofa-general] [PATCH v5] IB/mlx4: shrinking WQE In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403027D232D@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Wed, 10 Oct 2007 16:09:48 -0000") References: <20070909112917.GA25910@mellanox.co.il> <20070909140201.GD25910@mellanox.co.il> <20070910142241.GA12546@mellanox.co.il> <200710101744.21620.jackm@dev.mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA8403027D232D@G3W0634.americas.hpqcorp.net> Message-ID: > Can you provide sample code to use these new features ? There are no new features, it's purely an internal driver optimization. From changquing.tang at hp.com Wed Oct 10 09:22:44 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 10 Oct 2007 16:22:44 -0000 Subject: [ofa-general] [PATCH 2/3]: IB/mthca: allow lockless SRQ In-Reply-To: <1192031742.7337.61.camel@mtls03> References: <1192031742.7337.61.camel@mtls03> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403027D2385@G3W0634.americas.hpqcorp.net> Can give a few more words about lockless SRQ ? Thanks --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Eli Cohen > Sent: Wednesday, October 10, 2007 10:56 AM > To: Roland Dreier > Cc: openfabrics > Subject: [ofa-general] [PATCH 2/3]: IB/mthca: allow lockless SRQ > > Add support to mthca for lockless SRQ > > Signed-off-by: Eli Cohen > > --- > > Index: ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c > =================================================================== > --- > ofa_kernel-1.2.5.orig/drivers/infiniband/hw/mthca/mthca_srq.c > 2007-10-10 15:18:40.000000000 +0200 > +++ ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_srq.c > 2007-10-10 15:24:05.000000000 +0200 > @@ -394,6 +394,9 @@ int mthca_modify_srq(struct ib_srq *ibsr > return -EINVAL; > } > > + if (attr_mask & IB_SRQ_LOCKNESS) > + srq->use_lock = !!attr->use_lock; > + > return 0; > } > > @@ -473,7 +476,8 @@ void mthca_free_srq_wqe(struct mthca_srq > > ind = wqe_addr >> srq->wqe_shift; > > - spin_lock(&srq->lock); > + if (srq->use_lock) > + spin_lock(&srq->lock); > > if (likely(srq->first_free >= 0)) > *wqe_to_link(get_wqe(srq, srq->last_free)) = > ind; @@ -483,7 +487,8 @@ void mthca_free_srq_wqe(struct mthca_srq > *wqe_to_link(get_wqe(srq, ind)) = -1; > srq->last_free = ind; > > - spin_unlock(&srq->lock); > + if (srq->use_lock) > + spin_unlock(&srq->lock); > } > > int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct > ib_recv_wr *wr, @@ -502,7 +507,8 @@ int > mthca_tavor_post_srq_recv(struct ib_ > void *wqe; > void *prev_wqe; > > - spin_lock_irqsave(&srq->lock, flags); > + if (srq->use_lock) > + spin_lock_irqsave(&srq->lock, flags); > > first_ind = srq->first_free; > > @@ -609,7 +615,9 @@ int mthca_tavor_post_srq_recv(struct ib_ > */ > mmiowb(); > > - spin_unlock_irqrestore(&srq->lock, flags); > + if (srq->use_lock) > + spin_unlock_irqrestore(&srq->lock, flags); > + > return err; > } > > @@ -626,7 +634,8 @@ int mthca_arbel_post_srq_recv(struct ib_ > int i; > void *wqe; > > - spin_lock_irqsave(&srq->lock, flags); > + if (srq->use_lock) > + spin_lock_irqsave(&srq->lock, flags); > > for (nreq = 0; wr; ++nreq, wr = wr->next) { > ind = srq->first_free; > @@ -692,7 +701,9 @@ int mthca_arbel_post_srq_recv(struct ib_ > *srq->db = cpu_to_be32(srq->counter); > } > > - spin_unlock_irqrestore(&srq->lock, flags); > + if (srq->use_lock) > + spin_unlock_irqrestore(&srq->lock, flags); > + > return err; > } > > Index: ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_provider.h > =================================================================== > --- > ofa_kernel-1.2.5.orig/drivers/infiniband/hw/mthca/mthca_pro > vider.h 2007-10-10 15:10:22.000000000 +0200 > +++ > ofa_kernel-1.2.5/drivers/infiniband/hw/mthca/mthca_provider.h > 2007-10-10 15:24:05.000000000 +0200 > @@ -222,6 +222,7 @@ struct mthca_cq { > struct mthca_srq { > struct ib_srq ibsrq; > spinlock_t lock; > + int use_lock; > int refcount; > int srqn; > int max; > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From eli at mellanox.co.il Wed Oct 10 09:26:45 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 18:26:45 +0200 Subject: [ofa-general] [PATCH 2/3]: IB/mthca: allow lockless SRQ In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403027D2385@G3W0634.americas.hpqcorp.net> References: <1192031742.7337.61.camel@mtls03> <349DCDA352EACF42A0C49FA6DCEA8403027D2385@G3W0634.americas.hpqcorp.net> Message-ID: <1192033605.7337.78.camel@mtls03> > Can give a few more words about lockless SRQ ? Thanks > The idea is that if the consumer know that calls to ib_poll_cq and ib_post_srq_recv are serialize than you don't need to use a spinlock to serialize access to the SRQ's data structures. From peter.p.waskiewicz.jr at intel.com Wed Oct 10 09:42:28 2007 From: peter.p.waskiewicz.jr at intel.com (Waskiewicz Jr, Peter P) Date: Wed, 10 Oct 2007 09:42:28 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010160211.GA14239@one.firstfloor.org> References: <20071010.034446.85819294.davem@davemloft.net> <20071010160211.GA14239@one.firstfloor.org> Message-ID: > -----Original Message----- > From: Andi Kleen [mailto:andi at firstfloor.org] > Sent: Wednesday, October 10, 2007 9:02 AM > To: Waskiewicz Jr, Peter P > Cc: David Miller; andi at firstfloor.org; hadi at cyberus.ca; > shemminger at linux-foundation.org; jeff at garzik.org; > johnpol at 2ka.mipt.ru; herbert at gondor.apana.org.au; > gaagaan at gmail.com; Robert.Olsson at data.slu.se; > netdev at vger.kernel.org; rdreier at cisco.com; > mcarlson at broadcom.com; jagana at us.ibm.com; > general at lists.openfabrics.org; mchan at broadcom.com; > tgraf at suug.ch; randy.dunlap at oracle.com; sri at us.ibm.com; > kaber at trash.net > Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net > core use batching > > > We've done similar testing with ixgbe to push maximum descriptor > > counts, and we lost performance very quickly in the same > range you're > > quoting on NIU. > > Did you try it with WC writes to the ring or CLFLUSH? > > -Andi Hmm, I think it might be slightly different, but it still shows queue depth vs. performance. I was actually referring to how many descriptors we can represent a packet with before it becomes a problem wrt performance. This morning I tried to actually push my ixgbe NIC hard enough to come close to filling the ring with packets (384-byte packets), and even on my 8-core Xeon I can't do it. My system can't generate enough I/O to fill the hardware queues before CPUs max out. -PJ Waskiewicz From mshefty at ichips.intel.com Wed Oct 10 09:46:38 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 09:46:38 -0700 Subject: [ofa-general] [PATCH 1/3]: IB/core: allow lockless SRQ In-Reply-To: <1192031740.7337.60.camel@mtls03> References: <1192031740.7337.60.camel@mtls03> Message-ID: <470D01EE.2050207@ichips.intel.com> Eli Cohen wrote: > Allow to modify a SRQ to be lockless > > This patch allow the consumer to call ib_modify_srq and specify > whether the SRQ is lockless or not. I would think this needs to be specified at SRQ creation time. Otherwise, you can end up with a race where the SRQ is modified to/from lockless while in a call, resulting in either not releasing a lock, or releasing one that wasn't acquired. - Sean From mshefty at ichips.intel.com Wed Oct 10 09:53:36 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 09:53:36 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470C2897.1010105@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> Message-ID: <470D0390.9010209@ichips.intel.com> > We discussed this previously and had agreed upon limiting the memory > foot print to 1GB by default. This module parameter was for larger > systems that had plenty of memory and could afford to use more. > This way the sys admin could increase the limit. > > Hence I am not really in favour of removing this. But doesn't the admin already have the necessary parameters to limit memory usage? (max QPs, RQ depth, and mtu?) - Sean From rdreier at cisco.com Wed Oct 10 09:54:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 09:54:59 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470C2897.1010105@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Tue, 09 Oct 2007 18:19:19 -0700") References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> Message-ID: > We discussed this previously and had agreed upon limiting the memory > foot print to 1GB by default. This module parameter was for larger > systems that had plenty of memory and could afford to use more. > This way the sys admin could increase the limit. The problem is that increasing the memory limit doesn't necessarily do anything. The admin would also have to raise the limit on the number of QPs. So why not just limit the number of QPs? - R. From rdreier at cisco.com Wed Oct 10 09:57:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 09:57:02 -0700 Subject: [ofa-general] Re: [PATCH 1/3]: IB/core: allow lockless SRQ In-Reply-To: <1192031740.7337.60.camel@mtls03> (Eli Cohen's message of "Wed, 10 Oct 2007 17:55:40 +0200") References: <1192031740.7337.60.camel@mtls03> Message-ID: I don't think we really want to go down this route. There are too many subtleties in locking that consumers would have to worry about, and I don't think anyone would ever get it right. - R. From anton at samba.org Wed Oct 10 09:56:34 2007 From: anton at samba.org (Anton Blanchard) Date: Wed, 10 Oct 2007 11:56:34 -0500 Subject: [ofa-general] [PATCH] fix some ehca limits In-Reply-To: References: <20070930053726.GA28619@kryten> <20071001153620.GA31830@kryten> Message-ID: <20071010165634.GA22835@kryten> Hi Roland, > I didn't see a response to my earlier email about the other uses of > min_t(int, x, INT_MAX) so I fixed it up myself and added this to my > tree. I don't have a working setup to test yet so please let me know > if you see anything wrong with this: Thanks for doing this, sorry I didnt get back to you. I pulled your tree and it tested out fine: max_cqe: 2147483647 max_pd: 2147483647 max_ah: 2147483647 Acked-by: Anton Blanchard Anton > commit 919225e60a1a73e3518f257f040f74e9379a61c3 > Author: Roland Dreier > Date: Tue Oct 9 13:17:42 2007 -0700 > > IB/ehca: Fix clipping of device limits to INT_MAX > > Doing min_t(int, foo, INT_MAX) doesn't work correctly, because if foo > is bigger than INT_MAX, then when treated as a signed integer, it will > become negative and hence such an expression is just an elaborate NOP. > > Fix such cases in ehca to do min_t(unsigned, foo, INT_MAX) instead. > This fixes negative reported values for max_cqe, max_pd and max_ah: > > Before: > > max_cqe: -64 > max_pd: -1 > max_ah: -1 > > After: > max_cqe: 2147483647 > max_pd: 2147483647 > max_ah: 2147483647 > > Based on a bug report and fix from Anton Blanchard . > > Signed-off-by: Roland Dreier > > diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c > index 3436c49..4aa3ffa 100644 > --- a/drivers/infiniband/hw/ehca/ehca_hca.c > +++ b/drivers/infiniband/hw/ehca/ehca_hca.c > @@ -82,17 +82,17 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) > props->vendor_id = rblock->vendor_id >> 8; > props->vendor_part_id = rblock->vendor_part_id >> 16; > props->hw_ver = rblock->hw_ver; > - props->max_qp = min_t(int, rblock->max_qp, INT_MAX); > - props->max_qp_wr = min_t(int, rblock->max_wqes_wq, INT_MAX); > - props->max_sge = min_t(int, rblock->max_sge, INT_MAX); > - props->max_sge_rd = min_t(int, rblock->max_sge_rd, INT_MAX); > - props->max_cq = min_t(int, rblock->max_cq, INT_MAX); > - props->max_cqe = min_t(int, rblock->max_cqe, INT_MAX); > - props->max_mr = min_t(int, rblock->max_mr, INT_MAX); > - props->max_mw = min_t(int, rblock->max_mw, INT_MAX); > - props->max_pd = min_t(int, rblock->max_pd, INT_MAX); > - props->max_ah = min_t(int, rblock->max_ah, INT_MAX); > - props->max_fmr = min_t(int, rblock->max_mr, INT_MAX); > + props->max_qp = min_t(unsigned, rblock->max_qp, INT_MAX); > + props->max_qp_wr = min_t(unsigned, rblock->max_wqes_wq, INT_MAX); > + props->max_sge = min_t(unsigned, rblock->max_sge, INT_MAX); > + props->max_sge_rd = min_t(unsigned, rblock->max_sge_rd, INT_MAX); > + props->max_cq = min_t(unsigned, rblock->max_cq, INT_MAX); > + props->max_cqe = min_t(unsigned, rblock->max_cqe, INT_MAX); > + props->max_mr = min_t(unsigned, rblock->max_mr, INT_MAX); > + props->max_mw = min_t(unsigned, rblock->max_mw, INT_MAX); > + props->max_pd = min_t(unsigned, rblock->max_pd, INT_MAX); > + props->max_ah = min_t(unsigned, rblock->max_ah, INT_MAX); > + props->max_fmr = min_t(unsigned, rblock->max_mr, INT_MAX); > > if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca->hca_cap)) { > props->max_srq = props->max_qp; > @@ -104,15 +104,15 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) > props->local_ca_ack_delay > = rblock->local_ca_ack_delay; > props->max_raw_ipv6_qp > - = min_t(int, rblock->max_raw_ipv6_qp, INT_MAX); > + = min_t(unsigned, rblock->max_raw_ipv6_qp, INT_MAX); > props->max_raw_ethy_qp > - = min_t(int, rblock->max_raw_ethy_qp, INT_MAX); > + = min_t(unsigned, rblock->max_raw_ethy_qp, INT_MAX); > props->max_mcast_grp > - = min_t(int, rblock->max_mcast_grp, INT_MAX); > + = min_t(unsigned, rblock->max_mcast_grp, INT_MAX); > props->max_mcast_qp_attach > - = min_t(int, rblock->max_mcast_qp_attach, INT_MAX); > + = min_t(unsigned, rblock->max_mcast_qp_attach, INT_MAX); > props->max_total_mcast_qp_attach > - = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); > + = min_t(unsigned, rblock->max_total_mcast_qp_attach, INT_MAX); > > /* translate device capabilities */ > props->device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID | From rdreier at cisco.com Wed Oct 10 09:58:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 09:58:39 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: optimize receive flow In-Reply-To: <1192031732.7337.58.camel@mtls03> (Eli Cohen's message of "Wed, 10 Oct 2007 17:55:32 +0200") References: <1192031732.7337.58.camel@mtls03> Message-ID: > - if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > + if (unlikely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { This looks dubious -- you've reversed the sense of the test here. if (!likely(foo)) should be converted to if (unlikely(!foo)) instead. From mshefty at ichips.intel.com Wed Oct 10 09:59:13 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 09:59:13 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> Message-ID: <470D04E1.6090101@ichips.intel.com> > I don't think we want the qp_type to be a module parameter -- it seems > we already have ud vs. rc handled via the parameter that enables > connected mode, and if we want to enable uc we should do that in a > similar per-interface way. > > Similarly if there's any point to making use_srq something that can be > controlled, ideally it should be per-interface. But this could be > tricky because it may be hard to change at runtime. > > (Ideally max_conn_qp would be per-interface too but that seems too > hard as well) I agree that these should be per interface. They may be difficult to change at runtime without reseting all connections, but as the person not coding it, I would think it would be doable. What happens now when dynamically switching between UD or CM mode? - Sean From rdreier at cisco.com Wed Oct 10 09:59:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 09:59:18 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: optimize receive flow In-Reply-To: <1192031732.7337.58.camel@mtls03> (Eli Cohen's message of "Wed, 10 Oct 2007 17:55:32 +0200") References: <1192031732.7337.58.camel@mtls03> Message-ID: > This patch tries to reduce the number of accesses to the skb > object and save CPU cycles and cache misses. Does it succeed? Did you measure the performance, or look at the generated code to confirm that it helps? - R. From eli at mellanox.co.il Wed Oct 10 10:00:21 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 19:00:21 +0200 Subject: [ofa-general] [PATCH 1/3]: IB/core: allow lockless SRQ In-Reply-To: <470D01EE.2050207@ichips.intel.com> References: <1192031740.7337.60.camel@mtls03> <470D01EE.2050207@ichips.intel.com> Message-ID: <1192035621.7337.82.camel@mtls03> > I would think this needs to be specified at SRQ creation time. > > Otherwise, you can end up with a race where the SRQ is modified to/from > lockless while in a call, resulting in either not releasing a lock, or > releasing one that wasn't acquired. > Yes you're right. I didn't want to change the creation verb but it looks like it is a better choice. From eli at mellanox.co.il Wed Oct 10 10:02:19 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 19:02:19 +0200 Subject: [ofa-general] Re: [PATCH] IB/ipoib: optimize receive flow In-Reply-To: References: <1192031732.7337.58.camel@mtls03> Message-ID: <1192035739.7337.85.camel@mtls03> On Wed, 2007-10-10 at 09:58 -0700, Roland Dreier wrote: > > - if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > > + if (unlikely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { > > This looks dubious -- you've reversed the sense of the test here. > > if (!likely(foo)) > > should be converted to > > if (unlikely(!foo)) > > instead. Sure, you're right. From eli at mellanox.co.il Wed Oct 10 10:06:19 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 10 Oct 2007 19:06:19 +0200 Subject: [ofa-general] Re: [PATCH] IB/ipoib: optimize receive flow In-Reply-To: References: <1192031732.7337.58.camel@mtls03> Message-ID: <1192035979.7337.89.camel@mtls03> > Does it succeed? Did you measure the performance, or look at the > generated code to confirm that it helps? > Actually I ran oprofile and saw that this reduces the time spent on skb_put_frags() (from 14.6% to 11.6% in the test I did). From bnl_info at adelphia.net Wed Oct 10 10:09:02 2007 From: bnl_info at adelphia.net (THE NATIONAL LOTTERY) Date: Wed, 10 Oct 2007 10:09:02 -0700 Subject: [ofa-general] FINAL NOTIFICATION Message-ID: <2400967.1192036142633.JavaMail.root@web35> FINAL NOTIFICATION We are pleased to inform you of the result of the winners of the BRITISH NATIONAL LOTTERY ONLINE PROMO PROGRAMMER, held on the 9th October ,2007.Your e-mail address was attached to these lucky winning numbers below:09 14 26 28 30 45 38 Which subsequently won you the lottery bonus draw.You have there fore been approved to claim a total sum of £1,500,000 (One Million Five Hundred Thousand Pounds Sterling ),payout in Usdollars;2,776,646.55 in cash credited to filektu/9023118308/03. To file for your claim, please contact our fiduciary agent MR Richard Cook With the feed Verification/Fund Release Form Below 1.Full Name: 2.Full Address: 3.Marital Status: 4.Occupation: 5.Age: 6.Sex: 7.Nationality: 8.Country Of Residence: 9.Telephone Number: Mr Edison Walker Phone Number:+44 703 1912 825 Phone Number:+44 703 1910 546 E-mail:contactpayofficer_edisonwalker at yahoo.co.uk From hrosenstock at xsigo.com Wed Oct 10 10:29:33 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 10 Oct 2007 10:29:33 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <470C9C55.3090304@Sun.COM> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> Message-ID: <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-10 at 15:03 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi, > I am using madrpc_init which in turn calling umad_register(). There is no > problem in sending and receiving data. Only problem comes when two separate user > threads(one for SMI recv and another for GSI recv) are trying to recv data using > mad_receive(0, timeout) function simultaneously. I get SMI mad in GSI thread > and vice versa sometimes. How to get rid of this problem as mad_receive has no > control of qp selection. There is no per thread demuxing. You would need two different mad agents to do this with one looking at the SMI side and the other the GSI side. I haven't looked at libibmad in terms of using this model though. -- Hal > > Thanks and Regards > sumit > > > Hal Rosenstock wrote: > > On Tue, 2007-10-09 at 13:01 +0530, Sumit Gaur - Sun Microsystem wrote: > > > >>Hi, > >> > >>It is regarding *umad_recv* function of libibumad/src/umad.c file. Is it not > >>possible to recv MAD specific to GSI or SMI type. As per my impression if I have > >>two separate threads to send and receive then I could send MADs to different qp > >>0 or 1 depend on GSI and SMI MAD. But receiving has no control over it. Please > >>suggest if there is any workaround for it. > > > > > > See umad_register(). > > > > -- Hal > > > > > >>Thanks and Regards > >>sumit > >>_______________________________________________ > >>general mailing list > >>general at lists.openfabrics.org > >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Wed Oct 10 10:30:48 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 10:30:48 -0700 Subject: [ofa-general] IPoIB CM (NOSRQ) [Patch V9] revised In-Reply-To: <470C4DB7.2050103@linux.vnet.ibm.com> References: <470C4DB7.2050103@linux.vnet.ibm.com> Message-ID: <000101c80b63$4c6192f0$bacc180a@amr.corp.intel.com> >@@ -313,19 +483,18 @@ static int ipoib_cm_req_handler(struct i > } > > psn = random32() & 0xffffff; >- ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >- if (ret) >- goto err_modify; >- >- spin_lock_irq(&priv->lock); >- queue_delayed_work(ipoib_workqueue, >- &priv->cm.stale_task, IPOIB_CM_RX_DELAY); >- /* Add this entry to passive ids list head, but do not re-add it >- * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ >- p->jiffies = jiffies; >- if (p->state == IPOIB_CM_RX_LIVE) >- list_move(&p->list, &priv->cm.passive_ids); >- spin_unlock_irq(&priv->lock); >+ if (!priv->cm.srq) { >+ ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); >+ if (ret) >+ goto err_modify; >+ } else { >+ p->rx_ring = NULL; >+ ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >+ if (ret) >+ goto err_modify; >+ p->state = IPOIB_CM_RX_LIVE; >+ init_context_and_add_list(cm_id, p, priv); I missed this impact in my previous review. Removing the locking from init_context_and_add_list() means that we need a lock here. - Sean From pradeeps at linux.vnet.ibm.com Wed Oct 10 10:38:25 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 10:38:25 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> Message-ID: <470D0E11.9050305@linux.vnet.ibm.com> Roland Dreier wrote: > > We discussed this previously and had agreed upon limiting the memory > > foot print to 1GB by default. This module parameter was for larger > > systems that had plenty of memory and could afford to use more. > > This way the sys admin could increase the limit. > > The problem is that increasing the memory limit doesn't necessarily do > anything. The admin would also have to raise the limit on the number > of QPs. So why not just limit the number of QPs? > Yes, the admin would have to increase the number of QPs as well. However, increasing the number of QPs only does not give a picture to the admin as to how much memory is being used. This way he is able to tune the system to use resources the way he would want to control. Pradeep From sean.hefty at intel.com Wed Oct 10 10:35:44 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 10:35:44 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM><1191930206.22963.164.camel@hrosenstock-ws.xsigo.com><470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> Message-ID: <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> >There is no per thread demuxing. You would need two different mad agents >to do this with one looking at the SMI side and the other the GSI side. >I haven't looked at libibmad in terms of using this model though. umad_receive() doesn't take the mad_agent as an input parameter. The only possibility I see is calling umad_open_port() twice for the same port, with the GSI/SMI registrations going to separate port_id's. - Sean From bnl_info1 at adelphia.net Wed Oct 10 10:40:43 2007 From: bnl_info1 at adelphia.net (THE BRITISH LOTTERY) Date: Wed, 10 Oct 2007 10:40:43 -0700 Subject: [ofa-general] ***SPAM*** FINAL NOTIFICATION Message-ID: <21969309.1192038043936.JavaMail.root@web35> FINAL NOTIFICATION We are pleased to inform you of the result of the winners of the BRITISH NATIONAL LOTTERY ONLINE PROMO PROGRAMMER, held on the 9th October ,2007.Your e-mail address was attached to these lucky winning numbers below:09 14 26 28 30 45 38 Which subsequently won you the lottery bonus draw.You have there fore been approved to claim a total sum of £1,500,000 (One Million Five Hundred Thousand Pounds Sterling ),payout in Usdollars;2,776,646.55 in cash credited to filektu/9023118308/03. To file for your claim, please contact our fiduciary agent MR Richard Cook With the feed Verification/Fund Release Form Below 1.Full Name: 2.Full Address: 3.Marital Status: 4.Occupation: 5.Age: 6.Sex: 7.Nationality: 8.Country Of Residence: 9.Telephone Number: Mr Edison Walker Phone Number:+44 703 1912 825 Phone Number:+44 703 1910 546 E-mail:contactpayofficer_edisonwalker at yahoo.co.u From rdreier at cisco.com Wed Oct 10 10:41:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 10:41:57 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D0E11.9050305@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 10 Oct 2007 10:38:25 -0700") References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> Message-ID: > Yes, the admin would have to increase the number of QPs as well. However, > increasing the number of QPs only does not give a picture to the admin as > to how much memory is being used. This way he is able to tune the system to > use resources the way he would want to control. So should we remove the module parameter to limit the number of QPs? I hate adding more and more module parameters, especially when they are not orthogonal and have complex interactions. - R. From pradeeps at linux.vnet.ibm.com Wed Oct 10 10:46:51 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 10:46:51 -0700 Subject: [ofa-general] IPoIB CM (NOSRQ) [Patch V9] revised In-Reply-To: <000101c80b63$4c6192f0$bacc180a@amr.corp.intel.com> References: <470C4DB7.2050103@linux.vnet.ibm.com> <000101c80b63$4c6192f0$bacc180a@amr.corp.intel.com> Message-ID: <470D100B.2070904@linux.vnet.ibm.com> Sean Hefty wrote: >> @@ -313,19 +483,18 @@ static int ipoib_cm_req_handler(struct i >> } >> >> psn = random32() & 0xffffff; >> - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >> - if (ret) >> - goto err_modify; >> - >> - spin_lock_irq(&priv->lock); >> - queue_delayed_work(ipoib_workqueue, >> - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); >> - /* Add this entry to passive ids list head, but do not re-add it >> - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ >> - p->jiffies = jiffies; >> - if (p->state == IPOIB_CM_RX_LIVE) >> - list_move(&p->list, &priv->cm.passive_ids); >> - spin_unlock_irq(&priv->lock); >> + if (!priv->cm.srq) { >> + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); >> + if (ret) >> + goto err_modify; >> + } else { >> + p->rx_ring = NULL; >> + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); >> + if (ret) >> + goto err_modify; >> + p->state = IPOIB_CM_RX_LIVE; >> + init_context_and_add_list(cm_id, p, priv); > > I missed this impact in my previous review. Removing the locking from > init_context_and_add_list() means that we need a lock here. > Yes, you are correct. The no srq case is correct. This impacts the srq case. I will fix that. Pradeep From mshefty at ichips.intel.com Wed Oct 10 10:54:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 10:54:49 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D0E11.9050305@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> Message-ID: <470D11E9.3020603@ichips.intel.com> > Yes, the admin would have to increase the number of QPs as well. However, > increasing the number of QPs only does not give a picture to the admin as > to how much memory is being used. This way he is able to tune the system to > use resources the way he would want to control. How about providing some way for the admin to see current and maximum memory usage that ipoib could consume based on the current QP and RQ settings? You're using the memory limit to restrict the number of QPs to less than what the user requested. It could instead have been used to restrict the size of the receive queue, or both. Having the extra parameter can be confusing. Consider an admin increasing the RQ size only to find that they end up with fewer QPs. - Sean From pradeeps at linux.vnet.ibm.com Wed Oct 10 11:08:42 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 11:08:42 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D11E9.3020603@ichips.intel.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> Message-ID: <470D152A.9040409@linux.vnet.ibm.com> Sean Hefty wrote: >> Yes, the admin would have to increase the number of QPs as well. >> However, increasing the number of QPs only does not give a picture to >> the admin as to how much memory is being used. This way he is able to >> tune the system to >> use resources the way he would want to control. > > How about providing some way for the admin to see current and maximum > memory usage that ipoib could consume based on the current QP and RQ > settings? > > You're using the memory limit to restrict the number of QPs to less than > what the user requested. It could instead have been used to restrict > the size of the receive queue, or both. Having the extra parameter can > be confusing. Consider an admin increasing the RQ size only to find > that they end up with fewer QPs. Yes, the admin could run into the problem that you describe. That is exactly why we have these as module parameters. It gives him/her the flexibility. I am thinking tht we are seeing this differently. I don't view that as a problem, but us usefulness. Pradeep From swise at opengridcomputing.com Wed Oct 10 11:14:52 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 10 Oct 2007 13:14:52 -0500 Subject: [ofa-general] RLIMIT_MEMLOCK In-Reply-To: <470BA493.9040501@vims.edu> References: <470BA493.9040501@vims.edu> Message-ID: <470D169C.2060005@opengridcomputing.com> I usually have to add something in /etc/init.d/ssh* and restart the ssh daemon... Adam Miller wrote: > We have run into this problem with using mpiexec. SLES 10 is on the > cluster and we have set the limits under /etc/security/limits.conf and > they work there, even when we run mpirun commands work fine but when > tying them all in using mpiexec it still comes back with the 32K limit > in memory. > > Any and all users can log in and in bash type "ulimit -a" and tcsh type > "limit" and both state the correct full memory limits, but when using > mpiexec under both shells they get the 32k limit. > > Any suggestions? > > thanks > From rdreier at cisco.com Wed Oct 10 11:26:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 11:26:14 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: optimize post srq In-Reply-To: <1192031738.7337.59.camel@mtls03> (Eli Cohen's message of "Wed, 10 Oct 2007 17:55:37 +0200") References: <1192031738.7337.59.camel@mtls03> Message-ID: It makes sense to mark the error paths as unlikely(), so I applied this. > If this approach is accepted I can do the same for mlx4 I just looked a the mlx4 code -- it seems I already marked the error paths as unlikely in the post srq recv function. So I don't think there's anything to do there. - R. From rdreier at cisco.com Wed Oct 10 11:26:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 11:26:41 -0700 Subject: [ofa-general] Re: [PATCH 04/23] IB/ipath - Verify host bus bandwidth to chip will not limit performance In-Reply-To: <20071009195935.7151.18898.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Tue, 09 Oct 2007 12:59:35 -0700") References: <20071009195914.7151.19428.stgit@eng-46.internal.keyresearch.com> <20071009195935.7151.18898.stgit@eng-46.internal.keyresearch.com> Message-ID: thanks, I merged this on top to simplify the error path and fix a memory leak: diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 8fa2bb5..f83fb03 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -305,8 +305,7 @@ static void ipath_verify_pioperf(struct ipath_devdata *dd) if (!piobuf) { dev_info(&dd->pcidev->dev, "No PIObufs for checking perf, skipping\n"); - goto done; - + return; } /* @@ -358,9 +357,12 @@ static void ipath_verify_pioperf(struct ipath_devdata *dd) lcnt / (u32) emsecs); preempt_enable(); + + vfree(addr); + done: - if (piobuf) /* disarm it, so it's available again */ - ipath_disarm_piobufs(dd, pbnum, 1); + /* disarm piobuf, so it's available again */ + ipath_disarm_piobufs(dd, pbnum, 1); } static int __devinit ipath_init_one(struct pci_dev *pdev, From mshefty at ichips.intel.com Wed Oct 10 11:31:59 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 11:31:59 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <46D78104.mailJY81GRONO@systemfabricworks.com> Message-ID: <470D1A9F.9050305@ichips.intel.com> Does anyone know what happened with this patch? Steve? I last remember a couple of minor changes being requested, but that was it. - Sean > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 6f42877..9ec910b 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > } > > > > /* Check to post send on QP or process locally */ > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > if (port_priv) { > > mad_priv->mad.mad.mad_hdr.tid = > > ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > > index 1cfc298..d96fc8e 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return 1 if the SMP response should be handled by the local management stack > > + */ > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > + > > #endif /* __SMI_H_ */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From appledraws3 at adelphia.net Wed Oct 10 11:41:16 2007 From: appledraws3 at adelphia.net (APPLE Lottery) Date: Wed, 10 Oct 2007 11:41:16 -0700 Subject: [ofa-general] ***SPAM*** October 2007 Apple Lottery Winner !!! Message-ID: <5307415.1192041676372.JavaMail.root@web17> APPLE LOTTERY ONLINE, UK Design House, Exmoor Avenue, Scunthorpe, North Lincolnshire NL45 8RE. ================================ TICKET NO: APL (02-36-99-87-13) BATCH NO: 2007APL-007 (bonus no.31) REF NO: B/98-867-974APL ================================ AWARD WINNING APPROVAL We happily announce to you the draw (#1091) of the APPLE LOTTERY, online Sweepstakes International Program held on Tuesday October 9th, 2007. Your e-mail address attached to TICKET NO: APL (02-36-99-87-13) with BATCH NO: 2007APL-007(bonus no.31), which subsequently won you the lottery in the 2nd Category i.e. match 5 plus bonus. You have therefore been approved to Claim a total sum of £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) in cash credited to file REF NO: B/98-867-974APL. This is from a total cash prize of £ 2,500,000 shared amongst the (7) lucky winners in this category i.e. Match 5 plus bonus. All participants for the online version were selected Randomly from World Wide Web sites through Computer Ballot Draw system and extracted from over 100m Secured Web Sites Worldwide and your E-mail address was selected which subsequently led to your Winning this Lottery in the 2nd Category i.e. Match 5 plus bonus. In view of this, your £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) will be released to you by our payment office here in London, United Kingdom. For security reasons, you are advised to keep your Winning information confidential till your claims is processed and your award prize is remitted to you in whatever manner you deem fit to claim your Prize. This is part of our precautionary measure to avoid double Claiming and unwarranted abuse of this program. Please be warned. Your fund has been deposited in an escrow account with our affiliate bank here in United Kingdom (UK), and insured with your REF NO: B/98-867-974APL and your E-mail address. You are to keep your TICKET NO. REF NO. BATCH NO. from the public, until you have been processed and your prize money remitted to your personal account. To claim your winning prize, you must first contact the Fiduciary Agent by email for processing and remittance of your prize money to you. Below is the contact of the Fiduciary Agent: ------------------------------------ AGENT: Garry Cooke E-MAIL: applelclaimsdesk1 at yahoo.co.uk TEL/FAX: +44 (0) 702 403 8665 | (0) 702 403 9047 ------------------------------------ Claims Requirements: ==================== 1.Full Name : 2.Address : 3.Nationality : 4.Age : 5.Occupation : 6.Phone/Fax : 7.Present Country : The Fiduciary Agent will assist you in claiming your due prize. In order to avoid unnecessary delays and complications, please remember to quote your ticket, reference, and batch numbers in all correspondences with the Fiduciary Agent. Sincerely, Betty Rowland (Mrs) For APPLE LOTTERY ONLINE UK; From hrosenstock at xsigo.com Wed Oct 10 11:44:13 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 10 Oct 2007 11:44:13 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D1A9F.9050305@ichips.intel.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <470D1A9F.9050305@ichips.intel.com> Message-ID: <1192041853.17526.78.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-10 at 11:31 -0700, Sean Hefty wrote: > Does anyone know what happened with this patch? Steve? > > I last remember a couple of minor changes being requested, but that was it. Yes, we both requested some minor changes and no revised patch was issued AFAIK. There's also the related mthca router mode patch too which so far is lacking comment. -- Hal > - Sean > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > > index 6f42877..9ec910b 100644 > > > --- a/drivers/infiniband/core/mad.c > > > +++ b/drivers/infiniband/core/mad.c > > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > > } > > > > > > /* Check to post send on QP or process locally */ > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > > goto out; > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > > if (port_priv) { > > > mad_priv->mad.mad.mad_hdr.tid = > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > > > index 1cfc298..d96fc8e 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return 1 if the SMP response should be handled by the local management stack > > > + */ > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > > > + struct ib_device *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > + > > > #endif /* __SMI_H_ */ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Wed Oct 10 11:44:13 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 10 Oct 2007 11:44:13 -0700 Subject: [ofa-general] [PATCH] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D1A9F.9050305@ichips.intel.com> References: <46D78104.mailJY81GRONO@systemfabricworks.com> <470D1A9F.9050305@ichips.intel.com> Message-ID: <1192041853.17526.78.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-10 at 11:31 -0700, Sean Hefty wrote: > Does anyone know what happened with this patch? Steve? > > I last remember a couple of minor changes being requested, but that was it. Yes, we both requested some minor changes and no revised patch was issued AFAIK. There's also the related mthca router mode patch too which so far is lacking comment. -- Hal > - Sean > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > > index 6f42877..9ec910b 100644 > > > --- a/drivers/infiniband/core/mad.c > > > +++ b/drivers/infiniband/core/mad.c > > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > > } > > > > > > /* Check to post send on QP or process locally */ > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > + smi_check_local_resp_smp(smp, device) == IB_SMI_DISCARD) > > > goto out; > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > @@ -754,6 +755,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > > if (port_priv) { > > > mad_priv->mad.mad.mad_hdr.tid = > > > ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > > > index 1cfc298..d96fc8e 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -71,4 +71,18 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return 1 if the SMP response should be handled by the local management stack > > > + */ > > > +static inline enum smi_action smi_check_local_resp_smp(struct ib_smp *smp, > > > + struct ib_device *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > + > > > #endif /* __SMI_H_ */ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From appledraws3 at adelphia.net Wed Oct 10 11:51:47 2007 From: appledraws3 at adelphia.net (APPLE Lottery) Date: Wed, 10 Oct 2007 11:51:47 -0700 Subject: [ofa-general] October 2007 Apple Lottery Winner !!! Message-ID: <1140068.1192042307284.JavaMail.root@web17> APPLE LOTTERY ONLINE, UK Design House, Exmoor Avenue, Scunthorpe, North Lincolnshire NL45 8RE. ================================ TICKET NO: APL (02-36-99-87-13) BATCH NO: 2007APL-007 (bonus no.31) REF NO: B/98-867-974APL ================================ AWARD WINNING APPROVAL We happily announce to you the draw (#1091) of the APPLE LOTTERY, online Sweepstakes International Program held on Tuesday October 9th, 2007. Your e-mail address attached to TICKET NO: APL (02-36-99-87-13) with BATCH NO: 2007APL-007(bonus no.31), which subsequently won you the lottery in the 2nd Category i.e. match 5 plus bonus. You have therefore been approved to Claim a total sum of £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) in cash credited to file REF NO: B/98-867-974APL. This is from a total cash prize of £ 2,500,000 shared amongst the (7) lucky winners in this category i.e. Match 5 plus bonus. All participants for the online version were selected Randomly from World Wide Web sites through Computer Ballot Draw system and extracted from over 100m Secured Web Sites Worldwide and your E-mail address was selected which subsequently led to your Winning this Lottery in the 2nd Category i.e. Match 5 plus bonus. In view of this, your £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) will be released to you by our payment office here in London, United Kingdom. For security reasons, you are advised to keep your Winning information confidential till your claims is processed and your award prize is remitted to you in whatever manner you deem fit to claim your Prize. This is part of our precautionary measure to avoid double Claiming and unwarranted abuse of this program. Please be warned. Your fund has been deposited in an escrow account with our affiliate bank here in United Kingdom (UK), and insured with your REF NO: B/98-867-974APL and your E-mail address. You are to keep your TICKET NO. REF NO. BATCH NO. from the public, until you have been processed and your prize money remitted to your personal account. To claim your winning prize, you must first contact the Fiduciary Agent by email for processing and remittance of your prize money to you. Below is the contact of the Fiduciary Agent: ------------------------------------ AGENT: Garry Cooke E-MAIL: applelclaimsdesk1 at yahoo.co.uk TEL/FAX: +44 (0) 702 403 8665 | (0) 702 403 9047 ------------------------------------ Claims Requirements: ==================== 1.Full Name : 2.Address : 3.Nationality : 4.Age : 5.Occupation : 6.Phone/Fax : 7.Present Country : The Fiduciary Agent will assist you in claiming your due prize. In order to avoid unnecessary delays and complications, please remember to quote your ticket, reference, and batch numbers in all correspondences with the Fiduciary Agent. Sincerely, Betty Rowland (Mrs) For APPLE LOTTERY ONLINE UK; From mshefty at ichips.intel.com Wed Oct 10 12:03:19 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 12:03:19 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D152A.9040409@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> <470D152A.9040409@linux.vnet.ibm.com> Message-ID: <470D21F7.10209@ichips.intel.com> > Yes, the admin could run into the problem that you describe. That is exactly > why we have these as module parameters. It gives him/her the flexibility. But it doesn't give additional flexibility, and makes it more difficult. Increasing this value by itself may not do anything unless the admin also increase max QPs / RQ size / mtu. Similarly, increasing max QP / RQ size / mtu may not work without also increasing this value. Multiple values need to be manipulated. Decreasing this value can have the side effect of limiting max QP. This side effect is arbitrary. And even if this value is left unchanged, the results of changing other parameters is unknown. The only sure way that the admin can know what will happen is to understand the relationship that max QP / RQ size / mtu have on memory use. This parameter doesn't remove that need and makes the relationship between them show up in confusing ways. If admins want some way of limiting how much memory is consumed by ipoib, then how about creating a simple userspace app to convert their request into the proper kernel settings? This way, the policy is kept in userspace, rather than hard-coded in the kernel driver. - Sean From pradeeps at linux.vnet.ibm.com Wed Oct 10 12:17:18 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 12:17:18 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D21F7.10209@ichips.intel.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> <470D152A.9040409@linux.vnet.ibm.com> <470D21F7.10209@ichips.intel.com> Message-ID: <470D253E.1070904@linux.vnet.ibm.com> Sean Hefty wrote: >> Yes, the admin could run into the problem that you describe. That is >> exactly >> why we have these as module parameters. It gives him/her the flexibility. > > But it doesn't give additional flexibility, and makes it more difficult. > > Increasing this value by itself may not do anything unless the admin > also increase max QPs / RQ size / mtu. Similarly, increasing max QP / > RQ size / mtu may not work without also increasing this value. Multiple > values need to be manipulated. > > Decreasing this value can have the side effect of limiting max QP. This > side effect is arbitrary. > > And even if this value is left unchanged, the results of changing other > parameters is unknown. > > The only sure way that the admin can know what will happen is to > understand the relationship that max QP / RQ size / mtu have on memory > use. This parameter doesn't remove that need and makes the relationship > between them show up in confusing ways. > > If admins want some way of limiting how much memory is consumed by > ipoib, then how about creating a simple userspace app to convert their > request into the proper kernel settings? This way, the policy is kept > in userspace, rather than hard-coded in the kernel driver. > Sean, As we debate this issue I do not want no srq patch to miss the 2.6.24 merge. This has been waiting to be merged for a very long time. We all have a slightly different view point. This was the reason I did not touch the module parameters in my previous patches. Can we agree to continue discussing this, but merge the patch (I will provide the fix that you pointed out)? Pradeep From swise at opengridcomputing.com Wed Oct 10 12:47:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 10 Oct 2007 14:47:53 -0500 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470BB8DB.8090107@dev.mellanox.co.il> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> <470BB8DB.8090107@dev.mellanox.co.il> Message-ID: <470D2C69.3000500@opengridcomputing.com> Hey Vlad, The libcxgb3 rpms built by this ofed-1.2.5 release are still named libcxgb3*-1.0.1 instead of 1.0.3. Can you update your spec files to indicate that the library is release 1.0.3? You'll need to also update the ofed-1.3 spec file I guess. Thanks, Steve. Vladimir Sokolovsky wrote: > Steve Wise wrote: >> Thanks Vlad, >> >> Can you crank a ofed-1.2.5 development build too? >> >> Thanks, >> >> Steve. >> > > Done: > > http://www.openfabrics.org/builds/connectx/OFED-1.2.5-20071009-0955.tgz > > Regards, > Vladimir From pradeeps at linux.vnet.ibm.com Wed Oct 10 13:03:30 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 13:03:30 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D253E.1070904@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> <470D152A.9040409@linux.vnet.ibm.com> <470D21F7.10209@ichips.intel.com> <470D253E.1070904@linux.vnet.ibm.com> Message-ID: <470D3012.5030706@linux.vnet.ibm.com> Pradeep Satyanarayana wrote: > Sean Hefty wrote: >>> Yes, the admin could run into the problem that you describe. That is >>> exactly >>> why we have these as module parameters. It gives him/her the flexibility. >> But it doesn't give additional flexibility, and makes it more difficult. >> >> Increasing this value by itself may not do anything unless the admin >> also increase max QPs / RQ size / mtu. Similarly, increasing max QP / >> RQ size / mtu may not work without also increasing this value. Multiple >> values need to be manipulated. >> >> Decreasing this value can have the side effect of limiting max QP. This >> side effect is arbitrary. >> >> And even if this value is left unchanged, the results of changing other >> parameters is unknown. >> >> The only sure way that the admin can know what will happen is to >> understand the relationship that max QP / RQ size / mtu have on memory >> use. This parameter doesn't remove that need and makes the relationship >> between them show up in confusing ways. >> >> If admins want some way of limiting how much memory is consumed by >> ipoib, then how about creating a simple userspace app to convert their >> request into the proper kernel settings? This way, the policy is kept >> in userspace, rather than hard-coded in the kernel driver. >> > > Sean, > > As we debate this issue I do not want no srq patch to miss the 2.6.24 merge. > This has been waiting to be merged for a very long time. > > We all have a slightly different view point. This was the reason I did not > touch the module parameters in my previous patches. Can we agree to continue > discussing this, but merge the patch (I will provide the fix that you pointed out)? > In the interest of reaching a quick resolution, would it be acceptable if I put in a warning message (printing current memory usage) when memory usage say exceeds 1GB for no srq and also eliminate the max_receive_buffer module parameter. Is that satisfactory? Pradeep From mshefty at ichips.intel.com Wed Oct 10 13:29:32 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 13:29:32 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D3012.5030706@linux.vnet.ibm.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> <470D152A.9040409@linux.vnet.ibm.com> <470D21F7.10209@ichips.intel.com> <470D253E.1070904@linux.vnet.ibm.com> <470D3012.5030706@linux.vnet.ibm.com> Message-ID: <470D362C.3040509@ichips.intel.com> > In the interest of reaching a quick resolution, would it be acceptable if I put > in a warning message (printing current memory usage) when memory usage say exceeds 1GB > for no srq and also eliminate the max_receive_buffer module parameter. Is that > satisfactory? That would work for me. You could even print a warning if the user configures ipoib such that it could exceed X GBs. That would only leave the max_rc_qp parameter, right? Is there any reason not to rename this to max_conn_qp in case UC support were ever added? Or would we want separate parameters for RC and UC? (Other changes I suggested could easily wait until we have a UC implementation.) - Sean From pradeeps at linux.vnet.ibm.com Wed Oct 10 13:46:35 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 13:46:35 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V9] patch In-Reply-To: <470D362C.3040509@ichips.intel.com> References: <46F05476.4090809@linux.vnet.ibm.com> <470C05EF.90409@ichips.intel.com> <470C2897.1010105@linux.vnet.ibm.com> <470D0E11.9050305@linux.vnet.ibm.com> <470D11E9.3020603@ichips.intel.com> <470D152A.9040409@linux.vnet.ibm.com> <470D21F7.10209@ichips.intel.com> <470D253E.1070904@linux.vnet.ibm.com> <470D3012.5030706@linux.vnet.ibm.com> <470D362C.3040509@ichips.intel.com> Message-ID: <470D3A2B.6070902@linux.vnet.ibm.com> Sean Hefty wrote: >> In the interest of reaching a quick resolution, would it be acceptable >> if I put >> in a warning message (printing current memory usage) when memory usage >> say exceeds 1GB for no srq and also eliminate the max_receive_buffer >> module parameter. Is that satisfactory? > > That would work for me. You could even print a warning if the user > configures ipoib such that it could exceed X GBs. That brings up questions like how does the user configure ipoib? That means adding additional parameters. It was felt that 1GB was large enough to flag it and so I will stick with that. > > That would only leave the max_rc_qp parameter, right? Is there any > reason not to rename this to max_conn_qp in case UC support were ever > added? Or would we want separate parameters for RC and UC? (Other > changes I suggested could easily wait until we have a UC implementation.) Since we do not know what UC will add to the mix, I would like to keep that separate. Pradeep From mshefty at ichips.intel.com Wed Oct 10 14:01:07 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 14:01:07 -0700 Subject: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. In-Reply-To: <470AA729.2050009@opengridcomputing.com> References: <46B883B5.8040702@opengridcomputing.com> <46BB61D0.4090101@opengridcomputing.com> <46BB89C0.4040303@ichips.intel.com> <20070809.145534.102938208.davem@davemloft.net> <470AA729.2050009@opengridcomputing.com> Message-ID: <470D3D93.2020606@ichips.intel.com> > The hack to use a socket and bind it to claim the port was just for > demostrating the idea. The correct solution, IMO, is to enhance the > core low level 4-tuple allocation services to be more generic (eg: not > be tied to a struct sock). Then the host tcp stack and the host rdma > stack can allocate TCP/iWARP ports/4tuples from this common exported > service and share the port space. This allocation service could also be > used by other deep adapters like iscsi adapters if needed. Since iWarp runs on top of TCP, the port space is really the same. FWIW, I agree that this proposal is the correct solution to support iWarp. - Sean From kanoj at netxen.com Wed Oct 10 14:42:33 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 10 Oct 2007 14:42:33 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests In-Reply-To: <470BD4A5.40902@ichips.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com> <470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> Message-ID: <470D4749.8000309@netxen.com> Sean Hefty wrote: >> Just so I understand, did you discover problems (maybe preexisting >> race conditions) with my previously posted patch? If yes, please >> point it out, so its easier to review yours; if not, I will assume >> your patch implements a better locking scheme and review it as such. > > Sean, I looked over your patch for a while. Agreed, your patch fixes a race condition that my patch had exposed (I had analyzed the sequence wildcard destruct getting to a device listener before a racing device removal could, but not the reverse order). I do have some issues though: * in your patch, I suggest taking out the warning printk from cma_listen_on_dev() when the listener create attempt fails; it might be that the device is out of resources etc. Since the code takes care of this situation pretty well, I don't see a need for the printk. * I don't see a reason for the internal_id and the device listeners getting a refcount on the wildcard listener. Because, even without these, it is guaranteed that the wildcard listener will exist at least as long as any of the children device listener's are around, by looking at the logic in rdma_destroy_id(). Can you provide some logic for requring this then? * not that I am very worried (and I suggesting resolving this thru another subsequent patch if it is really a problem), but I think device removal is still racy wrt non wildcard listeners. Here's the sequence: cma_process_remove()->cma_remove_id_dev() decides it will rdma_destroy_id() the listener id, and at the same time a process context rdma_destroy_id() decides it is going to do the same. There are probably various ways to take care of this, the simple one might be for rdma_destroy_id() to look at the "state" and make a decision about who gets to destroy. Thanks. Kanoj > I tried to explain the issue somewhat in my change commit and code > comments. The issue is synchronizing cleanup of the listen_list with > device removal. > > When an RDMA device is added to the system, a new listen request is > added for all wildcard listens. Since the original locking held the > mutex throughout the cleanup of the listen list, it prevented adding > another listen request during that same time. > > Similar protection was there for handling device removal. When a > device is removed from the system, all internal listen requests > associated with that device are destroyed. If the associated wildcard > listen is also being destroyed, we need to ensure that we don't try to > destroy the same listen twice. > > My patch, like yours, ends up releasing the mutex while cleaning up > the listen_list. I choose to eliminate the cma_destroy_listen() call, > and use rdma_destroy_id() as a single destruction path instead. This > keeps the locking contained to a single function. (I don't like > acquiring a lock in one call and releasing it in another. It puts too > much assumption on the caller.) > > What was missing was ensuring that a device removal didn't try to > destroy the same listen request. This is handled by the adding the > list_del*() calls to cma_cancel_listens(). Whichever thread removes > the listening id from the device list is responsible for its > destruction. And because that thread could be the device removal > thread, I added a reference from the per device listen to the wildcard > listen. > > Hopefully this makes sense. > > - Sean > From rdreier at cisco.com Wed Oct 10 14:43:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Oct 2007 14:43:28 -0700 Subject: [ofa-general] Re: [PATCH v3 for 2.6.24] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: (Or Gerlitz's message of "Mon, 8 Oct 2007 10:13:00 +0200 (IST)") References: Message-ID: OK, at long last I merged the following. I rewrote the changelog to (I think) be more understandable, and also cleaned up a few things in the patch (including whitespace damage...). commit 335a64a5a958002bc238c90de695e120c3c8c120 Author: Or Gerlitz Date: Mon Oct 8 10:13:00 2007 +0200 IPoIB: Allow setting policy to ignore multicast groups The kernel IB stack allows (through the RDMA CM) userspace applications to join and use multicast groups from the IPoIB MGID range. This allows multicast traffic to be handled directly from userspace QPs, without going through the kernel stack, which gives better performance for some applications. However, to fully interoperate with IP multicast, such userspace applications need to participate in IGMP reports and queries, or else routers may not forward the multicast traffic to the system where the application is running. The simplest way to do this is to share the kernel IGMP implementation by using the IP_ADD_MEMBERSHIP option to join multicast groups that are being handled directly in userspace. However, in such cases, the actual multicast traffic should not also be handled by the IPoIB interface, because that would burn resources handling multicast packets that will just be discarded in the kernel. To handle this, this patch adds lookup on the database used for IB multicast group reference counting when IPoIB is joining multicast groups, and if a multicast group is already handled by user space, then the IPoIB kernel driver ignores the group. This is controlled by a per-interface policy flag. When the flag is set, IPoIB will not join and attach its QP to a multicast group which already has an entry in the database; when the flag is cleared, IPoIB will behave as before this change. For each IPoIB interface, the /sys/class/net/$intf/umcast attribute controls the policy flag. The default value is off/0. Signed-off-by: Or Gerlitz Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index fc16bce..a198ce8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -86,6 +86,7 @@ enum { IPOIB_MCAST_STARTED = 8, IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, + IPOIB_FLAG_UMCAST = 11, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -384,6 +385,7 @@ static inline void ipoib_put_ah(struct ipoib_ah *ah) int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 900335a..ff17fe3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1019,6 +1019,37 @@ static ssize_t show_pkey(struct device *dev, } static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); +static ssize_t show_umcast(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + return sprintf(buf, "%d\n", test_bit(IPOIB_FLAG_UMCAST, &priv->flags)); +} + +static ssize_t set_umcast(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + unsigned long umcast_val = simple_strtoul(buf, NULL, 0); + + if (umcast_val > 0) { + set_bit(IPOIB_FLAG_UMCAST, &priv->flags); + ipoib_warn(priv, "ignoring multicast groups joined directly " + "by userspace\n"); + } else + clear_bit(IPOIB_FLAG_UMCAST, &priv->flags); + + return count; +} +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); + +int ipoib_add_umcast_attr(struct net_device *dev) +{ + return device_create_file(&dev->dev, &dev_attr_umcast); +} + static ssize_t create_child(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -1136,6 +1167,8 @@ static struct net_device *ipoib_add_port(const char *format, goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 94a5709..62abfb6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -761,6 +761,7 @@ void ipoib_mcast_restart_task(struct work_struct *work) struct ipoib_mcast *mcast, *tmcast; LIST_HEAD(remove_list); unsigned long flags; + struct ib_sa_mcmember_rec rec; ipoib_dbg_mcast(priv, "restarting multicast task\n"); @@ -794,6 +795,14 @@ void ipoib_mcast_restart_task(struct work_struct *work) if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { struct ipoib_mcast *nmcast; + /* ignore group which is directly joined by userspace */ + if (test_bit(IPOIB_FLAG_UMCAST, &priv->flags) && + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) { + ipoib_dbg_mcast(priv, "ignoring multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + continue; + } + /* Not found or send-only group, let's add a new entry */ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 6762988..293f5b8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -119,6 +119,8 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; From pradeeps at linux.vnet.ibm.com Wed Oct 10 14:45:10 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 10 Oct 2007 14:45:10 -0700 Subject: [ofa-general] IPoIB CM (NOSRQ) [PATCH V9] updated to incorporate Sean's comments Message-ID: <470D47E6.90308@linux.vnet.ibm.com> This patch has been updated to include all of Sean's comments including elimination of the max_recv_buf module parameter. Instead we print a warning when no srq memory usage exceeds 1GB. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-03 12:01:58.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-09 19:42:51.000000000 -0500 @@ -69,6 +69,7 @@ enum { IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, + IPOIB_MAX_RC_QP = 4096, IPOIB_NUM_WC = 4, @@ -95,11 +96,13 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #define IPOIB_OP_RECV (1ul << 31) + #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -186,11 +189,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by no srq only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -no srq only */ enum ipoib_cm_state state; }; @@ -235,6 +241,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -458,6 +466,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) +extern int max_rc_qp; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-10 16:20:50.000000000 -0500 @@ -49,6 +49,14 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +int max_rc_qp = 128; + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0444); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of no srq RC QPs supported"); + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for no srq */ + +#define NOSRQ_INDEX_MASK (0xfff) /* This corresponds to a max of 4096 QPs for no srq */ #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +89,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", + (unsigned long long)id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +113,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +185,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -203,11 +254,14 @@ static struct ib_qp *ipoib_cm_create_rx_ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; return ib_create_qp(priv->pd, &attr); } @@ -281,12 +335,129 @@ static int ipoib_cm_send_rep(struct net_ rep.private_data_len = sizeof data; rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 index; + u64 i, recv_mem_used; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the no srq case we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + p->qp->qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + + init_context_and_add_list(cm_id, p, priv); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + if (index == max_rc_qp) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "no srq has reached the configurable limit " + "of %d RC QPs\n", max_rc_qp); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; + + if (recv_mem_used >= (1ul << 30)) + ipoib_warn(priv, "no srq is currently using %d MB of memory\n", + (unsigned int)recv_mem_used >> 20); + + priv->cm.rx_index_table[index] = p; + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + spin_unlock_irq(&priv->lock); + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", (int)i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + ret = post_receive_nosrq(dev, i << 32 | index); + if (ret) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %lld\n", (unsigned long long)i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + atomic_dec(¤t_rc_qp); + kfree(p->rx_ring); + spin_lock_irq(&priv->lock); + list_del_init(&p->list); + spin_unlock_irq(&priv->lock); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -301,9 +472,6 @@ static int ipoib_cm_req_handler(struct i return -ENOMEM; p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -313,19 +481,20 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; - - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (!priv->cm.srq) { + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); + if (ret) + goto err_modify; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + p->state = IPOIB_CM_RX_LIVE; + spin_lock_irq(&priv->lock); + init_context_and_add_list(cm_id, p, priv); + spin_unlock_irq(&priv->lock); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -398,29 +567,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", - wr_id, wc->status); + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -428,23 +628,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -456,13 +648,96 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", + (unsigned long long)wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK; + rx_ptr = priv->cm.rx_index_table[index]; + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) + timer_check_nosrq(priv, rx_ptr); + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -482,10 +757,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n", + (unsigned long long)wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -677,6 +964,43 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); + kfree(priv->cm.rx_index_table); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -691,6 +1015,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -814,7 +1143,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 0; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -854,7 +1185,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1198,6 +1529,8 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) + atomic_dec(¤t_rc_qp); kfree(p); } } @@ -1216,12 +1549,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1275,16 +1615,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1301,20 +1665,32 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + ret = ib_query_device(priv->ca, &attr); + if (ret) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + ret = create_srq(dev, priv); + if (ret) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kcalloc(max_rc_qp, + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate rx_index_table\n"); + return -ENOMEM; + } } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1327,17 +1703,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for no srq we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-09 19:02:45.000000000 -0500 @@ -300,7 +300,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -566,7 +566,7 @@ void ipoib_drain_cq(struct net_device *d if (priv->ibwc[i].status == IB_WC_SUCCESS) priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-18 12:39:12.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-10-09 19:02:45.000000000 -0500 @@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if (!priv->cm.srq) + size += (max_rc_qp - 1) * ipoib_recvq_size; +#endif + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-03 12:01:58.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-09 21:34:24.000000000 -0500 @@ -1229,6 +1229,7 @@ static int __init ipoib_init_module(void ipoib_sendq_size = roundup_pow_of_two(ipoib_sendq_size); ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + max_rc_qp = min(max_rc_qp, IPOIB_MAX_RC_QP); ret = ipoib_register_debugfs(); if (ret) From sean.hefty at intel.com Wed Oct 10 14:59:43 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 14:59:43 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470D4749.8000309@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> Message-ID: <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> >* in your patch, I suggest taking out the warning printk from >cma_listen_on_dev() when the listener create attempt fails; it might be >that the device is out of resources etc. Since the code takes care of >this situation pretty well, I don't see a need for the printk. That's easy enough to do. >* I don't see a reason for the internal_id and the device listeners >getting a refcount on the wildcard listener. Because, even without >these, it is guaranteed that the wildcard listener will exist at least >as long as any of the children device listener's are around, by looking >at the logic in rdma_destroy_id(). Can you provide some logic for >requring this then? There are 2 ways to destroy an internal_id: destroying its parent (the wildcard listen) or removing its device. When a device is removed, the internal_id is removed from its parent list to ensure that it is only destroyed once. If the parent were to be destroyed at this point, it would destroy any remaining children, then be freed. The internal_id still exists however, and could be generating connection request events, which expects to fine the parent. The reference ensures that the parent stays around as long as any children remain. >* not that I am very worried (and I suggesting resolving this thru >another subsequent patch if it is really a problem), but I think device >removal is still racy wrt non wildcard listeners. Here's the sequence: >cma_process_remove()->cma_remove_id_dev() decides it will >rdma_destroy_id() the listener id, and at the same time a process >context rdma_destroy_id() decides it is going to do the same. There are >probably various ways to take care of this, the simple one might be for >rdma_destroy_id() to look at the "state" and make a decision about who >gets to destroy. A user cannot both return non-zero from their callback (indicating that the rdma_cm should destroy the id) and call rdma_destroy_id() on the same id. This is equivalent to call rdma_destroy_id() twice. It's not too difficult for the user to avoid this. - Sean From eli at dev.mellanox.co.il Wed Oct 10 15:13:22 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 11 Oct 2007 00:13:22 +0200 Subject: [ofa-general] Re: [PATCH] IB/mthca: optimize post srq In-Reply-To: References: <1192031738.7337.59.camel@mtls03> Message-ID: <4e6a6b3c0710101513i29226964r84570c6cc685e26e@mail.gmail.com> > > > > If this approach is accepted I can do the same for mlx4 > > I just looked a the mlx4 code -- it seems I already marked the error > paths as unlikely in the post srq recv function. So I don't think > there's anything to do there. > > - R. The comment was meant to be for the lockless approach but I put here by mistake... -------------- next part -------------- An HTML attachment was scrubbed... URL: From davem at davemloft.net Wed Oct 10 15:37:05 2007 From: davem at davemloft.net (David Miller) Date: Wed, 10 Oct 2007 15:37:05 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <1192021728.4853.17.camel@localhost> References: <20071010102331.GA10496@one.firstfloor.org> <20071010.034446.85819294.davem@davemloft.net> <1192021728.4853.17.camel@localhost> Message-ID: <20071010.153705.94557376.davem@davemloft.net> From: jamal Date: Wed, 10 Oct 2007 09:08:48 -0400 > On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote: > > > I've always gotten very poor results when increasing the TX queue a > > lot, for example with NIU the point of diminishing returns seems to > > be in the range of 256-512 TX descriptor entries and this was with > > 1.6Ghz cpus. > > Is it interupt per packet? From my experience, you may find interesting > results varying tx interupt mitigation parameters in addition to the > ring parameters. > Unfortunately when you do that, optimal parameters also depends on > packet size. so what may work for 64B, wont work well for 1400B. No, it was not interrupt per-packet, I was telling the chip to interrupt me every 1/4 of the ring. From davem at davemloft.net Wed Oct 10 15:53:22 2007 From: davem at davemloft.net (David Miller) Date: Wed, 10 Oct 2007 15:53:22 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010120215.7ec19323.billfink@mindspring.com> References: <1191967006.5324.14.camel@localhost> <20071009.170435.43504422.davem@davemloft.net> <20071010120215.7ec19323.billfink@mindspring.com> Message-ID: <20071010.155322.43010360.davem@davemloft.net> From: Bill Fink Date: Wed, 10 Oct 2007 12:02:15 -0400 > On Tue, 09 Oct 2007, David Miller wrote: > > > We have to keep in mind, however, that the sw queue right now is 1000 > > packets. I heavily discourage any driver author to try and use any > > single TX queue of that size. Which means that just dropping on back > > pressure might not work so well. > > > > Or it might be perfect and signal TCP to backoff, who knows! :-) > > I can't remember the details anymore, but for 10-GigE, I have encountered > cases where I was able to significantly increase TCP performance by > increasing the txqueuelen to 10000, which is the setting I now use for > any 10-GigE testing. For some reason this does not surprise me. We bumped the ethernet default up to 1000 for gigabit. From Sunkyoung.Shin at falconstor.com Wed Oct 10 15:54:10 2007 From: Sunkyoung.Shin at falconstor.com (Sunkyoung Shin) Date: Wed, 10 Oct 2007 18:54:10 -0400 Subject: [ofa-general] rdma retry number Message-ID: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> Hello, During failover test, we found the iscsi over iser reconnected to the iscs target after 100 seconds due to the default max timeout (8sec) and retry number (15). The max timeout was adjustable with the module parameter, max_timeout, of ib_cm.ko, but the retry number wasn't. Can we add the retry number as module parameter of rdma_cm.ko? I added the patch below based on the ofed version, OFED-1.2-20070626-0917. diff -Naur ofa_kernel-1.2.orig/drivers/infiniband/core/cma.c ofa_kernel-1.2/drivers/infiniband/core/cma.c --- ofa_kernel-1.2.orig/drivers/infiniband/core/cma.c 2007-06-26 12:17:47.000000000 -0400 +++ ofa_kernel-1.2/drivers/infiniband/core/cma.c 2007-10-10 18:41:09.000000000 -0400 @@ -53,6 +53,10 @@ #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 15 +static int cma_max_cm_retries = CMA_MAX_CM_RETRIES; +module_param_named(cma_max_cm_retries, cma_max_cm_retries, int, 0644); +MODULE_PARM_DESC(cma_max_cm_retries, "the number of retry"); + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -1985,7 +1989,7 @@ req.service_id = cma_get_service_id(id_priv->id.ps, &route->addr.dst_addr); req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8); - req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.max_cm_retries = cma_max_cm_retries; ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req); if (ret) { @@ -2045,7 +2049,7 @@ req.rnr_retry_count = conn_param->rnr_retry_count; req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.max_cm_retries = cma_max_cm_retries; req.srq = id_priv->srq ? 1 : 0; ret = ib_send_cm_req(id_priv->cm_id.ib, &req); Sunkyoung Shin FalconStor Software, Inc. From davem at davemloft.net Wed Oct 10 16:04:54 2007 From: davem at davemloft.net (David Miller) Date: Wed, 10 Oct 2007 16:04:54 -0700 (PDT) Subject: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. In-Reply-To: <470D3D93.2020606@ichips.intel.com> References: <20070809.145534.102938208.davem@davemloft.net> <470AA729.2050009@opengridcomputing.com> <470D3D93.2020606@ichips.intel.com> Message-ID: <20071010.160454.25158026.davem@davemloft.net> From: Sean Hefty Date: Wed, 10 Oct 2007 14:01:07 -0700 > > The hack to use a socket and bind it to claim the port was just for > > demostrating the idea. The correct solution, IMO, is to enhance the > > core low level 4-tuple allocation services to be more generic (eg: not > > be tied to a struct sock). Then the host tcp stack and the host rdma > > stack can allocate TCP/iWARP ports/4tuples from this common exported > > service and share the port space. This allocation service could also be > > used by other deep adapters like iscsi adapters if needed. > > Since iWarp runs on top of TCP, the port space is really the same. > FWIW, I agree that this proposal is the correct solution to support iWarp. But you can be sure it's not going to happen, sorry. It would mean that we'd need to export the entire TCP socket table so then when iWARP connections are created you can search to make sure there is not an existing full 4-tuple that is the same. It is not just about local TCP ports. iWARP needs to live in it's seperate little container and not contaminate the rest of the networking, this is the deal. Any suggested such change which breaks that deal will be NACK'd by all of the core networking developers. From kanoj at netxen.com Wed Oct 10 16:17:10 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 10 Oct 2007 16:17:10 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> Message-ID: <470D5D76.7010305@netxen.com> Sean Hefty wrote: >>* in your patch, I suggest taking out the warning printk from >>cma_listen_on_dev() when the listener create attempt fails; it might be >>that the device is out of resources etc. Since the code takes care of >>this situation pretty well, I don't see a need for the printk. >> >> > >That's easy enough to do. > > > >>* I don't see a reason for the internal_id and the device listeners >>getting a refcount on the wildcard listener. Because, even without >>these, it is guaranteed that the wildcard listener will exist at least >>as long as any of the children device listener's are around, by looking >>at the logic in rdma_destroy_id(). Can you provide some logic for >>requring this then? >> >> > >There are 2 ways to destroy an internal_id: destroying its parent (the wildcard >listen) or removing its device. When a device is removed, the internal_id is >removed from its parent list to ensure that it is only destroyed once. If the >parent were to be destroyed at this point, it would destroy any remaining >children, then be freed. The internal_id still exists however, and could be >generating connection request events, which expects to fine the parent. The >reference ensures that the parent stays around as long as any children remain. > > Ok, makes sense. > > >>* not that I am very worried (and I suggesting resolving this thru >>another subsequent patch if it is really a problem), but I think device >>removal is still racy wrt non wildcard listeners. Here's the sequence: >>cma_process_remove()->cma_remove_id_dev() decides it will >>rdma_destroy_id() the listener id, and at the same time a process >>context rdma_destroy_id() decides it is going to do the same. There are >>probably various ways to take care of this, the simple one might be for >>rdma_destroy_id() to look at the "state" and make a decision about who >>gets to destroy. >> >> > >A user cannot both return non-zero from their callback (indicating that the >rdma_cm should destroy the id) and call rdma_destroy_id() on the same id. This >is equivalent to call rdma_destroy_id() twice. It's not too difficult for the >user to avoid this. > >- Sean > > > I don't understand your response. ucma.c for example can call rdma_create_id() and rdma_destroy_id(), correct? What says that when ucma.c does a rdma_destroy_id() on a nonwildcard listener, a device removal is not attempting to do the same on the listener? If this is possible, the code paths I mentioned above can still trigger a double destruct on a listener, correct? Kanoj From sean.hefty at intel.com Wed Oct 10 16:30:46 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 16:30:46 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470D5D76.7010305@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> Message-ID: <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> >I don't understand your response. ucma.c for example can call >rdma_create_id() and rdma_destroy_id(), correct? What says that when >ucma.c does a rdma_destroy_id() on a nonwildcard listener, a device >removal is not attempting to do the same on the listener? If this is >possible, the code paths I mentioned above can still trigger a double >destruct on a listener, correct? Device removal only automatically destroys internal listens, and a non-wildcard listen would never generate an internal listen. Internal listens are used to map wildcard listens across multiple RDMA devices. Their creation and destruction is contained to the cma. From the viewpoint of the device removal code, a nonwildcard listen is treated the same as a connected id. The ucma only destroys id's from an event callback if the id is for a new connection which it can't handle. Hope this makes sense. - Sean From sean.hefty at intel.com Wed Oct 10 17:00:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 17:00:40 -0700 Subject: [ofa-general] RE: rdma retry number In-Reply-To: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> References: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> Message-ID: <000101c80b99$c2dfd550$39c8180a@amr.corp.intel.com> >During failover test, we found the iscsi over iser reconnected to the >iscs target after 100 seconds due to the default max timeout (8sec) and >retry number (15). The max timeout was adjustable with the module >parameter, max_timeout, of ib_cm.ko, but the retry number wasn't. Can we >add the retry number as module parameter of rdma_cm.ko? I added the >patch below based on the ofed version, OFED-1.2-20070626-0917. Note that you can abort a connection operation by destroying the corresponding rdma_cm_id. Does iser try to re-establish a connection over the same path on failover? I'm wondering why it tried to connect over the failed path first. >+static int cma_max_cm_retries = CMA_MAX_CM_RETRIES; >+module_param_named(cma_max_cm_retries, cma_max_cm_retries, int, 0644); >+MODULE_PARM_DESC(cma_max_cm_retries, "the number of retry"); This must be a value between 0-15. I need to see if there's a better way to support users that want smaller connection timeouts. - Sean From kanoj at netxen.com Wed Oct 10 17:03:20 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 10 Oct 2007 17:03:20 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> Message-ID: <470D6848.2080806@netxen.com> Sean Hefty wrote: >>I don't understand your response. ucma.c for example can call >>rdma_create_id() and rdma_destroy_id(), correct? What says that when >>ucma.c does a rdma_destroy_id() on a nonwildcard listener, a device >>removal is not attempting to do the same on the listener? If this is >>possible, the code paths I mentioned above can still trigger a double >>destruct on a listener, correct? >> >> > >Device removal only automatically destroys internal listens, and a non-wildcard >listen would never generate an internal listen. Internal listens are used to > > Oh, ok. I must be missing something though. cma_process_remove() goes thru the device's id_list, and non-wildcard listeners do show up on this list (say thru rdma_bind_addr() -> cma_acquire_dev() -> cma_attach_to_dev()). So, cma_process_remove() would end up attempting a rdma_destroy_id(), no? Wait, I see ... cma_remove_id_dev() would return 0 from the event_handler, ensuring cma_process_remove() does not invoke rdma_destroy_id(), is that it? Kanoj >map wildcard listens across multiple RDMA devices. Their creation and >destruction is contained to the cma. From the viewpoint of the device removal >code, a nonwildcard listen is treated the same as a connected id. > >The ucma only destroys id's from an event callback if the id is for a new >connection which it can't handle. > >Hope this makes sense. > >- Sean > > > From mshefty at ichips.intel.com Wed Oct 10 17:16:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 17:16:49 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470D6848.2080806@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> <470D6848.2080806@netxen.com> Message-ID: <470D6B71.9000807@ichips.intel.com> > Wait, I see ... cma_remove_id_dev() would return 0 from the > event_handler, ensuring cma_process_remove() does not invoke > rdma_destroy_id(), is that it? yep - the destruction of the id is controlled by the user From kanoj at netxen.com Wed Oct 10 17:43:57 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Wed, 10 Oct 2007 17:43:57 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470D6B71.9000807@ichips.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> <470D6848.2080806@netxen.com> <470D6B71.9000807@ichips.intel.com> Message-ID: <470D71CD.9090007@netxen.com> Sean Hefty wrote: >> Wait, I see ... cma_remove_id_dev() would return 0 from the >> event_handler, ensuring cma_process_remove() does not invoke >> rdma_destroy_id(), is that it? > > > yep - the destruction of the id is controlled by the user > Ok, one last thing while we are here. cma_process_remove() -> cma_remove_id_dev() generates the event for device removal. This is ok to do as long as it can be guaranteed that a racing rdma_destroy_id() has not returned back to caller, correct? IE, the caller must be willing to accept device removal events until its rdma_destroy_id() returns. If so, why is cma_remove_id_dev() trying so hard to not generate the event when rdma_destroy_id() has gotten to the point of setting CMA_DESTROYING? Could it not just generate the event, happy in the knowledge that the refcount bump done by cma_process_remove() will prevent the rdma_destroy_id() call from returning? If it could, that could mean all the cma_exch() code can be deleted from cma.c, and the CMA_DESTROYING state can also go away (your patch has taken out the only other reason CMA_DESTROYING was needed). Kanoj From swelch at systemfabricworks.com Wed Oct 10 19:59:16 2007 From: swelch at systemfabricworks.com (swelch at systemfabricworks.com) Date: Wed, 10 Oct 2007 21:59:16 -0500 Subject: [ofa-general] [PATCH V2] infiniband/core: Enable loopback of DR SMP responses from userspace Message-ID: <470D9184.mail1TC11IRJX@systemfabricworks.com> Sean, Roland, The local loopback of an outgoing DR SMP response is limited to those that originate at the driver specific SMA implementation during the drivers process_mad() function. This patch[v2] enables the DR SMP response originating in user space (or elsewhere) to be delivered back up the stack on the same node. In this case the driver specific process_mad() function does not consume or process the MAD so it must be manually copied to the MAD buffer which is to be handed off to a local agent. This is version 2 of the patch, the comments are updated, function renamed to better reflect IB specification terminology, and setting of the TID removed which this patch elminates the need for. Thanks, Steve Signed-off-by: Steve Welch --- diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..3c26cea 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, } /* Check to post send on QP or process locally */ - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && + smi_check_local_outgoing_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, port_priv = ib_get_mad_port(mad_agent_priv->agent.device, mad_agent_priv->agent.port_num); if (port_priv) { - mad_priv->mad.mad.mad_hdr.tid = - ((struct ib_mad *)smp)->mad_hdr.tid; + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); recv_mad_agent = find_mad_agent(port_priv, &mad_priv->mad.mad); } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 1cfc298..53407b1 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); /* - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad */ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, struct ib_device *device) @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, (smp->hop_ptr == smp->hop_cnt + 1)) ? IB_SMI_HANDLE : IB_SMI_DISCARD); } + +/* + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad + */ +static inline enum smi_action smi_check_local_outgoing_smp(struct ib_smp *smp, + struct ib_device *device) +{ + /* C14-13:3 -- We're at the end of the DR segment of path */ + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ + return ((device->process_mad && + ib_get_smp_direction(smp) && + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); +} + #endif /* __SMI_H_ */ From dledford at redhat.com Wed Oct 10 20:26:19 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 03:26:19 +0000 Subject: [ofa-general] librdmacm and libmthca question Message-ID: <1192073179.19888.405.camel@firewall.xsintricity.com> OK, I ran into an issue with librdmacm and I was curious what the answers to these issues are. First, the rdma_connect/rdma_accept functions both require a connection param struct. That struct tells librdmacm what you want in terms of responder_resources and initiator_depth. Reading the man page, that's the number of outstanding RMDA reads and RDMA atomic operations. In usage, I found that the QP max_recv_wr and max_send_wr are totally unrelated to this (I at first thought they could be the same). In fact, on mthca hardware I found the hard limit to be either 4 or 5 (4 worked, 6 didn't, didn't try 5, assumed 4). So even with a send queue depth of 128, I couldn't get above a 4 depth on initiator_depth. I think it might be of value to document somewhere that the initiator depth and responder resources are not directly related to the actual work queue depth, and that without some sort of intervention, are not that high. However, I spent a *lot* of time tracking this down because the failure doesn't occur until rdma_accept time. Passing an impossibly high value in initiator_depth or responder_resources doesn't fail on rdma_connect. This leads one to believe that the values are OK, even though they fail when you use the same values in rdma_accept. A note to this effect in the man pages would help. Second, now that I know that mthca hardware fails with initiator depth or responder resources > 4, it raises several unanswered questions: 1) Can this limit be adjusted by module parameters, and if so, which ones? 2) Does this limit represent the limit on outstanding RMDA READ/Atomic operations in a) progress, b) queue, or c) registration? 3) The answer to #2 implies the answer to this, but I would like a specific response. If I attempt to register more IBV_ACCESS_REMOTE_READ memory regions than responder resources, what happens? If I attempt to queue more IBV_WR_RDMA_READ work requests than initiator_depth, what happens? If there are more IBV_WR_RDMA_READ requests in queue than initiator_depth and it hits the initiator_depth + 1 request while still processing the proceeding requests, what happens? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From swelch at systemfabricworks.com Wed Oct 10 20:29:25 2007 From: swelch at systemfabricworks.com (swelch at systemfabricworks.com) Date: Wed, 10 Oct 2007 22:29:25 -0500 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace Message-ID: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Sean, Roland, This patch [v3] replaces the [v2] patch; it includes those changes but renames the smi function testing returning SMP requests to the name Hal recommends. This patch allows userspace DR SMP responses to be looped back and delivered to a local mad agent by the management stack. Thanks, Steve Signed-off-by: Steve Welch --- drivers/infiniband/core/mad.c | 6 +++--- drivers/infiniband/core/smi.h | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..98148d6 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, } /* Check to post send on QP or process locally */ - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, port_priv = ib_get_mad_port(mad_agent_priv->agent.device, mad_agent_priv->agent.port_num); if (port_priv) { - mad_priv->mad.mad.mad_hdr.tid = - ((struct ib_mad *)smp)->mad_hdr.tid; + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); recv_mad_agent = find_mad_agent(port_priv, &mad_priv->mad.mad); } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 1cfc298..aff96ba 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); /* - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad */ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, struct ib_device *device) @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, (smp->hop_ptr == smp->hop_cnt + 1)) ? IB_SMI_HANDLE : IB_SMI_DISCARD); } + +/* + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad + */ +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, + struct ib_device *device) +{ + /* C14-13:3 -- We're at the end of the DR segment of path */ + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ + return ((device->process_mad && + ib_get_smp_direction(smp) && + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); +} + #endif /* __SMI_H_ */ From kliteyn at mellanox.co.il Wed Oct 10 22:10:46 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 11 Oct 2007 07:10:46 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-11:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-10 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From dotanb at dev.mellanox.co.il Wed Oct 10 23:25:26 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 11 Oct 2007 08:25:26 +0200 Subject: [ofa-general] librdmacm and libmthca question In-Reply-To: <1192073179.19888.405.camel@firewall.xsintricity.com> References: <1192073179.19888.405.camel@firewall.xsintricity.com> Message-ID: <470DC1D6.7020108@dev.mellanox.co.il> Hi. I can try to answer some of the questions that you have which are related to the core/verbs. Doug Ledford wrote: > OK, I ran into an issue with librdmacm and I was curious what the > answers to these issues are. > > First, the rdma_connect/rdma_accept functions both require a connection > param struct. That struct tells librdmacm what you want in terms of > responder_resources and initiator_depth. Reading the man page, that's > the number of outstanding RMDA reads and RDMA atomic operations. In > usage, I found that the QP max_recv_wr and max_send_wr are totally > unrelated to this (I at first thought they could be the same). In fact, > on mthca hardware I found the hard limit to be either 4 or 5 (4 worked, > 6 didn't, didn't try 5, assumed 4). So even with a send queue depth of > 128, I couldn't get above a 4 depth on initiator_depth. I think it > might be of value to document somewhere that the initiator depth and > responder resources are not directly related to the actual work queue > depth, and that without some sort of intervention, are not that high. > > However, I spent a *lot* of time tracking this down because the failure > doesn't occur until rdma_accept time. Passing an impossibly high value > in initiator_depth or responder_resources doesn't fail on rdma_connect. > This leads one to believe that the values are OK, even though they fail > when you use the same values in rdma_accept. A note to this effect in > the man pages would help. > > Second, now that I know that mthca hardware fails with initiator depth > or responder resources > 4, it raises several unanswered questions: > > 1) Can this limit be adjusted by module parameters, and if so, which > ones? > This value is an attribute of the device (there is an upper limit on how many outstanding RDMA Reads/atomic it supports). The mthca low level driver is being loaded with default value of 4 (which is less that the device capability), but there is a module parameter called (rdb_per_qp) which can be changed to support higher value. > 2) Does this limit represent the limit on outstanding RMDA READ/Atomic > operations in a) progress, b) queue, or c) registration? > This value limit the number of RDMA read/Atomic which can be processed in parallel in this QP. for example: you posted 100 RDMA Reads, and the QP was configured to support only 4, so 4 RDMA Reads will be processed every time in parallel, when one will be finished, another one will begin until all of your 100 will be processed: so the answer is a), in progress. > 3) The answer to #2 implies the answer to this, but I would like a > specific response. If I attempt to register more IBV_ACCESS_REMOTE_READ > memory regions than responder resources, what happens? If I attempt to > queue more IBV_WR_RDMA_READ work requests than initiator_depth, what > happens? If there are more IBV_WR_RDMA_READ requests in queue than > initiator_depth and it hits the initiator_depth + 1 request while still > processing the proceeding requests, what happens? > There isn't any connection between the number of Memory Regions that you have (it doesn't matter which permission you registered them with) and the value that you gave to the QP to handle RDMA Reads/ Atomic. (A MR can be shared with several QPs) I hope that i helped you with this info Dotan From sean.hefty at intel.com Wed Oct 10 23:48:56 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 10 Oct 2007 23:48:56 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470D71CD.9090007@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> <470D6848.2080806@netxen.com> <470D6B71.9000807@ichips.intel.com> <470D71CD.9090007@netxen.com> Message-ID: <000501c80bd2$cbe7c160$6acc180a@amr.corp.intel.com> >cma_process_remove() -> cma_remove_id_dev() generates the event for >device removal. This is ok to do as long as it can be guaranteed that a >racing rdma_destroy_id() has not returned back to caller, correct? > >IE, the caller must be willing to accept device removal events until its >rdma_destroy_id() returns. Correct - rdma_destroy_id() blocks until all callbacks from the rdma_cm have completed. >If so, why is cma_remove_id_dev() trying so hard to not generate the >event when rdma_destroy_id() has gotten to the point of setting >CMA_DESTROYING? Could it not just generate the event, happy in the >knowledge that the refcount bump done by cma_process_remove() will >prevent the rdma_destroy_id() call from returning? There are two ways for the user to destroy an rdma_cm_id. They can either call rdma_destroy_id() directly or return a non-zero value from a callback. In order to support the latter, all callbacks to a user on the same rdma_cm_id must be serialized, and once the user has returned a non-zero value no further callbacks can occur. (Otherwise the user wouldn't know when it was safe to deallocate their connection context.) Since a device removal can occur at any point, the device removal callback must be serialized with any other callback in progress. It does this by marking that the device has been removed. This prevents any new callbacks from being invoked, but a callback may already be in progress. The device removal code waits for that callback to complete. After it completes, it needs to see if the user wants to destroy the rdma_cm_id - meaning they returned a non-zero value from the first callback. If so, then the device removal callback cannot be invoked. One other point is that all event callbacks for a given rdma_cm_id end up being serialized by default. Only device removal event requires special handling, since that thread can run at any time. If you look at some of the callback handlers (named *_handler), you'll see calls to disable/enable remove, which provides this serialization. - Sean From krkumar2 at in.ibm.com Wed Oct 10 23:52:23 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Thu, 11 Oct 2007 12:22:23 +0530 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071009.134331.35664207.davem@davemloft.net> Message-ID: Hi Dave, David Miller wrote on 10/10/2007 02:13:31 AM: > > Hopefully that new qdisc will just use the TX rings of the hardware > > directly. They are typically large enough these days. That might avoid > > some locking in this critical path. > > Indeed, I also realized last night that for the default qdiscs > we do a lot of stupid useless work. If the queue is a FIFO > and the device can take packets, we should send it directly > and not stick it into the qdisc at all. Since you are talking of how it should be done in the *current* code, I feel LLTX drivers will not work nicely with this. Actually I was trying this change a couple of weeks back, but felt that doin go would result in out of order packets (skbs present in q which were not sent out for LLTX failure will be sent out only at next net_tx_action, while other skbs are sent ahead). One option is to first call qdisc_run() and then process this skb, but that is ugly (requeue handling). However I guess this can be done cleanly once LLTX is removed. Thanks, - KK From sean.hefty at intel.com Thu Oct 11 00:00:31 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Oct 2007 00:00:31 -0700 Subject: [ofa-general] librdmacm and libmthca question In-Reply-To: <1192073179.19888.405.camel@firewall.xsintricity.com> References: <1192073179.19888.405.camel@firewall.xsintricity.com> Message-ID: <000601c80bd4$69fefcf0$6acc180a@amr.corp.intel.com> >might be of value to document somewhere that the initiator depth and >responder resources are not directly related to the actual work queue >depth, and that without some sort of intervention, are not that high. FYI - I am currently updating the librdmacm man pages based on your other e-mail. I will make sure to document these values. >This leads one to believe that the values are OK, even though they fail >when you use the same values in rdma_accept. A note to this effect in >the man pages would help. I will at least note how to obtain the correct maximum for local responder_resources by querying the local HCA. The correct setting for initiator_depth needs to come from the remote endpoint. Please continue to let me know what other problems you run into. I'd like to address as many issues as possible for the next release. (New features will be pushed out some.) - Sean From ogerlitz at voltaire.com Thu Oct 11 00:59:41 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 11 Oct 2007 09:59:41 +0200 Subject: [ofa-general] Re: [ewg] rdma retry number In-Reply-To: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> References: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> Message-ID: <470DD7ED.80606@voltaire.com> Sunkyoung Shin wrote: > During failover test, we found the iscsi over iser reconnected to the > iscs target after 100 seconds due to the default max timeout (8sec) and > retry number (15). The max timeout was adjustable with the module > parameter, max_timeout, of ib_cm.ko, but the retry number wasn't. Can we > add the retry number as module parameter of rdma_cm.ko? I added the > patch below based on the ofed version, OFED-1.2-20070626-0917. I understand that you want the QP timeout/retries to be smaller, and not the CM timeout/retries and hence there might be some confusion here which the following rdma-cm code snip from cma_connect_ib() might help resolving: ... > req.qp_num = id_priv->qp_num; > req.qp_type = IB_QPT_RC; > req.starting_psn = id_priv->seq_num; > req.responder_resources = conn_param->responder_resources; > req.initiator_depth = conn_param->initiator_depth; > req.flow_control = conn_param->flow_control; > req.retry_count = conn_param->retry_count; > req.rnr_retry_count = conn_param->rnr_retry_count; > req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > req.max_cm_retries = CMA_MAX_CM_RETRIES; > req.srq = id_priv->srq ? 1 : 0; > > ret = ib_send_cm_req(id_priv->cm_id.ib, &req); ... The user is in total control on the QP retry count through the rdma-cm connection param structure, the req.max_cm_retries has nothing to do with the QP timeout. The RC QP timeout is derived by the IB CM internally (on ofed through module param which you have changed) and the rdma-cm nor its consumer have direct control on it. This follows the IB spec spirit that the SM/SA is the one to calculate and return to the host a param named "this path packet life time" so the IB CM combines the packet life time and something called the "hca ack delay". Currently the IB CM just 2 * path.packet_life_time as an estimation for the timeout which is the packet life time plus the hca ack delay, see cm_init_av_by_path() in core/cm.c . Note that the actual timeout T = 4.096us * 2^t where t is the value plugged into the QP. Hence doing t = path.packet_life_time + 1 does what I described above. In examination I did on the past I think that the openSM always returns path.packet_life_time = 18 and same for some vendor SMs. This means that the timeout is 2^(2+18+1) = 2^21us = 2 seconds The # retries set by the iser initiator are seven (see iser_route_handler()) so seven times two give 14 seconds, which makes your report on the 100 seconds it took the initiator to reconnect to possibly point on the different problem. Or. From vlad at dev.mellanox.co.il Thu Oct 11 01:10:41 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 11 Oct 2007 10:10:41 +0200 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470D2C69.3000500@opengridcomputing.com> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> <470BB8DB.8090107@dev.mellanox.co.il> <470D2C69.3000500@opengridcomputing.com> Message-ID: <470DDA81.4060108@dev.mellanox.co.il> Steve Wise wrote: > Hey Vlad, > > The libcxgb3 rpms built by this ofed-1.2.5 release are still named > libcxgb3*-1.0.1 instead of 1.0.3. Can you update your spec files to > indicate that the library is release 1.0.3? > > You'll need to also update the ofed-1.3 spec file I guess. > > Thanks, > > Steve. > Hi Steve, You should update libcxgb3 version in the configure.in file: Update version to 1.0.3 Signed-off-by: Vladimir Sokolovsky --- diff --git a/configure.in b/configure.in index 6f916d3..15406b7 100644 --- a/configure.in +++ b/configure.in @@ -1,11 +1,11 @@ dnl Process this file with autoconf to produce a configure script. AC_PREREQ(2.57) -AC_INIT(libcxgb3, 1.0.1, general at lists.openfabrics.org) +AC_INIT(libcxgb3, 1.0.3, general at lists.openfabrics.org) AC_CONFIG_SRCDIR([src/iwch.h]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE(libcxgb3, 1.0.1) +AM_INIT_AUTOMAKE(libcxgb3, 1.0.3) AM_PROG_LIBTOOL AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], Regards, Vladimir From diocese at theedge.ca Thu Oct 11 02:05:38 2007 From: diocese at theedge.ca (diocese at theedge.ca) Date: Thu, 11 Oct 2007 10:05:38 +0100 Subject: [ofa-general] Attn:Winner!!!!!!!!!!!! Message-ID: <493bb2791bb4.470df572@theedge.ca> Attn:Winner!!!!!!!!!!!! Congratulations The Foundazion Di Vittorio has chosenyoubythe board of trustees as one of the final recipients ofacashGrant/Donation for your own personal,educational,andbusinessTocelebrate the 30th anniversary 2007 program,We are giving outayearlydonation of US$200,000.00 to nd it to the PaymentRemitanceOffice Viaemail contact BATCH NO40 lucky recipients,ascharitydonations/aid. fill out below Formse:Batch(N-222-6747,E-900-56) FullName:.............. ResidentialAddress:............... Occupation:.............. Country:.................. Telephone:.................. Fax:...................... Number:.... Sex:................... age:................. NextofKin:............ Winning BatchNo:...... (PaymentRemitanceContact) Mr Mack Tony E-Mail:mack_tony2002 at yaho.com http://www.fondazionedivittorio.it From vlad at lists.openfabrics.org Thu Oct 11 02:52:52 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 11 Oct 2007 02:52:52 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071011-0200 daily build status Message-ID: <20071011095252.A56E4E60876@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Failed: From ogerlitz at voltaire.com Thu Oct 11 04:15:29 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 11 Oct 2007 13:15:29 +0200 Subject: [ofa-general] Re: [PATCH v3 for 2.6.24] IB/ipoib: enable IGMP for userpsace multicast IB apps In-Reply-To: References: Message-ID: <470E05D1.5040605@voltaire.com> Roland Dreier wrote: > OK, at long last I merged the following. I rewrote the changelog to > (I think) be more understandable, and also cleaned up a few things in > the patch (including whitespace damage...). thanks for all your work and sorry for the white space damage. Or. From hrosenstock at xsigo.com Thu Oct 11 04:35:32 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 11 Oct 2007 04:35:32 -0700 Subject: [ofa-general] Re: [ewg] rdma retry number In-Reply-To: <470DD7ED.80606@voltaire.com> References: <63272BF021AFD644870BF7D57AAB387001476CAF@CORPEXCH01.FalconStor.Net> <470DD7ED.80606@voltaire.com> Message-ID: <1192102532.17526.151.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-11 at 09:59 +0200, Or Gerlitz wrote: > In examination I did on the past I think that the openSM always returns > path.packet_life_time = 18 and same for some vendor SMs. That's the default for OpenSM but it is configurable on a subnet wide basis. -- Hal From swise at opengridcomputing.com Thu Oct 11 05:42:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 11 Oct 2007 07:42:32 -0500 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470DDA81.4060108@dev.mellanox.co.il> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> <470BB8DB.8090107@dev.mellanox.co.il> <470D2C69.3000500@opengridcomputing.com> <470DDA81.4060108@dev.mellanox.co.il> Message-ID: <470E1A38.2020902@opengridcomputing.com> oops. Lemme fix this up then we'll re-pull. Thanks, Steve. Vladimir Sokolovsky wrote: > Steve Wise wrote: >> Hey Vlad, >> >> The libcxgb3 rpms built by this ofed-1.2.5 release are still named >> libcxgb3*-1.0.1 instead of 1.0.3. Can you update your spec files to >> indicate that the library is release 1.0.3? >> >> You'll need to also update the ofed-1.3 spec file I guess. >> >> Thanks, >> >> Steve. >> > > Hi Steve, > You should update libcxgb3 version in the configure.in file: > > Update version to 1.0.3 > > Signed-off-by: Vladimir Sokolovsky > --- > diff --git a/configure.in b/configure.in > index 6f916d3..15406b7 100644 > --- a/configure.in > +++ b/configure.in > @@ -1,11 +1,11 @@ > dnl Process this file with autoconf to produce a configure script. > > AC_PREREQ(2.57) > -AC_INIT(libcxgb3, 1.0.1, general at lists.openfabrics.org) > +AC_INIT(libcxgb3, 1.0.3, general at lists.openfabrics.org) > AC_CONFIG_SRCDIR([src/iwch.h]) > AC_CONFIG_AUX_DIR(config) > AM_CONFIG_HEADER(config.h) > -AM_INIT_AUTOMAKE(libcxgb3, 1.0.1) > +AM_INIT_AUTOMAKE(libcxgb3, 1.0.3) > AM_PROG_LIBTOOL > > AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for > presence of ib libraries], > > > Regards, > Vladimir From ogerlitz at voltaire.com Thu Oct 11 07:27:47 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 11 Oct 2007 16:27:47 +0200 Subject: [ofa-general] OFED October 8 meeting summary on OFED 1.3 In-Reply-To: <470A8D36.7050407@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> <470A8D36.7050407@mellanox.co.il> Message-ID: <470E32E3.1040101@voltaire.com> Tziporet Koren wrote: > OFED October 8 meeting summary on OFED 1.3 alpha readiness > Meeting summary: > ============ > 1. Alpha release is planed for this week (Wed or Thursday) > 2. Requests for the beta release Hi Tziporet, As of the centrality and importance of IPoIB, my take is that at this point the correct thing to do would be to remove the stateless offload patches et al that for the most part did not pass any review on the general list. I am talking on the LSO patches, the checksum patches whose related discussion has started but never ended. The LRO patch is problematic and need to be fixed as Eli has agreed. Its also pure SW optimization and I don't see why rush to merge it into OFED 1.3. As for the interrupt mitigation patches, I have sent some comments and Eli only replied on some of them, nothing was changed or fixed yet, and other then this, no review has been done. This is just my opinion, can be nice to see what others have to say. I am referring to these patches under kernel_patches/fixes > t_0010_ipoib_high_dma.patch > t_0017_ipoib_sg.patch > t_0019_hw_csum.patch > t_0020_core_csum.patch > t_0030_mthca_checksum_offload.patch > t_0040_mlx4_checksum_offload.patch > t_0050_ipoib_checksum_offload.patch > t_0060_ipoib_qp_init_attr.patch > t_0080_mlx4_qp_max_msg.patch > t_0090_core_lso.patch > t_0100_mlx4_lso.patch > t_0110_ipoib_lso.patch > t_0120_ipoib_ethtool.patch > t_0130_ipoib_lro.patch > t_0140_core_modify_cq.patch > t_0150_mlx4_modify_cq.patch > t_0160_ipoib_modify_cq.patch > t_0170_cq_coal.patch Or. From dledford at redhat.com Thu Oct 11 07:39:44 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 14:39:44 +0000 Subject: [ofa-general] librdmacm and libmthca question In-Reply-To: <000601c80bd4$69fefcf0$6acc180a@amr.corp.intel.com> References: <1192073179.19888.405.camel@firewall.xsintricity.com> <000601c80bd4$69fefcf0$6acc180a@amr.corp.intel.com> Message-ID: <1192113584.19888.406.camel@firewall.xsintricity.com> On Thu, 2007-10-11 at 00:00 -0700, Sean Hefty wrote: > >might be of value to document somewhere that the initiator depth and > >responder resources are not directly related to the actual work queue > >depth, and that without some sort of intervention, are not that high. > > FYI - I am currently updating the librdmacm man pages based on your other > e-mail. I will make sure to document these values. > > >This leads one to believe that the values are OK, even though they fail > >when you use the same values in rdma_accept. A note to this effect in > >the man pages would help. > > I will at least note how to obtain the correct maximum for local > responder_resources by querying the local HCA. The correct setting for > initiator_depth needs to come from the remote endpoint. Works for Me (TM) > Please continue to let me know what other problems you run into. I'd like to > address as many issues as possible for the next release. (New features will be > pushed out some.) Will do. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Thu Oct 11 07:41:49 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 14:41:49 +0000 Subject: [ofa-general] librdmacm and libmthca question In-Reply-To: <470DC1D6.7020108@dev.mellanox.co.il> References: <1192073179.19888.405.camel@firewall.xsintricity.com> <470DC1D6.7020108@dev.mellanox.co.il> Message-ID: <1192113709.19888.409.camel@firewall.xsintricity.com> On Thu, 2007-10-11 at 08:25 +0200, Dotan Barak wrote: > Hi. > > I can try to answer some of the questions that you have which are > related to the core/verbs. > > 1) Can this limit be adjusted by module parameters, and if so, which > > ones? > > > This value is an attribute of the device (there is an upper limit on how > many outstanding RDMA Reads/atomic > it supports). > The mthca low level driver is being loaded with default value of 4 > (which is less that the device capability), > but there is a module parameter called (rdb_per_qp) which can be > changed to support higher value. Thanks. I thought the rdb_per_qp might be related, but I wasn't sure. > > 2) Does this limit represent the limit on outstanding RMDA READ/Atomic > > operations in a) progress, b) queue, or c) registration? > > > This value limit the number of RDMA read/Atomic which can be processed > in parallel in this QP. > for example: you posted 100 RDMA Reads, and the QP was configured to > support only 4, > so 4 RDMA Reads will be processed every time in parallel, when one will > be finished, another one > will begin until all of your 100 will be processed: so the answer is a), > in progress. Cool. Then it's not really all that much of a limit in terms of my usage anyway. Having 4 running in parallel should be plenty to keep the wire busy. > > 3) The answer to #2 implies the answer to this, but I would like a > > specific response. If I attempt to register more IBV_ACCESS_REMOTE_READ > > memory regions than responder resources, what happens? If I attempt to > > queue more IBV_WR_RDMA_READ work requests than initiator_depth, what > > happens? If there are more IBV_WR_RDMA_READ requests in queue than > > initiator_depth and it hits the initiator_depth + 1 request while still > > processing the proceeding requests, what happens? > > > There isn't any connection between the number of Memory Regions that you > have (it doesn't matter > which permission you registered them with) and the value that you gave > to the QP to handle RDMA Reads/ > Atomic. (A MR can be shared with several QPs) > > I hope that i helped you with this info Yep, very much so. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From ogerlitz at voltaire.com Thu Oct 11 07:42:22 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 11 Oct 2007 16:42:22 +0200 Subject: [ofa-general] Re: OFED October 8 meeting summary - ofa devcon In-Reply-To: <470A8D36.7050407@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> <470A8D36.7050407@mellanox.co.il> Message-ID: <470E364E.3050306@voltaire.com> Tziporet Koren wrote: > OFED October 8 meeting summary on OFED 1.3 alpha readiness > 3. We discussed some ideas for talks in the developer's summit. The > following ideas were raised: sa caching (Intel), QoS support (Sean), > Extended RC (MPI team) We can discuss the connectx ipoib stateless offload approach/patches. It can also be nice if someone would present the TX batching approach now discussed and implemented in netdev, since its relevant also to IPoIB. I will be mostly off for the coming two weeks, but once back, be happy to help with setting the agenda for the meeting etc. Or. From eli at mellanox.co.il Thu Oct 11 07:55:55 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 11 Oct 2007 16:55:55 +0200 Subject: [ofa-general] OFED October 8 meeting summary on OFED 1.3 In-Reply-To: <470E32E3.1040101@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> <470A8D36.7050407@mellanox.co.il> <470E32E3.1040101@voltaire.com> Message-ID: <1192114555.7337.103.camel@mtls03> > I am talking on the LSO patches, the checksum patches whose related > discussion has started but never ended. The LRO patch is problematic and > need to be fixed as Eli has agreed. Adding facilities to disable LRO is not a big problem. As I said, I will send a patch that does that. > Its also pure SW optimization and I > don't see why rush to merge it into OFED 1.3. > > As for the interrupt mitigation patches, I have sent some comments and > Eli only replied on some of them, nothing was changed or fixed yet, and > other then this, no review has been done. I am not aware that you have more questions about interrupt mitigation. We even discussed this over the phone so I assumed you don't have more questions. Anyway, please send any questions you still have. > > This is just my opinion, can be nice to see what others have to say. > > I am referring to these patches under kernel_patches/fixes > > > t_0010_ipoib_high_dma.patch > > t_0017_ipoib_sg.patch > > t_0019_hw_csum.patch > > t_0020_core_csum.patch > > t_0030_mthca_checksum_offload.patch > > t_0040_mlx4_checksum_offload.patch > > t_0050_ipoib_checksum_offload.patch > > t_0060_ipoib_qp_init_attr.patch > > t_0080_mlx4_qp_max_msg.patch > > t_0090_core_lso.patch > > t_0100_mlx4_lso.patch > > t_0110_ipoib_lso.patch > > t_0120_ipoib_ethtool.patch > > t_0130_ipoib_lro.patch > > t_0140_core_modify_cq.patch > > t_0150_mlx4_modify_cq.patch > > t_0160_ipoib_modify_cq.patch > > t_0170_cq_coal.patch > > Or. > > From ogerlitz at voltaire.com Thu Oct 11 08:07:56 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 11 Oct 2007 17:07:56 +0200 Subject: [ofa-general] OFED October 8 meeting summary on OFED 1.3 In-Reply-To: <1192114555.7337.103.camel@mtls03> References: <6C2C79E72C305246B504CBA17B5500C901563EFB@mtlexch01.mtl.com> <470A8D36.7050407@mellanox.co.il> <470E32E3.1040101@voltaire.com> <1192114555.7337.103.camel@mtls03> Message-ID: <470E3C4C.1000506@voltaire.com> Eli Cohen wrote: >> As for the interrupt mitigation patches, I have sent some comments and >> Eli only replied on some of them, nothing was changed or fixed yet, and >> other then this, no review has been done. > I am not aware that you have more questions about interrupt mitigation. > We even discussed this over the phone so I assumed you don't have more > questions. Anyway, please send any questions you still have. I don't have question, I had comments to the upstream patches you have posted which you did not address. But that's no --the-- issue, the thing is that you add lots of code to ipoib in ofed 1.3 without this code passing upstream review and acceptance for merge. Or. From okir at lst.de Thu Oct 11 08:16:24 2007 From: okir at lst.de (Olaf Kirch) Date: Thu, 11 Oct 2007 17:16:24 +0200 Subject: [ofa-general] [PATCH] rdma/cm: add locking around QP accesses Message-ID: <200710111716.25862.okir@lst.de> Hi, > > Did we ever get any confirmation that this fixed the problem that Olaf saw? > > No. I haven't seen a response. Sorry, my fault. Yes, this patch seems to fix the issue. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From jackm at dev.mellanox.co.il Thu Oct 11 08:36:43 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 11 Oct 2007 17:36:43 +0200 Subject: [ofa-general] [PATCH v6] IB/mlx4: shrinking WQE In-Reply-To: References: <20070909112917.GA25910@mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA8403027D232D@G3W0634.americas.hpqcorp.net> Message-ID: <200710111736.44136.jackm@dev.mellanox.co.il> commit c0aa89f0b295dd0c20b2ff2b1d2eca10cdc84f4b Author: Michael S. Tsirkin Date: Thu Aug 30 15:51:40 2007 +0300 IB/mlx4: shrinking WQE ConnectX supports shrinking wqe, such that a single WR can include multiple units of wqe_shift. This way, WRs can differ in size, and do not have to be a power of 2 in size, saving memory and speeding up send WR posting. Unfortunately, if we do this wqe_index field in CQE can't be used to look up the WR ID anymore, so do this only if selective signalling is off. Further, on 32-bit platforms, we can't use vmap to make the QP buffer virtually contigious. Thus we have to use constant-sized WRs to make sure a WR is always fully within a single page-sized chunk. Finally, we use WR with NOP opcode to avoid wrap-around in the middle of WR. We set NoErrorCompletion bit to avoid getting completions with error for NOP WRs. Since NEC is only supported starting with firmware 2.2.232, we use constant-sized WRs for older firmware. And, since MLX QPs only support SEND, we use constant-sized WRs in this case. When stamping during NOP posting, do stamping following setting of the NOP wqe valid bit. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack Morgenstein --- Changes since v5: stamp_send_wqe: fix call to get_send_wqe() in code path where constant size WQEs are used (eventually, caused kernel oops in MAD post_send). Changes since v4: fix calls to stamp_send_wqe, and stamping placement                   inside post_nop_wqe. Found by regression, fixed by Jack Morgenstein. Changes since v3: fix nop formatting. Found by Eli Cohen. Changes since v2: fix memory leak in mlx4_buf_alloc. Found by internal code review. changes since v1: add missing patch hunks Index: infiniband/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/cq.c 2007-10-10 17:12:05.184757000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/cq.c 2007-10-10 17:23:02.337140000 +0200 @@ -331,6 +331,12 @@ static int mlx4_ib_poll_one(struct mlx4_ is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; + if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP && + is_send)) { + printk(KERN_WARNING "Completion for NOP opcode detected!\n"); + return -EINVAL; + } + if (!*cur_qp || (be32_to_cpu(cqe->my_qpn) & 0xffffff) != (*cur_qp)->mqp.qpn) { /* @@ -353,8 +359,10 @@ static int mlx4_ib_poll_one(struct mlx4_ if (is_send) { wq = &(*cur_qp)->sq; - wqe_ctr = be16_to_cpu(cqe->wqe_index); - wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + if (!(*cur_qp)->sq_signal_bits) { + wqe_ctr = be16_to_cpu(cqe->wqe_index); + wq->tail += (u16) (wqe_ctr - (u16) wq->tail); + } wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; } else if ((*cur_qp)->ibqp.srq) { Index: infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-10-10 17:21:17.844882000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/mlx4_ib.h 2007-10-10 17:23:02.341138000 +0200 @@ -120,6 +120,8 @@ struct mlx4_ib_qp { u32 doorbell_qpn; __be32 sq_signal_bits; + unsigned sq_next_wqe; + int sq_max_wqes_per_wr; int sq_spare_wqes; struct mlx4_ib_wq sq; Index: infiniband/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/qp.c 2007-10-10 17:21:17.853882000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/qp.c 2007-10-10 17:23:02.350137000 +0200 @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include @@ -92,7 +93,7 @@ static int is_qp0(struct mlx4_ib_dev *de static void *get_wqe(struct mlx4_ib_qp *qp, int offset) { - if (qp->buf.nbufs == 1) + if (BITS_PER_LONG == 64 || qp->buf.nbufs == 1) return qp->buf.u.direct.buf + offset; else return qp->buf.u.page_list[offset >> PAGE_SHIFT].buf + @@ -111,16 +112,88 @@ static void *get_send_wqe(struct mlx4_ib /* * Stamp a SQ WQE so that it is invalid if prefetched by marking the - * first four bytes of every 64 byte chunk with 0xffffffff, except for - * the very first chunk of the WQE. + * first four bytes of every 64 byte chunk with + * 0x7FFFFFF | (invalid_ownership_value << 31). + * + * When max WR is than or equal to the WQE size, + * as an optimization, we can stamp WQE with 0xffffffff, + * and skip the very first chunk of the WQE. */ -static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n) +static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) { - u32 *wqe = get_send_wqe(qp, n); + u32 *wqe; int i; + int s; + int ind; + void *buf; + __be32 stamp; + + s = roundup(size, 1 << qp->sq.wqe_shift); + if (qp->sq_max_wqes_per_wr > 1) { + for (i = 0; i < s; i += 64) { + ind = (i >> qp->sq.wqe_shift) + n; + stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0x7fffffff) : + cpu_to_be32(0xffffffff); + buf = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); + wqe = buf + (i & ((1 << qp->sq.wqe_shift) - 1)); + *wqe = stamp; + } + } else { + buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + for (i = 64; i < s; i += 64) { + wqe = buf + i; + *wqe = 0xffffffff; + } + } +} + +static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size) +{ + struct mlx4_wqe_ctrl_seg *ctrl; + struct mlx4_wqe_inline_seg *inl; + void *wqe; + int s; + + ctrl = wqe = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = sizeof(struct mlx4_wqe_ctrl_seg); + + if (qp->ibqp.qp_type == IB_QPT_UD) { + struct mlx4_wqe_datagram_seg *dgram = wqe + sizeof *ctrl; + struct mlx4_av *av = (struct mlx4_av *)dgram->av; + memset(dgram, 0, sizeof *dgram); + av->port_pd = cpu_to_be32((qp->port << 24) | to_mpd(qp->ibqp.pd)->pdn); + s += sizeof(struct mlx4_wqe_datagram_seg); + } + + /* Pad the remainder of the WQE with an inline data segment. */ + if (size > s) { + inl = wqe + s; + inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl)); + } + ctrl->srcrb_flags = 0; + ctrl->fence_size = size / 16; + /* + * Make sure descriptor is fully written before + * setting ownership bit (because HW can start + * executing as soon as we do). + */ + wmb(); - for (i = 16; i < 1 << (qp->sq.wqe_shift - 2); i += 16) - wqe[i] = 0xffffffff; + ctrl->owner_opcode = cpu_to_be32(MLX4_OPCODE_NOP | MLX4_WQE_CTRL_NEC) | + (n & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + + stamp_send_wqe(qp, n + qp->sq_spare_wqes, size); +} + +/* Post NOP WQE to prevent wrap-around in the middle of WR */ +static inline unsigned pad_wraparound(struct mlx4_ib_qp *qp, int ind) +{ + unsigned s = qp->sq.wqe_cnt - (ind & (qp->sq.wqe_cnt - 1)); + if (unlikely(s < qp->sq_max_wqes_per_wr)) { + post_nop_wqe(qp, ind, s << qp->sq.wqe_shift); + ind += s; + } + return ind; } static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type) @@ -237,6 +310,8 @@ static int set_rq_size(struct mlx4_ib_de static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { + int s; + /* Sanity check SQ size before proceeding */ if (cap->max_send_wr > dev->dev->caps.max_wqes || cap->max_send_sge > dev->dev->caps.max_sq_sg || @@ -252,20 +327,69 @@ static int set_kernel_sq_size(struct mlx cap->max_send_sge + 2 > dev->dev->caps.max_sq_sg) return -EINVAL; - qp->sq.wqe_shift = ilog2(roundup_pow_of_two(max(cap->max_send_sge * - sizeof (struct mlx4_wqe_data_seg), - cap->max_inline_data + - sizeof (struct mlx4_wqe_inline_seg)) + - send_wqe_overhead(type))); - qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / - sizeof (struct mlx4_wqe_data_seg); + s = max(cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg), + cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + + send_wqe_overhead(type); /* - * We need to leave 2 KB + 1 WQE of headroom in the SQ to - * allow HW to prefetch. + * Hermon supports shrinking wqe, such that a single WR can include + * multiple units of wqe_shift. This way, WRs can differ in size, and + * do not have to be a power of 2 in size, saving memory and speeding up + * send WR posting. Unfortunately, if we do this wqe_index field in CQE + * can't be used to look up the WR ID anymore, so do this only if + * selective signalling is off. + * + * Further, on 32-bit platforms, we can't use vmap to make + * the QP buffer virtually contigious. Thus we have to use + * constant-sized WRs to make sure a WR is always fully within + * a single page-sized chunk. + * + * Finally, we use NOP opcode to avoid wrap-around in the middle of WR. + * We set NEC bit to avoid getting completions with error for NOP WRs. + * Since NEC is only supported starting with firmware 2.2.232, + * we use constant-sized WRs for older firmware. + * + * And, since MLX QPs only support SEND, we use constant-sized WRs in this + * case. + * + * We look for the smallest value of wqe_shift such that the resulting + * number of wqes does not exceed device capabilities. + * + * We set WQE size to at least 64 bytes, this way stamping invalidates each WQE. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; - qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && + qp->sq_signal_bits && BITS_PER_LONG == 64 && + type != IB_QPT_SMI && type != IB_QPT_GSI) + qp->sq.wqe_shift = ilog2(64); + else + qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); + + for (;;) { + if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + + qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 << qp->sq.wqe_shift); + + /* + * We need to leave 2 KB + 1 WR of headroom in the SQ to + * allow HW to prefetch. + */ + qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + qp->sq_max_wqes_per_wr; + qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr * + qp->sq_max_wqes_per_wr + + qp->sq_spare_wqes); + + if (qp->sq.wqe_cnt <= dev->dev->caps.max_wqes) + break; + + if (qp->sq_max_wqes_per_wr <= 1) + return -EINVAL; + + ++qp->sq.wqe_shift; + } + + qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + (qp->sq.wqe_cnt << qp->sq.wqe_shift); @@ -277,7 +401,8 @@ static int set_kernel_sq_size(struct mlx qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_wr = qp->sq.max_post = + (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; cap->max_send_sge = qp->sq.max_gs; /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -315,6 +440,12 @@ static int create_qp_common(struct mlx4_ qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq_next_wqe = 0; + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) + qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); + else + qp->sq_signal_bits = 0; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -405,11 +536,6 @@ static int create_qp_common(struct mlx4_ */ qp->doorbell_qpn = swab32(qp->mqp.qpn << 8); - if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) - qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE); - else - qp->sq_signal_bits = 0; - qp->mqp.event = mlx4_ib_qp_event; return 0; @@ -904,7 +1030,7 @@ static int __mlx4_ib_modify_qp(struct ib ctrl = get_send_wqe(qp, i); ctrl->owner_opcode = cpu_to_be32(1 << 31); - stamp_send_wqe(qp, i); + stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift); } } @@ -1266,13 +1392,14 @@ int mlx4_ib_post_send(struct ib_qp *ibqp unsigned long flags; int nreq; int err = 0; - int ind; - int size; + unsigned ind; + int uninitialized_var(stamp); + int uninitialized_var(size); int i; spin_lock_irqsave(&qp->rq.lock, flags); - ind = qp->sq.head; + ind = qp->sq_next_wqe; for (nreq = 0; wr; ++nreq, wr = wr->next) { if (mlx4_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { @@ -1288,7 +1415,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp } ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); - qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; + qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] = wr->wr_id; ctrl->srcrb_flags = (wr->send_flags & IB_SEND_SIGNALED ? @@ -1401,16 +1528,23 @@ int mlx4_ib_post_send(struct ib_qp *ibqp ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); + stamp = ind + qp->sq_spare_wqes; + ind += DIV_ROUND_UP(size * 16, 1 << qp->sq.wqe_shift); + /* * We can improve latency by not stamping the last * send queue WQE until after ringing the doorbell, so * only stamp here if there are still more WQEs to post. + * + * Same optimization applies to padding with NOP wqe + * in case of WQE shrinking (used to prevent wrap-around + * in the middle of WR). */ - if (wr->next) - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes) & - (qp->sq.wqe_cnt - 1)); + if (wr->next) { + stamp_send_wqe(qp, stamp, size * 16); + ind = pad_wraparound(qp, ind); + } - ++ind; } out: @@ -1432,8 +1566,10 @@ out: */ mmiowb(); - stamp_send_wqe(qp, (ind + qp->sq_spare_wqes - 1) & - (qp->sq.wqe_cnt - 1)); + stamp_send_wqe(qp, stamp, size * 16); + + ind = pad_wraparound(qp, ind); + qp->sq_next_wqe = ind; } spin_unlock_irqrestore(&qp->rq.lock, flags); Index: infiniband/drivers/net/mlx4/alloc.c =================================================================== --- infiniband.orig/drivers/net/mlx4/alloc.c 2007-10-10 17:12:12.259502000 +0200 +++ infiniband/drivers/net/mlx4/alloc.c 2007-10-10 17:23:02.356137000 +0200 @@ -151,6 +151,19 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, memset(buf->u.page_list[i].buf, 0, PAGE_SIZE); } + + if (BITS_PER_LONG == 64) { + struct page **pages; + pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL); + if (!pages) + goto err_free; + for (i = 0; i < buf->nbufs; ++i) + pages[i] = virt_to_page(buf->u.page_list[i].buf); + buf->u.direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); + kfree(pages); + if (!buf->u.direct.buf) + goto err_free; + } } return 0; @@ -170,6 +183,9 @@ void mlx4_buf_free(struct mlx4_dev *dev, dma_free_coherent(&dev->pdev->dev, size, buf->u.direct.buf, buf->u.direct.map); else { + if (BITS_PER_LONG == 64) + vunmap(buf->u.direct.buf); + for (i = 0; i < buf->nbufs; ++i) dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->u.page_list[i].buf, Index: infiniband/include/linux/mlx4/device.h =================================================================== --- infiniband.orig/include/linux/mlx4/device.h 2007-10-10 17:21:17.954882000 +0200 +++ infiniband/include/linux/mlx4/device.h 2007-10-10 17:23:02.363137000 +0200 @@ -133,6 +133,11 @@ enum { MLX4_STAT_RATE_OFFSET = 5 }; +static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) +{ + return (major << 32) | (minor << 16) | subminor; +} + struct mlx4_caps { u64 fw_ver; int num_ports; @@ -189,7 +194,7 @@ struct mlx4_buf_list { }; struct mlx4_buf { - union { + struct { struct mlx4_buf_list direct; struct mlx4_buf_list *page_list; } u; Index: infiniband/include/linux/mlx4/qp.h =================================================================== --- infiniband.orig/include/linux/mlx4/qp.h 2007-10-10 17:12:38.460566000 +0200 +++ infiniband/include/linux/mlx4/qp.h 2007-10-10 17:23:02.366140000 +0200 @@ -154,7 +154,11 @@ struct mlx4_qp_context { u32 reserved5[10]; }; +/* Which firmware version adds support for NEC (NoErrorCompletion) bit */ +#define MLX4_FW_VER_WQE_CTRL_NEC mlx4_fw_ver(2, 2, 232) + enum { + MLX4_WQE_CTRL_NEC = 1 << 29, MLX4_WQE_CTRL_FENCE = 1 << 6, MLX4_WQE_CTRL_CQ_UPDATE = 3 << 2, MLX4_WQE_CTRL_SOLICITED = 1 << 1, From dledford at redhat.com Thu Oct 11 09:37:02 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 16:37:02 +0000 Subject: [ofa-general] More librdmacm stuff Message-ID: <1192120622.19888.417.camel@firewall.xsintricity.com> OK, one of the issues I ran into was that in order to call rdma_bind_addr(), you had to have an IP address to bind. What if you just want to run your RDMA devs without IPoIB? An rdma_bind_dev() would be useful. Then, when a listen request comes in, knowing whether or not the other end has an IP addr, whether or not your end has an IP addr, and what they may be is buried inside cm_id->route->addr->src_addr and cm_id->route->addr->dst_addr. A couple wrappers to get at these in a safe manner, and to possibly check that they are even valid at the same time, would be good I think. Also, I assume these are persistent structs that won't be free()d out from under us, so I assume I can pass those out to upper layers? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Thu Oct 11 10:00:28 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 17:00:28 +0000 Subject: [ofa-general] verbs/hardware question Message-ID: <1192122028.19888.431.camel@firewall.xsintricity.com> So, one of the options when creating a QP is the max inline data size. If I understand this correctly, for any send up to that size, the payload of that send will be transmitted to the receiving side along with the request to send. This reduces back and forth packet counts on the wire in the case that the receiving side is good to go, because it basically just responds with "OK, got it" and you're done. The trade off of course is that if there is a resource shortage on the receiving side, then it sends a RNR packet back, and however much payload data you sent over the wire with the original request to send was just wasted bandwidth as it was thrown away on the receiving side. So, if my understanding of that is correct, then inline data improves latency and maximum bandwidth up until the point where the receiving side starts to have resource problems, then it wastes bandwidth and doesn't help latency at all. So, if a person wanted to write their program to use inline data up until this point of congestion, then quit using it until the congestion clears, how would they go about doing that? Would I have to set RNR retry count to something ridiculously small and take the RNR error (along with the corresponding queue flush and the pain that brings in terms of requeuing all the flushed events) and do an ibv_modify_qp to turn off inline data until some number of sends have completed without error? Or is there possibly a counter somewhere that I can monitor? Or should I just forget about trying to optimize this part of my code? Separate question, when using an SRQ, and let's say you have more than 1 QP associate with that SRQ, then does a single QP going into QP_ERR state flush the SRQ requests, or only the send requests on the QP that's in error? And if you get down to only 1 QP left attached to the SRQ, and you then set that QP to the error state, will it flush the SRQ entries? Reading everything I can on SRQs, it's not clear to me how you might flush one, especially since setting the SRQ itself to error state specifically does *not* flush the posted and unused recv requests. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From swise at opengridcomputing.com Thu Oct 11 10:39:30 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 11 Oct 2007 12:39:30 -0500 Subject: [ofa-general] verbs/hardware question In-Reply-To: <1192122028.19888.431.camel@firewall.xsintricity.com> References: <1192122028.19888.431.camel@firewall.xsintricity.com> Message-ID: <470E5FD2.7090906@opengridcomputing.com> Doug Ledford wrote: > So, one of the options when creating a QP is the max inline data size. > If I understand this correctly, for any send up to that size, the > payload of that send will be transmitted to the receiving side along > with the request to send. What it really means is the payload is DMA'd to the HW on the local side in the work request itself as opposed being DMA'd down in a 2nd transaction after the WR is DMA'd and processed. It has no end-to-end significance. Other than to reduce the latency needed to transfer the data. > This reduces back and forth packet counts on > the wire in the case that the receiving side is good to go, because it > basically just responds with "OK, got it" and you're done. I don't think this is true. Definitely not with iWARP. INLINE is just an optimization to push small amts of data downto the local adapter as part of the work request, thus avoiding 2 DMA's. > The trade > off of course is that if there is a resource shortage on the receiving > side, then it sends a RNR packet back, and however much payload data you > sent over the wire with the original request to send was just wasted > bandwidth as it was thrown away on the receiving side. > > So, if my understanding of that is correct, then inline data improves > latency and maximum bandwidth up until the point where the receiving > side starts to have resource problems, then it wastes bandwidth and > doesn't help latency at all. So, if a person wanted to write their > program to use inline data up until this point of congestion, then quit > using it until the congestion clears, how would they go about doing > that? Even though you create the QP with the inline option, only WRs that pass in the IBV_SEND_INLINE flag will do inline processing, so you can control this functionality at a per-WR basis. > Would I have to set RNR retry count to something ridiculously > small and take the RNR error (along with the corresponding queue flush > and the pain that brings in terms of requeuing all the flushed events) > and do an ibv_modify_qp to turn off inline data until some number of > sends have completed without error? Or is there possibly a counter > somewhere that I can monitor? Or should I just forget about trying to > optimize this part of my code? > > Separate question, when using an SRQ, and let's say you have more than 1 > QP associate with that SRQ, then does a single QP going into QP_ERR > state flush the SRQ requests, or only the send requests on the QP that's > in error? And if you get down to only 1 QP left attached to the SRQ, > and you then set that QP to the error state, will it flush the SRQ > entries? Reading everything I can on SRQs, it's not clear to me how you > might flush one, especially since setting the SRQ itself to error state > specifically does *not* flush the posted and unused recv requests. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dledford at redhat.com Thu Oct 11 11:24:37 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 14:24:37 -0400 Subject: [ofa-general] verbs/hardware question In-Reply-To: <470E5FD2.7090906@opengridcomputing.com> References: <1192122028.19888.431.camel@firewall.xsintricity.com> <470E5FD2.7090906@opengridcomputing.com> Message-ID: <1192127077.19888.434.camel@firewall.xsintricity.com> On Thu, 2007-10-11 at 12:39 -0500, Steve Wise wrote: > Doug Ledford wrote: > > So, one of the options when creating a QP is the max inline data size. > > If I understand this correctly, for any send up to that size, the > > payload of that send will be transmitted to the receiving side along > > with the request to send. > > What it really means is the payload is DMA'd to the HW on the local side > in the work request itself as opposed being DMA'd down in a 2nd > transaction after the WR is DMA'd and processed. It has no end-to-end > significance. Other than to reduce the latency needed to transfer the data. OK, that clears things up for me ;-) > > This reduces back and forth packet counts on > > the wire in the case that the receiving side is good to go, because it > > basically just responds with "OK, got it" and you're done. > > I don't think this is true. Definitely not with iWARP. INLINE is just > an optimization to push small amts of data downto the local adapter as > part of the work request, thus avoiding 2 DMA's. > Even though you create the QP with the inline option, only WRs that pass > in the IBV_SEND_INLINE flag will do inline processing, so you can > control this functionality at a per-WR basis. Hmm..that raises a question on my part. You don't call ibv_reg_mr on the wr itself, so if the data is pushed with the wr, do you still need to call ibv_reg_mr on the data separately? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From swise at opengridcomputing.com Thu Oct 11 11:40:11 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 11 Oct 2007 13:40:11 -0500 Subject: [ofa-general] verbs/hardware question In-Reply-To: <1192127077.19888.434.camel@firewall.xsintricity.com> References: <1192122028.19888.431.camel@firewall.xsintricity.com> <470E5FD2.7090906@opengridcomputing.com> <1192127077.19888.434.camel@firewall.xsintricity.com> Message-ID: <470E6E0B.7000902@opengridcomputing.com> Doug Ledford wrote: > On Thu, 2007-10-11 at 12:39 -0500, Steve Wise wrote: >> Doug Ledford wrote: >>> So, one of the options when creating a QP is the max inline data size. >>> If I understand this correctly, for any send up to that size, the >>> payload of that send will be transmitted to the receiving side along >>> with the request to send. >> What it really means is the payload is DMA'd to the HW on the local side >> in the work request itself as opposed being DMA'd down in a 2nd >> transaction after the WR is DMA'd and processed. It has no end-to-end >> significance. Other than to reduce the latency needed to transfer the data. > > OK, that clears things up for me ;-) > >>> This reduces back and forth packet counts on >>> the wire in the case that the receiving side is good to go, because it >>> basically just responds with "OK, got it" and you're done. >> I don't think this is true. Definitely not with iWARP. INLINE is just >> an optimization to push small amts of data downto the local adapter as >> part of the work request, thus avoiding 2 DMA's. > >> Even though you create the QP with the inline option, only WRs that pass >> in the IBV_SEND_INLINE flag will do inline processing, so you can >> control this functionality at a per-WR basis. > > Hmm..that raises a question on my part. You don't call ibv_reg_mr on > the wr itself, so if the data is pushed with the wr, do you still need > to call ibv_reg_mr on the data separately? > The WR DMA'd by the HW is actually built in memory that is setup for the adapter to DMA from. Whether that is really done via ibv_reg_mr or some other method is provider/vendor specific. So the WR you pass into ibv_post_send() is always copied and munged into the HW-specific memory and format. For inline sends, the data you pass in via the SGL is copied into the HW-specific WR memory as well. And from the man page on ibv_post_send(), I conclude you do _not_ have to register the payload memory used in an INLINE send: > IBV_SEND_INLINE Send data in given gather list as inline data > in a send WQE. Valid only for Send and RDMA Write. The L_Key will not be checked. Steve. From kanoj at netxen.com Thu Oct 11 12:21:27 2007 From: kanoj at netxen.com (Kanoj Sarcar) Date: Thu, 11 Oct 2007 12:21:27 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <000501c80bd2$cbe7c160$6acc180a@amr.corp.intel.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> <470D6848.2080806@netxen.com> <470D6B71.9000807@ichips.intel.com> <470D71CD.9090007@netxen.com> <000501c80bd2$cbe7c160$6acc180a@amr.corp.intel.com> Message-ID: <470E77B7.4070405@netxen.com> Sean Hefty wrote: >>cma_process_remove() -> cma_remove_id_dev() generates the event for >>device removal. This is ok to do as long as it can be guaranteed that a >>racing rdma_destroy_id() has not returned back to caller, correct? >> >>IE, the caller must be willing to accept device removal events until its >>rdma_destroy_id() returns. >> >> > >Correct - rdma_destroy_id() blocks until all callbacks from the rdma_cm have >completed. > > > >>If so, why is cma_remove_id_dev() trying so hard to not generate the >>event when rdma_destroy_id() has gotten to the point of setting >>CMA_DESTROYING? Could it not just generate the event, happy in the >>knowledge that the refcount bump done by cma_process_remove() will >>prevent the rdma_destroy_id() call from returning? >> >> > >There are two ways for the user to destroy an rdma_cm_id. They can either call >rdma_destroy_id() directly or return a non-zero value from a callback. In order >to support the latter, all callbacks to a user on the same rdma_cm_id must be >serialized, and once the user has returned a non-zero value no further callbacks >can occur. (Otherwise the user wouldn't know when it was safe to deallocate >their connection context.) > >Since a device removal can occur at any point, the device removal callback must >be serialized with any other callback in progress. It does this by marking that >the device has been removed. This prevents any new callbacks from being >invoked, but a callback may already be in progress. The device removal code >waits for that callback to complete. After it completes, it needs to see if the >user wants to destroy the rdma_cm_id - meaning they returned a non-zero value >from the first callback. If so, then the device removal callback cannot be >invoked. > >One other point is that all event callbacks for a given rdma_cm_id end up being >serialized by default. Only device removal event requires special handling, >since that thread can run at any time. If you look at some of the callback >handlers (named *_handler), you'll see calls to disable/enable remove, which >provides this serialization. > >- Sean > > > Ok, thanks, I see how CMA_DESTROYING is used to correctly implement the callback initiated destruct. I don't understand the reason for callback initiated destruct in the first place, but thats too off topic ... With this new information, I will revisit the thread posted at http://lists.openfabrics.org/pipermail/general/2007-September/040614.html to see if really the problem being talked about there is non existant. Kanoj From krause at cup.hp.com Thu Oct 11 14:04:38 2007 From: krause at cup.hp.com (Michael Krause) Date: Thu, 11 Oct 2007 14:04:38 -0700 Subject: [ofa-general] verbs/hardware question In-Reply-To: <470E6E0B.7000902@opengridcomputing.com> References: <1192122028.19888.431.camel@firewall.xsintricity.com> <470E5FD2.7090906@opengridcomputing.com> <1192127077.19888.434.camel@firewall.xsintricity.com> <470E6E0B.7000902@opengridcomputing.com> Message-ID: <6.2.0.14.2.20071011140023.02f13a90@esmail.cup.hp.com> At 11:40 AM 10/11/2007, Steve Wise wrote: >Doug Ledford wrote: >>On Thu, 2007-10-11 at 12:39 -0500, Steve Wise wrote: >>>Doug Ledford wrote: >>>>So, one of the options when creating a QP is the max inline data size. >>>>If I understand this correctly, for any send up to that size, the >>>>payload of that send will be transmitted to the receiving side along >>>>with the request to send. >>>What it really means is the payload is DMA'd to the HW on the local side >>>in the work request itself as opposed being DMA'd down in a 2nd >>>transaction after the WR is DMA'd and processed. Typically it is a series of MMIO writes coalesced and not DMA operations on the PCI bus. In-line eliminates the latency associated with a MMIO write to trigger a DMA Read request to be generated by the device and then the subsequent completion(s) which is then followed by another DMA Read request and one or more completions. >>> It has no end-to-end significance. Correct. WR + Data in-line is a common technique used in a variety of I/O solutions for a number of years now. The degree of performance gains from write coalesce varies by processor / chipset as well as over time. The in-line itself >>> Other than to reduce the latency needed to transfer the data. >>OK, that clears things up for me ;-) >> >>>>This reduces back and forth packet counts on >>>>the wire in the case that the receiving side is good to go, because it >>>>basically just responds with "OK, got it" and you're done. >>>I don't think this is true. Definitely not with iWARP. INLINE is just >>>an optimization to push small amts of data downto the local adapter as >>>part of the work request, thus avoiding 2 DMA's. Correct. It is a local only operation. Mike >>>Even though you create the QP with the inline option, only WRs that pass >>>in the IBV_SEND_INLINE flag will do inline processing, so you can >>>control this functionality at a per-WR basis. >>Hmm..that raises a question on my part. You don't call ibv_reg_mr on >>the wr itself, so if the data is pushed with the wr, do you still need >>to call ibv_reg_mr on the data separately? > >The WR DMA'd by the HW is actually built in memory that is setup for the >adapter to DMA from. Whether that is really done via ibv_reg_mr or some >other method is provider/vendor specific. So the WR you pass into >ibv_post_send() is always copied and munged into the HW-specific memory >and format. For inline sends, the data you pass in via the SGL is copied >into the HW-specific WR memory as well. > >And from the man page on ibv_post_send(), I conclude you do _not_ have to >register the payload memory used in an INLINE send: > >> IBV_SEND_INLINE Send data in given gather list as inline data >> in a send WQE. Valid only for Send and RDMA Write. The >> L_Key will not be checked. > > >Steve. >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu Oct 11 14:36:59 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 11 Oct 2007 16:36:59 -0500 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470E1A38.2020902@opengridcomputing.com> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> <470BB8DB.8090107@dev.mellanox.co.il> <470D2C69.3000500@opengridcomputing.com> <470DDA81.4060108@dev.mellanox.co.il> <470E1A38.2020902@opengridcomputing.com> Message-ID: <470E977B.5080107@opengridcomputing.com> Ok, can you re-pull to get the configure.in change? Sorry for the pain. Steve. Steve Wise wrote: > oops. > > Lemme fix this up then we'll re-pull. > > Thanks, > > Steve. > > > Vladimir Sokolovsky wrote: >> Steve Wise wrote: >>> Hey Vlad, >>> >>> The libcxgb3 rpms built by this ofed-1.2.5 release are still named >>> libcxgb3*-1.0.1 instead of 1.0.3. Can you update your spec files to >>> indicate that the library is release 1.0.3? >>> >>> You'll need to also update the ofed-1.3 spec file I guess. >>> >>> Thanks, >>> >>> Steve. >>> >> >> Hi Steve, >> You should update libcxgb3 version in the configure.in file: >> >> Update version to 1.0.3 >> >> Signed-off-by: Vladimir Sokolovsky >> --- >> diff --git a/configure.in b/configure.in >> index 6f916d3..15406b7 100644 >> --- a/configure.in >> +++ b/configure.in >> @@ -1,11 +1,11 @@ >> dnl Process this file with autoconf to produce a configure script. >> >> AC_PREREQ(2.57) >> -AC_INIT(libcxgb3, 1.0.1, general at lists.openfabrics.org) >> +AC_INIT(libcxgb3, 1.0.3, general at lists.openfabrics.org) >> AC_CONFIG_SRCDIR([src/iwch.h]) >> AC_CONFIG_AUX_DIR(config) >> AM_CONFIG_HEADER(config.h) >> -AM_INIT_AUTOMAKE(libcxgb3, 1.0.1) >> +AM_INIT_AUTOMAKE(libcxgb3, 1.0.3) >> AM_PROG_LIBTOOL >> >> AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for >> presence of ib libraries], >> >> >> Regards, >> Vladimir > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Oct 11 14:53:44 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Oct 2007 14:53:44 -0700 Subject: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroyinglisten requests In-Reply-To: <470E77B7.4070405@netxen.com> References: <000001c80aa1$0e5c1fb0$3c98070a@amr.corp.intel.com><470BCB00.1040702@netxen.com> <470BD4A5.40902@ichips.intel.com> <470D4749.8000309@netxen.com> <000001c80b88$dd9e5c60$51c8180a@amr.corp.intel.com> <470D5D76.7010305@netxen.com> <000001c80b95$9603c040$39c8180a@amr.corp.intel.com> <470D6848.2080806@netxen.com> <470D6B71.9000807@ichips.intel.com> <470D71CD.9090007@netxen.com> <000501c80bd2$cbe7c160$6acc180a@amr.corp.intel.com> <470E77B7.4070405@netxen.com> Message-ID: <000001c80c51$32a09c50$3acc180a@amr.corp.intel.com> >I don't understand the reason for callback initiated destruct in the >first place, but thats too off topic ... This was a feature request by users. It ends up being convenient from the user's perspective, since it avoids needing to queue the id's to another thread for destruction. This is particularly important for connection requests, where the id is a new id. The user may not be able to allocate space for their context. This is how it is used by the ucma and sdp (out of tree module). >With this new information, I will revisit the thread posted at >http://lists.openfabrics.org/pipermail/general/2007-September/040614.html >to see if really the problem being talked about there is non existant. The expected behavior of the IB and iWarp CMs is that a destroy call for a given id will block while there's a callback in progress which references it. The rdma_cm will only work if the underlying CMs adhere to this. I would personally treat issues in iWarp CM synchronization as separate. - Sean From mshefty at ichips.intel.com Thu Oct 11 15:12:51 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 11 Oct 2007 15:12:51 -0700 Subject: [ofa-general] More librdmacm stuff In-Reply-To: <1192120622.19888.417.camel@firewall.xsintricity.com> References: <1192120622.19888.417.camel@firewall.xsintricity.com> Message-ID: <470E9FE3.5000200@ichips.intel.com> Doug Ledford wrote: > OK, one of the issues I ran into was that in order to call > rdma_bind_addr(), you had to have an IP address to bind. What if you > just want to run your RDMA devs without IPoIB? An rdma_bind_dev() would > be useful. Then, when a listen request comes in, knowing whether or not > the other end has an IP addr, whether or not your end has an IP addr, > and what they may be is buried inside cm_id->route->addr->src_addr and > cm_id->route->addr->dst_addr. A couple wrappers to get at these in a > safe manner, and to possibly check that they are even valid at the same > time, would be good I think. Also, I assume these are persistent > structs that won't be free()d out from under us, so I assume I can pass > those out to upper layers? At the top level, the listen call operates on a 16-bit port value that, for IB, is eventually mapped to a 64-bit service ID. The connect request targets a listen by IP address / port number. The IP address is used to identify the correct remote device. How would the active side locate the correct listener? Or, are you only talking about loopback connections on the same device? (Hmm... maybe we can do something fancy with loopback addresses which would avoid using IPoIB.) The src/dst addresses are persistent once set, but I will add creating wrapper functions to return them to my to-do list. - Sean From Ramaswamy.Tummala at Sun.COM Thu Oct 11 15:35:48 2007 From: Ramaswamy.Tummala at Sun.COM (Ramaswamy Tummala) Date: Thu, 11 Oct 2007 15:35:48 -0700 Subject: [ofa-general] openfabrics CMA interfaces for iWARP Message-ID: <470EA544.9030101@Sun.COM> I have a few questions about the openfabrics CMA interfaces for iWARP. I'd appreciate if anyone could clarify them. - If RNIC's modify_qp() entry point is called to move the QP state to CLOSING or ERROR while there are some WQEs on SQ and RQ, does RNIC flush the incomplete WRs on the SQ or RQ? If so, does RNIC wait until the flush is complete before returning modify_qp() to the caller? If RNIC does not wait for the flush to complete how does the caller know when the flush is complete (so that caller can poll CQ to retrieve the CQ entries)? [ Another possibility is, when RNIC's modify_qp() entry point called to move the QP state to CLOSING while there some WQEs on the SQ, the RNIC would internally move the QP state to ERROR. My question still is does RNIC wait until the flushing of incomplete WRs from SQ and RQ are done before returning modify_qp() to the caller even though it internally transitioned the QP state to ERROR. If RNIC does not wait for the flush to complete how does the caller know when the flush is complete? ] - If RNIC's modify_qp() entry point called to move the QP state to CLOSING, does RNIC just initiate LLP CLOSE and return to the caller?, or does it wait until LLP CLOSE is complete?. - It appears that RNIC should send IW_CM_EVENT_DISCONNECT event to CMA prior to the start of closing or aborting the connection (except in the case where the disconnect has been initiated by CMA itself, for example by CMA calling modify_qp entry point of RNIC to move the QP state to CLOSING or ERROR). Is this correct? - It appears that RNIC should send IW_CM_EVENT_CLOSE event after the connection has been closed. Should this event be sent on both active and passive sides after the connection has been closed? - RNIC has add_ref(struct ib_qp *qp), and rem_ref(struct ib_qp *qp) entry points. What is the expected use of CMA calling these entry points? My general thinking is that CMA can increase the reference count on QP (i.e. add_ref) to prevent the QP from being destroyed by RNIC. But, it is the CMA that initiates destroying of QP by calling destroy_qp() entry point of RNIC. So, CMA could maintain the reference count for QP in its own private data (instead of calling RNIC's add_ref entry point) and not call destroy_qp() entry point of RNIC if the reference count is not zero. - It appears that if RNIC's accept() entry point is called to accept an incoming connection, the RNIC, after successful processing of accept, would send IW_CM_EVENT_ESTABLISHED event to CMA. What event RNIC should send if the call to accept() succeeded, but later RNIC encountered some error in sending MPA reply message to the remote peer or some other error? In this case although the call to accept() succeeded, the connection could still be not be established. So the RNIC can not send IW_CM_EVENT_ESTABLISHED event. - It appears that a client of CMA needs to call rdma_resolve_route() after a successful rdma_resolve_addr(). Any reason for the existence of two interfaces instead of one interface that combines the functionality of both the interfaces? Thanks, Ramaswamy. From dledford at redhat.com Thu Oct 11 15:46:42 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 22:46:42 +0000 Subject: [ofa-general] More librdmacm stuff In-Reply-To: <470E9FE3.5000200@ichips.intel.com> References: <1192120622.19888.417.camel@firewall.xsintricity.com> <470E9FE3.5000200@ichips.intel.com> Message-ID: <1192142802.19888.437.camel@firewall.xsintricity.com> On Thu, 2007-10-11 at 15:12 -0700, Sean Hefty wrote: > Doug Ledford wrote: > > OK, one of the issues I ran into was that in order to call > > rdma_bind_addr(), you had to have an IP address to bind. What if you > > just want to run your RDMA devs without IPoIB? An rdma_bind_dev() would > > be useful. Then, when a listen request comes in, knowing whether or not > > the other end has an IP addr, whether or not your end has an IP addr, > > and what they may be is buried inside cm_id->route->addr->src_addr and > > cm_id->route->addr->dst_addr. A couple wrappers to get at these in a > > safe manner, and to possibly check that they are even valid at the same > > time, would be good I think. Also, I assume these are persistent > > structs that won't be free()d out from under us, so I assume I can pass > > those out to upper layers? > > At the top level, the listen call operates on a 16-bit port value that, > for IB, is eventually mapped to a 64-bit service ID. The connect > request targets a listen by IP address / port number. The IP address is > used to identify the correct remote device. How would the active side > locate the correct listener? I'm more referring to when you call rdma_bind_addr to bind to your device before you call rdma_connect. In that instance, your address isn't for the eventual destination, but just to bind you to your local rdma device. For that, an rdma_bind_dev that took an ibv context and a port number on that device would avoid having to specify an IP address that you don't really care about. > Or, are you only talking about loopback > connections on the same device? (Hmm... maybe we can do something fancy > with loopback addresses which would avoid using IPoIB.) > > The src/dst addresses are persistent once set, but I will add creating > wrapper functions to return them to my to-do list. > > - Sean -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From counternoise at helenwongstours.com Thu Oct 11 16:17:54 2007 From: counternoise at helenwongstours.com (Wilmer Powell) Date: Thu, 11 Oct 2007 19:17:54 -0400 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80c5b$c82cde00$0100007f@localhost> cheapxpsoft . com From pradeeps at linux.vnet.ibm.com Thu Oct 11 16:41:46 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 11 Oct 2007 16:41:46 -0700 Subject: [ofa-general] Draft patch to address bugzilla bug#728 Message-ID: <470EB4BA.40509@linux.vnet.ibm.com> This is a draft patch to address the following bug: https://bugs.openfabrics.org/show_bug.cgi?id=728 There are still a few debug prints and the like which needs to be cleaned up. A few error conditions still need to be addressed. Please ignore them. I have done some minimal test runs on both mthca and ehca and it seems to work. I have not yet updated the no srq code in this patch. I wanted to post this patch for comments before proceeding further. While working on this I observed that for mthca max_srq_sge returned by ib_query_device() is not equal to max_sge returned by ib_query_srq(). Why is that? Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-03 12:01:58.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-10 20:31:43.000000000 -0500 @@ -212,7 +212,7 @@ struct ipoib_cm_tx { struct ipoib_cm_rx_buf { struct sk_buff *skb; - u64 mapping[IPOIB_CM_RX_SG]; + u64 *mapping; }; struct ipoib_cm_dev_priv { --- a/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-31 12:14:30.000000000 -0500 +++ b/linux-2.6.23-rc7/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-11 17:55:35.000000000 -0500 @@ -61,24 +61,27 @@ static struct ib_qp_attr ipoib_cm_err_at }; #define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) static struct ib_send_wr ipoib_cm_rx_drain_wr = { .wr_id = IPOIB_CM_RX_DRAIN_WRID, .opcode = IB_WR_SEND, }; +static int num_frags, order, fragment_size; + static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags, - u64 mapping[IPOIB_CM_RX_SG]) + u64 *mapping) { int i; ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); for (i = 0; i < frags; ++i) - ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); + ib_dma_unmap_single(priv->ca, mapping[i + 1], fragment_size, DMA_FROM_DEVICE); } static int ipoib_cm_post_receive(struct net_device *dev, int id) @@ -89,13 +92,13 @@ static int ipoib_cm_post_receive(struct priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; - for (i = 0; i < IPOIB_CM_RX_SG; ++i) + for (i = 0; i < num_frags; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); - ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + ipoib_cm_dma_unmap_rx(priv, num_frags - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); priv->cm.srq_ring[id].skb = NULL; @@ -105,7 +108,7 @@ static int ipoib_cm_post_receive(struct } static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, - u64 mapping[IPOIB_CM_RX_SG]) + u64 *mapping) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; @@ -129,7 +132,7 @@ static struct sk_buff *ipoib_cm_alloc_rx } for (i = 0; i < frags; i++) { - struct page *page = alloc_page(GFP_ATOMIC); + struct page *page = alloc_pages(GFP_ATOMIC, order); if (!page) goto partial_error; @@ -405,12 +408,15 @@ void ipoib_cm_handle_rx_wc(struct net_de struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; - u64 mapping[IPOIB_CM_RX_SG]; + u64 *mapping; int frags; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); + /* What happens if this fails, look at the kfree() also */ + mapping = (u64 *)kzalloc(num_frags * sizeof(u64 *), GFP_ATOMIC); + if (unlikely(wr_id >= ipoib_recvq_size)) { if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { spin_lock_irqsave(&priv->lock, flags); @@ -448,7 +454,7 @@ void ipoib_cm_handle_rx_wc(struct net_de } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, - (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + (unsigned)IPOIB_CM_HEAD_SIZE)) / fragment_size; newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); if (unlikely(!newskb)) { @@ -486,6 +492,7 @@ repost: if (unlikely(ipoib_cm_post_receive(dev, wr_id))) ipoib_warn(priv, "ipoib_cm_post_receive failed " "for buf %d\n", wr_id); + kfree(mapping); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -1281,10 +1288,11 @@ int ipoib_cm_dev_init(struct net_device struct ib_srq_init_attr srq_init_attr = { .attr = { .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG } }; - int ret, i; + int ret, i, j, max_sge_supported; + struct ib_srq_attr srq_attr; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1301,13 +1309,48 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); + ret = ib_query_device(priv->ca, &attr); + if (ret) + return ret; + + printk(KERN_WARNING "max_srq_sge=%d\n", attr.max_srq_sge); + + srq_init_attr.attr.max_sge = attr.max_srq_sge; + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); if (IS_ERR(priv->cm.srq)) { + printk(KERN_WARNING "ib_create_srq() failed!!!\n"); ret = PTR_ERR(priv->cm.srq); priv->cm.srq = NULL; return ret; } + ret = ib_query_srq(priv->cm.srq, &srq_attr); + if (ret) { + printk(KERN_WARNING "ib_query_srq() failed with %d\n", ret); + return -EINVAL; + } + + /* We want max_sge_supported to be a power of 2, but + * <= srq_attr.max_sge + */ + printk(KERN_WARNING "srq_attr.max_sge=%d\n", srq_attr.max_sge); + max_sge_supported = roundup_pow_of_two(min((u32)(attr.max_srq_sge), srq_attr.max_sge)); + if (max_sge_supported != srq_attr.max_sge) + max_sge_supported = max_sge_supported >> 1; + + if (IPOIB_CM_RX_SG >= max_sge_supported) { + fragment_size = CM_PACKET_SIZE/max_sge_supported; + num_frags = CM_PACKET_SIZE/fragment_size; + } else { + fragment_size = CM_PACKET_SIZE/IPOIB_CM_RX_SG; + num_frags = IPOIB_CM_RX_SG; + } + order = get_order(fragment_size); + printk(KERN_WARNING "Computed values of order=%d, max_sge_supported=%d," + " fragment_size=0x%x, num_frags=%d\n", order, max_sge_supported, + fragment_size, num_frags); + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, GFP_KERNEL); if (!priv->cm.srq_ring) { @@ -1317,18 +1360,32 @@ int ipoib_cm_dev_init(struct net_device return -ENOMEM; } - for (i = 0; i < IPOIB_CM_RX_SG; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) { + priv->cm.srq_ring[i].mapping = kzalloc(num_frags * sizeof(u64 *), + GFP_KERNEL); + if (!priv->cm.srq_ring[i].mapping) { + printk(KERN_WARNING "%s: failed to allocate mapping for srq_ring\n", + priv->ca->name); + for (j = i; j > 0; j--) + kfree(priv->cm.srq_ring[j].mapping); + kfree(priv->cm.srq_ring); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + } + + for (i = 0; i < num_frags; ++i) priv->cm.rx_sge[i].lkey = priv->mr->lkey; priv->cm.rx_sge[0].length = IPOIB_CM_HEAD_SIZE; - for (i = 1; i < IPOIB_CM_RX_SG; ++i) - priv->cm.rx_sge[i].length = PAGE_SIZE; + for (i = 1; i < num_frags; ++i) + priv->cm.rx_sge[i].length = fragment_size; priv->cm.rx_wr.next = NULL; priv->cm.rx_wr.sg_list = priv->cm.rx_sge; - priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; + priv->cm.rx_wr.num_sge = num_frags; for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + if (!ipoib_cm_alloc_rx_skb(dev, i, num_frags - 1, priv->cm.srq_ring[i].mapping)) { ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); ipoib_cm_dev_cleanup(dev); @@ -1362,13 +1419,15 @@ void ipoib_cm_dev_cleanup(struct net_dev priv->cm.srq = NULL; if (!priv->cm.srq_ring) return; - for (i = 0; i < ipoib_recvq_size; ++i) + for (i = 0; i < ipoib_recvq_size; ++i) { if (priv->cm.srq_ring[i].skb) { ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping); dev_kfree_skb_any(priv->cm.srq_ring[i].skb); priv->cm.srq_ring[i].skb = NULL; } + kfree(priv->cm.srq_ring[i].mapping); + } kfree(priv->cm.srq_ring); priv->cm.srq_ring = NULL; } From rdreier at cisco.com Thu Oct 11 18:08:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 11 Oct 2007 18:08:52 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git for-linus Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the batch of changes queued up for the 2.6.24 merge window (although I still have a few more things to merge later, once Dave Miller's networking tree has landed too): Ali Ayoub (1): IB/sa: Error handling thinko fix Anton Blanchard (3): IB/fmr_pool: Clean up some error messages in fmr_pool.c IB/ehca: Make output clearer by removing some debug messages IB/ehca: Export module parameters in sysfs Arthur Jones (4): IB/ipath: iba6110 rev4 GPIO counters support IB/ipath: Use counters in ipath_poll and cleanup interrupts in ipath_close IB/ipath: iba6110 rev4 no longer needs recv header overrun workaround IB/ipath: Indicate a couple of chip bugs to userspace Dave Olson (5): IB/ipath: Verify host bus bandwidth to chip will not limit performance IB/ipath: Correctly describe workaround for TID write chip bug IB/ipath: Future proof eeprom checksum code (contents reading) IB/ipath: Fix QHT7040 serial number check IB/ipath: Minor fix to ordering of freeing and zeroing of tid pages. Dotan Barak (2): mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS IPoIB/cm: Clean up initialization of QP attr in ipoib_cm_create_tx_qp() Eli Cohen (3): IPoIB: Fix typo to end statement with ';' instead of ',' IPoIB: Fix error path memory leak IB/mthca: Mark error paths as unlikely() in post_srq_recv functions Hoang-Nam Nguyen (4): IB/ehca: Use remap_4k_pfn() to map firmware contexts to user space IB/ehca: Fix large page HW cap defines IB/ehca: Fix mem leak of firmware ctrlblock in ehca_create_srq() IB/ehca: Adjust 64-bit alignment of create QP response for userspace Jack Morgenstein (5): IB/mlx4: Display misc device information under /sys/class/infiniband/ mlx4_core: Support ICM tables in coherent memory mlx4_core: Write MTTs from CPU instead with of WRITE_MTT FW command IB/mlx4: Implement FMRs mlx4_core: Increase max number of QPs per multicast group to 56 Joachim Fenkes (11): IB/ehca: Refactor hvcall tracing IB/ehca: Print return codes as signed decimal integers IB/ehca: ehca_gen_warn() should always print IB/ehca: Add check for max #SGE to create_qp() IB/ehca: Path migration support IB/ehca: Serialize MR alloc and MR free hvCalls IB/ehca: Replace get_paca()->paca_index by the more portable raw_smp_processor_id() IB/ehca: Bump version number and change its format IB/umem: Add hugetlb flag to struct ib_umem IB/ehca: Only use MR large pages for hugetlb regions IB/ehca: Return srq_attr->max_sge in ehca_query_srq() Michael Albaugh (2): IB/ipath: Maintain active time on all chips IB/ipath: Better handling of unexpected GPIO interrupts Michael S. Tsirkin (2): mlx4_core: Enable MSI-X by default IB/mthca: Enable MSI-X by default Or Gerlitz (1): IPoIB: Allow setting policy to ignore multicast groups Peter Oruba (1): IB/mthca: Use PCI-X/PCI-Express read control interfaces Ralph Campbell (13): IB/core: Fix handling of multicast response failures IB/ipath: Performance optimization for CPU differences IB/ipath: Change UD to queue work requests like RC & UC IB/ipath: Remove unneeded code for ipathfs IB/ipath: UC RDMA WRITE with IMMEDIATE doesn't send the immediate IB/ipath: Remove redundant code IB/ipath: Generate flush CQE when QP is in error state IB/ipath: Implement IB_EVENT_QP_LAST_WQE_REACHED IB/ipath: Optimize completion queue entry insertion and polling IB/ipath: Add ability to set the LMC via the sysfs debugging interface IB/ipath: Remove duplicate copy of LMC IB/ipath: Fix IB_EVENT_PORT_ERR event IB/ipath: Remove redundant link state checks Roland Dreier (18): IPoIB: Make sure no receives are handled when stopping device IB: find_first_zero_bit() takes unsigned pointer mlx4_core: Don't free special QPs in QP number bitmap IB/mlx4: Use __set_data_seg() in mlx4_ib_post_recv() IB/ehca: Include from ehca_classes.h IB/mlx4: Fix up SRQ limit_watermark endianness IB/iser: Remove unnecessary includes mlx4_core: Change capability decoding: SRC->XRC IB/umad: Add P_Key index support IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems IB/uverbs: Make ib_uverbs_release_event_file() static mlx4_core: Reserve the correct number of MTT segments mlx4_core: Fix meaning of dev->caps.reserved_mtts IB/mthca: Increase max number of QPs per multicast group to 56 IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up mlx4_core: Use mmiowb() to avoid firmware commands getting jumbled up IB/ehca: Fix clipping of device limits to INT_MAX mlx4_core: Fix section mismatches Satyam Sharma (1): IB/ehca: Misc cpuinit section annotations and #ifdef cleanups Sean Hefty (7): IPoIB: Specify Traffic Class with path record queries for QoS support IB/sa: Add new QoS fields to path record RDMA/cma: Add ability to specify type of service RDMA/ucma: Allow user space to set service type IB/srp: Add QoS support through service ID IB/cm: Modify interface to send MRAs in response to duplicate messages RDMA/cma: Queue IB CM MRAs to avoid unnecessary remote retries Stefan Roscher (2): IB/ehca: Small QP userspace support IB/ehca: Support more than 4k QPs for userspace and kernelspace Steve Wise (2): RDMA/cxgb3: Make the iw_cxgb3 module parameters writable RDMA/cma: Use neigh_event_send() to start neighbour discovery Documentation/infiniband/user_mad.txt | 14 + drivers/infiniband/core/addr.c | 3 +- drivers/infiniband/core/cm.c | 51 ++-- drivers/infiniband/core/cma.c | 46 +++- drivers/infiniband/core/device.c | 4 +- drivers/infiniband/core/fmr_pool.c | 22 +- drivers/infiniband/core/multicast.c | 2 +- drivers/infiniband/core/sa_query.c | 12 +- drivers/infiniband/core/ucma.c | 74 +++++- drivers/infiniband/core/umem.c | 20 ++- drivers/infiniband/core/user_mad.c | 151 +++++++--- drivers/infiniband/core/uverbs.h | 1 - drivers/infiniband/core/uverbs_main.c | 16 +- drivers/infiniband/hw/cxgb3/iwch_cm.c | 16 +- drivers/infiniband/hw/ehca/ehca_classes.h | 14 +- drivers/infiniband/hw/ehca/ehca_cq.c | 23 +- drivers/infiniband/hw/ehca/ehca_hca.c | 34 +- drivers/infiniband/hw/ehca/ehca_irq.c | 33 +-- drivers/infiniband/hw/ehca/ehca_main.c | 52 ++-- drivers/infiniband/hw/ehca/ehca_mcast.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 102 ++++---- drivers/infiniband/hw/ehca/ehca_qp.c | 169 +++++++---- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 2 +- drivers/infiniband/hw/ehca/ehca_tools.h | 19 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 46 ++-- drivers/infiniband/hw/ehca/hcp_if.c | 105 ++++--- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 1 + drivers/infiniband/hw/ipath/ipath_common.h | 4 +- drivers/infiniband/hw/ipath/ipath_cq.c | 94 +++--- drivers/infiniband/hw/ipath/ipath_diag.c | 22 +- drivers/infiniband/hw/ipath/ipath_driver.c | 93 ++++++- drivers/infiniband/hw/ipath/ipath_eeprom.c | 10 +- drivers/infiniband/hw/ipath/ipath_file_ops.c | 74 +++-- drivers/infiniband/hw/ipath/ipath_fs.c | 187 ------------ drivers/infiniband/hw/ipath/ipath_iba6110.c | 57 ++-- drivers/infiniband/hw/ipath/ipath_iba6120.c | 18 +- drivers/infiniband/hw/ipath/ipath_intr.c | 64 +++-- drivers/infiniband/hw/ipath/ipath_kernel.h | 12 +- drivers/infiniband/hw/ipath/ipath_mad.c | 53 ++-- drivers/infiniband/hw/ipath/ipath_qp.c | 31 ++- drivers/infiniband/hw/ipath/ipath_rc.c | 73 ++++-- drivers/infiniband/hw/ipath/ipath_ruc.c | 308 +++++++------------- drivers/infiniband/hw/ipath/ipath_stats.c | 17 +- drivers/infiniband/hw/ipath/ipath_sysfs.c | 40 +++- drivers/infiniband/hw/ipath/ipath_uc.c | 98 +++---- drivers/infiniband/hw/ipath/ipath_ud.c | 382 ++++++++---------------- drivers/infiniband/hw/ipath/ipath_verbs.c | 329 ++++++++++++++------- drivers/infiniband/hw/ipath/ipath_verbs.h | 45 ++- drivers/infiniband/hw/mlx4/main.c | 50 +++ drivers/infiniband/hw/mlx4/mlx4_ib.h | 16 + drivers/infiniband/hw/mlx4/mr.c | 100 ++++++- drivers/infiniband/hw/mlx4/qp.c | 14 +- drivers/infiniband/hw/mlx4/srq.c | 2 +- drivers/infiniband/hw/mthca/mthca_cmd.c | 6 + drivers/infiniband/hw/mthca/mthca_dev.h | 2 +- drivers/infiniband/hw/mthca/mthca_main.c | 110 ++++---- drivers/infiniband/hw/mthca/mthca_srq.c | 8 +- drivers/infiniband/ulp/ipoib/ipoib.h | 24 ++- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 8 + drivers/infiniband/ulp/ipoib/ipoib_main.c | 45 +++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 31 +-- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 2 + drivers/infiniband/ulp/iser/iser_initiator.c | 2 - drivers/infiniband/ulp/iser/iser_memory.c | 2 - drivers/infiniband/ulp/iser/iser_verbs.c | 1 - drivers/infiniband/ulp/srp/ib_srp.c | 2 + drivers/net/mlx4/cmd.c | 11 +- drivers/net/mlx4/cq.c | 2 +- drivers/net/mlx4/eq.c | 13 +- drivers/net/mlx4/fw.c | 2 +- drivers/net/mlx4/icm.c | 134 +++++++-- drivers/net/mlx4/icm.h | 9 +- drivers/net/mlx4/main.c | 130 +++++---- drivers/net/mlx4/mcg.c | 2 +- drivers/net/mlx4/mlx4.h | 10 +- drivers/net/mlx4/mr.c | 242 +++++++++++++--- drivers/net/mlx4/pd.c | 2 +- drivers/net/mlx4/qp.c | 5 +- drivers/net/mlx4/srq.c | 4 +- include/linux/mlx4/device.h | 27 ++ include/rdma/ib_cm.h | 7 +- include/rdma/ib_sa.h | 11 +- include/rdma/ib_umem.h | 1 + include/rdma/ib_user_mad.h | 70 +++++- include/rdma/rdma_cm.h | 14 + include/rdma/rdma_user_cm.h | 18 ++ 89 files changed, 2498 insertions(+), 1710 deletions(-) From davem at davemloft.net Thu Oct 11 18:17:19 2007 From: davem at davemloft.net (David Miller) Date: Thu, 11 Oct 2007 18:17:19 -0700 (PDT) Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: Message-ID: <20071011.181719.78707713.davem@davemloft.net> From: Roland Dreier Date: Thu, 11 Oct 2007 18:08:52 -0700 > This will get the batch of changes queued up for the 2.6.24 merge > window (although I still have a few more things to merge later, once > Dave Miller's networking tree has landed too): Roland are you absolutely sure this won't create merge conflicts with my 8MB net-2.6 merge, inside of which there are many infiniband driver changes? I really wish you would submit your inifiniband work through normal network driver channels, such as Jeff Garzik. Jeff has been syncing on almost a daily basis with me so that I wouldn't have to worry about changes coming out of left field and adding additional merge issues for an already difficult merge. Even if you're confident there won't be merge issues, could you just wait for the net-2.6 stuff to go in first? Thanks. From sean.hefty at intel.com Thu Oct 11 18:51:28 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 11 Oct 2007 18:51:28 -0700 Subject: [ofa-general] More librdmacm stuff In-Reply-To: <1192142802.19888.437.camel@firewall.xsintricity.com> References: <1192120622.19888.417.camel@firewall.xsintricity.com><470E9FE3.5000200@ichips.intel.com> <1192142802.19888.437.camel@firewall.xsintricity.com> Message-ID: <000101c80c72$684e5bf0$3acc180a@amr.corp.intel.com> >I'm more referring to when you call rdma_bind_addr to bind to your >device before you call rdma_connect. In that instance, your address >isn't for the eventual destination, but just to bind you to your local >rdma device. For that, an rdma_bind_dev that took an ibv context and a >port number on that device would avoid having to specify an IP address >that you don't really care about. Maybe I'm missing something, but you would still use IP addressing to identify the remote system, which requires IPoIB anyway. My expectation is that the side that calls rdma_connect() would usually call rdma_resolve_addr(), and not use rdma_bind_addr(). This way the local device binding occurs based on the routing tables to the remote address. - Sean From rdreier at cisco.com Thu Oct 11 19:21:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 11 Oct 2007 19:21:06 -0700 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: <20071011.181719.78707713.davem@davemloft.net> (David Miller's message of "Thu, 11 Oct 2007 18:17:19 -0700 (PDT)") References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: > > This will get the batch of changes queued up for the 2.6.24 merge > > window (although I still have a few more things to merge later, once > > Dave Miller's networking tree has landed too): > > Roland are you absolutely sure this won't create merge conflicts with > my 8MB net-2.6 merge, inside of which there are many infiniband > driver changes? I'm not absolutely sure of anything but I have merged our two git trees quite a few times during the 2.6.23 cycle and I have not seen any conflicts. Unless you've added some more IB changes very recently I don't think there should be any problem. > I really wish you would submit your inifiniband work through normal > network driver channels, such as Jeff Garzik. Jeff has been syncing > on almost a daily basis with me so that I wouldn't have to worry about > changes coming out of left field and adding additional merge issues > for an already difficult merge. I'm not sure what you mean. During the 2.6.23 cycle I've been sending any patches that potentially could conflict with the net-2.6 tree to you and Jeff so that you can merge them upstream via your tree. Or do you mean Jeff should become the maintainer of drivers/infiniband?? Can't you guys just keep the networking stuff contained in its little box so it doesn't create maintenance problems for InfiniBand stuff? > Even if you're confident there won't be merge issues, could you just > wait for the net-2.6 stuff to go in first? I don't mind waiting but I guess it's up to Linus really. - R. From pradeeps at linux.vnet.ibm.com Thu Oct 11 19:30:49 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 11 Oct 2007 19:30:49 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: Message-ID: <470EDC59.2030607@linux.vnet.ibm.com> Yesterday afternoon I submitted the no srq patch incorporating all of Sean's comments. I did not see that in this list? When do you plan to merge that? Pradeep From davem at davemloft.net Thu Oct 11 19:36:34 2007 From: davem at davemloft.net (David Miller) Date: Thu, 11 Oct 2007 19:36:34 -0700 (PDT) Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: <20071011.193634.48801952.davem@davemloft.net> From: Roland Dreier Date: Thu, 11 Oct 2007 19:21:06 -0700 > I'm not sure what you mean. During the 2.6.23 cycle I've been sending > any patches that potentially could conflict with the net-2.6 tree to > you and Jeff so that you can merge them upstream via your tree. Or do > you mean Jeff should become the maintainer of drivers/infiniband?? Not the maintainer, I'm just saying you should gateway your patches through him. From torvalds at linux-foundation.org Thu Oct 11 19:58:04 2007 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 11 Oct 2007 19:58:04 -0700 (PDT) Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: <20071011.181719.78707713.davem@davemloft.net> References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: On Thu, 11 Oct 2007, David Miller wrote: > > Even if you're confident there won't be merge issues, could you just > wait for the net-2.6 stuff to go in first? I pulled the net stuff first, and merged the IB stuff afterwards. No conflicts in IB, but there *were* conflicts with the networking pull for other reasons. That horrid, horrid mess that is called include/linux/mod_devicetable.h and scripts/mod/file2alias.c must go at some point. The thing is unmaintainable. Different maintainers add their own structures to both, and functions to both, and it's just messy. That's not how maintainable and modularized code should be written. Now it broke on sdio vs ssb, but there was actually a conflict earlier with the Kbuild merge (which I aborted for other reasons), so this file really is starting to be a problem. The merge was fairly straightforward and stupid - it's not like the code added is *complicated*, but all those small functions and structrues are set up to be a maze of very similar lines, so the merge is actually much worse than it should be - because there is inherent similarity, some lines are automatically auto-merged, making the result just harder to visualize. So I merged it all, and I don't expect any problems, but I'm hoping somebody is thinking about that mod_devicetable.h/file2alias.c mess. I'm not entirely sure who to blame on that thing. I'm adding Greg to the Cc, on the assumption that blaming him is usually the right thing to do ;) Oh, and obviously, the NAPI changes may well have resulted in a merge that had no actual *conflicts* in it, but whether the end result works or not (and whether any IB drivers need updating due to the NAPI changes), I cannot tell. I've pushed out my tree, so people who are competent or just morbidly curious should start looking at it: it's got the following things merged now: - x86 merge - mmc - v4l-dvb - blackfin - avr32 - block layer updates - Jeff's dmi-const - Purdie's blacklight and led trees - ide - mips - net - infiniband and it all builds for me, but hey, I don't use half of it. Oh, btw, one final note: because of just a *ton* of renames, if you actually want git to do rename-detection for you and do automatic merges across those x86 renames, you should likely add [diff] renamelimit=0 to your .gitconfig file. Otherwise, the rename detection heuristics may end up saying "I'm not going to even bother finding renames in that mess". (That final note really shouldn't affect any normal users, but I thought I'd mention it in case somebody is going to want git to merge things across the x86 merge, and gets stuck not realizing why some versions of git might not notice the renames). Linus From dledford at redhat.com Thu Oct 11 20:26:43 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 11 Oct 2007 23:26:43 -0400 Subject: [ofa-general] More librdmacm stuff In-Reply-To: <000101c80c72$684e5bf0$3acc180a@amr.corp.intel.com> References: <1192120622.19888.417.camel@firewall.xsintricity.com> <470E9FE3.5000200@ichips.intel.com> <1192142802.19888.437.camel@firewall.xsintricity.com> <000101c80c72$684e5bf0$3acc180a@amr.corp.intel.com> Message-ID: <1192159603.19888.442.camel@firewall.xsintricity.com> On Thu, 2007-10-11 at 18:51 -0700, Sean Hefty wrote: > >I'm more referring to when you call rdma_bind_addr to bind to your > >device before you call rdma_connect. In that instance, your address > >isn't for the eventual destination, but just to bind you to your local > >rdma device. For that, an rdma_bind_dev that took an ibv context and a > >port number on that device would avoid having to specify an IP address > >that you don't really care about. > > Maybe I'm missing something, but you would still use IP addressing to identify > the remote system, which requires IPoIB anyway. My expectation is that the side > that calls rdma_connect() would usually call rdma_resolve_addr(), and not use > rdma_bind_addr(). This way the local device binding occurs based on the routing > tables to the remote address. Think multiport cards and wanting to use a specific port (for load balancing or other reasons). -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From davem at davemloft.net Thu Oct 11 20:28:57 2007 From: davem at davemloft.net (David Miller) Date: Thu, 11 Oct 2007 20:28:57 -0700 (PDT) Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: <20071011.202857.74752210.davem@davemloft.net> From: Linus Torvalds Date: Thu, 11 Oct 2007 19:58:04 -0700 (PDT) > > > On Thu, 11 Oct 2007, David Miller wrote: > > > > Even if you're confident there won't be merge issues, could you just > > wait for the net-2.6 stuff to go in first? > > I pulled the net stuff first, and merged the IB stuff afterwards. No > conflicts in IB, but there *were* conflicts with the networking pull for > other reasons. > > That horrid, horrid mess that is called include/linux/mod_devicetable.h > and scripts/mod/file2alias.c must go at some point. The thing is > unmaintainable. Different maintainers add their own structures to both, > and functions to both, and it's just messy. That's not how maintainable > and modularized code should be written. > > Now it broke on sdio vs ssb, but there was actually a conflict earlier > with the Kbuild merge (which I aborted for other reasons), so this file > really is starting to be a problem. > > The merge was fairly straightforward and stupid - it's not like the code > added is *complicated*, but all those small functions and structrues are > set up to be a maze of very similar lines, so the merge is actually much > worse than it should be - because there is inherent similarity, some lines > are automatically auto-merged, making the result just harder to visualize. > > So I merged it all, and I don't expect any problems, but I'm hoping > somebody is thinking about that mod_devicetable.h/file2alias.c mess. It all looks good from here. From miclarks04 at yahoo.com Thu Oct 11 20:45:33 2007 From: miclarks04 at yahoo.com (Mira Clarks) Date: Thu, 11 Oct 2007 20:45:33 -0700 (PDT) Subject: [ofa-general] ***SPAM*** Hoping to read from you Message-ID: <388319.72798.qm@web45309.mail.sp1.yahoo.com> Dearest One, How is everything? I know it's all good.Thank God. I am excited to be writing you today. My name is Mira Clarks, a Cote D' Ivoire. Presently I am residing here in Dakar Senegal , as a result of the civil war which took place in my country. I am a 22 year old girl, about 5'10" tall. By the grace of God we made my way to a nearby country Senegal where I 'm presently living in a refugee camp. I would like to know more about you. Your likes and dislikes,your hobbies and what you are doing presently. I will tell you more about myself in my next mail. I attach here my picture for you, though i am not all that photogenic,i hope you wouldn't mind.i am waiting for your reply! Much Love, Mira Much Love, Mira --------------------------------- Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase. --------------------------------- Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase. --------------------------------- Shape Yahoo! in your own image. Join our Network Research Panel today! --------------------------------- Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. --------------------------------- Don't let your dream ride pass you by. Make it a reality with Yahoo! Autos. --------------------------------- Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gregkh at suse.de Thu Oct 11 20:52:15 2007 From: gregkh at suse.de (Greg KH) Date: Thu, 11 Oct 2007 20:52:15 -0700 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: <20071012035215.GA10190@suse.de> On Thu, Oct 11, 2007 at 07:58:04PM -0700, Linus Torvalds wrote: > > So I merged it all, and I don't expect any problems, but I'm hoping > somebody is thinking about that mod_devicetable.h/file2alias.c mess. > > I'm not entirely sure who to blame on that thing. I'm adding Greg to the > Cc, on the assumption that blaming him is usually the right thing to do ;) Hey, it wasn't me this time, I haven't even built my trees for you to pull from and break everything yet :) But yeah, splitting up the mod_devicetable.h/file2alias.c mess is a very good idea, I'll see what I can come up with tomorrow. thanks, greg k-h From torvalds at linux-foundation.org Thu Oct 11 21:03:14 2007 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 11 Oct 2007 21:03:14 -0700 (PDT) Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: <20071012035215.GA10190@suse.de> References: <20071011.181719.78707713.davem@davemloft.net> <20071012035215.GA10190@suse.de> Message-ID: On Thu, 11 Oct 2007, Greg KH wrote: > On Thu, Oct 11, 2007 at 07:58:04PM -0700, Linus Torvalds wrote: > > > > So I merged it all, and I don't expect any problems, but I'm hoping > > somebody is thinking about that mod_devicetable.h/file2alias.c mess. > > > > I'm not entirely sure who to blame on that thing. I'm adding Greg to the > > Cc, on the assumption that blaming him is usually the right thing to do ;) > > Hey, it wasn't me this time, I haven't even built my trees for you to > pull from and break everything yet :) No, I meant more in the "who the hell is responsible for designing those *files*" rather than who is responsible for the particular merge mess that happened to involve them this time around. > But yeah, splitting up the mod_devicetable.h/file2alias.c mess is a very > good idea, I'll see what I can come up with tomorrow. I don't think it's a huge issue, but I wanted to bring it up because these days we're normally so good with these kinds of things that it actually stood out a bit. I used to do these kinds of nasty merges all the time with init/main.c and the configuration files, until we split them up. So I'm certainly perfectly able and used to doing them, it's just that I also think that we have generally learnt to do so much better. In other words: no hurry or pressure, I just wanted to bring it up, since during the merge I got flashbacks to various "bad old times" that I had hoped we had mostly left behind. Those files were originally designed/set up by Rusty. I could have blamed him, or perhaps Sam as a kbuild guy, but the reason I cc'd you is that I think this kind of smells like a "device model"ish thing... Hmm? Linus From kliteyn at mellanox.co.il Thu Oct 11 22:09:01 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 12 Oct 2007 07:09:01 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-12:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-11 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From Sumit.Gaur at Sun.COM Thu Oct 11 23:39:06 2007 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Fri, 12 Oct 2007 12:09:06 +0530 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> Message-ID: <470F168A.50703@Sun.COM> Hi , Sean Hefty wrote: >>There is no per thread demuxing. You would need two different mad agents >>to do this with one looking at the SMI side and the other the GSI side. >>I haven't looked at libibmad in terms of using this model though. > > > umad_receive() doesn't take the mad_agent as an input parameter. The only > possibility I see is calling umad_open_port() twice for the same port, with the > GSI/SMI registrations going to separate port_id's. I think this solution is also not possible as calling umad_open_port() twice for the same port and ca_name is always gives error in port_alloc because dev_to_umad_id generate same umad_id for same ca_name and portnum. ibwarn: [9634] port_alloc: umad port id 1 is already allocated for mthca0 2 So looks like it is impossible to generate two separate portid for the same port. > > - Seanumad_open_port() From HNGUYEN at de.ibm.com Thu Oct 11 23:55:14 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Fri, 12 Oct 2007 08:55:14 +0200 Subject: [ofa-general] [PATCH] fix some ehca limits In-Reply-To: Message-ID: general-bounces at lists.openfabrics.org wrote on 09.10.2007 22:19:17: > I didn't see a response to my earlier email about the other uses of > min_t(int, x, INT_MAX) so I fixed it up myself and added this to my > tree. I don't have a working setup to test yet so please let me know > if you see anything wrong with this: > > commit 919225e60a1a73e3518f257f040f74e9379a61c3 > Author: Roland Dreier > Date: Tue Oct 9 13:17:42 2007 -0700 > > IB/ehca: Fix clipping of device limits to INT_MAX Roland, apologize for this late response. Acked-by: Hoang-Nam Nguyen From superdemonic at world-newspapers.com Fri Oct 12 02:48:30 2007 From: superdemonic at world-newspapers.com (Blake Howard) Date: Fri, 12 Oct 2007 03:48:30 -0600 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80ca3$4f517480$0100007f@localhost> cheapxpsoftware . com From harake at cscs.ch Fri Oct 12 02:06:09 2007 From: harake at cscs.ch (H.N.HARAKE) Date: Fri, 12 Oct 2007 11:06:09 +0200 Subject: [ofa-general] OFED and SLES SP1 Message-ID: <760BED53-C287-4E6E-9A09-66B72A20BC60@cscs.ch> I built the packages of OFED version 1.2.51 on SLES 10 SP1 kernel 2.6.16.53-0.8, building the packages and installing the rpms went smooth. But when i try to insert the module i get the error below. any help would be appreciated Best Regards H. N. Harake modprobe ib_ipoib FATAL: Error inserting ib_ipoib (/lib/modules/2.6.16.53-0.8-smp/updates/kernel/drivers/infiniband/ulp/ ipoib/ib_ipoib.ko): Unknown symbol in module, or unknown parameter (see dmesg) dmesg : ib_ipoib: disagrees about version of symbol ib_create_qp ib_ipoib: Unknown symbol ib_create_qp ib_ipoib: disagrees about version of symbol ib_create_srq ib_ipoib: Unknown symbol ib_create_srq ib_ipoib: disagrees about version of symbol ib_modify_qp ib_ipoib: Unknown symbol ib_modify_qp ib_ipoib: disagrees about version of symbol ib_destroy_ah ib_ipoib: Unknown symbol ib_destroy_ah ib_ipoib: disagrees about version of symbol ib_query_pkey ib_ipoib: Unknown symbol ib_query_pkey ib_ipoib: disagrees about version of symbol ib_init_ah_from_path ib_ipoib: Unknown symbol ib_init_ah_from_path ib_ipoib: disagrees about version of symbol ib_destroy_qp ib_ipoib: Unknown symbol ib_destroy_qp ib_ipoib: disagrees about version of symbol ib_send_cm_rtu ib_ipoib: Unknown symbol ib_send_cm_rtu ib_ipoib: disagrees about version of symbol ib_send_cm_req ib_ipoib: Unknown symbol ib_send_cm_req ib_ipoib: disagrees about version of symbol ib_sa_join_multicast ib_ipoib: Unknown symbol ib_sa_join_multicast ib_ipoib: disagrees about version of symbol ib_find_pkey ib_ipoib: Unknown symbol ib_find_pkey ib_ipoib: disagrees about version of symbol ib_dealloc_pd ib_ipoib: Unknown symbol ib_dealloc_pd ib_ipoib: disagrees about version of symbol ib_query_gid ib_ipoib: Unknown symbol ib_query_gid ib_ipoib: disagrees about version of symbol ib_attach_mcast ib_ipoib: Unknown symbol ib_attach_mcast ib_ipoib: Unknown symbol icmpv6_send ib_ipoib: disagrees about version of symbol ib_send_cm_rej ib_ipoib: Unknown symbol ib_send_cm_rej From harake at cscs.ch Fri Oct 12 02:07:17 2007 From: harake at cscs.ch (H.N.HARAKE) Date: Fri, 12 Oct 2007 11:07:17 +0200 Subject: [ofa-general] Disable IPV6 Message-ID: How to disable ipv6 and relative symbols to the kernel_ib package (ib_ipoib) or others I am getting an unknown symbol icmpv6_send error when i try to insert the module ib_ipoib in the kernel? regards H.N. Harake From vlad at lists.openfabrics.org Fri Oct 12 02:54:43 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 12 Oct 2007 02:54:43 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071012-0200 daily build status Message-ID: <20071012095443.88F2DE603C4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From makc at sgi.com Fri Oct 12 03:11:14 2007 From: makc at sgi.com (Max Matveev) Date: Fri, 12 Oct 2007 20:11:14 +1000 Subject: [ofa-general] perfquery looking at the wrong bits Message-ID: <18191.18498.739648.956805@kuku.melbourne.sgi.com> In OFED 1.2 perfquery attempts to check if a port supports extended counters: } else { /* Should ClassPortInfo be implemented in libibmad ? */ pc2 = (uint16_t *)&pc[2]; /* CapabilityMask */ cap_mask = *pc2; if (!(cap_mask & 0x100)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); It seems to be what there are at least 2 problems here: 1. bit 9, if we're counting from 0, will have mask of 0x200, not 0x100. mask of 0x100 will be for counter aggregation according to IBA 1.2. 2. If capmask is 16 bit big-endian word, then we're looking at the wrong byte on x86, we must ntohs(*pc2) first. max From toastiest at helpusettle.com Fri Oct 12 03:56:37 2007 From: toastiest at helpusettle.com (Wilmer Obrien) Date: Fri, 12 Oct 2007 13:56:37 +0300 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80cbd$6ac97900$0100007f@localhost> cheapxpsoftware . com From apl_lotto_winner at adelphia.net Fri Oct 12 04:01:30 2007 From: apl_lotto_winner at adelphia.net (APPLE Lottery) Date: Fri, 12 Oct 2007 4:01:30 -0700 Subject: [ofa-general] October 2007 APPLE Lottery Winner !!! Message-ID: <4132507.1192186890147.JavaMail.root@web11.mail.adelphia.net> APPLE LOTTERY ONLINE, UK Design House, Exmoor Avenue, Scunthorpe, North Lincolnshire NL45 8RE. ================================ TICKET NO: APL (02-36-99-87-13) BATCH NO: 2007APL-007 (bonus no.31) REF NO: B/98-867-974APL ================================ AWARD WINNING APPROVAL We happily announce to you the draw (#1091) of the APPLE LOTTERY, online Sweepstakes International Program held on Thursday October 11th, 2007. Your e-mail address attached to TICKET NO: APL (02-36-99-87-13) with BATCH NO: 2007APL-007(bonus no.31), which subsequently won you the lottery in the 2nd Category i.e. match 5 plus bonus. You have therefore been approved to Claim a total sum of £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) in cash credited to file REF NO: B/98-867-974APL. This is from a total cash prize of £ 2,500,000 shared amongst the (7) lucky winners in this category i.e. Match 5 plus bonus. All participants for the online version were selected Randomly from World Wide Web sites through Computer Ballot Draw system and extracted from over 100m Secured Web Sites Worldwide and your E-mail address was selected which subsequently led to your Winning this Lottery in the 2nd Category i.e. Match 5 plus bonus. In view of this, your £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) will be released to you by our payment office here in London, United Kingdom. For security reasons, you are advised to keep your Winning information confidential till your claims is processed and your award prize is remitted to you in whatever manner you deem fit to claim your Prize. This is part of our precautionary measure to avoid double Claiming and unwarranted abuse of this program. Please be warned. Your fund has been deposited in an escrow account with our affiliate bank here in United Kingdom (UK), and insured with your REF NO: B/98-867-974APL and your E-mail address. You are to keep your TICKET NO. REF NO. BATCH NO. from the public, until you have been processed and your prize money remitted to your personal account. To claim your winning prize, you must first contact the Fiduciary Agent by email for processing and remittance of your prize money to you. Below is the contact of the Fiduciary Agent: ------------------------------------ AGENT: Garry Cooke E-MAIL: aplclaimsdesk at yahoo.co.uk TEL/FAX: +44 (0) 702 403 8665 | (0) 702 403 9047 ------------------------------------ Claims Requirements: ==================== 1.Full Name : 2.Address : 3.Nationality : 4.Age : 5.Occupation : 6.Phone/Fax : 7.Present Country : The Fiduciary Agent will assist you in claiming your due prize. In order to avoid unnecessary delays and complications, please remember to quote your ticket, reference, and batch numbers in all correspondences with the Fiduciary Agent. Sincerely, Betty Rowland (Mrs) For APPLE LOTTERY ONLINE UK; From sachsenlotto5 at charter.net Fri Oct 12 04:27:55 2007 From: sachsenlotto5 at charter.net (APPLE Lottery) Date: Fri, 12 Oct 2007 4:27:55 -0700 Subject: [ofa-general] October 2007 APPLE Lottery Winner !!! Message-ID: <20071012072756.I3TTO.511962.root@fepweb13> APPLE LOTTERY ONLINE, UK Design House, Exmoor Avenue, Scunthorpe, North Lincolnshire NL45 8RE. ================================ TICKET NO: APL (02-36-99-87-13) BATCH NO: 2007APL-007 (bonus no.31) REF NO: B/98-867-974APL ================================ AWARD WINNING APPROVAL We happily announce to you the draw (#1091) of the APPLE LOTTERY, online Sweepstakes International Program held on Thursday October 11th, 2007. Your e-mail address attached to TICKET NO: APL (02-36-99-87-13) with BATCH NO: 2007APL-007(bonus no.31), which subsequently won you the lottery in the 2nd Category i.e. match 5 plus bonus. You have therefore been approved to Claim a total sum of £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) in cash credited to file REF NO: B/98-867-974APL. This is from a total cash prize of £ 2,500,000 shared amongst the (7) lucky winners in this category i.e. Match 5 plus bonus. All participants for the online version were selected Randomly from World Wide Web sites through Computer Ballot Draw system and extracted from over 100m Secured Web Sites Worldwide and your E-mail address was selected which subsequently led to your Winning this Lottery in the 2nd Category i.e. Match 5 plus bonus. In view of this, your £ 700,000.00 GBP (Seven Hundred thousand British Pounds Sterling) will be released to you by our payment office here in London, United Kingdom. For security reasons, you are advised to keep your Winning information confidential till your claims is processed and your award prize is remitted to you in whatever manner you deem fit to claim your Prize. This is part of our precautionary measure to avoid double Claiming and unwarranted abuse of this program. Please be warned. Your fund has been deposited in an escrow account with our affiliate bank here in United Kingdom (UK), and insured with your REF NO: B/98-867-974APL and your E-mail address. You are to keep your TICKET NO. REF NO. BATCH NO. from the public, until you have been processed and your prize money remitted to your personal account. To claim your winning prize, you must first contact the Fiduciary Agent by email for processing and remittance of your prize money to you. Below is the contact of the Fiduciary Agent: ------------------------------------ AGENT: Garry Cooke E-MAIL: aplclaimsdesk at yahoo.co.uk TEL/FAX: +44 (0) 702 403 8665 | (0) 702 403 9047 ------------------------------------ Claims Requirements: ==================== 1.Full Name : 2.Address : 3.Nationality : 4.Age : 5.Occupation : 6.Phone/Fax : 7.Present Country : The Fiduciary Agent will assist you in claiming your due prize. In order to avoid unnecessary delays and complications, please remember to quote your ticket, reference, and batch numbers in all correspondences with the Fiduciary Agent. Sincerely, Sachsen Yo Kim (Mrs) For APPLE LOTTERY ONLINE UK; From hrosenstock at xsigo.com Fri Oct 12 04:36:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 04:36:51 -0700 Subject: [ofa-general] perfquery looking at the wrong bits ([PATCH]) In-Reply-To: <18191.18498.739648.956805@kuku.melbourne.sgi.com> References: <18191.18498.739648.956805@kuku.melbourne.sgi.com> Message-ID: <1192189011.14052.253.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 20:11 +1000, Max Matveev wrote: > In OFED 1.2 perfquery attempts to check if a port supports extended > counters: > > } else { > /* Should ClassPortInfo be implemented in libibmad ? */ > pc2 = (uint16_t *)&pc[2]; /* CapabilityMask */ > cap_mask = *pc2; > if (!(cap_mask & 0x100)) /* 1.2 errata: bit 9 is extended counter support */ > IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); > > It seems to be what there are at least 2 problems here: The good news is it only prints out a warning message and continues on. > 1. bit 9, if we're counting from 0, will have mask of 0x200, > not 0x100. mask of 0x100 will be for counter aggregation according > to IBA 1.2. > > 2. If capmask is 16 bit big-endian word, then we're looking > at the wrong byte on x86, we must ntohs(*pc2) first. Here's a patch to fix this for OFED 1.3: perfquery.c: Fix issue checking PerfMgt:ClassPortInfo.CapabilityMask 1. bit 9, if we're counting from 0, will have mask of 0x200, not 0x100. mask of 0x100 will be for counter aggregation according to IBA 1.2. 2. If capmask is 16 bit big-endian word, then we're looking at the wrong byte on x86, we must ntohs(*pc2) first. Found-by: Max Matveev Compile tested only Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 2ae3281..53b3fb3 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -40,8 +40,9 @@ #include #include #include +#include -#define __BUILD_VERSION_TAG__ 1.2.1 +#define __BUILD_VERSION_TAG__ 1.2.2 #include #include #include @@ -202,8 +203,8 @@ main(int argc, char **argv) } else { /* Should ClassPortInfo be implemented in libibmad ? */ pc2 = (uint16_t *)&pc[2]; /* CapabilityMask */ - cap_mask = *pc2; - if (!(cap_mask & 0x100)) /* 1.2 errata: bit 9 is extended counter support */ + cap_mask = ntohs(*pc2); + if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); if (!port_performance_ext_query(pc, &portid, port, timeout)) > max > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Fri Oct 12 04:50:17 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 04:50:17 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <470F168A.50703@Sun.COM> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> Message-ID: <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 12:09 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi , > > Sean Hefty wrote: > >>There is no per thread demuxing. You would need two different mad agents > >>to do this with one looking at the SMI side and the other the GSI side. > >>I haven't looked at libibmad in terms of using this model though. > > > > > > umad_receive() doesn't take the mad_agent as an input parameter. The only > > possibility I see is calling umad_open_port() twice for the same port, with the > > GSI/SMI registrations going to separate port_id's. > I think this solution is also not possible as calling umad_open_port() twice for > the same port and ca_name is always gives error in port_alloc because > dev_to_umad_id generate same umad_id for same ca_name and portnum. > > ibwarn: [9634] port_alloc: umad port id 1 is already allocated for mthca0 2 > > So looks like it is impossible to generate two separate portid for the same port. It might be possible to support this with some changes to libibumad. Sasha ? -- Hal > > > > - Seanumad_open_port() From makc at sgi.com Fri Oct 12 05:50:01 2007 From: makc at sgi.com (Max Matveev) Date: Fri, 12 Oct 2007 22:50:01 +1000 Subject: [ofa-general] perfquery looking at the wrong bits ([PATCH]) In-Reply-To: <1192189011.14052.253.camel@hrosenstock-ws.xsigo.com> References: <18191.18498.739648.956805@kuku.melbourne.sgi.com> <1192189011.14052.253.camel@hrosenstock-ws.xsigo.com> Message-ID: <18191.28025.470216.391224@kuku.melbourne.sgi.com> >>>>> "HR" == Hal Rosenstock writes: HR> Here's a patch to fix this for OFED 1.3: While you're there, can you change pointer dereference with memcpy, e.g.: memcpy (&capmask, pc+2, sizeof(capmask)); capmask = ntohs(capmask); Those pointer dereferenes are royal pain on ia64 unless you can guarantee what pc is always aligned properly. max From jeff at garzik.org Fri Oct 12 05:58:06 2007 From: jeff at garzik.org (Jeff Garzik) Date: Fri, 12 Oct 2007 08:58:06 -0400 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: <470F6F5E.4080601@garzik.org> Linus Torvalds wrote: > Oh, and obviously, the NAPI changes may well have resulted in a merge that > had no actual *conflicts* in it, but whether the end result works or not > (and whether any IB drivers need updating due to the NAPI changes), I > cannot tell. I've pushed out my tree, so people who are competent or just > morbidly curious should start looking at it: it's got the following things > merged now: > > - x86 merge > - mmc > - v4l-dvb > - blackfin > - avr32 > - block layer updates > - Jeff's dmi-const > - Purdie's blacklight and led trees > - ide > - mips > - net > - infiniband > > and it all builds for me, but hey, I don't use half of it. works here on intel x86-64, amd64, and 32-bit pentium4. and without disk corruption, so I may now attend to the libata merge :) Jeff From rdreier at cisco.com Fri Oct 12 06:07:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Oct 2007 06:07:18 -0700 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus References: <20071011.181719.78707713.davem@davemloft.net> <20071011.193634.48801952.davem@davemloft.net> Message-ID: > > I'm not sure what you mean. During the 2.6.23 cycle I've been sending > > any patches that potentially could conflict with the net-2.6 tree to > > you and Jeff so that you can merge them upstream via your tree. Or do > > you mean Jeff should become the maintainer of drivers/infiniband?? > > Not the maintainer, I'm just saying you should gateway > your patches through him. What value do you see in that? It just seems like it creates more work for Jeff and gives no benefit over the current status quo. - R. From hrosenstock at xsigo.com Fri Oct 12 06:22:58 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 06:22:58 -0700 Subject: [ofa-general] perfquery looking at the wrong bits ([PATCH]) In-Reply-To: <18191.28025.470216.391224@kuku.melbourne.sgi.com> References: <18191.18498.739648.956805@kuku.melbourne.sgi.com> <1192189011.14052.253.camel@hrosenstock-ws.xsigo.com> <18191.28025.470216.391224@kuku.melbourne.sgi.com> Message-ID: <1192195378.14052.266.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 22:50 +1000, Max Matveev wrote: > >>>>> "HR" == Hal Rosenstock writes: > > HR> Here's a patch to fix this for OFED 1.3: > > While you're there, can you change pointer dereference with memcpy, > e.g.: > > memcpy (&capmask, pc+2, sizeof(capmask)); > capmask = ntohs(capmask); > > > Those pointer dereferenes are royal pain on ia64 unless you can > guarantee what pc is always aligned properly. PATCHv2 shortly. Stay tuned. -- Hal > > max > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Fri Oct 12 06:30:57 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 06:30:57 -0700 Subject: [ofa-general] [PATCHv2] infiniband-diags/perfquery.c: Fix issues when checking PerfMgt:ClassPortInfo.CapabilityMask Message-ID: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> infiniband-diags/perfquery.c: Fix issues when checking PerfMgt:ClassPortInfo.CapabilityMask 1. bit 9, if we're counting from 0, will have mask of 0x200, not 0x100. mask of 0x100 will be for counter aggregation according to IBA 1.2. 2. If capmask is 16 bit big-endian word, then we're looking at the wrong byte on x86, we must ntohs(*pc2) first. 3. Also, change pointer dereference with memcpy, e.g.: memcpy (&capmask, pc+2, sizeof(capmask)); capmask = ntohs(capmask); Those pointer dereferenes are royal pain on ia64 unless you can guarantee what pc is always aligned properly. Found-by: Max Matveev Compile tested only Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 2ae3281..148e452 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -40,8 +40,9 @@ #include #include #include +#include -#define __BUILD_VERSION_TAG__ 1.2.1 +#define __BUILD_VERSION_TAG__ 1.2.2 #include #include #include @@ -97,7 +98,7 @@ main(int argc, char **argv) char *ca = 0; int ca_port = 0; int extended = 0; - uint16_t cap_mask, *pc2; + uint16_t cap_mask; static char const str_opts[] = "C:P:s:t:dGearRVhu"; static const struct option long_opts[] = { @@ -201,9 +202,9 @@ main(int argc, char **argv) mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { /* Should ClassPortInfo be implemented in libibmad ? */ - pc2 = (uint16_t *)&pc[2]; /* CapabilityMask */ - cap_mask = *pc2; - if (!(cap_mask & 0x100)) /* 1.2 errata: bit 9 is extended counter support */ + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ + cap_mask = ntohs(cap_mask); + if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); if (!port_performance_ext_query(pc, &portid, port, timeout)) From makc at sgi.com Fri Oct 12 06:35:48 2007 From: makc at sgi.com (Max Matveev) Date: Fri, 12 Oct 2007 23:35:48 +1000 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Fix issues when checking PerfMgt:ClassPortInfo.CapabilityMask In-Reply-To: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> References: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> Message-ID: <18191.30772.41824.765698@kuku.melbourne.sgi.com> >>>>> "HR" == Hal Rosenstock writes: HR> infiniband-diags/perfquery.c: Fix issues when checking HR> PerfMgt:ClassPortInfo.CapabilityMask Looks good. Thanks. max From bulgiest at onlinemeridian.net Fri Oct 12 08:21:39 2007 From: bulgiest at onlinemeridian.net (Real Phillips) Date: Fri, 12 Oct 2007 16:21:39 +0100 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80ce2$8c892700$0100007f@localhost> cheapxpsoftware . com From swise at opengridcomputing.com Fri Oct 12 08:35:39 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 12 Oct 2007 10:35:39 -0500 Subject: [ofa-general] More librdmacm stuff In-Reply-To: <1192159603.19888.442.camel@firewall.xsintricity.com> References: <1192120622.19888.417.camel@firewall.xsintricity.com> <470E9FE3.5000200@ichips.intel.com> <1192142802.19888.437.camel@firewall.xsintricity.com> <000101c80c72$684e5bf0$3acc180a@amr.corp.intel.com> <1192159603.19888.442.camel@firewall.xsintricity.com> Message-ID: <470F944B.3020108@opengridcomputing.com> Doug Ledford wrote: > On Thu, 2007-10-11 at 18:51 -0700, Sean Hefty wrote: >>> I'm more referring to when you call rdma_bind_addr to bind to your >>> device before you call rdma_connect. In that instance, your address >>> isn't for the eventual destination, but just to bind you to your local >>> rdma device. For that, an rdma_bind_dev that took an ibv context and a >>> port number on that device would avoid having to specify an IP address >>> that you don't really care about. >> Maybe I'm missing something, but you would still use IP addressing to identify >> the remote system, which requires IPoIB anyway. My expectation is that the side >> that calls rdma_connect() would usually call rdma_resolve_addr(), and not use >> rdma_bind_addr(). This way the local device binding occurs based on the routing >> tables to the remote address. > > Think multiport cards and wanting to use a specific port (for load > balancing or other reasons). > But IP addresses associated with IPoIB and iWARP netdevs _do_ map to a specific port, yes? > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jesse.brandeburg at intel.com Fri Oct 12 09:08:58 2007 From: jesse.brandeburg at intel.com (Brandeburg, Jesse) Date: Fri, 12 Oct 2007 09:08:58 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071010003716.GB552@one.firstfloor.org> Message-ID: <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> Andi Kleen wrote: >> When the hw TX queue gains space, the driver self-batches packets >> from the sw queue to the hw queue. > > I don't really see the advantage over the qdisc in that scheme. > It's certainly not simpler and probably more code and would likely > also not require less locks (e.g. a currently lockless driver > would need a new lock for its sw queue). Also it is unclear to me > it would be really any faster. related to this comment, does Linux have a lockless (using atomics) singly linked list element? That would be very useful in a driver hot path. From rdreier at cisco.com Fri Oct 12 09:47:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Oct 2007 09:47:14 -0700 Subject: [ofa-general] Draft patch to address bugzilla bug#728 In-Reply-To: <470EB4BA.40509@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 11 Oct 2007 16:41:46 -0700") References: <470EB4BA.40509@linux.vnet.ibm.com> Message-ID: > This is a draft patch to address the following bug: > https://bugs.openfabrics.org/show_bug.cgi?id=728 Might be nice to include a description with the patch, so everyone doesn't have to go figure out this bug report (the issue I guess is that ehca doesn't support enough SG entries to handle 16 4K pages on the IPoIB CM receive queue). > While working on this I observed that for mthca max_srq_sge > returned by ib_query_device() is not equal to max_sge returned > by ib_query_srq(). Why is that? Not sure. I'll take a look. What are the two values that you get? > struct ipoib_cm_rx_buf { > struct sk_buff *skb; > - u64 mapping[IPOIB_CM_RX_SG]; > + u64 *mapping; > }; I think it would be much simpler just to leave the array here. You waste a few bytes in the worst case but the memory used for each ipoib_cm_rx_buf structures is much less than the actual receive buffers it points to anyway, so I think the overhead is negligible. > + if (IPOIB_CM_RX_SG >= max_sge_supported) { > + fragment_size = CM_PACKET_SIZE/max_sge_supported; > + num_frags = CM_PACKET_SIZE/fragment_size; > + } else { > + fragment_size = CM_PACKET_SIZE/IPOIB_CM_RX_SG; > + num_frags = IPOIB_CM_RX_SG; > + } > + order = get_order(fragment_size); I think that if the device can't handle enough SG entries to handle the full CM_PACKET_SIZE with PAGE_SIZE fragments, we just have to reduce the size of the receive buffers. Trying to allocate multi-page receive fragments (especially with GFP_ATOMIC on the receive path) is almost certainly going to fail once memory gets fragmented. Lots of other ethernet drivers have been forced to avoid multi-page allocations when using jumbo frames because of serious issues observed in practice, so we should avoid making the same mistake. - R. From shemminger at linux-foundation.org Fri Oct 12 10:05:00 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 12 Oct 2007 10:05:00 -0700 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> References: <20071010003716.GB552@one.firstfloor.org> <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> Message-ID: <20071012100500.02255243@freepuppy.rosehill> On Fri, 12 Oct 2007 09:08:58 -0700 "Brandeburg, Jesse" wrote: > Andi Kleen wrote: > >> When the hw TX queue gains space, the driver self-batches packets > >> from the sw queue to the hw queue. > > > > I don't really see the advantage over the qdisc in that scheme. > > It's certainly not simpler and probably more code and would likely > > also not require less locks (e.g. a currently lockless driver > > would need a new lock for its sw queue). Also it is unclear to me > > it would be really any faster. > > related to this comment, does Linux have a lockless (using atomics) > singly linked list element? That would be very useful in a driver hot > path. Use RCU? or write a generic version and get it reviewed. You really want someone with knowledge of all the possible barrier impacts to review it. -- Stephen Hemminger From mschlining at datadirectnet.com Fri Oct 12 10:22:43 2007 From: mschlining at datadirectnet.com (Martin W. Schlining III) Date: Fri, 12 Oct 2007 13:22:43 -0400 Subject: [ofa-general] Building an OFED distribution package Message-ID: <470FAD63.7090702@datadirectnet.com> I'd like to patch the OFED-1.2.5 source file ib_srp.h (or use the modified source file) and rebuild the source RPM (whichever one ib_srp.h comes from) and the OFED 1.2.5 distribution package. Just to make things nice and neat for local use. The goal is to have a local OFED distribution package that already contains the changes I need. I'll probably want to do the same for OFED 1.3 when it is released. Now, how do I do this? There is a promising entry on the OpenFabrics web page called "How to integrate a patch in a local copy of OFED and rebuild the source rpm ?", but the link is out of date. I'm not an expert on rpm building or patches, so I may need a few mundane details. What is the best way to do this? Martin From pradeeps at linux.vnet.ibm.com Fri Oct 12 11:22:21 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 12 Oct 2007 11:22:21 -0700 Subject: [ofa-general] Draft patch to address bugzilla bug#728 In-Reply-To: References: <470EB4BA.40509@linux.vnet.ibm.com> Message-ID: <470FBB5D.5090506@linux.vnet.ibm.com> > > While working on this I observed that for mthca max_srq_sge > > returned by ib_query_device() is not equal to max_sge returned > > by ib_query_srq(). Why is that? > > Not sure. I'll take a look. What are the two values that you get? I get 28 and 16. This is on InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) HCAs that we have. > > + if (IPOIB_CM_RX_SG >= max_sge_supported) { > > + fragment_size = CM_PACKET_SIZE/max_sge_supported; > > + num_frags = CM_PACKET_SIZE/fragment_size; > > + } else { > > + fragment_size = CM_PACKET_SIZE/IPOIB_CM_RX_SG; > > + num_frags = IPOIB_CM_RX_SG; > > + } > > + order = get_order(fragment_size); > > I think that if the device can't handle enough SG entries to handle > the full CM_PACKET_SIZE with PAGE_SIZE fragments, we just have to > reduce the size of the receive buffers. Trying to allocate multi-page > receive fragments (especially with GFP_ATOMIC on the receive path) is > almost certainly going to fail once memory gets fragmented. Lots > of other ethernet drivers have been forced to avoid multi-page > allocations when using jumbo frames because of serious issues observed > in practice, so we should avoid making the same mistake. I sort of expected that this might come up, hence the draft patch. If we are driving the systems so hard that in steady state (i.e on the receipt of every packet) one may fail to allocate a handful of multi-page (read 4K page) fragments, what will happen when one uses say Rhel5.1 where we need to allocate only one 64K page? Won't that fail too? Are you suggesting that we reduce the MTU to be sized according to the PAGE_SIZE * max_num_sg supported? If that is correct, then it is a MTU vs memory trade-off -right? Pradeep From andi at firstfloor.org Fri Oct 12 11:27:24 2007 From: andi at firstfloor.org (Andi Kleen) Date: Fri, 12 Oct 2007 20:27:24 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> References: <20071010003716.GB552@one.firstfloor.org> <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> Message-ID: <20071012182724.GA12933@one.firstfloor.org> > related to this comment, does Linux have a lockless (using atomics) > singly linked list element? That would be very useful in a driver hot > path. No; it doesn't. At least not a portable one. Besides they tend to be not faster anyways because e.g. cmpxchg tends to be as slow as an explicit spinlock. -Andi From andi at firstfloor.org Fri Oct 12 11:29:49 2007 From: andi at firstfloor.org (Andi Kleen) Date: Fri, 12 Oct 2007 20:29:49 +0200 Subject: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching In-Reply-To: <20071012100500.02255243@freepuppy.rosehill> References: <20071010003716.GB552@one.firstfloor.org> <36D9DB17C6DE9E40B059440DB8D95F5203887DD1@orsmsx418.amr.corp.intel.com> <20071012100500.02255243@freepuppy.rosehill> Message-ID: <20071012182949.GB12933@one.firstfloor.org> > Use RCU? or write a generic version and get it reviewed. You really > want someone with knowledge of all the possible barrier impacts to > review it. I guess he was thinking of using cmpxchg; but we don't support this in portable code. RCU is not really suitable for this because it assume writing is relatively rare which is definitely not the case for a qdisc. Also general list management with RCU is quite expensive anyways -- it would require a full copy (that is the 'C' in RCU which Linux generally doesn't use at all) -Andi From gmk at lbl.gov Fri Oct 12 13:27:00 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 13:27:00 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure Message-ID: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> I am getting this error with the current packaged releases: # ibcheckerrors perfquery: iberror: failed: perfquery Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED perfquery: iberror: failed: perfquery Error check on lid 1 (ibtest1 HCA-1) port all: FAILED ## Summary: 2 nodes checked, 0 bad nodes found ## 2 ports checked, 0 ports have errors beyond threshold ibcheckerrors seems to be calling the script ibcheckerrs with a port of 255 and ibcheckerrs is calling the binary perfquery with the "-a" option (which is apparently not working as it is supposed to query all ports) which then calls into libibmad (port_performance_query and then pma_query). I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the latest versions from the download directory on the web site. Is it possible this is a bug, or did I grab incompatible versions of the packages? Many thanks! Greg note: Please CC me with any replies as I am not (yet) a list member. Thanks again. :) -- Greg Kurtzer gmk at lbl.gov From hrosenstock at xsigo.com Fri Oct 12 13:56:42 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 13:56:42 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> Message-ID: <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> Greg, On Fri, 2007-10-12 at 13:27 -0700, Greg Kurtzer wrote: > I am getting this error with the current packaged releases: > > > # ibcheckerrors > perfquery: iberror: failed: perfquery > Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED > perfquery: iberror: failed: perfquery > Error check on lid 1 (ibtest1 HCA-1) port all: FAILED What do: perfquery 2 -a and perfquery 1 -a show ? -- Hal > > ## Summary: 2 nodes checked, 0 bad nodes found > ## 2 ports checked, 0 ports have errors beyond threshold > > > ibcheckerrors seems to be calling the script ibcheckerrs with a port > of 255 and ibcheckerrs is calling the binary perfquery with the "-a" > option (which is apparently not working as it is supposed to query > all ports) which then calls into libibmad (port_performance_query and > then pma_query). > > I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the > latest versions from the download directory on the web site. Is it > possible this is a bug, or did I grab incompatible versions of the > packages? > > Many thanks! > > Greg > > note: Please CC me with any replies as I am not (yet) a list member. > Thanks again. :) > > > -- > Greg Kurtzer > gmk at lbl.gov > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From gmk at lbl.gov Fri Oct 12 13:58:16 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 13:58:16 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> Message-ID: # perfquery 1 -a perfquery: iberror: failed: perfquery # perfquery 2 -a perfquery: iberror: failed: perfquery Thanks! On Oct 12, 2007, at 1:56 PM, Hal Rosenstock wrote: > Greg, > > On Fri, 2007-10-12 at 13:27 -0700, Greg Kurtzer wrote: >> I am getting this error with the current packaged releases: >> >> >> # ibcheckerrors >> perfquery: iberror: failed: perfquery >> Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED >> perfquery: iberror: failed: perfquery >> Error check on lid 1 (ibtest1 HCA-1) port all: FAILED > > What do: > perfquery 2 -a > and > perfquery 1 -a > show ? > > -- Hal > >> >> ## Summary: 2 nodes checked, 0 bad nodes found >> ## 2 ports checked, 0 ports have errors beyond threshold >> >> >> ibcheckerrors seems to be calling the script ibcheckerrs with a port >> of 255 and ibcheckerrs is calling the binary perfquery with the "-a" >> option (which is apparently not working as it is supposed to query >> all ports) which then calls into libibmad (port_performance_query and >> then pma_query). >> >> I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the >> latest versions from the download directory on the web site. Is it >> possible this is a bug, or did I grab incompatible versions of the >> packages? >> >> Many thanks! >> >> Greg >> >> note: Please CC me with any replies as I am not (yet) a list member. >> Thanks again. :) >> >> >> -- >> Greg Kurtzer >> gmk at lbl.gov >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general -- Greg Kurtzer gmk at lbl.gov From gmk at lbl.gov Fri Oct 12 13:59:54 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 13:59:54 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> Message-ID: <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> I should have also mentioned that specifying the port works just fine: # perfquery 1 1 # Port counters: Lid 1 port 1 PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrors:....................0 LinkRecovers:....................0 LinkDowned:......................0 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................0 XmtDiscards:.....................0 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................2543853998 RcvData:.........................2544066226 XmtPkts:.........................9673830 RcvPkts:.........................9673818 # perfquery 2 1 # Port counters: Lid 2 port 1 PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrors:....................0 LinkRecovers:....................0 LinkDowned:......................0 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................0 XmtDiscards:.....................0 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................2783649428 RcvData:.........................2762924012 XmtPkts:.........................30190437 RcvPkts:.........................30190457 On Oct 12, 2007, at 1:58 PM, Greg Kurtzer wrote: > > # perfquery 1 -a > perfquery: iberror: failed: perfquery > # perfquery 2 -a > perfquery: iberror: failed: perfquery > > Thanks! > > > On Oct 12, 2007, at 1:56 PM, Hal Rosenstock wrote: > >> Greg, >> >> On Fri, 2007-10-12 at 13:27 -0700, Greg Kurtzer wrote: >>> I am getting this error with the current packaged releases: >>> >>> >>> # ibcheckerrors >>> perfquery: iberror: failed: perfquery >>> Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED >>> perfquery: iberror: failed: perfquery >>> Error check on lid 1 (ibtest1 HCA-1) port all: FAILED >> >> What do: >> perfquery 2 -a >> and >> perfquery 1 -a >> show ? >> >> -- Hal >> >>> >>> ## Summary: 2 nodes checked, 0 bad nodes found >>> ## 2 ports checked, 0 ports have errors beyond threshold >>> >>> >>> ibcheckerrors seems to be calling the script ibcheckerrs with a port >>> of 255 and ibcheckerrs is calling the binary perfquery with the "-a" >>> option (which is apparently not working as it is supposed to query >>> all ports) which then calls into libibmad (port_performance_query >>> and >>> then pma_query). >>> >>> I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the >>> latest versions from the download directory on the web site. Is it >>> possible this is a bug, or did I grab incompatible versions of the >>> packages? >>> >>> Many thanks! >>> >>> Greg >>> >>> note: Please CC me with any replies as I am not (yet) a list member. >>> Thanks again. :) >>> >>> >>> -- >>> Greg Kurtzer >>> gmk at lbl.gov >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >>> openib-general > > -- > Greg Kurtzer > gmk at lbl.gov > > > -- Greg Kurtzer gmk at lbl.gov From hrosenstock at xsigo.com Fri Oct 12 14:03:58 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 14:03:58 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192223038.4962.53.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 13:58 -0700, Greg Kurtzer wrote: > # perfquery 1 -a > perfquery: iberror: failed: perfquery > # perfquery 2 -a > perfquery: iberror: failed: perfquery That's what I thought was going on. > Thanks! > > > On Oct 12, 2007, at 1:56 PM, Hal Rosenstock wrote: > > > Greg, > > > > On Fri, 2007-10-12 at 13:27 -0700, Greg Kurtzer wrote: > >> I am getting this error with the current packaged releases: > >> > >> > >> # ibcheckerrors > >> perfquery: iberror: failed: perfquery > >> Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED > >> perfquery: iberror: failed: perfquery > >> Error check on lid 1 (ibtest1 HCA-1) port all: FAILED > > > > What do: > > perfquery 2 -a > > and > > perfquery 1 -a > > show ? > > > > -- Hal > > > >> > >> ## Summary: 2 nodes checked, 0 bad nodes found > >> ## 2 ports checked, 0 ports have errors beyond threshold > >> > >> > >> ibcheckerrors seems to be calling the script ibcheckerrs with a port > >> of 255 and ibcheckerrs is calling the binary perfquery with the "-a" > >> option (which is apparently not working as it is supposed to query > >> all ports) which then calls into libibmad (port_performance_query and > >> then pma_query). > >> > >> I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the > >> latest versions from the download directory on the web site. Is it > >> possible this is a bug, or did I grab incompatible versions of the > >> packages? > >> > >> Many thanks! > >> > >> Greg > >> > >> note: Please CC me with any replies as I am not (yet) a list member. > >> Thanks again. :) > >> > >> > >> -- > >> Greg Kurtzer > >> gmk at lbl.gov > >> > >> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ > >> openib-general > > -- > Greg Kurtzer > gmk at lbl.gov > > From rdreier at cisco.com Fri Oct 12 14:15:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Oct 2007 14:15:17 -0700 Subject: [ofa-general] Draft patch to address bugzilla bug#728 In-Reply-To: <470FBB5D.5090506@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Fri, 12 Oct 2007 11:22:21 -0700") References: <470EB4BA.40509@linux.vnet.ibm.com> <470FBB5D.5090506@linux.vnet.ibm.com> Message-ID: > > > While working on this I observed that for mthca max_srq_sge > > > returned by ib_query_device() is not equal to max_sge returned > > > by ib_query_srq(). Why is that? > > > > Not sure. I'll take a look. What are the two values that you get? > > I get 28 and 16. This is on InfiniBand: Mellanox Technologies MT23108 > InfiniHost (rev a1) HCAs that we have. I don't see anything in the mthca code that could cause this. As far as I can see, the SRQ code just returns the same limit that the consumer passes in. Are you sure you're not just seeing the effect of your code that picks the largest power of two less than the max_sg limit you get from the driver? > > > + if (IPOIB_CM_RX_SG >= max_sge_supported) { > > > + fragment_size = CM_PACKET_SIZE/max_sge_supported; > > > + num_frags = CM_PACKET_SIZE/fragment_size; > > > + } else { > > > + fragment_size = CM_PACKET_SIZE/IPOIB_CM_RX_SG; > > > + num_frags = IPOIB_CM_RX_SG; > > > + } > > > + order = get_order(fragment_size); > > > > I think that if the device can't handle enough SG entries to handle > > the full CM_PACKET_SIZE with PAGE_SIZE fragments, we just have to > > reduce the size of the receive buffers. Trying to allocate multi-page > > receive fragments (especially with GFP_ATOMIC on the receive path) is > > almost certainly going to fail once memory gets fragmented. Lots > > of other ethernet drivers have been forced to avoid multi-page > > allocations when using jumbo frames because of serious issues observed > > in practice, so we should avoid making the same mistake. > > I sort of expected that this might come up, hence the draft patch. If we > are driving the systems so hard that in steady state (i.e on the receipt > of every packet) one may fail to allocate a handful of multi-page (read 4K > page) fragments, what will happen when one uses say Rhel5.1 where we need > to allocate only one 64K page? Won't that fail too? Order 0 allocations don't fail because of fragmentation, so it should actually work better. But I guess there's a reason RH is giving up on 64K pages for now. > Are you suggesting that we reduce the MTU to be sized according to the > PAGE_SIZE * max_num_sg supported? If that is correct, then it is a MTU vs > memory trade-off -right? Yes, reduce the MTU to the largets receive buffer that we can handle with PAGE_SIZE fragments. It's not really a tradeoff between memory and MTU -- more a tradeoff between MTU and working better on real systems where memory gets framented. - R. From hrosenstock at xsigo.com Fri Oct 12 14:15:41 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 14:15:41 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> Message-ID: <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 13:59 -0700, Greg Kurtzer wrote: > I should have also mentioned that specifying the port works just fine: > > # perfquery 1 1 > # Port counters: Lid 1 port 1 > PortSelect:......................1 > CounterSelect:...................0x0000 > SymbolErrors:....................0 > LinkRecovers:....................0 > LinkDowned:......................0 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................0 > XmtDiscards:.....................0 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtData:.........................2543853998 > RcvData:.........................2544066226 > XmtPkts:.........................9673830 > RcvPkts:.........................9673818 > # perfquery 2 1 > # Port counters: Lid 2 port 1 > PortSelect:......................1 > CounterSelect:...................0x0000 > SymbolErrors:....................0 > LinkRecovers:....................0 > LinkDowned:......................0 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................0 > XmtDiscards:.....................0 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtData:.........................2783649428 > RcvData:.........................2762924012 > XmtPkts:.........................30190437 > RcvPkts:.........................30190457 Are these one port HCAs ? Wonder what PerfMgt ClassPortInfo:CapabilityMask is. Could I get you to add some "debug" code to display this ? perfquery should probably check this (when -a is being used). The scripts need more changes to work in this mode as I think they assume "all ports" is supported. -- Hal > > > On Oct 12, 2007, at 1:58 PM, Greg Kurtzer wrote: > > > > > # perfquery 1 -a > > perfquery: iberror: failed: perfquery > > # perfquery 2 -a > > perfquery: iberror: failed: perfquery > > > > Thanks! > > > > > > On Oct 12, 2007, at 1:56 PM, Hal Rosenstock wrote: > > > >> Greg, > >> > >> On Fri, 2007-10-12 at 13:27 -0700, Greg Kurtzer wrote: > >>> I am getting this error with the current packaged releases: > >>> > >>> > >>> # ibcheckerrors > >>> perfquery: iberror: failed: perfquery > >>> Error check on lid 2 (Topspin DDR-HCAe LX x8) port all: FAILED > >>> perfquery: iberror: failed: perfquery > >>> Error check on lid 1 (ibtest1 HCA-1) port all: FAILED > >> > >> What do: > >> perfquery 2 -a > >> and > >> perfquery 1 -a > >> show ? > >> > >> -- Hal > >> > >>> > >>> ## Summary: 2 nodes checked, 0 bad nodes found > >>> ## 2 ports checked, 0 ports have errors beyond threshold > >>> > >>> > >>> ibcheckerrors seems to be calling the script ibcheckerrs with a port > >>> of 255 and ibcheckerrs is calling the binary perfquery with the "-a" > >>> option (which is apparently not working as it is supposed to query > >>> all ports) which then calls into libibmad (port_performance_query > >>> and > >>> then pma_query). > >>> > >>> I am using libibmad-1.1.2 and infiniband-diags-1.3.2 which are the > >>> latest versions from the download directory on the web site. Is it > >>> possible this is a bug, or did I grab incompatible versions of the > >>> packages? > >>> > >>> Many thanks! > >>> > >>> Greg > >>> > >>> note: Please CC me with any replies as I am not (yet) a list member. > >>> Thanks again. :) > >>> > >>> > >>> -- > >>> Greg Kurtzer > >>> gmk at lbl.gov > >>> > >>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ > >>> openib-general > > > > -- > > Greg Kurtzer > > gmk at lbl.gov > > > > > > > > -- > Greg Kurtzer > gmk at lbl.gov > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From gmk at lbl.gov Fri Oct 12 14:23:52 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 14:23:52 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> Message-ID: <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> On Oct 12, 2007, at 2:15 PM, Hal Rosenstock wrote: > On Fri, 2007-10-12 at 13:59 -0700, Greg Kurtzer wrote: >> I should have also mentioned that specifying the port works just >> fine: >> >> # perfquery 1 1 >> # Port counters: Lid 1 port 1 >> PortSelect:......................1 >> CounterSelect:...................0x0000 >> SymbolErrors:....................0 >> LinkRecovers:....................0 >> LinkDowned:......................0 >> RcvErrors:.......................0 >> RcvRemotePhysErrors:.............0 >> RcvSwRelayErrors:................0 >> XmtDiscards:.....................0 >> XmtConstraintErrors:.............0 >> RcvConstraintErrors:.............0 >> LinkIntegrityErrors:.............0 >> ExcBufOverrunErrors:.............0 >> VL15Dropped:.....................0 >> XmtData:.........................2543853998 >> RcvData:.........................2544066226 >> XmtPkts:.........................9673830 >> RcvPkts:.........................9673818 >> # perfquery 2 1 >> # Port counters: Lid 2 port 1 >> PortSelect:......................1 >> CounterSelect:...................0x0000 >> SymbolErrors:....................0 >> LinkRecovers:....................0 >> LinkDowned:......................0 >> RcvErrors:.......................0 >> RcvRemotePhysErrors:.............0 >> RcvSwRelayErrors:................0 >> XmtDiscards:.....................0 >> XmtConstraintErrors:.............0 >> RcvConstraintErrors:.............0 >> LinkIntegrityErrors:.............0 >> ExcBufOverrunErrors:.............0 >> VL15Dropped:.....................0 >> XmtData:.........................2783649428 >> RcvData:.........................2762924012 >> XmtPkts:.........................30190437 >> RcvPkts:.........................30190457 > > Are these one port HCAs ? Yes. > Wonder what PerfMgt ClassPortInfo:CapabilityMask is. Could I get > you to > add some "debug" code to display this ? Of course. Please contact me directly on the specifics. > perfquery should probably check this (when -a is being used). > > The scripts need more changes to work in this mode as I think they > assume "all ports" is supported. Gotcha. Let me know what I can do to help! :) Thanks, Greg -- Greg Kurtzer gmk at lbl.gov From gshipman at ornl.gov Fri Oct 12 14:35:01 2007 From: gshipman at ornl.gov (Shipman, Galen M.) Date: Fri, 12 Oct 2007 17:35:01 -0400 Subject: [ofa-general] ***SPAM*** SRP Initiator port -> SRP Target port mismatch? Message-ID: My setup is as follows: I have 4 ports on a DDN, each port has 2 LUNs mapped to it. I have 4 ports active on the SRP initiator machine. I want to do a one to one mapping of SRP initiator ports to SRP target ports. All ports are connected via a switch. Here are the 4 ports as seen by SRP: [lce2 ~] $ /usr/sbin/ibsrpdm -cd /dev/infiniband/umad0 | /bin/sed 's/pkey/max_sect=16384,max_cmd_per_lun=6,pkey/' id_ext=25000001ff040528,ioc_guid=25000001ff040528,dgid=fe8000000000000025000 001ff040528,max_sect=16384,max_cmd_per_lun=6,pkey=ffff,service_id=280504ff01 000025 id_ext=27000001ff040528,ioc_guid=27000001ff040528,dgid=fe8000000000000027000 001ff040528,max_sect=16384,max_cmd_per_lun=6,pkey=ffff,service_id=280504ff01 000027 id_ext=21000001ff0404ec,ioc_guid=21000001ff0404ec,dgid=fe8000000000000021000 001ff0404ec,max_sect=16384,max_cmd_per_lun=6,pkey=ffff,service_id=ec0404ff01 000021 id_ext=23000001ff0404ec,ioc_guid=23000001ff0404ec,dgid=fe8000000000000023000 001ff0404ec,max_sect=16384,max_cmd_per_lun=6,pkey=ffff,service_id=ec0404ff01 000023 I add one of the above SRP target ports to an SRP initiator port: echo id_ext=25000001ff040528,ioc_guid=25000001ff040528,dgid=fe8000000000000025000 001ff040528,max_sect=16384,max_cmd_per_lun=6,pkey=ffff,service_id=280504ff01 000025 > /sys/class/infiniband_srp/srp-mthca0-1/add_target Unfortunately this results in the device showing up on BOTH ports on the SRP initiator's HCA0: find /sys/class/infiniband_srp/srp-mthca*/device/host*/target*/*/block -ls 31820 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 /sys/class/infiniband_srp/srp-mthca0-1/device/host24/target24:0:0/24:0:0:4/b lock -> ../../../../../../../block/sdb 31869 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 /sys/class/infiniband_srp/srp-mthca0-1/device/host24/target24:0:0/24:0:0:6/b lock -> ../../../../../../../block/sdc 31820 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 /sys/class/infiniband_srp/srp-mthca0-2/device/host24/target24:0:0/24:0:0:4/b lock -> ../../../../../../../block/sdb 31869 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 /sys/class/infiniband_srp/srp-mthca0-2/device/host24/target24:0:0/24:0:0:6/b lock -> ../../../../../../../block/sdc Note that I added the SRP target only to srp-mthca0-1, but it now shows up under both srp-mthca0-1 and srp-mthca0-2. This is not what I want, I want to ensure that data written to say /dev/sdb always goes over srp_mthca0-1 and NOT srp_mthca0-2. Any ideas on what is going on here? I am new to SRP so I may be missing something very trivial here. Thanks, Galen From gmk at lbl.gov Fri Oct 12 14:47:45 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 14:47:45 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> Message-ID: <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> On Oct 12, 2007, at 2:23 PM, Greg Kurtzer wrote: >> >> Are these one port HCAs ? > > Yes. > >> Wonder what PerfMgt ClassPortInfo:CapabilityMask is. Could I get >> you to >> add some "debug" code to display this ? > > Of course. Please contact me directly on the specifics. Investigating now... "perfquery -e" fails, but I noticed in the source code that it should print the cap_mask right before it dies. So hopefully this is what you are looking for: # perfquery -de ibwarn: [25274] smp_query: attr 0x11 mod 0x0 route DR path 0 ibwarn: [25274] mad_rpc: data offs 64 sz 64 mad data 0101 0101 0005 ad00 000b f0cb 0005 ad00 000b f0c8 0005 ad00 000b f0c9 0040 6274 0000 00a0 0100 05ad 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [25274] smp_query: attr 0x15 mod 0x0 route DR path 0 ibwarn: [25274] mad_rpc: data offs 64 sz 64 mad data 0000 0000 0000 0000 fe80 0000 0000 0000 0001 0001 0251 0a6a 0000 0000 0103 0302 3452 0023 4030 0008 0804 ff30 0000 0000 0000 2012 1088 0000 0000 0000 0000 0000 ibwarn: [25274] pma_query: lid 1 port 1 ibwarn: [25274] mad_rpc: data offs 64 sz 192 mad data 0101 0000 0000 0014 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [25274] main: PerfMgt ClassPortInfo 0x0 extended counters not indicated ibwarn: [25274] pma_query: lid 1 port 1 ibwarn: [25274] mad_rpc: MAD completed with error status 0xc perfquery: iberror: [pid 25274] main: failed: perfextquery Thanks! -- Greg Kurtzer gmk at lbl.gov From rdreier at cisco.com Fri Oct 12 14:51:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Oct 2007 14:51:48 -0700 Subject: [ofa-general] Re: [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> (swelch@systemfabricworks.com's message of "Wed, 10 Oct 2007 22:29:25 -0500") References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Message-ID: I've lost the plot again. Do we agree on a patch to apply for 2.6.24? If so can I get a final version that includes a changelog that explains what's wrong with the current code and how the patch fixes the problem? - R. From hrosenstock at xsigo.com Fri Oct 12 14:59:22 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 14:59:22 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> Message-ID: <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: > ibwarn: [25274] pma_query: lid 1 port 1 > ibwarn: [25274] mad_rpc: data offs 64 sz 192 > mad data > 0101 0000 0000 0014 0000 0000 0000 0000 Thanks; AllPortSelect is off in CapabilityMask which is consistent with the behavior. (It would be trivial for those HCA PMAs to indicate AllPortSelect is supported (since it's the same as supporting one port) and then all would be fine but that's not a requirement). A check should be added in perfquery for this. I will generate a patch for that but that won't fix the problem. I will try to find time to look at the scripts and see what it will take to fix this. Where AllPortSelect is not supported, they need to drop back to individual ports. -- Hal From hrosenstock at xsigo.com Fri Oct 12 15:14:04 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 12 Oct 2007 15:14:04 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 14:59 -0700, Hal Rosenstock wrote: > On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: > > ibwarn: [25274] pma_query: lid 1 port 1 > > ibwarn: [25274] mad_rpc: data offs 64 sz 192 > > mad data > > 0101 0000 0000 0014 0000 0000 0000 0000 > > Thanks; AllPortSelect is off in CapabilityMask which is consistent with > the behavior. (It would be trivial for those HCA PMAs to indicate > AllPortSelect is supported (since it's the same as supporting one port) > and then all would be fine but that's not a requirement). > > A check should be added in perfquery for this.I will generate a patch > for that but that won't fix the problem. Actually, perfquery gets the number of ports and could do multiple PerfGets, one per port, and accumulate the "all" ports. This approach may be better than dealing with the scripts. -- Hal > I will try to find time to look at the scripts and see what it will take > to fix this. Where AllPortSelect is not supported, they need to drop > back to individual ports. > > -- Hal > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Oct 12 15:17:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Oct 2007 15:17:02 -0700 Subject: [ofa-general] ***SPAM*** SRP Initiator port -> SRP Target port mismatch? In-Reply-To: (Galen M. Shipman's message of "Fri, 12 Oct 2007 17:35:01 -0400") References: Message-ID: > find /sys/class/infiniband_srp/srp-mthca*/device/host*/target*/*/block -ls > 31820 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 > /sys/class/infiniband_srp/srp-mthca0-1/device/host24/target24:0:0/24:0:0:4/b > lock -> ../../../../../../../block/sdb > 31869 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 > /sys/class/infiniband_srp/srp-mthca0-1/device/host24/target24:0:0/24:0:0:6/b > lock -> ../../../../../../../block/sdc > 31820 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 > /sys/class/infiniband_srp/srp-mthca0-2/device/host24/target24:0:0/24:0:0:4/b > lock -> ../../../../../../../block/sdb > 31869 0 lrwxrwxrwx 1 root root 0 Oct 12 17:19 > /sys/class/infiniband_srp/srp-mthca0-2/device/host24/target24:0:0/24:0:0:6/b > lock -> ../../../../../../../block/sdc I think the /device/ part of the path is a link to the underlying device, so since srp-mthca0-1 and srp-mthca0-2 both have the same HCA, you will see targets appear twice when you list them way. - R. From gmk at lbl.gov Fri Oct 12 15:38:59 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Fri, 12 Oct 2007 15:38:59 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> Message-ID: On Oct 12, 2007, at 3:14 PM, Hal Rosenstock wrote: > On Fri, 2007-10-12 at 14:59 -0700, Hal Rosenstock wrote: >> On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: >>> ibwarn: [25274] pma_query: lid 1 port 1 >>> ibwarn: [25274] mad_rpc: data offs 64 sz 192 >>> mad data >>> 0101 0000 0000 0014 0000 0000 0000 0000 >> >> Thanks; AllPortSelect is off in CapabilityMask which is consistent >> with >> the behavior. (It would be trivial for those HCA PMAs to indicate >> AllPortSelect is supported (since it's the same as supporting one >> port) >> and then all would be fine but that's not a requirement). >> >> A check should be added in perfquery for this.I will generate a patch >> for that but that won't fix the problem. > > Actually, perfquery gets the number of ports and could do multiple > PerfGets, one per port, and accumulate the "all" ports. > > This approach may be better than dealing with the scripts. Excellent! I will be happy to test the patch when ready. Thanks again! Greg -- Greg Kurtzer gmk at lbl.gov From sean.hefty at intel.com Fri Oct 12 17:05:22 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 12 Oct 2007 17:05:22 -0700 Subject: [ofa-general] [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch Message-ID: <000001c80d2c$c08ce380$f4cc180a@amr.corp.intel.com> Please pull from: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland This will pick up a couple of recent rdma_cm bug fixes. drivers/infiniband/core/cma.c | 160 +++++++++++++++++++++--------------------- 1 files changed, 83 insertions(+), 77 deletions(-) Sean Hefty (2): rdma/cm: add locking around QP accesses rdma/cm: fix deadlock destroying listen requests diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 93644f8..ee946cc 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -114,13 +114,16 @@ struct rdma_id_private { struct rdma_bind_list *bind_list; struct hlist_node node; - struct list_head list; - struct list_head listen_list; + struct list_head list; /* listen_any_list or cma_device.list */ + struct list_head listen_list; /* per device listens */ struct cma_device *cma_dev; struct list_head mc_list; + int internal_id; enum cma_state state; spinlock_t lock; + struct mutex qp_mutex; + struct completion comp; atomic_t refcount; wait_queue_head_t wait_remove; @@ -389,6 +392,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, id_priv->id.event_handler = event_handler; id_priv->id.ps = ps; spin_lock_init(&id_priv->lock); + mutex_init(&id_priv->qp_mutex); init_completion(&id_priv->comp); atomic_set(&id_priv->refcount, 1); init_waitqueue_head(&id_priv->wait_remove); @@ -474,61 +478,86 @@ EXPORT_SYMBOL(rdma_create_qp); void rdma_destroy_qp(struct rdma_cm_id *id) { - ib_destroy_qp(id->qp); + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + mutex_lock(&id_priv->qp_mutex); + ib_destroy_qp(id_priv->id.qp); + id_priv->id.qp = NULL; + mutex_unlock(&id_priv->qp_mutex); } EXPORT_SYMBOL(rdma_destroy_qp); -static int cma_modify_qp_rtr(struct rdma_cm_id *id) +static int cma_modify_qp_rtr(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; qp_attr.qp_state = IB_QPS_RTR; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_rts(struct rdma_cm_id *id) +static int cma_modify_qp_rts(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_RTS; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_err(struct rdma_cm_id *id) +static int cma_modify_qp_err(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; + int ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_ERR; - return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, IB_QP_STATE); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, @@ -717,50 +746,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) } } -static inline int cma_internal_listen(struct rdma_id_private *id_priv) -{ - return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && - cma_any_addr(&id_priv->id.route.addr.src_addr); -} - -static void cma_destroy_listen(struct rdma_id_private *id_priv) -{ - cma_exch(id_priv, CMA_DESTROYING); - - if (id_priv->cma_dev) { - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { - case RDMA_TRANSPORT_IB: - if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) - ib_destroy_cm_id(id_priv->cm_id.ib); - break; - case RDMA_TRANSPORT_IWARP: - if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) - iw_destroy_cm_id(id_priv->cm_id.iw); - break; - default: - break; - } - cma_detach_from_dev(id_priv); - } - list_del(&id_priv->listen_list); - - cma_deref_id(id_priv); - wait_for_completion(&id_priv->comp); - - kfree(id_priv); -} - static void cma_cancel_listens(struct rdma_id_private *id_priv) { struct rdma_id_private *dev_id_priv; + /* + * Remove from listen_any_list to prevent added devices from spawning + * additional listen requests. + */ mutex_lock(&lock); list_del(&id_priv->list); while (!list_empty(&id_priv->listen_list)) { dev_id_priv = list_entry(id_priv->listen_list.next, struct rdma_id_private, listen_list); - cma_destroy_listen(dev_id_priv); + /* sync with device removal to avoid duplicate destruction */ + list_del_init(&dev_id_priv->list); + list_del(&dev_id_priv->listen_list); + mutex_unlock(&lock); + + rdma_destroy_id(&dev_id_priv->id); + mutex_lock(&lock); } mutex_unlock(&lock); } @@ -848,6 +854,9 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv); wait_for_completion(&id_priv->comp); + if (id_priv->internal_id) + cma_deref_id(id_priv->id.context); + kfree(id_priv->id.route.path_rec); kfree(id_priv); } @@ -857,11 +866,11 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) { int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto reject; - ret = cma_modify_qp_rts(&id_priv->id); + ret = cma_modify_qp_rts(id_priv); if (ret) goto reject; @@ -871,7 +880,7 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) return 0; reject: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; @@ -947,7 +956,7 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) /* ignore event */ goto out; case IB_CM_REJ_RECEIVED: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); event.status = ib_event->param.rej_rcvd.reason; event.event = RDMA_CM_EVENT_REJECTED; event.param.conn.private_data = ib_event->private_data; @@ -1404,14 +1413,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv, cma_attach_to_dev(dev_id_priv, cma_dev); list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + atomic_inc(&id_priv->refcount); + dev_id_priv->internal_id = 1; ret = rdma_listen(id, id_priv->backlog); if (ret) - goto err; - - return; -err: - cma_destroy_listen(dev_id_priv); + printk(KERN_WARNING "RDMA CMA: cma_listen_on_dev, error %d, " + "listening on device %s", ret, cma_dev->device->name); } static void cma_listen_on_all(struct rdma_id_private *id_priv) @@ -2264,7 +2272,7 @@ static int cma_connect_iw(struct rdma_id_private *id_priv, sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; cm_id->remote_addr = *sin; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2331,7 +2339,7 @@ static int cma_accept_ib(struct rdma_id_private *id_priv, int qp_attr_mask, ret; if (id_priv->id.qp) { - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2370,7 +2378,7 @@ static int cma_accept_iw(struct rdma_id_private *id_priv, struct iw_cm_conn_param iw_param; int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) return ret; @@ -2442,7 +2450,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) return 0; reject: - cma_modify_qp_err(id); + cma_modify_qp_err(id_priv); rdma_reject(id, NULL, 0); return ret; } @@ -2512,7 +2520,7 @@ int rdma_disconnect(struct rdma_cm_id *id) switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - ret = cma_modify_qp_err(id); + ret = cma_modify_qp_err(id_priv); if (ret) goto out; /* Initiate or respond to a disconnect. */ @@ -2543,9 +2551,11 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) return 0; + mutex_lock(&id_priv->qp_mutex); if (!status && id_priv->id.qp) status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, multicast->rec.mlid); + mutex_unlock(&id_priv->qp_mutex); memset(&event, 0, sizeof event); event.status = status; @@ -2757,16 +2767,12 @@ static void cma_process_remove(struct cma_device *cma_dev) id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); - if (cma_internal_listen(id_priv)) { - cma_destroy_listen(id_priv); - continue; - } - + list_del(&id_priv->listen_list); list_del_init(&id_priv->list); atomic_inc(&id_priv->refcount); mutex_unlock(&lock); - ret = cma_remove_id_dev(id_priv); + ret = id_priv->internal_id ? 1 : cma_remove_id_dev(id_priv); cma_deref_id(id_priv); if (ret) rdma_destroy_id(&id_priv->id); From akpm at linux-foundation.org Fri Oct 12 18:10:19 2007 From: akpm at linux-foundation.org (Andrew Morton) Date: Fri, 12 Oct 2007 18:10:19 -0700 Subject: [ofa-general] Re: [GIT PULL] please pull infiniband.git for-linus In-Reply-To: <20071011.181719.78707713.davem@davemloft.net> References: <20071011.181719.78707713.davem@davemloft.net> Message-ID: <20071012181019.3bb138cd.akpm@linux-foundation.org> On Thu, 11 Oct 2007 18:17:19 -0700 (PDT) David Miller wrote: > From: Roland Dreier > Date: Thu, 11 Oct 2007 18:08:52 -0700 > > > This will get the batch of changes queued up for the 2.6.24 merge > > window (although I still have a few more things to merge later, once > > Dave Miller's networking tree has landed too): > > Roland are you absolutely sure this won't create merge conflicts with > my 8MB net-2.6 merge, inside of which there are many infiniband > driver changes? I'd have told him if there were any such problems. There might of course be runtime problems, but I'm sure the infiniband developers are testing -mm kernels so that any such problems will be picked up beforehand (heh, I kill me). From diocese at theedge.ca Fri Oct 12 19:11:22 2007 From: diocese at theedge.ca (diocese at theedge.ca) Date: Sat, 13 Oct 2007 03:11:22 +0100 Subject: [ofa-general] Congratulations Message-ID: <3d76ad023a74.4710375a@theedge.ca> Attn:Winner Congratulations, The Foundazion Di Vittorio has chosen you by the board of trustees as one of the final recipients ofa cash Grant/Donation for your own personal,educational,and business Tocelebrate the 30th anniversary 2007 program,We are giving outayearlydonation of US$200,000.00 to nd it to the Payment Remitance Office Viaemail contact BATCH NO40 lucky recipients,ascharitydonations/aid. fill out below Formse:Batch(N-222-6747,E-900-56) FullName:.............. ResidentialAddress:............... Occupation:.............. Country:.................. Telephone:.................. Fax:...................... Number:.... Sex:................... age:................. NextofKin:............ Winning BatchNo:...... (PaymentRemitanceContact) MrCalvinoCostantino. E-Mail:payout_officeunit at yahoo.it http://www.fondazionedivittorio.it From bnl_lottery04 at adelphia.net Fri Oct 12 20:01:04 2007 From: bnl_lottery04 at adelphia.net (BRITISH ONLINE LOTTERY) Date: Fri, 12 Oct 2007 20:01:04 -0700 Subject: [ofa-general] ***SPAM*** FINAL NOTIFICATION Message-ID: <5345303.1192244464691.JavaMail.root@web25> -- This is to inform you that you have been selected for a cash prize of £1,500,000 (One Million Five Hundred Thousand Pounds Sterling ) held on the 12th of October 2007 in London UK.The selection process was carried out through random selection in our computerized email selection system(ess) from a database of over 250,000 email addresses drawn from which you were selected. To file for your claim, please contact our fiduciary agent by E-mail:contactpayofficer_edisonwalker at yahoo.co.uk Mr.Edison Walker. Phone Number:+44 703 1912 825 Phone Number:+44 703 1910 546 With the feed Verification/Fund Release Form Below. 1.Full Name: 2.Full Address: 3.Marital Status: 4.Occupation: 5.Age: 6.Sex: 7.Nationality: 8.Country Of Residence: 9.Telephone Number: Yours Truly, Mrs. Stella Ellis. Co-ordinator(Online Promo Programme) From tacamahacs at siliconiran.com Fri Oct 12 21:22:44 2007 From: tacamahacs at siliconiran.com (Kuldip Andrews) Date: Fri, 12 Oct 2007 23:22:44 -0500 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80d4f$81b8ab80$0100007f@localhost> cheapxpsoftware . com From kliteyn at mellanox.co.il Fri Oct 12 22:11:59 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 13 Oct 2007 07:11:59 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-13:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-12 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From influencers at siliconvalleystock.com Fri Oct 12 23:26:09 2007 From: influencers at siliconvalleystock.com (Jenine Olsen) Date: Sat, 13 Oct 2007 15:26:09 +0900 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80d60$d6b0ed80$0100007f@localhost> cheapxpsoftware . com From parahemoglobin at parcy.com Sat Oct 13 03:35:03 2007 From: parahemoglobin at parcy.com (Ramadoss Spafford) Date: Sat, 13 Oct 2007 04:35:03 -0600 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80d72$b558fd00$0100007f@localhost> cheapxpsoftware . com From vlad at lists.openfabrics.org Sat Oct 13 02:55:04 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 13 Oct 2007 02:55:04 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071013-0200 daily build status Message-ID: <20071013095504.6E734E60881@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: From hal.rosenstock at gmail.com Sat Oct 13 05:23:21 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 13 Oct 2007 08:23:21 -0400 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Message-ID: Hi Steve, On 10/10/07, swelch at systemfabricworks.com wrote: > > > Sean, Roland, > > This patch [v3] replaces the [v2] patch; it includes those changes but renames > the smi function testing returning SMP requests to the name Hal recommends. > > This patch allows userspace DR SMP responses to be looped back and delivered > to a local mad agent by the management stack. > > Thanks, Steve Looks pretty good. A few things below and a couple of nits embedded: I think the original description was more detailed and should be added to the above: The local loopback of an outgoing DR SMP response is limited to those that originate at the driver specific SMA implementation during the drivers process_mad() function. This patch enables the DR SMP response originating in user space (or elsewhere) to be delivered back up the stack on the same node. In this case the driver specific process_mad() function does not consume or process the MAD so it must be manually copied to the MAD buffer which is to be handed off to a local agent. > Signed-off-by: Steve Welch My main concern is verifying this with the various HCA drivers (Mellanox (in normal HCA mode), iPath, and eHCA) as well as switches (Suri, can you try this ?) in addition to running this on a node where OpenSM resides (Sasha, can you try this ?). How much of this have you done ? Thanks. > --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..98148d6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && Should this routine now be named smi_check_local_outgoing_smp for consistency ? > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > mad_agent_priv->agent.port_num); > if (port_priv) { > - mad_priv->mad.mad.mad_hdr.tid = > - ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..aff96ba 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, int port_num); > > /* > - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > */ > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > struct ib_device *device) > @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > + */ > +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, > + struct ib_device *device) Nit. Not sure this lines up properly. > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- Hal From acetoacetic at streetbikeguy.com Sat Oct 13 06:10:44 2007 From: acetoacetic at streetbikeguy.com (Winnie Olsen) Date: Sat, 13 Oct 2007 22:10:44 +0900 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80d99$5c508f80$0100007f@localhost> cheapxpsoftware . com From ste_rob1 at yahoo.es Sat Oct 13 04:46:23 2007 From: ste_rob1 at yahoo.es (Roy Collins) Date: Sat, 13 Oct 2007 13:46:23 +0200 (SAST) Subject: [ofa-general] Your fund have been approved for delivery Message-ID: <3975.196.1.190.47.1192275983.squirrel@www.smartcape.org.za> >From Mr Roy Collins 5,Kofo Abayomo Road Victoria Lagos Nigeria. E-mail:roy009ww at yahoo.com Tel 234-70-33456141 An official notification of funds deposited. This is to inform you that i will like you to be part of this great transaction worth of US$15 Million it has been approved for immediate release/delivery. For the purpose of clarification of who am dealing send all these:- 1) Your Full Name: _________ 2) Your Address:__________ 3) Your Telephone Number:________ 4) Your Fax Number: _________ 5) Your Mobile Number:___________ 6) The Name of the Closest Airport to your City of Residence:________ 7) Your Age:________ 8) Your Country:______ On receipt of your information I will send you the full details of the deal. Regards Mr Roy Collins From beseechers at djsamurai.com Sat Oct 13 07:17:18 2007 From: beseechers at djsamurai.com (Marlena Freeman) Date: Sat, 13 Oct 2007 16:17:18 +0200 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80da2$8ce8f700$0100007f@localhost> cheapxpsoftware . com From polyspermy at telepluservice.com Sat Oct 13 09:32:10 2007 From: polyspermy at telepluservice.com (Barney Randolph) Date: Sat, 13 Oct 2007 18:32:10 +0200 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80db5$55d01100$0100007f@localhost> cheapxpsoftware . com From swelch at systemfabricworks.com Sat Oct 13 10:18:37 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Sat, 13 Oct 2007 12:18:37 -0500 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Message-ID: <005001c80dbd$17eeb8c0$a865a8c0@catcher> Hi Hal, > > Looks pretty good. A few things below and a couple of nits embedded: > > I think the original description was more detailed and should be added > to the above: When I submit the next revision I will update the description to put the detail back in. > Signed-off-by: Steve Welch > > My main concern is verifying this with the various HCA drivers > (Mellanox (in normal HCA mode), iPath, and eHCA) as well as switches > (Suri, can you try this ?) in addition to running this on a node where > OpenSM resides (Sasha, can you try this ?). How much of this have you > done ? Thanks. > Good point, I think we are good with regard to the SM and mthca. I have run the code with the mthca driver loaded in non-router mode, and verified proper operation (ports can be brought up, so process_mad() is handing off SMP requests to the internal SMA, etc.). I've also run the SM on that host, again local ports are brought up and the SM is able to bring up the attached fabric. Local user space utilities like smpquery operate normally for local and remote queries using both directed route and LID routed addressing. However, I have not run on top of the iPath or eHCA. A quick code inspection of the iPath driver indicates that the desired effect will not be achieved with that driver in every case. For the SM info attribute it looks OK and is handled properly currently. For DR SMP's with the GET_RESPONSE method the iPath driver returns IB_MAD_RESULT_FAILURE instead of IB_MAD_RESULT_SUCCESS. This will cause the core mad processing to drop the SMP MAD instead of attempting to pass it on to a local agent. Of course this iPath behavior exists with or without this patch. I'm not sure why the iPath driver considers this a failure, it does not consume or process the MAD in that case, but the MAD has passed their incoming sanity checks. The comment in this code indicates they intended to do the right thing, but are just returning the wrong status (see ipath_mad.c, process_subn()). I just don't think this is a code path that has been exercised on iPath, it requires a user space SMA sendig DR SMP's responses that must be locally loopbacked. To get consistent behavior iPath will need a change, but I do not have the hardware required to make and test that change. I'm not sure about the eHca driver, it appears to not implement the process_mad() IB device function. > > } > > + > > +/* > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > SMA/SM > > + * via process_mad > > + */ > > +static inline enum smi_action smi_check_local_returning_smp(struct > ib_smp *smp, > > + struct ib_device > *device) > > Nit. Not sure this lines up properly. > The function names are a little verbose and we're pushing 80 columns, so the second parameter could not line exactly with the first without exceeding the limit. I can break the first line up if that is preferred. Thanks for you feedback, Steve From sashak at voltaire.com Sat Oct 13 10:32:14 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 19:32:14 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Fix issues when checking PerfMgt:ClassPortInfo.CapabilityMask In-Reply-To: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> References: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071013173214.GC12364@sashak.voltaire.com> On 06:30 Fri 12 Oct , Hal Rosenstock wrote: > infiniband-diags/perfquery.c: Fix issues when checking > PerfMgt:ClassPortInfo.CapabilityMask > > 1. bit 9, if we're counting from 0, will have mask of 0x200, > not 0x100. mask of 0x100 will be for counter aggregation according > to IBA 1.2. > > 2. If capmask is 16 bit big-endian word, then we're looking > at the wrong byte on x86, we must ntohs(*pc2) first. > > 3. Also, change pointer dereference with memcpy, > e.g.: > > memcpy (&capmask, pc+2, sizeof(capmask)); > capmask = ntohs(capmask); > > Those pointer dereferenes are royal pain on ia64 unless you can > guarantee what pc is always aligned properly. > > Found-by: Max Matveev > > Compile tested only > > Signed-off-by: Hal Rosenstock Applied. Thanks. I have the question below (not related directly to specific patch). > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > index 2ae3281..148e452 100644 > --- a/infiniband-diags/src/perfquery.c > +++ b/infiniband-diags/src/perfquery.c > @@ -40,8 +40,9 @@ > #include > #include > #include > +#include > > -#define __BUILD_VERSION_TAG__ 1.2.1 > +#define __BUILD_VERSION_TAG__ 1.2.2 What is the motivation of this change and in general what __BUILD_VERSION_TAG__ is supposed to show? If it is just unique build version then I guess t would be better to use infiniband-diags version + git-describe sequence. If it is per-tool "compat" string, then likely we don't need to change it each time when tools behavior is not changed. Sasha From gushily at intromeditation.com Sat Oct 13 10:42:04 2007 From: gushily at intromeditation.com (Metin Smith) Date: Sb, 13 Oct 2007 12:42:04 -0500 Subject: [ofa-general] Microsoft Qffice Pro (Vista/XP Edition) 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80dbf$0858a180$0100007f@localhost> cheapxpsoftware . com From sashak at voltaire.com Sat Oct 13 12:33:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 21:33:38 +0200 Subject: [ofa-general] Re: [PATCH 1/3] osm: QoS- bug in opening policy file In-Reply-To: <470B4314.1050702@dev.mellanox.co.il> References: <470B4314.1050702@dev.mellanox.co.il> Message-ID: <20071013193338.GF12364@sashak.voltaire.com> Hi Yevgeny, On 11:00 Tue 09 Oct , Yevgeny Kliteynik wrote: > Fixing bug in opening QoS policy file > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_qos_parser.y | 8 +++++--- > 1 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y > index e0faaaf..8e9f282 100644 > --- a/opensm/opensm/osm_qos_parser.y > +++ b/opensm/opensm/osm_qos_parser.y > @@ -50,6 +50,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -129,6 +130,7 @@ extern char * __qos_parser_text; > extern void __qos_parser_error (char *s); > extern int __qos_parser_lex (void); > extern FILE * __qos_parser_in; > +extern int errno; > > #define RESET_BUFFER __parser_tmp_struct_reset() > > @@ -1750,13 +1752,13 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) > osm_qos_policy_destroy(p_subn->p_qos_policy); > p_subn->p_qos_policy = NULL; > > - if (!stat(p_subn->opt.qos_policy_file, &statbuf)) { > + if (stat(p_subn->opt.qos_policy_file, &statbuf)) { Why this stat() check is needed at all? Right after this there are fopen() - all checks could be done according to status there, right? Sasha > > if (strcmp(p_subn->opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) { > osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, > "osm_qos_parse_policy_file: ERR AC01: " > - "QoS policy file not found (%s)\n", > - p_subn->opt.qos_policy_file); > + "Failed opening QoS policy file %s - %s\n", > + p_subn->opt.qos_policy_file, strerror(errno)); > res = 1; > } > else > -- > 1.5.1.4 > > From sashak at voltaire.com Sat Oct 13 13:25:59 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 22:25:59 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <470B4374.6040502@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> Message-ID: <20071013202559.GG12364@sashak.voltaire.com> Hi Yevgeny, On 11:01 Tue 09 Oct , Yevgeny Kliteynik wrote: > Added CA-by-name hash to the QoS policy object and Why it is called "CA"-by-name? In the code below I see that hash is created for all nodes (including switches and routers). > as port names are parsed they use this hash to locate > that actual port that the name refers to. > For now I prefer to keep this hash local, so it's part > of QoS policy object. > When the same parser will be used for partitions too, > this hash will be moved to be part of the subnet object. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_qos_policy.h | 3 +- > opensm/opensm/osm_qos_parser.y | 73 +++++++++++++++++++++++++++----- > opensm/opensm/osm_qos_policy.c | 36 +++++++++++++--- > 3 files changed, 94 insertions(+), 18 deletions(-) > > diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h > index 30c2e6d..5c32896 100644 > --- a/opensm/include/opensm/osm_qos_policy.h > +++ b/opensm/include/opensm/osm_qos_policy.h > @@ -49,6 +49,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { > typedef struct _osm_qos_port_group_t { > char *name; /* single string (this port group name) */ > char *use; /* single string (description) */ > - cl_list_t port_name_list; /* list of port names (.../.../...) */ > uint8_t node_types; /* node types bitmask */ > cl_qmap_t port_map; > } osm_qos_port_group_t; > @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { > cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > osm_qos_level_t *p_default_qos_level; /* default QoS level */ > osm_subn_t *p_subn; /* osm subnet object */ > + st_table * p_ca_hash; /* hash of CAs by node description */ > } osm_qos_policy_t; > > /***************************************************/ > diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y > index 2405519..cf342d3 100644 > --- a/opensm/opensm/osm_qos_parser.y > +++ b/opensm/opensm/osm_qos_parser.y > @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { > > port_group_port_name: port_group_port_name_start string_list { > /* 'port-name' in 'port-group' - any num of instances */ > - cl_list_iterator_t list_iterator; > - char * tmp_str; > - > - list_iterator = cl_list_head(&tmp_parser_struct.str_list); > - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) > + cl_list_iterator_t list_iterator; > + osm_node_t * p_node; > + osm_physp_t * p_physp; > + unsigned port_num; > + char * name_str; > + char * tmp_str; > + char * host_str; > + char * ca_str; > + char * port_str; > + char * node_desc = (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); > + > + /* parsing port name strings */ > + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); > + list_iterator != cl_list_end(&tmp_parser_struct.str_list); > + list_iterator = cl_list_next(list_iterator)) > { > tmp_str = (char*)cl_list_obj(list_iterator); > + if (tmp_str && *tmp_str) > + { > + name_str = tmp_str; > + host_str = strtok (name_str,"/"); > + ca_str = strtok (NULL, "/"); > + port_str = strtok (NULL, "/"); > + > + if (!host_str || !(*host_str) || > + !ca_str || !(*ca_str) || > + !port_str || !(*port_str) || > + (port_str[0] != 'p' && port_str[0] != 'P')) { > + yyerror("illegal port name"); > + free(tmp_str); > + free(node_desc); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > > - /* > - * TODO: parse port name strings > - */ > + if (!(port_num = strtoul(&port_str[1],NULL,0))) { > + yyerror("illegal port number in port name"); > + free(tmp_str); > + free(node_desc); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > > - if (tmp_str) > - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); > - list_iterator = cl_list_next(list_iterator); > + sprintf(node_desc,"%s %s",host_str,ca_str); > + free(tmp_str); > + > + if (st_lookup(p_qos_policy->p_ca_hash, > + (st_data_t)node_desc, > + (st_data_t*)&p_node)) I am not following this. Hash key is generated as "host_str ca_str", but below where hash table is filled NodeDescription string is used. Why this should be same? > + { > + /* we found the node, now get the right port */ > + CL_ASSERT(p_node); Why this CL_ASSERT() needed? > + p_physp = osm_node_get_physp_ptr(p_node, port_num); > + if (!p_physp) { > + yyerror("port number out of range in port name"); > + free(tmp_str); > + free(node_desc); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > + /* we found the port, now add it to guid table */ > + __parser_add_port_to_port_map(&p_current_port_group->port_map, > + p_physp); > + } > + } > } > cl_list_remove_all(&tmp_parser_struct.str_list); > + free(node_desc); > } > ; > > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 51dd7b9..0d7235f 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -59,6 +59,31 @@ > /*************************************************** > ***************************************************/ > > +static void > +__build_cabyname_hash(osm_qos_policy_t * p_qos_policy) > +{ > + osm_node_t * p_node; > + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; > + > + p_qos_policy->p_ca_hash = st_init_strtable(); > + CL_ASSERT(p_qos_policy->p_ca_hash); > + > + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) > + return; > + > + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); > + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); > + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { > + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) > + st_insert(p_qos_policy->p_ca_hash, > + (st_data_t)p_node->print_desc, > + (st_data_t)p_node); Hmm, why do you think NodeDescription will be unique for each node in a fabric? Sasha > + } > +} > + > +/*************************************************** > + ***************************************************/ > + > static boolean_t > __is_num_in_range_arr(uint64_t ** range_arr, > unsigned range_arr_len, uint64_t num) > @@ -127,8 +152,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() > return NULL; > > memset(p, 0, sizeof(osm_qos_port_group_t)); > - > - cl_list_init(&p->port_name_list, 10); > cl_qmap_init(&p->port_map); > > return p; > @@ -150,10 +173,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) > if (p->use) > free(p->use); > > - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); > - cl_list_remove_all(&p->port_name_list); > - cl_list_destroy(&p->port_name_list); > - > p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); > while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) > { > @@ -423,6 +442,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) > cl_list_init(&p_qos_policy->qos_match_rules, 10); > > p_qos_policy->p_subn = p_subn; > + __build_cabyname_hash(p_qos_policy); > + > return p_qos_policy; > } > > @@ -495,6 +516,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) > cl_list_remove_all(&p_qos_policy->qos_match_rules); > cl_list_destroy(&p_qos_policy->qos_match_rules); > > + if (p_qos_policy->p_ca_hash) > + st_free_table(p_qos_policy->p_ca_hash); > + > free(p_qos_policy); > > p_qos_policy = NULL; > -- > 1.5.1.4 > > From sashak at voltaire.com Sat Oct 13 13:35:49 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 22:35:49 +0200 Subject: [ofa-general] Re: [PATCH 2/3] osm: QoS - fixing memory leaks In-Reply-To: <470B4336.9000207@dev.mellanox.co.il> References: <470B4336.9000207@dev.mellanox.co.il> Message-ID: <20071013203549.GH12364@sashak.voltaire.com> On 11:00 Tue 09 Oct , Yevgeny Kliteynik wrote: > Fixing bunch of memory leaks and pointer mismatches in QoS. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 13 14:02:39 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 23:02:39 +0200 Subject: [ofa-general] Re: OpenSM prints guids twice In-Reply-To: <4709E55B.8070901@dev.mellanox.co.il> References: <4709E55B.8070901@dev.mellanox.co.il> Message-ID: <20071013210239.GJ12364@sashak.voltaire.com> Hi Yevgeny, On 10:07 Mon 08 Oct , Yevgeny Kliteynik wrote: > > I noticed the following problem a while ago - when the whole > duplicated guids and re-reading files mails were running, > but never had a chance to dig deeper. > > Anyway, sometimes OpenSM 'sees' the same HCA ports twice. It is just how osm_state_mgr_report() is done - it iterates nodes by port_guid_tbl map and not by node_guid_tbl. I have no idea why it was done this way, likely just a bug. Anyway the patch below fixes this. Sasha commit b272c11fa910And07a0b02d5544ea75507f69515c Author: Sasha Khapyorsky Date: Mon Oct 8 15:02:54 2007 +0200 opensm: report message fix Generate OpenSM report message node by node (not by ports), and so eliminate duplicated nodes reporting. Signed-off-by: Sasha Khapyorsky diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index e5ef89d..4646c8a 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1169,7 +1169,6 @@ static void __osm_topology_file_create(IN osm_state_mgr_t * const p_mgr) static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) { const cl_qmap_t *p_tbl; - const osm_port_t *p_port; const osm_node_t *p_node; const osm_physp_t *p_physp; const osm_physp_t *p_remote_physp; @@ -1191,23 +1190,22 @@ static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " " : Neighbor Port (Port #)\n"); - p_tbl = &p_mgr->p_subn->port_guid_tbl; + p_tbl = &p_mgr->p_subn->node_guid_tbl; /* * Hold lock non-exclusively while we perform these read-only operations. */ CL_PLOCK_ACQUIRE(p_mgr->p_lock); - p_port = (osm_port_t *) cl_qmap_head(p_tbl); - while (p_port != (osm_port_t *) cl_qmap_end(p_tbl)) { + p_node = (osm_node_t *) cl_qmap_head(p_tbl); + while (p_node != (osm_node_t *) cl_qmap_end(p_tbl)) { if (osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) { osm_log(p_mgr->p_log, OSM_LOG_DEBUG, "__osm_state_mgr_report: " - "Processing port 0x%016" PRIx64 "\n", - cl_ntoh64(osm_port_get_guid(p_port))); + "Processing node 0x%016" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_node))); } - p_node = p_port->p_node; node_type = osm_node_get_type(p_node); if (node_type == IB_NODE_TYPE_SWITCH) start_port = 0; @@ -1311,7 +1309,7 @@ static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, "------------------------------------------------------" "------------------------------------------------\n"); - p_port = (osm_port_t *) cl_qmap_next(&p_port->map_item); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item); } CL_PLOCK_RELEASE(p_mgr->p_lock); From sashak at voltaire.com Sat Oct 13 14:03:57 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 23:03:57 +0200 Subject: [ofa-general] [PATCH] opensm: move osm_state_mgr dumpers to osm_dump.c In-Reply-To: <20071013210239.GJ12364@sashak.voltaire.com> References: <4709E55B.8070901@dev.mellanox.co.il> <20071013210239.GJ12364@sashak.voltaire.com> Message-ID: <20071013210357.GK12364@sashak.voltaire.com> This moves osm_state_mgr dumpers (__osm_topology_file_create() and __osm_state_mgr_report()) to osm_dump.c. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_dump.c | 223 ++++++++++++++++++++++++++++ opensm/opensm/osm_state_mgr.c | 329 ----------------------------------------- 2 files changed, 223 insertions(+), 329 deletions(-) diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c index 7935c16..b7d99b2 100644 --- a/opensm/opensm/osm_dump.c +++ b/opensm/opensm/osm_dump.c @@ -370,6 +370,213 @@ static void dump_ucast_lfts(cl_map_item_t * p_map_item, void *cxt) fprintf(file, "%u lids dumped\n", max_lid); } +static void dump_topology_node(cl_map_item_t * p_map_item, void *cxt) +{ + osm_node_t *p_node = (osm_node_t *) p_map_item; + FILE *file = ((struct dump_context *)cxt)->file; + uint32_t cPort; + osm_node_t *p_nbnode; + osm_physp_t *p_physp, *p_default_physp, *p_rphysp; + uint8_t link_speed_act; + + if (!p_node->node_info.num_ports) + return; + + for (cPort = 1; cPort < osm_node_get_num_physp(p_node); cPort++) { + uint8_t port_state; + + p_physp = osm_node_get_physp_ptr(p_node, cPort); + if (!osm_physp_is_valid(p_physp)) + continue; + + p_rphysp = p_physp->p_remote_physp; + if (!p_rphysp || !osm_physp_is_valid(p_rphysp)) + continue; + + CL_ASSERT(cPort == p_physp->port_num); + + if (p_node->node_info.node_type == IB_NODE_TYPE_SWITCH) + p_default_physp = osm_node_get_physp_ptr(p_node, 0); + else + p_default_physp = p_physp; + + fprintf(file, "{ %s%s Ports:%02X" + " SystemGUID:%016" PRIx64 + " NodeGUID:%016" PRIx64 + " PortGUID:%016" PRIx64 + " VenID:%06X DevID:%04X Rev:%08X {%s} LID:%04X PN:%02X } ", + p_node->node_info.node_type == IB_NODE_TYPE_SWITCH ? + "SW" : p_node->node_info.node_type == + IB_NODE_TYPE_CA ? "CA" : p_node->node_info.node_type == + IB_NODE_TYPE_ROUTER ? "Rt" : "**", + p_default_physp->port_info.base_lid == + p_default_physp->port_info. + master_sm_base_lid ? "-SM" : "", + p_node->node_info.num_ports, + cl_ntoh64(p_node->node_info.sys_guid), + cl_ntoh64(p_node->node_info.node_guid), + cl_ntoh64(p_physp->port_guid), + cl_ntoh32(ib_node_info_get_vendor_id + (&p_node->node_info)), + cl_ntoh16(p_node->node_info.device_id), + cl_ntoh32(p_node->node_info.revision), + p_node->print_desc, + cl_ntoh16(p_default_physp->port_info.base_lid), cPort); + + p_nbnode = p_rphysp->p_node; + + if (p_nbnode->node_info.node_type == IB_NODE_TYPE_SWITCH) + p_default_physp = osm_node_get_physp_ptr(p_nbnode, 0); + else + p_default_physp = p_rphysp; + + fprintf(file, "{ %s%s Ports:%02X" + " SystemGUID:%016" PRIx64 + " NodeGUID:%016" PRIx64 + " PortGUID:%016" PRIx64 + " VenID:%08X DevID:%04X Rev:%08X {%s} LID:%04X PN:%02X } ", + p_nbnode->node_info.node_type == IB_NODE_TYPE_SWITCH ? + "SW" : p_nbnode->node_info.node_type == + IB_NODE_TYPE_CA ? "CA" : + p_nbnode->node_info.node_type == IB_NODE_TYPE_ROUTER ? + "Rt" : "**", + p_default_physp->port_info.base_lid == + p_default_physp->port_info. + master_sm_base_lid ? "-SM" : "", + p_nbnode->node_info.num_ports, + cl_ntoh64(p_nbnode->node_info.sys_guid), + cl_ntoh64(p_nbnode->node_info.node_guid), + cl_ntoh64(p_rphysp->port_guid), + cl_ntoh32(ib_node_info_get_vendor_id + (&p_nbnode->node_info)), + cl_ntoh32(p_nbnode->node_info.device_id), + cl_ntoh32(p_nbnode->node_info.revision), + p_nbnode->print_desc, + cl_ntoh16(p_default_physp->port_info.base_lid), + p_rphysp->port_num); + + port_state = ib_port_info_get_port_state(&p_physp->port_info); + link_speed_act = + ib_port_info_get_link_speed_active(&p_physp->port_info); + + fprintf(file, "PHY=%s LOG=%s SPD=%s\n", + p_physp->port_info.link_width_active == 1 ? "1x" : + p_physp->port_info.link_width_active == 2 ? "4x" : + p_physp->port_info.link_width_active == 8 ? "12x" : + "??", + port_state == IB_LINK_ACTIVE ? "ACT" : + port_state == IB_LINK_ARMED ? "ARM" : + port_state == IB_LINK_INIT ? "INI" : "DWN", + link_speed_act == 1 ? "2.5" : + link_speed_act == 2 ? "5" : + link_speed_act == 4 ? "10" : "??"); + } +} + +static void print_node_report(cl_map_item_t * p_map_item, void *cxt) +{ + osm_node_t *p_node = (osm_node_t *) p_map_item; + osm_opensm_t *osm = ((struct dump_context *)cxt)->p_osm; + osm_log_t *log = &osm->log; + const osm_physp_t *p_physp, *p_remote_physp; + const ib_port_info_t *p_pi; + uint8_t port_num; + uint32_t num_ports; + uint8_t node_type; + + if (osm_log_is_active(log, OSM_LOG_DEBUG)) + osm_log(log, OSM_LOG_DEBUG, "__osm_state_mgr_report: " + "Processing node 0x%016" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_node))); + + node_type = osm_node_get_type(p_node); + + num_ports = osm_node_get_num_physp(p_node); + port_num = node_type == IB_NODE_TYPE_SWITCH ? 0 : 1; + for (; port_num < num_ports; port_num++) { + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!osm_physp_is_valid(p_physp)) + continue; + + osm_log_printf(log, OSM_LOG_VERBOSE, "%-11s : %s : %02X :", + osm_get_manufacturer_str(cl_ntoh64 + (osm_node_get_node_guid + (p_node))), + osm_get_node_type_str_fixed_width + (node_type), port_num); + + p_pi = &p_physp->port_info; + + /* + * Port state is not defined for switch port 0 + */ + if (port_num == 0) + osm_log_printf(log, OSM_LOG_VERBOSE, " :"); + else + osm_log_printf(log, OSM_LOG_VERBOSE, " %s :", + osm_get_port_state_str_fixed_width + (ib_port_info_get_port_state(p_pi))); + + /* + * LID values are only meaningful in select cases. + */ + if (ib_port_info_get_port_state(p_pi) != IB_LINK_DOWN + && ((node_type == IB_NODE_TYPE_SWITCH && port_num == 0) + || node_type != IB_NODE_TYPE_SWITCH)) + osm_log_printf(log, OSM_LOG_VERBOSE, " %04X : %01X :", + cl_ntoh16(p_pi->base_lid), + ib_port_info_get_lmc(p_pi)); + else + osm_log_printf(log, OSM_LOG_VERBOSE, " : :"); + + if (port_num != 0) + osm_log_printf(log, OSM_LOG_VERBOSE, " %s : %s : %s ", + osm_get_mtu_str + (ib_port_info_get_neighbor_mtu(p_pi)), + osm_get_lwa_str(p_pi->link_width_active), + osm_get_lsa_str + (ib_port_info_get_link_speed_active + (p_pi))); + else + osm_log_printf(log, OSM_LOG_VERBOSE, + " : : "); + + if (osm_physp_get_port_guid(p_physp) == osm->subn.sm_port_guid) + osm_log_printf(log, OSM_LOG_VERBOSE, + "* %016" PRIx64 " *", + cl_ntoh64(osm_physp_get_port_guid + (p_physp))); + else + osm_log_printf(log, OSM_LOG_VERBOSE, + ": %016" PRIx64 " :", + cl_ntoh64(osm_physp_get_port_guid + (p_physp))); + + if (port_num + && (ib_port_info_get_port_state(p_pi) != IB_LINK_DOWN)) { + p_remote_physp = osm_physp_get_remote(p_physp); + if (p_remote_physp + && osm_physp_is_valid(p_remote_physp)) + osm_log_printf(log, OSM_LOG_VERBOSE, + " %016" PRIx64 " (%02X)", + cl_ntoh64 + (osm_physp_get_port_guid + (p_remote_physp)), + osm_physp_get_port_num + (p_remote_physp)); + else + osm_log_printf(log, OSM_LOG_VERBOSE, + " UNKNOWN"); + } + + osm_log_printf(log, OSM_LOG_VERBOSE, "\n"); + } + + osm_log_printf(log, OSM_LOG_VERBOSE, + "------------------------------------------------------" + "------------------------------------------------\n"); +} + /********************************************************************** **********************************************************************/ static void dump_qmap(osm_opensm_t * p_osm, FILE * file, @@ -410,6 +617,18 @@ static void dump_qmap_to_file(osm_opensm_t * p_osm, const char *file_name, /********************************************************************** **********************************************************************/ +static void print_report(osm_opensm_t * osm) +{ + osm_log_printf(&osm->log, OSM_LOG_VERBOSE, + "\n===================================================" + "====================================================" + "\nVendor : Ty " + ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " + " : Neighbor Port (Port #)\n"); + + dump_qmap(osm, NULL, &osm->subn.node_guid_tbl, print_node_report); +} + void osm_dump_mcast_routes(osm_opensm_t * osm) { if (osm_log_is_active(&osm->log, OSM_LOG_ROUTING)) { @@ -436,4 +655,8 @@ void osm_dump_all(osm_opensm_t * osm) dump_qmap_to_file(osm, "opensm.mcfdbs", &osm->subn.sw_guid_tbl, dump_mcast_routes); } + dump_qmap_to_file(osm, "opensm-subnet.lst", &osm->subn.node_guid_tbl, + dump_topology_node); + if (osm_log_is_active(&osm->log, OSM_LOG_VERBOSE)) + print_report(osm); } diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index cd1e4c0..d0ce37d 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -69,8 +69,6 @@ #include #include -#define SUBNET_LIST_FILENAME "/opensm-subnet.lst" - osm_signal_t osm_qos_setup(IN osm_opensm_t * p_osm); /********************************************************************** @@ -988,331 +986,6 @@ static ib_api_status_t __osm_state_mgr_light_sweep_start(IN osm_state_mgr_t * /********************************************************************** **********************************************************************/ -static void __osm_topology_file_create(IN osm_state_mgr_t * const p_mgr) -{ - const osm_node_t *p_node; - char *file_name; - FILE *rc; - - OSM_LOG_ENTER(p_mgr->p_log, __osm_topology_file_create); - - CL_PLOCK_ACQUIRE(p_mgr->p_lock); - - file_name = (char *)malloc(strlen(p_mgr->p_subn->opt.dump_files_dir) - + strlen(SUBNET_LIST_FILENAME) + 1); - - CL_ASSERT(file_name); - - strcpy(file_name, p_mgr->p_subn->opt.dump_files_dir); - strcat(file_name, SUBNET_LIST_FILENAME); - - if ((rc = fopen(file_name, "w")) == NULL) { - osm_log(p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_topology_file_create: " - "fopen failed for file:%s\n", file_name); - - CL_PLOCK_RELEASE(p_mgr->p_lock); - goto Exit; - } - - p_node = (osm_node_t *) cl_qmap_head(&p_mgr->p_subn->node_guid_tbl); - while (p_node != - (osm_node_t *) cl_qmap_end(&p_mgr->p_subn->node_guid_tbl)) { - if (p_node->node_info.num_ports) { - uint32_t cPort; - osm_node_t *p_nbnode; - osm_physp_t *p_physp; - osm_physp_t *p_default_physp; - osm_physp_t *p_rphysp; - uint8_t link_speed_act; - - for (cPort = 1; cPort < osm_node_get_num_physp(p_node); - cPort++) { - uint8_t port_state; - - p_physp = osm_node_get_physp_ptr(p_node, cPort); - - if (!osm_physp_is_valid(p_physp)) - continue; - - p_rphysp = p_physp->p_remote_physp; - - if ((p_rphysp == NULL) - || (!osm_physp_is_valid(p_rphysp))) - continue; - - CL_ASSERT(cPort == p_physp->port_num); - - if (p_node->node_info.node_type == - IB_NODE_TYPE_SWITCH) { - p_default_physp = - osm_node_get_physp_ptr(p_node, 0); - } else { - p_default_physp = p_physp; - } - - fprintf(rc, "{ %s%s Ports:%02X" - " SystemGUID:%016" PRIx64 - " NodeGUID:%016" PRIx64 - " PortGUID:%016" PRIx64 - " VenID:%06X DevID:%04X Rev:%08X {%s} LID:%04X PN:%02X } ", - (p_node->node_info.node_type == - IB_NODE_TYPE_SWITCH) ? "SW" : (p_node-> - node_info. - node_type - == - IB_NODE_TYPE_CA) - ? "CA" : (p_node->node_info.node_type == - IB_NODE_TYPE_ROUTER) ? "Rt" : - "**", - (p_default_physp->port_info.base_lid == - p_default_physp->port_info. - master_sm_base_lid) ? "-SM" : "", - p_node->node_info.num_ports, - cl_ntoh64(p_node->node_info.sys_guid), - cl_ntoh64(p_node->node_info.node_guid), - cl_ntoh64(p_physp->port_guid), - cl_ntoh32(ib_node_info_get_vendor_id - (&p_node->node_info)), - cl_ntoh16(p_node->node_info.device_id), - cl_ntoh32(p_node->node_info.revision), - p_node->print_desc, - cl_ntoh16(p_default_physp->port_info. - base_lid), cPort); - - p_nbnode = p_rphysp->p_node; - - if (p_nbnode->node_info.node_type == - IB_NODE_TYPE_SWITCH) { - p_default_physp = - osm_node_get_physp_ptr(p_nbnode, 0); - } else { - p_default_physp = p_rphysp; - } - - fprintf(rc, "{ %s%s Ports:%02X" - " SystemGUID:%016" PRIx64 - " NodeGUID:%016" PRIx64 - " PortGUID:%016" PRIx64 - " VenID:%08X DevID:%04X Rev:%08X {%s} LID:%04X PN:%02X } ", - (p_nbnode->node_info.node_type == - IB_NODE_TYPE_SWITCH) ? "SW" - : (p_nbnode->node_info.node_type == - IB_NODE_TYPE_CA) ? "CA" : (p_nbnode-> - node_info. - node_type - == - IB_NODE_TYPE_ROUTER) - ? "Rt" : "**", - (p_default_physp->port_info.base_lid == - p_default_physp->port_info. - master_sm_base_lid) ? "-SM" : "", - p_nbnode->node_info.num_ports, - cl_ntoh64(p_nbnode->node_info.sys_guid), - cl_ntoh64(p_nbnode->node_info. - node_guid), - cl_ntoh64(p_rphysp->port_guid), - cl_ntoh32(ib_node_info_get_vendor_id - (&p_nbnode->node_info)), - cl_ntoh32(p_nbnode->node_info. - device_id), - cl_ntoh32(p_nbnode->node_info.revision), - p_nbnode->print_desc, - cl_ntoh16(p_default_physp->port_info. - base_lid), - p_rphysp->port_num); - - port_state = - ib_port_info_get_port_state(&p_physp-> - port_info); - link_speed_act = - ib_port_info_get_link_speed_active - (&p_physp->port_info); - - fprintf(rc, "PHY=%s LOG=%s SPD=%s\n", - (p_physp->port_info.link_width_active == - 1) ? "1x" : (p_physp->port_info. - link_width_active == - 2) ? "4x" : (p_physp-> - port_info. - link_width_active - == - 8) ? "12x" : - "??", - ((port_state == - IB_LINK_ACTIVE) ? "ACT" : (port_state - == - IB_LINK_ARMED) - ? "ARM" : (port_state == - IB_LINK_INIT) ? "INI" : - "DWN"), - (link_speed_act == - 1) ? "2.5" : (link_speed_act == - 2) ? "5" - : (link_speed_act == 4) ? "10" : "??"); - } - } - p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item); - } - - CL_PLOCK_RELEASE(p_mgr->p_lock); - - fclose(rc); - - Exit: - free(file_name); - OSM_LOG_EXIT(p_mgr->p_log); -} - -/********************************************************************** - **********************************************************************/ -static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) -{ - const cl_qmap_t *p_tbl; - const osm_node_t *p_node; - const osm_physp_t *p_physp; - const osm_physp_t *p_remote_physp; - const ib_port_info_t *p_pi; - uint8_t port_num; - uint32_t num_ports; - uint8_t node_type; - - if (!osm_log_is_active(p_mgr->p_log, OSM_LOG_VERBOSE)) - return; - - OSM_LOG_ENTER(p_mgr->p_log, __osm_state_mgr_report); - - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - "\n===================================================" - "====================================================" - "\nVendor : Ty " - ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " - " : Neighbor Port (Port #)\n"); - - p_tbl = &p_mgr->p_subn->node_guid_tbl; - - /* - * Hold lock non-exclusively while we perform these read-only operations. - */ - - CL_PLOCK_ACQUIRE(p_mgr->p_lock); - p_node = (osm_node_t *) cl_qmap_head(p_tbl); - while (p_node != (osm_node_t *) cl_qmap_end(p_tbl)) { - if (osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) - osm_log(p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_state_mgr_report: " - "Processing node 0x%016" PRIx64 "\n", - cl_ntoh64(osm_node_get_node_guid(p_node))); - - node_type = osm_node_get_type(p_node); - - num_ports = osm_node_get_num_physp(p_node); - port_num = node_type == IB_NODE_TYPE_SWITCH ? 0 : 1; - for (; port_num < num_ports; port_num++) { - p_physp = osm_node_get_physp_ptr(p_node, port_num); - if (!osm_physp_is_valid(p_physp)) - continue; - - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - "%-11s : %s : %02X :", - osm_get_manufacturer_str(cl_ntoh64 - (osm_node_get_node_guid - (p_node))), - osm_get_node_type_str_fixed_width - (node_type), port_num); - - p_pi = &p_physp->port_info; - - /* - * Port state is not defined for switch port 0 - */ - if (port_num == 0) - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " :"); - else - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " %s :", - osm_get_port_state_str_fixed_width - (ib_port_info_get_port_state - (p_pi))); - - /* - * LID values are only meaningful in select cases. - */ - if (ib_port_info_get_port_state(p_pi) != IB_LINK_DOWN - && - ((node_type == IB_NODE_TYPE_SWITCH && port_num == 0) - || node_type != IB_NODE_TYPE_SWITCH)) - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " %04X : %01X :", - cl_ntoh16(p_pi->base_lid), - ib_port_info_get_lmc(p_pi)); - else - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " : :"); - - if (port_num != 0) - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " %s : %s : %s ", - osm_get_mtu_str - (ib_port_info_get_neighbor_mtu - (p_pi)), - osm_get_lwa_str(p_pi-> - link_width_active), - osm_get_lsa_str - (ib_port_info_get_link_speed_active - (p_pi))); - else - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - " : : "); - - if (osm_physp_get_port_guid(p_physp) == - p_mgr->p_subn->sm_port_guid) - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - "* %016" PRIx64 " *", - cl_ntoh64(osm_physp_get_port_guid - (p_physp))); - else - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - ": %016" PRIx64 " :", - cl_ntoh64(osm_physp_get_port_guid - (p_physp))); - - if (port_num - && (ib_port_info_get_port_state(p_pi) != - IB_LINK_DOWN)) { - p_remote_physp = osm_physp_get_remote(p_physp); - if (p_remote_physp - && osm_physp_is_valid(p_remote_physp)) - osm_log_printf(p_mgr->p_log, - OSM_LOG_VERBOSE, - " %016" PRIx64 " (%02X)", - cl_ntoh64 - (osm_physp_get_port_guid - (p_remote_physp)), - osm_physp_get_port_num - (p_remote_physp)); - else - osm_log_printf(p_mgr->p_log, - OSM_LOG_VERBOSE, - " UNKNOWN"); - } - - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, "\n"); - } - - osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, - "------------------------------------------------------" - "------------------------------------------------\n"); - p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item); - } - - CL_PLOCK_RELEASE(p_mgr->p_lock); - OSM_LOG_EXIT(p_mgr->p_log); -} - -/********************************************************************** - **********************************************************************/ static void __process_idle_time_queue_done(IN osm_state_mgr_t * const p_mgr) { cl_qlist_t *p_list = &p_mgr->idle_time_list; @@ -2636,9 +2309,7 @@ void osm_state_mgr_process(IN osm_state_mgr_t * const p_mgr, } p_mgr->p_subn->need_update = 0; - __osm_topology_file_create(p_mgr); osm_dump_all(p_mgr->p_subn->p_osm); - __osm_state_mgr_report(p_mgr); __osm_state_mgr_up_msg(p_mgr); if (osm_log_is_active(p_mgr->p_log, OSM_LOG_VERBOSE)) -- 1.5.3.4.206.g58ba4 From sashak at voltaire.com Sat Oct 13 14:12:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Oct 2007 23:12:36 +0200 Subject: [ofa-general] [PATCH] libibcommon: remove static _version_build var from common.h Message-ID: <20071013211236.GL12364@sashak.voltaire.com> This removes _version_build static variable if __BUILD_VERSION_TAG__ is not defined. When it is unconditional this variable is just duplicated over all objects compiled from source files where common.h is included (explicitly or implicitly). Signed-off-by: Sasha Khapyorsky --- libibcommon/include/infiniband/common.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index af4ab7a..72147d8 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -143,17 +143,15 @@ uint64_t getcurrenttime(void); /* hash.c */ uint32_t fhash(uint8_t *k, int length, uint32_t initval); +#ifdef __BUILD_VERSION_TAG__ + #undef stringify #undef tostring #define stringify(s) tostring(s) #define tostring(s) #s -#ifdef __BUILD_VERSION_TAG__ __attribute__((unused)) static char _build_version[] = { "BUILD VERSION: " stringify(__BUILD_VERSION_TAG__) " Build date: " __DATE__ " " __TIME__ }; -#else -__attribute__((unused)) static char _build_version[] = { __DATE__ " " __TIME__ }; -#endif __attribute__((unused)) static inline char* get_build_version(void) @@ -161,6 +159,8 @@ get_build_version(void) return _build_version; } +#endif + END_C_DECLS #endif /* __COMMON_H__ */ -- 1.5.3.4.206.g58ba4 From rdreier at cisco.com Sat Oct 13 14:17:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 13 Oct 2007 14:17:10 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git for-linus Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get one bug fix for a problem that completely kills the mlx4 driver: Roland Dreier (1): mlx4_core: Fix infinite loop on device initialization drivers/net/mlx4/main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index e029b8a..89b3f0b 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -884,7 +884,7 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev, ++mlx4_version_printed; } - return mlx4_init_one(pdev, id); + return __mlx4_init_one(pdev, id); } static void mlx4_remove_one(struct pci_dev *pdev) From kliteyn at mellanox.co.il Sat Oct 13 22:24:20 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 14 Oct 2007 07:24:20 +0200 Subject: [ofa-general] nightly osm_sim report 2007-10-14:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-10-13 OpenSM git rev = Tue_Oct_2_22:28:56_2007 [d5c34ddc158599abff9f09a6cc6c8cad67745f0b] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From kliteyn at dev.mellanox.co.il Sun Oct 14 01:53:33 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 14 Oct 2007 10:53:33 +0200 Subject: [ofa-general] Re: OpenSM prints guids twice In-Reply-To: <20071013210239.GJ12364@sashak.voltaire.com> References: <4709E55B.8070901@dev.mellanox.co.il> <20071013210239.GJ12364@sashak.voltaire.com> Message-ID: <4711D90D.1070502@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 10:07 Mon 08 Oct , Yevgeny Kliteynik wrote: >> I noticed the following problem a while ago - when the whole >> duplicated guids and re-reading files mails were running, >> but never had a chance to dig deeper. >> >> Anyway, sometimes OpenSM 'sees' the same HCA ports twice. > > It is just how osm_state_mgr_report() is done - it iterates nodes by > port_guid_tbl map and not by node_guid_tbl. I have no idea why it was > done this way, likely just a bug. Anyway the patch below fixes this. Great, thanks. -- Yevgeny > Sasha > > > commit b272c11fa910And07a0b02d5544ea75507f69515c > Author: Sasha Khapyorsky > Date: Mon Oct 8 15:02:54 2007 +0200 > > opensm: report message fix > > Generate OpenSM report message node by node (not by ports), and so > eliminate duplicated nodes reporting. > > Signed-off-by: Sasha Khapyorsky > > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index e5ef89d..4646c8a 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1169,7 +1169,6 @@ static void __osm_topology_file_create(IN osm_state_mgr_t * const p_mgr) > static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) > { > const cl_qmap_t *p_tbl; > - const osm_port_t *p_port; > const osm_node_t *p_node; > const osm_physp_t *p_physp; > const osm_physp_t *p_remote_physp; > @@ -1191,23 +1190,22 @@ static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) > ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " > " : Neighbor Port (Port #)\n"); > > - p_tbl = &p_mgr->p_subn->port_guid_tbl; > + p_tbl = &p_mgr->p_subn->node_guid_tbl; > > /* > * Hold lock non-exclusively while we perform these read-only operations. > */ > > CL_PLOCK_ACQUIRE(p_mgr->p_lock); > - p_port = (osm_port_t *) cl_qmap_head(p_tbl); > - while (p_port != (osm_port_t *) cl_qmap_end(p_tbl)) { > + p_node = (osm_node_t *) cl_qmap_head(p_tbl); > + while (p_node != (osm_node_t *) cl_qmap_end(p_tbl)) { > if (osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) { > osm_log(p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_state_mgr_report: " > - "Processing port 0x%016" PRIx64 "\n", > - cl_ntoh64(osm_port_get_guid(p_port))); > + "Processing node 0x%016" PRIx64 "\n", > + cl_ntoh64(osm_node_get_node_guid(p_node))); > } > > - p_node = p_port->p_node; > node_type = osm_node_get_type(p_node); > if (node_type == IB_NODE_TYPE_SWITCH) > start_port = 0; > @@ -1311,7 +1309,7 @@ static void __osm_state_mgr_report(IN osm_state_mgr_t * const p_mgr) > osm_log_printf(p_mgr->p_log, OSM_LOG_VERBOSE, > "------------------------------------------------------" > "------------------------------------------------\n"); > - p_port = (osm_port_t *) cl_qmap_next(&p_port->map_item); > + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item); > } > > CL_PLOCK_RELEASE(p_mgr->p_lock); > From vlad at dev.mellanox.co.il Sun Oct 14 01:55:14 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 14 Oct 2007 10:55:14 +0200 Subject: [ofa-general] Re: [ewg] libcxgb3-1.0.3 available for ofed-1.2.5 and ofed-1.3 In-Reply-To: <470E977B.5080107@opengridcomputing.com> References: <470A9363.4010007@opengridcomputing.com> <470BA4D4.3080707@dev.mellanox.co.il> <470BA93C.3010601@opengridcomputing.com> <470BB8DB.8090107@dev.mellanox.co.il> <470D2C69.3000500@opengridcomputing.com> <470DDA81.4060108@dev.mellanox.co.il> <470E1A38.2020902@opengridcomputing.com> <470E977B.5080107@opengridcomputing.com> Message-ID: <4711D972.2050205@dev.mellanox.co.il> Steve Wise wrote: > Ok, can you re-pull to get the configure.in change? > > Sorry for the pain. > > Steve. > > Done. Regards, Vladimir From kliteyn at dev.mellanox.co.il Sun Oct 14 02:00:50 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 14 Oct 2007 11:00:50 +0200 Subject: [ofa-general] Re: [PATCH 1/3] osm: QoS- bug in opening policy file In-Reply-To: <20071013193338.GF12364@sashak.voltaire.com> References: <470B4314.1050702@dev.mellanox.co.il> <20071013193338.GF12364@sashak.voltaire.com> Message-ID: <4711DAC2.5000807@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 11:00 Tue 09 Oct , Yevgeny Kliteynik wrote: >> Fixing bug in opening QoS policy file >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_qos_parser.y | 8 +++++--- >> 1 files changed, 5 insertions(+), 3 deletions(-) >> >> diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y >> index e0faaaf..8e9f282 100644 >> --- a/opensm/opensm/osm_qos_parser.y >> +++ b/opensm/opensm/osm_qos_parser.y >> @@ -50,6 +50,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -129,6 +130,7 @@ extern char * __qos_parser_text; >> extern void __qos_parser_error (char *s); >> extern int __qos_parser_lex (void); >> extern FILE * __qos_parser_in; >> +extern int errno; >> >> #define RESET_BUFFER __parser_tmp_struct_reset() >> >> @@ -1750,13 +1752,13 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) >> osm_qos_policy_destroy(p_subn->p_qos_policy); >> p_subn->p_qos_policy = NULL; >> >> - if (!stat(p_subn->opt.qos_policy_file, &statbuf)) { >> + if (stat(p_subn->opt.qos_policy_file, &statbuf)) { > > Why this stat() check is needed at all? Right after this there are > fopen() - all checks could be done according to status there, right? Good point. -- Yevgeny > Sasha > >> if (strcmp(p_subn->opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) { >> osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, >> "osm_qos_parse_policy_file: ERR AC01: " >> - "QoS policy file not found (%s)\n", >> - p_subn->opt.qos_policy_file); >> + "Failed opening QoS policy file %s - %s\n", >> + p_subn->opt.qos_policy_file, strerror(errno)); >> res = 1; >> } >> else >> -- >> 1.5.1.4 >> >> > From kliteyn at dev.mellanox.co.il Sun Oct 14 02:03:29 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 14 Oct 2007 11:03:29 +0200 Subject: [ofa-general] [PATCH] osm: QoS - bug in opening policy file Message-ID: <4711DB61.7000900@dev.mellanox.co.il> Fixing bug in opening QoS policy file Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.y | 24 +++++++----------------- 1 files changed, 7 insertions(+), 17 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index a5067d0..3e41cfe 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -50,7 +50,7 @@ #include #include #include -#include +#include #include #include #include @@ -129,6 +129,7 @@ extern char * __qos_parser_text; extern void __qos_parser_error (char *s); extern int __qos_parser_lex (void); extern FILE * __qos_parser_in; +extern int errno; #define RESET_BUFFER __parser_tmp_struct_reset() @@ -1741,7 +1742,6 @@ number_from_range_2: TK_NUMBER { int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) { int res = 0; - struct stat statbuf; static boolean_t first_time = TRUE; p_qos_parser_osm_log = &p_subn->p_osm->log; @@ -1750,13 +1750,14 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) osm_qos_policy_destroy(p_subn->p_qos_policy); p_subn->p_qos_policy = NULL; - if (!stat(p_subn->opt.qos_policy_file, &statbuf)) { - + __qos_parser_in = fopen (p_subn->opt.qos_policy_file, "r"); + if (!__qos_parser_in) + { if (strcmp(p_subn->opt.qos_policy_file,OSM_DEFAULT_QOS_POLICY_FILE)) { osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, "osm_qos_parse_policy_file: ERR AC01: " - "QoS policy file not found (%s)\n", - p_subn->opt.qos_policy_file); + "Failed opening QoS policy file %s - %s\n", + p_subn->opt.qos_policy_file, strerror(errno)); res = 1; } else @@ -1768,17 +1769,6 @@ int osm_qos_parse_policy_file(IN osm_subn_t * const p_subn) goto Exit; } - __qos_parser_in = fopen (p_subn->opt.qos_policy_file, "r"); - if (!__qos_parser_in) - { - osm_log(p_qos_parser_osm_log, OSM_LOG_ERROR, - "osm_qos_parse_policy_file: ERR AC02: " - "Failed opening QoS policy file (%s)\n", - p_subn->opt.qos_policy_file); - res = 1; - goto Exit; - } - if (first_time) { first_time = FALSE; -- 1.5.1.4 From sashak at voltaire.com Sun Oct 14 02:26:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 14 Oct 2007 11:26:48 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS - bug in opening policy file In-Reply-To: <4711DB61.7000900@dev.mellanox.co.il> References: <4711DB61.7000900@dev.mellanox.co.il> Message-ID: <20071014092648.GM12364@sashak.voltaire.com> On 11:03 Sun 14 Oct , Yevgeny Kliteynik wrote: > Fixing bug in opening QoS policy file > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From carbasus at youpaywhatyousay.com Sun Oct 14 02:16:57 2007 From: carbasus at youpaywhatyousay.com (Vinod Elliott) Date: Sun, 14 Oct 2007 11:16:57 +0200 Subject: [ofa-general] Microsoft Off|ce Pro -New Vista/XP Edition- 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80e41$b4db6780$0100007f@localhost> oemsoftwaredeal . com From vlad at lists.openfabrics.org Sun Oct 14 02:57:22 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 14 Oct 2007 02:57:22 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071014-0200 daily build status Message-ID: <20071014095722.4DC12E60873@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From fa at shcscollege.com Sun Oct 14 03:16:22 2007 From: fa at shcscollege.com (=?GB2312?B?u9i4tKO6?=) Date: Sun, 14 Oct 2007 18:16:22 +0800 Subject: [ofa-general] ***SPAM*** =?gb2312?b?JcqyJcO0ysdetPM9v809u6cqudwmKsDt?= Message-ID: <20071014100604.219A2E603CB@openfabrics.org> ★★★大客户的开发与维护★★★ 中 国・深 圳・2007 年 10月27-28日 中 国・上 海・2007 年 11月2-3日 中 国・北 京・2007 年 11月10-11日 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⊙主 办单位:众 人 行 管 理 咨 询 ⊙培 训价格: 2 2 0 0元 / 人 ⊙电 话: 0 7 5 5/2 6 0 7 5 2 6 5 8 1 0 6 9 6 ⊙传 真: 0 7 5 5/6 1 3 5 1 3 9 6 ⊙联 系 人: 曾 小 姐 凌 小 姐 前言:大客户管理的概述和发展 什么是重要客户 为什么进行大客户管理 什么是大客户管理 大客户管理发展模型及阶段 区域运作模型 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 第一章 针对大客户的销售流程 一.现代大客户采购流程分析 1.“谢绝推销”的启示 2.客户关心的是什么 3.研究客户购买流程 二.客户满意式销售流程 案例分析:美国戴尔计算机公司的成功 1.建立客户满意式销售流程的思路 2.客户满意式销售流程分析 第二章 针对大客户的销售模式 一.调查结论:大客户销售人员的成绩是天份吗? 1.成功销售人员的特点 2.成功销售人员的突出技能:四个善于 3.性情论批判 二.影响大客户销售业绩的六大因素分析 三. 建立高绩效的大客户销售模型 第三章 针对大客户的SPIN顾问式销售方略 一. 传统销售线索和现代销售线索 二.什么是SPIN提问方式 三.封闭式提问和开放式提问 四.如何起用SPIN提问 五.SPIN提问方式的注意点 第四章 如何了解或挖掘大客户的需求 引言:赢得客户信任的第一步―客户拜访 一.初次拜访的程序 二.初次拜访应注意的事项: 三.再次拜访的程序: 四.如何应付消极反应者 五.要善于聆听客户说话 1.多听少说的好处 2.多说少听的危害: 3.如何善于聆听 六.了解或挖掘需求的具体方法 1.客户需求的层次 2.目标客户的综合拜访 3.销售员和客户的四种信任关系 4.挖掘决策人员个人的特殊需求 第五章 如何具体推荐产品 一.使客户购买特性和产品特性相一致 二.处理好内部销售问题 三. FABE方法的运用 四.推荐商品时的注意事项 1.不应把推销变成争论或战斗 2.保持洽谈的友好气氛 3.讲求诚信,说到做到 4.控制洽谈方向 5.选择合适时机 6.要善于听买主说话 7.注重选择推荐商品的地点和环境 五.通过助销装备来推荐产品 六.巧用戏剧效果推荐产品 七.使用适于客户的语言交谈 1.多用简短的词语 2.使用买主易懂的语言 3.与买主语言同步调 4.少用产品代号 5.用带有感情色彩的语言激发客户 第六章 排除妨碍的有效法则 一.对待障碍的态度 二.障碍的种类 三.如何查明目标客户隐蔽的心理障碍 四.排除障碍的总策略 第七章 如何做好大客户的优质服务 一.优质服务的重要性 二.四种服务类型分析 三.如何处理客户的抱怨和投诉 1.客户投诉的内容 2.处理客户不满的原则和技巧 第八章 大客户销售人员的自我管理和修炼 一.时间分配管理 二.成功销售人士的六项自我修炼 1.建立在原则基础上的自我审视的修炼 2.自我领导的修炼 3.自我管理的修炼 4.双赢思维人际领导的修炼 5.有效沟通的修炼 6.创造性合作的修炼 结束语:伟大的职业,充实的人生 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 讲师介绍:鲍英凯 ? 北京大学经济系研究生; ? 美国南加州大学(USC)工商管理硕士; ? 营销学、客户分析学、谈判技巧专家,资深营销管理培训师。 ? 曾任美国AQA集团大中华区首席代表,荷兰飞利浦、德国西门子、法国施耐德 电气等国际知名公司的经销商管理、大客户经理、销售经理、市场总监等职位。 ? 目前仍就职于知名的欧洲电器制造公司,担任高级营销管理工作。 ? 具有丰富的商业实战理论基础与实践经验,包括商务谈判、销售、物流、运营、 管理、财务等全方面流程培训,拥有丰富的针对中高层管理人员和实地销售人 员进行培训和管理的经验。 ? 以独到的管理营销经验讲解结合互动、情景式培训,基础知识与实际运用并重, 注重受训人员的感悟及参与。培训方式及角度独特,语言幽默精辟,培训现场 学员参与性强,气氛热烈,广受参训学员的强烈好评。 曾培训或咨询过的企业有: IBM中国有限公司、ABB集团、华为、中兴通讯、海格、青岛中化实业、奥林巴斯、 联邦快递、汇丰银行、长城集团、万丰奥特控股集团、西安德宝、江苏华通、 天津市达恩机电等。 北京市城乡贸易集团、 华普超市、 中国邮政邮购局、十省市邮局、 中国电信、北京启明星晨三和国际集团有限公司、鹏达房地产开发有限公司、 柯尼卡美能达商用科技制造(香港)有限公司、中信物业、 信统光电科技(深圳)有限公司、深圳天华会计师事务所有限公司、 深圳康冠电脑技术KTC有限公司、深圳市朗宁通信技术服务有限公司、 深圳航嘉电源技术有限公司、TCL、广州海欧卫浴用品股份有限公司、江铃汽车、 中国网通、友邦保险、中电集团、海南马自达、广州壹时代、深圳恒波通讯、 金碟软件(中国)有限公司等 From kliteyn at dev.mellanox.co.il Sun Oct 14 03:24:54 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 14 Oct 2007 12:24:54 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071013202559.GG12364@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> Message-ID: <4711EE76.4070107@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 11:01 Tue 09 Oct , Yevgeny Kliteynik wrote: >> Added CA-by-name hash to the QoS policy object and > > Why it is called "CA"-by-name? In the code below I see that hash is > created for all nodes (including switches and routers). In osm_qos_policy.c: if (p_node->node_info.node_type == IB_NODE_TYPE_CA) st_insert(p_qos_policy->p_ca_hash, (st_data_t)p_node->print_desc, (st_data_t)p_node); >> as port names are parsed they use this hash to locate >> that actual port that the name refers to. >> For now I prefer to keep this hash local, so it's part >> of QoS policy object. >> When the same parser will be used for partitions too, >> this hash will be moved to be part of the subnet object. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/include/opensm/osm_qos_policy.h | 3 +- >> opensm/opensm/osm_qos_parser.y | 73 +++++++++++++++++++++++++++----- >> opensm/opensm/osm_qos_policy.c | 36 +++++++++++++--- >> 3 files changed, 94 insertions(+), 18 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h >> index 30c2e6d..5c32896 100644 >> --- a/opensm/include/opensm/osm_qos_policy.h >> +++ b/opensm/include/opensm/osm_qos_policy.h >> @@ -49,6 +49,7 @@ >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { >> typedef struct _osm_qos_port_group_t { >> char *name; /* single string (this port group name) */ >> char *use; /* single string (description) */ >> - cl_list_t port_name_list; /* list of port names (.../.../...) */ >> uint8_t node_types; /* node types bitmask */ >> cl_qmap_t port_map; >> } osm_qos_port_group_t; >> @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { >> cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ >> osm_qos_level_t *p_default_qos_level; /* default QoS level */ >> osm_subn_t *p_subn; /* osm subnet object */ >> + st_table * p_ca_hash; /* hash of CAs by node description */ >> } osm_qos_policy_t; >> >> /***************************************************/ >> diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y >> index 2405519..cf342d3 100644 >> --- a/opensm/opensm/osm_qos_parser.y >> +++ b/opensm/opensm/osm_qos_parser.y >> @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { >> >> port_group_port_name: port_group_port_name_start string_list { >> /* 'port-name' in 'port-group' - any num of instances */ >> - cl_list_iterator_t list_iterator; >> - char * tmp_str; >> - >> - list_iterator = cl_list_head(&tmp_parser_struct.str_list); >> - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) >> + cl_list_iterator_t list_iterator; >> + osm_node_t * p_node; >> + osm_physp_t * p_physp; >> + unsigned port_num; >> + char * name_str; >> + char * tmp_str; >> + char * host_str; >> + char * ca_str; >> + char * port_str; >> + char * node_desc = (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); >> + >> + /* parsing port name strings */ >> + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); >> + list_iterator != cl_list_end(&tmp_parser_struct.str_list); >> + list_iterator = cl_list_next(list_iterator)) >> { >> tmp_str = (char*)cl_list_obj(list_iterator); >> + if (tmp_str && *tmp_str) >> + { >> + name_str = tmp_str; >> + host_str = strtok (name_str,"/"); >> + ca_str = strtok (NULL, "/"); >> + port_str = strtok (NULL, "/"); >> + >> + if (!host_str || !(*host_str) || >> + !ca_str || !(*ca_str) || >> + !port_str || !(*port_str) || >> + (port_str[0] != 'p' && port_str[0] != 'P')) { >> + yyerror("illegal port name"); >> + free(tmp_str); >> + free(node_desc); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> >> - /* >> - * TODO: parse port name strings >> - */ >> + if (!(port_num = strtoul(&port_str[1],NULL,0))) { >> + yyerror("illegal port number in port name"); >> + free(tmp_str); >> + free(node_desc); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> >> - if (tmp_str) >> - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); >> - list_iterator = cl_list_next(list_iterator); >> + sprintf(node_desc,"%s %s",host_str,ca_str); >> + free(tmp_str); >> + >> + if (st_lookup(p_qos_policy->p_ca_hash, >> + (st_data_t)node_desc, >> + (st_data_t*)&p_node)) > > I am not following this. Hash key is generated as "host_str ca_str", but > below where hash table is filled NodeDescription string is used. Why > this should be same? Because that's how node description is created. From /etc/init.d/openibd: # Add node description to sysfs IBSYSDIR="/sys/class/infiniband" if [ -d ${IBSYSDIR} ]; then declare -i hca_id=1 for hca in ${IBSYSDIR}/* do if [ -e ${hca}/node_desc ]; then echo -n "$(hostname -s) HCA-${hca_id}" >> ${hca}/node_desc fi let hca_id++ done fi >> + { >> + /* we found the node, now get the right port */ >> + CL_ASSERT(p_node); > > Why this CL_ASSERT() needed? It's not - it was just for debugging. I can remove it. >> + p_physp = osm_node_get_physp_ptr(p_node, port_num); >> + if (!p_physp) { >> + yyerror("port number out of range in port name"); >> + free(tmp_str); >> + free(node_desc); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> + /* we found the port, now add it to guid table */ >> + __parser_add_port_to_port_map(&p_current_port_group->port_map, >> + p_physp); >> + } >> + } >> } >> cl_list_remove_all(&tmp_parser_struct.str_list); >> + free(node_desc); >> } >> ; >> >> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c >> index 51dd7b9..0d7235f 100644 >> --- a/opensm/opensm/osm_qos_policy.c >> +++ b/opensm/opensm/osm_qos_policy.c >> @@ -59,6 +59,31 @@ >> /*************************************************** >> ***************************************************/ >> >> +static void >> +__build_cabyname_hash(osm_qos_policy_t * p_qos_policy) >> +{ >> + osm_node_t * p_node; >> + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; >> + >> + p_qos_policy->p_ca_hash = st_init_strtable(); >> + CL_ASSERT(p_qos_policy->p_ca_hash); >> + >> + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) >> + return; >> + >> + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); >> + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); >> + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { >> + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) >> + st_insert(p_qos_policy->p_ca_hash, >> + (st_data_t)p_node->print_desc, >> + (st_data_t)p_node); > > Hmm, why do you think NodeDescription will be unique for each node in a > fabric? NodeDescription is a combination of host id and hca number. If nobody "plays" with these values (and doesn't modify this area of /etc/init.d/openibd), then NodeDescription will be unique. But IB spec doesn't require it to be unique. In fact, it doesn't say anything at all about how this NodeDescription should look. Moreover, if the device won't be found in /sys/class/infiniband, or if it won't have /sys/class/infiniband/${hca}/node_desc, I have no idea what would be the content of NodeDescription. I am, however, trying to give best value for the money :) OSM doesn't know what is the host id of a certain hca. The only thing I can think of right now is that OpenSM can check the NodeDescription before inserting it to the hash (which can be done in a consequent patch). It can check that the description looks like this: -- Yevgeny > Sasha > >> + } >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> static boolean_t >> __is_num_in_range_arr(uint64_t ** range_arr, >> unsigned range_arr_len, uint64_t num) >> @@ -127,8 +152,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() >> return NULL; >> >> memset(p, 0, sizeof(osm_qos_port_group_t)); >> - >> - cl_list_init(&p->port_name_list, 10); >> cl_qmap_init(&p->port_map); >> >> return p; >> @@ -150,10 +173,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) >> if (p->use) >> free(p->use); >> >> - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); >> - cl_list_remove_all(&p->port_name_list); >> - cl_list_destroy(&p->port_name_list); >> - >> p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); >> while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) >> { >> @@ -423,6 +442,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) >> cl_list_init(&p_qos_policy->qos_match_rules, 10); >> >> p_qos_policy->p_subn = p_subn; >> + __build_cabyname_hash(p_qos_policy); >> + >> return p_qos_policy; >> } >> >> @@ -495,6 +516,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) >> cl_list_remove_all(&p_qos_policy->qos_match_rules); >> cl_list_destroy(&p_qos_policy->qos_match_rules); >> >> + if (p_qos_policy->p_ca_hash) >> + st_free_table(p_qos_policy->p_ca_hash); >> + >> free(p_qos_policy); >> >> p_qos_policy = NULL; >> -- >> 1.5.1.4 >> >> > From kliteyn at dev.mellanox.co.il Sun Oct 14 03:46:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 14 Oct 2007 12:46:55 +0200 Subject: [ofa-general] [PATCH V2] osm: QoS - parsing port names Message-ID: <4711F39F.3050202@dev.mellanox.co.il> [V2 - removed CL_ASSERT and re-diffed with the latest code] Added CA-by-name hash to the QoS policy object and as port names are parsed they use this hash to locate that actual port that the name refers to. For now I prefer to keep this hash local, so it's part of QoS policy object. When the same parser will be used for partitions too, this hash will be moved to be part of the subnet object. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 3 +- opensm/opensm/osm_qos_parser.y | 72 +++++++++++++++++++++++++++----- opensm/opensm/osm_qos_policy.c | 36 +++++++++++++--- 3 files changed, 93 insertions(+), 18 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 30c2e6d..5c32896 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -49,6 +49,7 @@ #include #include +#include #include #include #include @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ - cl_list_t port_name_list; /* list of port names (.../.../...) */ uint8_t node_types; /* node types bitmask */ cl_qmap_t port_map; } osm_qos_port_group_t; @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ osm_subn_t *p_subn; /* osm subnet object */ + st_table * p_ca_hash; /* hash of CAs by node description */ } osm_qos_policy_t; /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index d2917d3..ffc51da 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -602,23 +602,73 @@ port_group_use_start: TK_USE { port_group_port_name: port_group_port_name_start string_list { /* 'port-name' in 'port-group' - any num of instances */ - cl_list_iterator_t list_iterator; - char * tmp_str; - - list_iterator = cl_list_head(&tmp_parser_struct.str_list); - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) + cl_list_iterator_t list_iterator; + osm_node_t * p_node; + osm_physp_t * p_physp; + unsigned port_num; + char * name_str; + char * tmp_str; + char * host_str; + char * ca_str; + char * port_str; + char * node_desc = (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); + + /* parsing port name strings */ + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); + list_iterator != cl_list_end(&tmp_parser_struct.str_list); + list_iterator = cl_list_next(list_iterator)) { tmp_str = (char*)cl_list_obj(list_iterator); + if (tmp_str && *tmp_str) + { + name_str = tmp_str; + host_str = strtok (name_str,"/"); + ca_str = strtok (NULL, "/"); + port_str = strtok (NULL, "/"); + + if (!host_str || !(*host_str) || + !ca_str || !(*ca_str) || + !port_str || !(*port_str) || + (port_str[0] != 'p' && port_str[0] != 'P')) { + yyerror("illegal port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - /* - * TODO: parse port name strings - */ + if (!(port_num = strtoul(&port_str[1],NULL,0))) { + yyerror("illegal port number in port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - if (tmp_str) - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); - list_iterator = cl_list_next(list_iterator); + sprintf(node_desc,"%s %s",host_str,ca_str); + free(tmp_str); + + if (st_lookup(p_qos_policy->p_ca_hash, + (st_data_t)node_desc, + (st_data_t*)&p_node)) + { + /* we found the node, now get the right port */ + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp) { + yyerror("port number out of range in port name"); + free(tmp_str); + free(node_desc); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } + /* we found the port, now add it to guid table */ + __parser_add_port_to_port_map(&p_current_port_group->port_map, + p_physp); + } + } } cl_list_remove_all(&tmp_parser_struct.str_list); + free(node_desc); } ; diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 51dd7b9..0d7235f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -59,6 +59,31 @@ /*************************************************** ***************************************************/ +static void +__build_cabyname_hash(osm_qos_policy_t * p_qos_policy) +{ + osm_node_t * p_node; + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; + + p_qos_policy->p_ca_hash = st_init_strtable(); + CL_ASSERT(p_qos_policy->p_ca_hash); + + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) + return; + + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) + st_insert(p_qos_policy->p_ca_hash, + (st_data_t)p_node->print_desc, + (st_data_t)p_node); + } +} + +/*************************************************** + ***************************************************/ + static boolean_t __is_num_in_range_arr(uint64_t ** range_arr, unsigned range_arr_len, uint64_t num) @@ -127,8 +152,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() return NULL; memset(p, 0, sizeof(osm_qos_port_group_t)); - - cl_list_init(&p->port_name_list, 10); cl_qmap_init(&p->port_map); return p; @@ -150,10 +173,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) if (p->use) free(p->use); - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); - cl_list_remove_all(&p->port_name_list); - cl_list_destroy(&p->port_name_list); - p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) { @@ -423,6 +442,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) cl_list_init(&p_qos_policy->qos_match_rules, 10); p_qos_policy->p_subn = p_subn; + __build_cabyname_hash(p_qos_policy); + return p_qos_policy; } @@ -495,6 +516,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) cl_list_remove_all(&p_qos_policy->qos_match_rules); cl_list_destroy(&p_qos_policy->qos_match_rules); + if (p_qos_policy->p_ca_hash) + st_free_table(p_qos_policy->p_ca_hash); + free(p_qos_policy); p_qos_policy = NULL; -- 1.5.1.4 From hrosenstock at xsigo.com Sun Oct 14 04:34:21 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 14 Oct 2007 04:34:21 -0700 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Fix issues when checking PerfMgt:ClassPortInfo.CapabilityMask In-Reply-To: <20071013173214.GC12364@sashak.voltaire.com> References: <1192195857.14052.272.camel@hrosenstock-ws.xsigo.com> <20071013173214.GC12364@sashak.voltaire.com> Message-ID: <1192361661.4962.124.camel@hrosenstock-ws.xsigo.com> On Sat, 2007-10-13 at 19:32 +0200, Sasha Khapyorsky wrote: > On 06:30 Fri 12 Oct , Hal Rosenstock wrote: > > infiniband-diags/perfquery.c: Fix issues when checking > > PerfMgt:ClassPortInfo.CapabilityMask > > > > 1. bit 9, if we're counting from 0, will have mask of 0x200, > > not 0x100. mask of 0x100 will be for counter aggregation according > > to IBA 1.2. > > > > 2. If capmask is 16 bit big-endian word, then we're looking > > at the wrong byte on x86, we must ntohs(*pc2) first. > > > > 3. Also, change pointer dereference with memcpy, > > e.g.: > > > > memcpy (&capmask, pc+2, sizeof(capmask)); > > capmask = ntohs(capmask); > > > > Those pointer dereferenes are royal pain on ia64 unless you can > > guarantee what pc is always aligned properly. > > > > Found-by: Max Matveev > > > > Compile tested only > > > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. > > I have the question below (not related directly to specific patch). > > > > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > > index 2ae3281..148e452 100644 > > --- a/infiniband-diags/src/perfquery.c > > +++ b/infiniband-diags/src/perfquery.c > > @@ -40,8 +40,9 @@ > > #include > > #include > > #include > > +#include > > > > -#define __BUILD_VERSION_TAG__ 1.2.1 > > +#define __BUILD_VERSION_TAG__ 1.2.2 > > What is the motivation of this change and in general what > __BUILD_VERSION_TAG__ is supposed to show? It predates me but I had been using it as an indicator of when changes (major or minor) were made to the tool. > If it is just unique build > version then I guess t would be better to use infiniband-diags version + > git-describe sequence. If a new release is generated for each change of this sort (which it wasn't), then this is fine. > If it is per-tool "compat" string, then likely we > don't need to change it each time when tools behavior is not changed. It was a per tool thing but could be different depending on the processes being used. -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Sun Oct 14 04:43:04 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 14 Oct 2007 04:43:04 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <005001c80dbd$17eeb8c0$a865a8c0@catcher> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <005001c80dbd$17eeb8c0$a865a8c0@catcher> Message-ID: <1192362184.4962.133.camel@hrosenstock-ws.xsigo.com> Hi Steve, On Sat, 2007-10-13 at 12:18 -0500, Steve Welch wrote: > Hi Hal, > > > > > Looks pretty good. A few things below and a couple of nits embedded: > > > > I think the original description was more detailed and should be added > > to the above: > > When I submit the next revision I will update the description to put > the detail back in. Thanks. > > Signed-off-by: Steve Welch > > > > My main concern is verifying this with the various HCA drivers > > (Mellanox (in normal HCA mode), iPath, and eHCA) as well as switches > > (Suri, can you try this ?) in addition to running this on a node where > > OpenSM resides (Sasha, can you try this ?). How much of this have you > > done ? Thanks. > > > > Good point, I think we are good with regard to the SM and mthca. > I have run the code with the mthca driver loaded in non-router mode, > and verified proper operation (ports can be brought up, so > process_mad() is handing off SMP requests to the internal SMA, > etc.). I've also run the SM on that host, again local ports are > brought up and the SM is able to bring up the attached fabric. Local > user space utilities like smpquery operate normally for local and > remote queries using both directed route and LID routed addressing. > > However, I have not run on top of the iPath or eHCA. I don't think this currently is utilized by eHCA as all this is done in firmware but there is at least one known switch implementation out there which should IMO be reverified with this change. > A quick code > inspection of the iPath driver indicates that the desired effect > will not be achieved with that driver in every case. For the > SM info attribute it looks OK and is handled properly currently. > For DR SMP's with the GET_RESPONSE method the iPath driver returns > IB_MAD_RESULT_FAILURE instead of IB_MAD_RESULT_SUCCESS. > This will cause the core mad processing to drop the SMP MAD instead > of attempting to pass it on to a local agent. Of course this > iPath behavior exists with or without this patch. I'm not sure > why the iPath driver considers this a failure, it does not > consume or process the MAD in that case, but the MAD has passed > their incoming sanity checks. The comment in this code indicates > they intended to do the right thing, but are just returning the > wrong status (see ipath_mad.c, process_subn()). I don't know either but that could be a separate patch. Maybe Ralph could comment on this. > I just don't think this is a code path that has been exercised > on iPath, it requires a user space SMA sendig DR SMP's responses > that must be locally loopbacked. To get consistent behavior iPath > will need a change, but I do not have the hardware required to > make and test that change. > > I'm not sure about the eHca driver, it appears to not implement > the process_mad() IB device function. Right; it currently does not expose QP0. It is all done in firmware. > > > } > > > + > > > +/* > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > SMA/SM > > > + * via process_mad > > > + */ > > > +static inline enum smi_action smi_check_local_returning_smp(struct > > ib_smp *smp, > > > + struct ib_device > > *device) > > > > Nit. Not sure this lines up properly. > > > The function names are a little verbose and we're pushing 80 columns, so > the second parameter could not line exactly with the first without exceeding > the limit. I can break the first line up if that is preferred. I agree they are verbose but I think that makes them clearer. Maybe they can be shortened: Just make their names is_local_outgoing/returning_smp, perhaps ? -- Hal > Thanks for you feedback, > Steve > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Sun Oct 14 08:11:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 14 Oct 2007 17:11:15 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071014151115.GD6489@sashak.voltaire.com> On 04:50 Fri 12 Oct , Hal Rosenstock wrote: > On Fri, 2007-10-12 at 12:09 +0530, Sumit Gaur - Sun Microsystem wrote: > > Hi , > > > > Sean Hefty wrote: > > >>There is no per thread demuxing. You would need two different mad agents > > >>to do this with one looking at the SMI side and the other the GSI side. > > >>I haven't looked at libibmad in terms of using this model though. > > > > > > > > > umad_receive() doesn't take the mad_agent as an input parameter. The only > > > possibility I see is calling umad_open_port() twice for the same port, with the > > > GSI/SMI registrations going to separate port_id's. > > I think this solution is also not possible as calling umad_open_port() twice for > > the same port and ca_name is always gives error in port_alloc because > > dev_to_umad_id generate same umad_id for same ca_name and portnum. > > > > ibwarn: [9634] port_alloc: umad port id 1 is already allocated for mthca0 2 > > > > So looks like it is impossible to generate two separate portid for the same port. > > It might be possible to support this with some changes to libibumad. > Sasha ? Yes, it could be possible this way. Sumit, could you try this patch? Sasha diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 589684c..5ccdcfb 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -82,6 +82,7 @@ int umaddebug = 0; #define UMAD_DEV_NAME_SZ 32 #define UMAD_DEV_FILE_SZ 256 +#define MAX_OPEN_PORTS 2048 static char *def_ca_name = "mthca0"; static int def_ca_port = 1; @@ -94,54 +95,18 @@ typedef struct Port { int id; } Port; -static Port ports[UMAD_MAX_PORTS]; +static Port *open_ports[MAX_OPEN_PORTS]; /************************************* * Port */ static Port * -port_alloc(int portid, char *dev, int portnum) -{ - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) { - IBWARN("bad umad portid %d", portid); - errno = EINVAL; - return 0; - } - - if (port->dev_name[0]) { - IBWARN("umad port id %d is already allocated for %s %d", - portid, port->dev_name, port->dev_port); - errno = EBUSY; - return 0; - } - - strncpy(port->dev_name, dev, UMAD_CA_NAME_LEN); - port->dev_port = portnum; - port->id = portid; - - return port; -} - -static Port * port_get(int portid) { - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) - return 0; - - if (port->dev_name[0] == 0) - return 0; - - return port; -} + if (portid < 0 || portid >= MAX_OPEN_PORTS) + return NULL; -static void -port_free(Port *port) -{ - memset(port, 0, sizeof *port); + return open_ports[portid]; } static int @@ -571,7 +536,7 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) int umad_open_port(char *ca_name, int portnum) { - int umad_id; + int umad_id, fd; Port *port; TRACE("ca %s port %d", ca_name, portnum); @@ -584,19 +549,35 @@ umad_open_port(char *ca_name, int portnum) if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) return -EINVAL; - if (!(port = port_alloc(umad_id, ca_name, portnum))) - return -errno; + port = malloc(sizeof(*port)); + if (!port) + return -ENOMEM; + memset(port, 0, sizeof(*port)); snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", UMAD_DEV_DIR , umad_id); - if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { + fd = open(port->dev_file, O_RDWR|O_NONBLOCK); + if (fd < 0) { DEBUG("open %s failed: %s", port->dev_file, strerror(errno)); + free(port); return -EIO; + } else if (fd >= MAX_OPEN_PORTS) { + DEBUG("no ports space for %s", port->dev_file); + errno = ENOMEM; + free(port); + return -ENOMEM; } + port->id = umad_id; + port->dev_port = portnum; + port->dev_fd = fd; + strncpy(port->dev_name, ca_name, UMAD_CA_NAME_LEN); + + open_ports[fd] = port; + DEBUG("opened %s fd %d portid %d", port->dev_file, port->dev_fd, port->id); - return port->id; + return fd; } int @@ -677,7 +658,8 @@ umad_close_port(int portid) close(port->dev_fd); - port_free(port); + open_ports[portid] = NULL; + free(port); DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); return 0; From sashak at voltaire.com Sun Oct 14 09:03:14 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 14 Oct 2007 18:03:14 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <4711EE76.4070107@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> Message-ID: <20071014160314.GE6489@sashak.voltaire.com> On 12:24 Sun 14 Oct , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > Hi Yevgeny, > > On 11:01 Tue 09 Oct , Yevgeny Kliteynik wrote: > >> Added CA-by-name hash to the QoS policy object and > > Why it is called "CA"-by-name? In the code below I see that hash is > > created for all nodes (including switches and routers). > > In osm_qos_policy.c: > > if (p_node->node_info.node_type == IB_NODE_TYPE_CA) > st_insert(p_qos_policy->p_ca_hash, > (st_data_t)p_node->print_desc, > (st_data_t)p_node); Ok, so what is wrong with switches and routers? Why it cannot be specified by "name"? > >> diff --git a/opensm/opensm/osm_qos_parser.y > >> b/opensm/opensm/osm_qos_parser.y > >> index 2405519..cf342d3 100644 > >> --- a/opensm/opensm/osm_qos_parser.y > >> +++ b/opensm/opensm/osm_qos_parser.y > >> @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { > >> > >> port_group_port_name: port_group_port_name_start string_list { > >> /* 'port-name' in 'port-group' - any num of > >> instances */ > >> - cl_list_iterator_t list_iterator; > >> - char * tmp_str; > >> - > >> - list_iterator = > >> cl_list_head(&tmp_parser_struct.str_list); > >> - while( list_iterator != > >> cl_list_end(&tmp_parser_struct.str_list) ) > >> + cl_list_iterator_t list_iterator; > >> + osm_node_t * p_node; > >> + osm_physp_t * p_physp; > >> + unsigned port_num; > >> + char * name_str; > >> + char * tmp_str; > >> + char * host_str; > >> + char * ca_str; > >> + char * port_str; > >> + char * node_desc = > >> (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); > >> + > >> + /* parsing port name strings */ > >> + for (list_iterator = > >> cl_list_head(&tmp_parser_struct.str_list); > >> + list_iterator != > >> cl_list_end(&tmp_parser_struct.str_list); > >> + list_iterator = > >> cl_list_next(list_iterator)) > >> { > >> tmp_str = > >> (char*)cl_list_obj(list_iterator); > >> + if (tmp_str && *tmp_str) > >> + { > >> + name_str = tmp_str; > >> + host_str = strtok (name_str,"/"); > >> + ca_str = strtok (NULL, "/"); > >> + port_str = strtok (NULL, "/"); > >> + > >> + if (!host_str || !(*host_str) || > >> + !ca_str || !(*ca_str) || > >> + !port_str || !(*port_str) || > >> + (port_str[0] != 'p' && > >> port_str[0] != 'P')) { > >> + yyerror("illegal port name"); > >> + free(tmp_str); > >> + free(node_desc); > >> + > >> cl_list_remove_all(&tmp_parser_struct.str_list); > >> + return 1; > >> + } > >> > >> - /* > >> - * TODO: parse port name strings > >> - */ > >> + if (!(port_num = > >> strtoul(&port_str[1],NULL,0))) { > >> + yyerror("illegal port number in > >> port name"); > >> + free(tmp_str); > >> + free(node_desc); > >> + > >> cl_list_remove_all(&tmp_parser_struct.str_list); > >> + return 1; > >> + } > >> > >> - if (tmp_str) > >> - > >> cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); > >> - list_iterator = > >> cl_list_next(list_iterator); > >> + sprintf(node_desc,"%s > >> %s",host_str,ca_str); > >> + free(tmp_str); > >> + > >> + if > >> (st_lookup(p_qos_policy->p_ca_hash, > >> + (st_data_t)node_desc, > >> + (st_data_t*)&p_node)) > > I am not following this. Hash key is generated as "host_str ca_str", but > > below where hash table is filled NodeDescription string is used. Why > > this should be same? > > Because that's how node description is created. > From /etc/init.d/openibd: > > # Add node description to sysfs > IBSYSDIR="/sys/class/infiniband" > if [ -d ${IBSYSDIR} ]; then > declare -i hca_id=1 > for hca in ${IBSYSDIR}/* > do > if [ -e ${hca}/node_desc ]; then > echo -n "$(hostname -s) HCA-${hca_id}" >> ${hca}/node_desc > fi > let hca_id++ > done > fi This script is optional, even when used the way how node_desc is generated can be easily changed. I think it is not good idea to copy the algorithm to OpenSM code and in this way to enforce an user to use the only this hardcoded node_desc format. Actually this (or another similar) script is sort of config file, as well as qos policy file, and both are in admin's hands. So basically I agree that it is ok to require to define node_desc (if an admin wishes to use names for her QoS). _But_ we cannot dictate how it should be generated - it clearly must be user's and not our choice. So instead of approaching hardcoded node_desc format I think that name definition in qos policy file should refer node_desc as whole string (well, in improved case it could be single substring with wild cards). > >> + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); > >> + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); > >> + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { > >> + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) > >> + st_insert(p_qos_policy->p_ca_hash, > >> + (st_data_t)p_node->print_desc, > >> + (st_data_t)p_node); > > Hmm, why do you think NodeDescription will be unique for each node in a > > fabric? > > NodeDescription is a combination of host id and hca number. > If nobody "plays" with these values (and doesn't modify this area > of /etc/init.d/openibd), then NodeDescription will be unique. > But IB spec doesn't require it to be unique. In fact, it doesn't say > anything at all about how this NodeDescription should look. Yes, it is the point. OTOH as I stated above I think it ok to require node_desc setup for "by node_desc" resolution, and in this case an user is responsible to have it unique. But let's do by node_desc and not by "$host $hca" or any another hardcoded format. > Moreover, if the device won't be found in /sys/class/infiniband, > or if it won't have /sys/class/infiniband/${hca}/node_desc, I have no > idea what would be the content of NodeDescription. > > I am, however, trying to give best value for the money :) > > OSM doesn't know what is the host id of a certain hca. > The only thing I can think of right now is that OpenSM can > check the NodeDescription before inserting it to the hash > (which can be done in a consequent patch). This is handled internally in st_insert() - new value will just replace old one. > It can check that the description looks like this: > Please no. :) Sasha From sashak at voltaire.com Sun Oct 14 12:52:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 14 Oct 2007 21:52:51 +0200 Subject: [ofa-general] [PATCH] libibcommon, infiniband-diags: move get_build_version() to diags Message-ID: <20071014195251.GH6489@sashak.voltaire.com> Move get_build_version() function (which is ifdefed by __BUILD_VERSION_TAG__) to infiniband-diags, where it is only used. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 11 +++++++++++ infiniband-diags/src/ibstat.c | 4 ++-- infiniband-diags/src/smpdump.c | 6 +++--- libibcommon/include/infiniband/common.h | 18 ------------------ 4 files changed, 16 insertions(+), 23 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 2d463c5..159e929 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -65,4 +65,15 @@ void iberror(const char *fn, char *msg, ...); /* NOTE: this modifies the parameter "nodedesc". */ char *clean_nodedesc(char *nodedesc); +#ifdef __BUILD_VERSION_TAG__ + +#define stringify(s) to_string(s) +#define to_string(s) #s + +static inline const char* get_build_version(void) +{ + return "BUILD VERSION: " stringify(__BUILD_VERSION_TAG__) " Build date: " __DATE__ " " __TIME__ ; +} +#endif + #endif /* _IBDIAG_COMMON_H_ */ diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 4653390..aa55d83 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -62,11 +62,11 @@ #include #include -#define DEBUG if (debug) IBWARN +#include static int debug; -static char *argv0 = "ibstat"; +char *argv0 = "ibstat"; static char *node_type_str[] = { "???", diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index 5eceea7..c325771 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -61,7 +61,7 @@ #include #include -#define DEBUG if (debug) IBWARN +#include static const uint8_t CLASS_SUBN_DIRECTED_ROUTE = 0x81; static const uint8_t CLASS_SUBN_LID_ROUTE = 0x1; @@ -73,9 +73,9 @@ static const uint8_t CLASS_SUBN_LID_ROUTE = 0x1; static int mad_agent; static int drmad_tid = 0x123; -static int debug; +static int debug, verbose; -static char *argv0 = "smpdump"; +char *argv0 = "smpdump"; typedef struct { char path[64]; diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index 72147d8..4eb3872 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -143,24 +143,6 @@ uint64_t getcurrenttime(void); /* hash.c */ uint32_t fhash(uint8_t *k, int length, uint32_t initval); -#ifdef __BUILD_VERSION_TAG__ - -#undef stringify -#undef tostring - -#define stringify(s) tostring(s) -#define tostring(s) #s - -__attribute__((unused)) static char _build_version[] = { "BUILD VERSION: " stringify(__BUILD_VERSION_TAG__) " Build date: " __DATE__ " " __TIME__ }; - -__attribute__((unused)) static inline char* -get_build_version(void) -{ - return _build_version; -} - -#endif - END_C_DECLS #endif /* __COMMON_H__ */ -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Sun Oct 14 13:02:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 14 Oct 2007 22:02:53 +0200 Subject: [ofa-general] Re: [PATCH] opensm: osm_console.h replaced string literals with macro definitions In-Reply-To: <470409EE.8010905@llnl.gov> References: <470409EE.8010905@llnl.gov> Message-ID: <20071014200253.GI6489@sashak.voltaire.com> On 14:30 Wed 03 Oct , Timothy A. Meier wrote: > Sasha - another small patch. I think I fixed the line wrap issue, but have > also attached > the patch just in case. This still wrap lines: > + || strcmp(opt.console, OSM_LOOPBACK_CONSOLE) == > 0 Attached patch was fine. > From f1ea67d05410373c90441962e1f3005aa6212b05 Mon Sep 17 00:00:00 2001 > From: Tim Meier > Date: Wed, 3 Oct 2007 14:05:03 -0700 > Subject: [PATCH] opensm: osm_console.h replaced string literals with macro > definitions > > Several string constants are used to define and control the behavior > of the OSM Console. This patch formalizes those constants, and uses > them in a consistent manner. > > Signed-off-by: Tim Meier Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Sun Oct 14 15:32:45 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 15 Oct 2007 00:32:45 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071014160314.GE6489@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> Message-ID: <4712990D.1060801@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > On 12:24 Sun 14 Oct , Yevgeny Kliteynik wrote: >> Sasha Khapyorsky wrote: >>> Hi Yevgeny, >>> On 11:01 Tue 09 Oct , Yevgeny Kliteynik wrote: >>>> Added CA-by-name hash to the QoS policy object and >>> Why it is called "CA"-by-name? In the code below I see that hash is >>> created for all nodes (including switches and routers). >> In osm_qos_policy.c: >> >> if (p_node->node_info.node_type == IB_NODE_TYPE_CA) >> st_insert(p_qos_policy->p_ca_hash, >> (st_data_t)p_node->print_desc, >> (st_data_t)p_node); > > Ok, so what is wrong with switches and routers? Why it cannot be > specified by "name"? Switches have the NodeDescription filled by FW, and it's usually the same string for all the switches. Also, what would be the meaning of "host id" for switches? As for routers - I don't know what's going on there, since I didn't get the chance to lay my hands on it yet (those IB routers are hard to get right now :-) >>>> diff --git a/opensm/opensm/osm_qos_parser.y >>>> b/opensm/opensm/osm_qos_parser.y >>>> index 2405519..cf342d3 100644 >>>> --- a/opensm/opensm/osm_qos_parser.y >>>> +++ b/opensm/opensm/osm_qos_parser.y >>>> @@ -603,23 +603,74 @@ port_group_use_start: TK_USE { >>>> >>>> port_group_port_name: port_group_port_name_start string_list { >>>> /* 'port-name' in 'port-group' - any num of >>>> instances */ >>>> - cl_list_iterator_t list_iterator; >>>> - char * tmp_str; >>>> - >>>> - list_iterator = >>>> cl_list_head(&tmp_parser_struct.str_list); >>>> - while( list_iterator != >>>> cl_list_end(&tmp_parser_struct.str_list) ) >>>> + cl_list_iterator_t list_iterator; >>>> + osm_node_t * p_node; >>>> + osm_physp_t * p_physp; >>>> + unsigned port_num; >>>> + char * name_str; >>>> + char * tmp_str; >>>> + char * host_str; >>>> + char * ca_str; >>>> + char * port_str; >>>> + char * node_desc = >>>> (char*)malloc(IB_NODE_DESCRIPTION_SIZE + 1); >>>> + >>>> + /* parsing port name strings */ >>>> + for (list_iterator = >>>> cl_list_head(&tmp_parser_struct.str_list); >>>> + list_iterator != >>>> cl_list_end(&tmp_parser_struct.str_list); >>>> + list_iterator = >>>> cl_list_next(list_iterator)) >>>> { >>>> tmp_str = >>>> (char*)cl_list_obj(list_iterator); >>>> + if (tmp_str && *tmp_str) >>>> + { >>>> + name_str = tmp_str; >>>> + host_str = strtok (name_str,"/"); >>>> + ca_str = strtok (NULL, "/"); >>>> + port_str = strtok (NULL, "/"); >>>> + >>>> + if (!host_str || !(*host_str) || >>>> + !ca_str || !(*ca_str) || >>>> + !port_str || !(*port_str) || >>>> + (port_str[0] != 'p' && >>>> port_str[0] != 'P')) { >>>> + yyerror("illegal port name"); >>>> + free(tmp_str); >>>> + free(node_desc); >>>> + >>>> cl_list_remove_all(&tmp_parser_struct.str_list); >>>> + return 1; >>>> + } >>>> >>>> - /* >>>> - * TODO: parse port name strings >>>> - */ >>>> + if (!(port_num = >>>> strtoul(&port_str[1],NULL,0))) { >>>> + yyerror("illegal port number in >>>> port name"); >>>> + free(tmp_str); >>>> + free(node_desc); >>>> + >>>> cl_list_remove_all(&tmp_parser_struct.str_list); >>>> + return 1; >>>> + } >>>> >>>> - if (tmp_str) >>>> - >>>> cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); >>>> - list_iterator = >>>> cl_list_next(list_iterator); >>>> + sprintf(node_desc,"%s >>>> %s",host_str,ca_str); >>>> + free(tmp_str); >>>> + >>>> + if >>>> (st_lookup(p_qos_policy->p_ca_hash, >>>> + (st_data_t)node_desc, >>>> + (st_data_t*)&p_node)) >>> I am not following this. Hash key is generated as "host_str ca_str", but >>> below where hash table is filled NodeDescription string is used. Why >>> this should be same? >> Because that's how node description is created. >> From /etc/init.d/openibd: >> >> # Add node description to sysfs >> IBSYSDIR="/sys/class/infiniband" >> if [ -d ${IBSYSDIR} ]; then >> declare -i hca_id=1 >> for hca in ${IBSYSDIR}/* >> do >> if [ -e ${hca}/node_desc ]; then >> echo -n "$(hostname -s) HCA-${hca_id}" >> ${hca}/node_desc >> fi >> let hca_id++ >> done >> fi > > This script is optional, even when used the way how node_desc is > generated can be easily changed. I think it is not good idea to copy the > algorithm to OpenSM code and in this way to enforce an user to use the > only this hardcoded node_desc format. > > Actually this (or another similar) script is sort of config file, as > well as qos policy file, and both are in admin's hands. So basically I > agree that it is ok to require to define node_desc (if an admin wishes > to use names for her QoS). _But_ we cannot dictate how it should be > generated - it clearly must be user's and not our choice. > > So instead of approaching hardcoded node_desc format I think that name > definition in qos policy file should refer node_desc as whole string > (well, in improved case it could be single substring with wild cards). Ok, so let's elaborate on this. Currently NodeDescription is filled by the openibd script. Although this script can be modified by admin, I doubt that an average admin would like to tweak it. Thus, I believe that in most cases the NodeDescription will look this way: "node-id hca-num". If we want to allow port names to have number ranges or asterisk (and we do want it), then we have to have *some* format. So here's my suggestion: 1. First of all, when the ca-by-name hash is created, osm will check that the NodeDescriptions are unique. If they aren't - parsing of the port names will be off, even if specified in the policy file. 2. If a port name doesn't have any special characters, it will be compared to the NodeDescription as is, and it'd better be unique 3. If the admin would like to include num. ranges and asterisks in the port name, he has to make sure that the NodeDescription is created like it is created now by openibd. Sounds ok? >>>> + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); >>>> + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); >>>> + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { >>>> + if (p_node->node_info.node_type == IB_NODE_TYPE_CA) >>>> + st_insert(p_qos_policy->p_ca_hash, >>>> + (st_data_t)p_node->print_desc, >>>> + (st_data_t)p_node); >>> Hmm, why do you think NodeDescription will be unique for each node in a >>> fabric? >> NodeDescription is a combination of host id and hca number. >> If nobody "plays" with these values (and doesn't modify this area >> of /etc/init.d/openibd), then NodeDescription will be unique. >> But IB spec doesn't require it to be unique. In fact, it doesn't say >> anything at all about how this NodeDescription should look. > > Yes, it is the point. > > OTOH as I stated above I think it ok to require node_desc setup for "by > node_desc" resolution, and in this case an user is responsible to have > it unique. > > But let's do by node_desc and not by "$host $hca" or any another > hardcoded format. > >> Moreover, if the device won't be found in /sys/class/infiniband, >> or if it won't have /sys/class/infiniband/${hca}/node_desc, I have no >> idea what would be the content of NodeDescription. >> >> I am, however, trying to give best value for the money :) >> >> OSM doesn't know what is the host id of a certain hca. >> The only thing I can think of right now is that OpenSM can >> check the NodeDescription before inserting it to the hash >> (which can be done in a consequent patch). > > This is handled internally in st_insert() - new value will just replace > old one. I mean that OSM will check the description format, not its existence in the hash >> It can check that the description looks like this: >> > > Please no. :) OK, but only because you asked nicely :) -- Yevgeny > Sasha > From sashak at voltaire.com Sun Oct 14 20:53:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 05:53:09 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <4712990D.1060801@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> Message-ID: <20071015035309.GN12364@sashak.voltaire.com> Hi Yevgeny, On 00:32 Mon 15 Oct , Yevgeny Kliteynik wrote: > >>>> Added CA-by-name hash to the QoS policy object and > >>> Why it is called "CA"-by-name? In the code below I see that hash is > >>> created for all nodes (including switches and routers). > >> In osm_qos_policy.c: > >> > >> if (p_node->node_info.node_type == IB_NODE_TYPE_CA) > >> st_insert(p_qos_policy->p_ca_hash, > >> (st_data_t)p_node->print_desc, > >> (st_data_t)p_node); > > Ok, so what is wrong with switches and routers? Why it cannot be > > specified by "name"? > > Switches have the NodeDescription filled by FW, and it's usually the > same string for all the switches. It must not be same. Also I suppose that node description can be changed at least for some managed switches even today. > Also, what would be the meaning of > "host id" for switches? I don't like "host id" approach - it assume a predefined node description format. > As for routers - I don't know what's going on there, since I didn't > get the chance to lay my hands on it yet (those IB routers are hard > to get right now :-) So I think it is better to include switches and routers here and to use "node name" instead of "CA name". In worst case when all records are same we will lost one or two hash table entries per fabric. > >> From /etc/init.d/openibd: > >> > >> # Add node description to sysfs > >> IBSYSDIR="/sys/class/infiniband" > >> if [ -d ${IBSYSDIR} ]; then > >> declare -i hca_id=1 > >> for hca in ${IBSYSDIR}/* > >> do > >> if [ -e ${hca}/node_desc ]; then > >> echo -n "$(hostname -s) HCA-${hca_id}" >> > >> ${hca}/node_desc > >> fi > >> let hca_id++ > >> done > >> fi > > This script is optional, even when used the way how node_desc is > > generated can be easily changed. I think it is not good idea to copy the > > algorithm to OpenSM code and in this way to enforce an user to use the > > only this hardcoded node_desc format. > > Actually this (or another similar) script is sort of config file, as > > well as qos policy file, and both are in admin's hands. So basically I > > agree that it is ok to require to define node_desc (if an admin wishes > > to use names for her QoS). _But_ we cannot dictate how it should be > > generated - it clearly must be user's and not our choice. > > So instead of approaching hardcoded node_desc format I think that name > > definition in qos policy file should refer node_desc as whole string > > (well, in improved case it could be single substring with wild cards). > > Ok, so let's elaborate on this. > > Currently NodeDescription is filled by the openibd script. > Although this script can be modified by admin, I doubt that an average > admin would like to tweak it. Thus, I believe that in most cases the > NodeDescription will look this way: "node-id hca-num". It must not be so. For instance I'm not using this script at all, OpenSM must not be installed as part of OFED, nodes can run other than Linux OSes (Solar*, Win*, etc.), etc. > If we want to allow port names to have number ranges or asterisk > (and we do want it), then we have to have *some* format. Why? What is wrong with plain strings? Look at example - you are able to use wildcards with 'ls' command even if file names doesn't have any predefined format. Right? > So here's my suggestion: > 1. First of all, when the ca-by-name hash is created, osm will check that > the NodeDescriptions are unique. If they aren't - parsing of the port > names will be off, even if specified in the policy file. It is overkill IMO - an user is responsible to setup things properly, finally it is her choice in which tricky way to use it. > 2. If a port name doesn't have any special characters, it will be compared > to the NodeDescription as is, and it'd better be unique Ok. > 3. If the admin would like to include num. ranges and asterisks in the > port name, he has to make sure that the NodeDescription is created > like it is created now by openibd. Again, why this limitation is needed? What is wrong with wildcards like "myname*", "hostname[1-3] *", etc.? > Sounds ok? (2) + (3 without format limitation) looks fine for me. Sasha From Sumit.Gaur at Sun.COM Sun Oct 14 21:16:16 2007 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Mon, 15 Oct 2007 09:46:16 +0530 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071014151115.GD6489@sashak.voltaire.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> Message-ID: <4712E990.9020906@Sun.COM> Thanks Sasha for patch I will try it and looks like it would work. Regards sumit Sasha Khapyorsky wrote: > On 04:50 Fri 12 Oct , Hal Rosenstock wrote: > >>On Fri, 2007-10-12 at 12:09 +0530, Sumit Gaur - Sun Microsystem wrote: >> >>>Hi , >>> >>>Sean Hefty wrote: >>> >>>>>There is no per thread demuxing. You would need two different mad agents >>>>>to do this with one looking at the SMI side and the other the GSI side. >>>>>I haven't looked at libibmad in terms of using this model though. >>>> >>>> >>>>umad_receive() doesn't take the mad_agent as an input parameter. The only >>>>possibility I see is calling umad_open_port() twice for the same port, with the >>>>GSI/SMI registrations going to separate port_id's. >>> >>>I think this solution is also not possible as calling umad_open_port() twice for >>>the same port and ca_name is always gives error in port_alloc because >>>dev_to_umad_id generate same umad_id for same ca_name and portnum. >>> >>>ibwarn: [9634] port_alloc: umad port id 1 is already allocated for mthca0 2 >>> >>>So looks like it is impossible to generate two separate portid for the same port. >> >>It might be possible to support this with some changes to libibumad. >>Sasha ? > > > Yes, it could be possible this way. > > Sumit, could you try this patch? > > Sasha > > > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > index 589684c..5ccdcfb 100644 > --- a/libibumad/src/umad.c > +++ b/libibumad/src/umad.c > @@ -82,6 +82,7 @@ int umaddebug = 0; > > #define UMAD_DEV_NAME_SZ 32 > #define UMAD_DEV_FILE_SZ 256 > +#define MAX_OPEN_PORTS 2048 > > static char *def_ca_name = "mthca0"; > static int def_ca_port = 1; > @@ -94,54 +95,18 @@ typedef struct Port { > int id; > } Port; > > -static Port ports[UMAD_MAX_PORTS]; > +static Port *open_ports[MAX_OPEN_PORTS]; > > /************************************* > * Port > */ > static Port * > -port_alloc(int portid, char *dev, int portnum) > -{ > - Port *port = ports + portid; > - > - if (portid < 0 || portid >= UMAD_MAX_PORTS) { > - IBWARN("bad umad portid %d", portid); > - errno = EINVAL; > - return 0; > - } > - > - if (port->dev_name[0]) { > - IBWARN("umad port id %d is already allocated for %s %d", > - portid, port->dev_name, port->dev_port); > - errno = EBUSY; > - return 0; > - } > - > - strncpy(port->dev_name, dev, UMAD_CA_NAME_LEN); > - port->dev_port = portnum; > - port->id = portid; > - > - return port; > -} > - > -static Port * > port_get(int portid) > { > - Port *port = ports + portid; > - > - if (portid < 0 || portid >= UMAD_MAX_PORTS) > - return 0; > - > - if (port->dev_name[0] == 0) > - return 0; > - > - return port; > -} > + if (portid < 0 || portid >= MAX_OPEN_PORTS) > + return NULL; > > -static void > -port_free(Port *port) > -{ > - memset(port, 0, sizeof *port); > + return open_ports[portid]; > } > > static int > @@ -571,7 +536,7 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) > int > umad_open_port(char *ca_name, int portnum) > { > - int umad_id; > + int umad_id, fd; > Port *port; > > TRACE("ca %s port %d", ca_name, portnum); > @@ -584,19 +549,35 @@ umad_open_port(char *ca_name, int portnum) > if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) > return -EINVAL; > > - if (!(port = port_alloc(umad_id, ca_name, portnum))) > - return -errno; > + port = malloc(sizeof(*port)); > + if (!port) > + return -ENOMEM; > + memset(port, 0, sizeof(*port)); > > snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > UMAD_DEV_DIR , umad_id); > > - if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { > + fd = open(port->dev_file, O_RDWR|O_NONBLOCK); > + if (fd < 0) { > DEBUG("open %s failed: %s", port->dev_file, strerror(errno)); > + free(port); > return -EIO; > + } else if (fd >= MAX_OPEN_PORTS) { > + DEBUG("no ports space for %s", port->dev_file); > + errno = ENOMEM; > + free(port); > + return -ENOMEM; > } > > + port->id = umad_id; > + port->dev_port = portnum; > + port->dev_fd = fd; > + strncpy(port->dev_name, ca_name, UMAD_CA_NAME_LEN); > + > + open_ports[fd] = port; > + > DEBUG("opened %s fd %d portid %d", port->dev_file, port->dev_fd, port->id); > - return port->id; > + return fd; > } > > int > @@ -677,7 +658,8 @@ umad_close_port(int portid) > > close(port->dev_fd); > > - port_free(port); > + open_ports[portid] = NULL; > + free(port); > > DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); > return 0; From anton at samba.org Sun Oct 14 22:49:07 2007 From: anton at samba.org (Anton Blanchard) Date: Mon, 15 Oct 2007 00:49:07 -0500 Subject: [ofa-general] [PATCH] Use round_jiffies() in ehca timer Message-ID: <20071015054907.GE3257@kryten> Use round_jiffies() to align the 1 second timer with other timers and potentially save power by sleeping cores for longer. Signed-off-by: Anton Blanchard --- diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 403467f..23000b7 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -902,7 +902,7 @@ void ehca_poll_eqs(unsigned long data) ehca_process_eq(shca, 0); } } - mod_timer(&poll_eqs_timer, jiffies + HZ); + mod_timer(&poll_eqs_timer, round_jiffies(jiffies + HZ)); spin_unlock(&shca_list_lock); } From anton at samba.org Sun Oct 14 22:50:56 2007 From: anton at samba.org (Anton Blanchard) Date: Mon, 15 Oct 2007 00:50:56 -0500 Subject: [ofa-general] [PATCH] Use round_jiffies() in IPoIB code Message-ID: <20071015055056.GF3257@kryten> Use round_jiffies() to align the 1 second ah_reap_task with other work and potentially save power by sleeping cores for longer. Signed-off-by: Anton Blanchard --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1a77e79..f1fa3c0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -436,7 +436,8 @@ void ipoib_reap_ah(struct work_struct *work) __ipoib_reap_ah(dev); if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + round_jiffies_relative(HZ)); } int ipoib_ib_dev_open(struct net_device *dev) @@ -472,7 +473,8 @@ int ipoib_ib_dev_open(struct net_device *dev) } clear_bit(IPOIB_STOP_REAPER, &priv->flags); - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + round_jiffies_relative(HZ)); set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); From sweitzen at cisco.com Sun Oct 14 22:56:38 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 14 Oct 2007 22:56:38 -0700 Subject: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl Message-ID: Vlad, I don't see a way to configure OFED 1.3 during installation with OFA_KERNEL_PARAMS like I could in 1.2.5 and earlier. I am specifically looking for the params --without-modprobe, --without-ipoibconf, and --with-madeye-mod. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Sun Oct 14 23:09:57 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 14 Oct 2007 23:09:57 -0700 Subject: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl In-Reply-To: References: Message-ID: I also don't see a way to use K_VER to compile for a kernel other than the currently booted kernel, like I could in 1.2.5 and earlier. Scott ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Sunday, October 14, 2007 10:57 PM To: OpenFabricsEWG; Vladimir Sokolovsky Cc: general at lists.openfabrics.org Subject: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl Vlad, I don't see a way to configure OFED 1.3 during installation with OFA_KERNEL_PARAMS like I could in 1.2.5 and earlier. I am specifically looking for the params --without-modprobe, --without-ipoibconf, and --with-madeye-mod. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Mon Oct 15 00:56:51 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 15 Oct 2007 09:56:51 +0200 Subject: [ofa-general] [PATCH 6/11 v1] IB/ipoib: add checksum offload support In-Reply-To: <1190721304.4947.134.camel@mtls03> References: <1190637551.4947.66.camel@mtls03> <46F8E40C.3030203@voltaire.com> <1190721304.4947.134.camel@mtls03> Message-ID: <1192435011.7337.151.camel@mtls03> Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- Add checksum offload support to ipoib Signed-off-by: Eli Cohen Signed-off-by: Ali Ayub --- This version make sure that set_tx_csum() and set_rx_csum() get called before ipoib_dev_init() so that whether NETIF_F_SG is set or not will properly affect the size the create QP. Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-14 18:07:31.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-14 18:07:37.000000000 +0200 @@ -87,6 +87,7 @@ enum { IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_FLAG_ADMIN_CM = 10, IPOIB_FLAG_HW_CSUM = 11, + IPOIB_FLAG_RX_CSUM = 12, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-14 18:07:31.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-14 18:07:37.000000000 +0200 @@ -1262,6 +1262,13 @@ static ssize_t set_mode(struct device *d set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); ipoib_warn(priv, "enabling connected mode " "will cause multicast packet drops\n"); + + /* clear ipv6 flag too */ + dev->features &= ~NETIF_F_IP_CSUM; + + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + ipoib_flush_paths(dev); return count; } @@ -1270,6 +1277,10 @@ static ssize_t set_mode(struct device *d clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); dev->mtu = min(priv->mcast_mtu, dev->mtu); ipoib_flush_paths(dev); + + if (priv->ca->flags & IB_DEVICE_IP_CSUM) + dev->features |= NETIF_F_IP_CSUM; /* ipv6 too */ + return count; } Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-14 18:07:31.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-14 18:07:37.000000000 +0200 @@ -37,6 +37,7 @@ #include #include +#include #include @@ -235,6 +236,16 @@ static void ipoib_ib_handle_rx_wc(struct skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; + + /* check rx csum */ + if (test_bit(IPOIB_FLAG_RX_CSUM, &priv->flags) && likely(wc->csum_ok)) { + /* Note: this is a specific requirement for Mellanox + HW but since it is the only HW currently supporting + checksum offload I put it here */ + if ((((struct iphdr *)(skb->data))->ihl) == 5) + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + netif_receive_skb(skb); repost: @@ -400,6 +411,15 @@ void ipoib_send(struct net_device *dev, return; } + if (priv->ca->flags & IB_DEVICE_IP_CSUM && + skb->ip_summed == CHECKSUM_PARTIAL) + priv->tx_wr.send_flags |= + IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM; + else + priv->tx_wr.send_flags &= + ~(IB_SEND_UDP_TCP_CSUM | IB_SEND_IP_CSUM); + + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, tx_req->mapping, skb_headlen(skb), Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-14 18:07:31.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-14 18:08:10.000000000 +0200 @@ -1128,6 +1128,29 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +static void set_tx_csum(struct net_device *dev, struct ib_device *hca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) + return; + + if (!(hca->flags & IB_DEVICE_IP_CSUM)) + return; + + dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; /* turn on ipv6 too */ +} + +static void set_rx_csum(struct net_device *dev, struct ib_device *hca) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (!(hca->flags & IB_DEVICE_IP_CSUM)) + return; + + set_bit(IPOIB_FLAG_RX_CSUM, &priv->flags); +} + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { @@ -1166,6 +1189,8 @@ static struct net_device *ipoib_add_port } else memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + set_tx_csum(priv->dev, hca); + set_rx_csum(priv->dev, hca); result = ipoib_dev_init(priv->dev, hca, port); if (result < 0) { From eli at mellanox.co.il Mon Oct 15 00:57:40 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 15 Oct 2007 09:57:40 +0200 Subject: [ofa-general] [PATCH regenrated] IB/ipoib: enable IGMP for userpsace multicast IB apps Message-ID: <1192435060.7337.152.camel@mtls03> The kernel IB stack allows (through the RDMA CM) user space multicast applications to interoperate with IP based apps optionally running at a different IP subnet. To support this inter-op for the case where the receiving party resides at the IB side, there is a need to handle IGMP (reports/queries) else the local IP router would not forward multicast traffic towards the IB network. This patch does a lookup on the database used for multicast reference counting and enhances IPoIB to ignore multicast group which is already handled by user space, all this under a per device policy flag. That is when the policy flag allows it, IPoIB will not join and attach its QP to a multicast group which has an entry on the database. For each IPoIB device, the /sys/class/net/$dev/umcast attribute controls the policy flag where the default value is being off (zero). The flag can be read and set/unset through sysfs. Signed-off-by: Or Gerlitz --- This is the same patch the Or committed to the ofa git tree but this one was regenerated to prevent conflicts which were introduced after the fix to the checksum offload patch I sent (fix was sent just before this one). Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-10-14 17:46:43.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-10-14 17:57:10.000000000 +0200 @@ -761,6 +761,7 @@ void ipoib_mcast_restart_task(struct wor struct ipoib_mcast *mcast, *tmcast; LIST_HEAD(remove_list); unsigned long flags; + struct ib_sa_mcmember_rec rec; ipoib_dbg_mcast(priv, "restarting multicast task\n"); @@ -794,6 +795,15 @@ void ipoib_mcast_restart_task(struct wor if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { struct ipoib_mcast *nmcast; + /* ignore group which is directly joined by user space */ + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags) && + !ib_sa_get_mcmember_rec(priv->ca, priv->port, &mgid, &rec)) + { + ipoib_dbg_mcast(priv, "ignoring multicast entry for mgid " + IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); + continue; + } + /* Not found or send-only group, let's add a new entry */ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-14 17:55:52.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-14 17:57:10.000000000 +0200 @@ -88,6 +88,7 @@ enum { IPOIB_FLAG_ADMIN_CM = 10, IPOIB_FLAG_HW_CSUM = 11, IPOIB_FLAG_RX_CSUM = 12, + IPOIB_FLAG_ADMIN_UMCAST_ALLOWED = 13, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -498,6 +499,7 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-14 17:55:50.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-10-14 18:05:36.000000000 +0200 @@ -1131,6 +1131,45 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +static ssize_t show_umcast(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + if (test_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags)) + return sprintf(buf, "1\n"); + else + return sprintf(buf, "0\n"); +} + +static ssize_t set_umcast(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + unsigned long umcast_val = simple_strtoul(buf, NULL, 0); + + if (umcast_val > 0) { + set_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + ipoib_warn(priv, "ignoring multicast groups joined directly " + "by user space\n"); + return count; + } + + if (!umcast_val) { + clear_bit(IPOIB_FLAG_ADMIN_UMCAST_ALLOWED, &priv->flags); + return count; + } + + return -EINVAL; +} +static DEVICE_ATTR(umcast, S_IWUSR | S_IRUGO, show_umcast, set_umcast); + +int ipoib_add_umcast_attr(struct net_device *dev) +{ + return device_create_file(&dev->dev, &dev_attr_umcast); +} + static void set_tx_csum(struct net_device *dev, struct ib_device *hca) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -1233,6 +1272,8 @@ static struct net_device *ipoib_add_port goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_create_child)) goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_delete_child)) Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-10-14 17:46:43.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 2007-10-14 17:57:10.000000000 +0200 @@ -119,6 +119,8 @@ int ipoib_vlan_add(struct net_device *pd goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; + if (ipoib_add_umcast_attr(priv->dev)) + goto sysfs_failed; if (device_create_file(&priv->dev->dev, &dev_attr_parent)) goto sysfs_failed; From eli at mellanox.co.il Mon Oct 15 01:01:32 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 15 Oct 2007 10:01:32 +0200 Subject: [ofa-general] [PATCH]: IB/ipoib control LRO with a module param Message-ID: <1192435292.7337.154.camel@mtls03> Allow to control LRO Use a module parameter to control whether LRO is enabled or disabled. This is required when the host is configured as a router, in which case using LRO would cause IP packets to be forwarded with wrong TCP checksum. Signed-off-by: Eli Cohen --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-14 12:10:17.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-10-14 12:34:16.000000000 +0200 @@ -52,6 +52,10 @@ MODULE_PARM_DESC(data_debug_level, "Enable data path debug tracing if > 0"); #endif +static int lro_enabled = 1; +module_param(lro_enabled, int, 0644); +MODULE_PARM_DESC(lro_enabled, "Enable LRO when > 1"); + static DEFINE_MUTEX(pkey_mutex); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, @@ -245,7 +249,7 @@ static void ipoib_ib_handle_rx_wc(struct checksum offload I put it here */ if ((((struct iphdr *)(skb->data))->ihl) == 5) { skb->ip_summed = CHECKSUM_UNNECESSARY; - if (!ipoib_lro_rx(priv, skb)) + if (lro_enabled && !ipoib_lro_rx(priv, skb)) goto repost; } } From kliteyn at dev.mellanox.co.il Mon Oct 15 01:48:03 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 15 Oct 2007 10:48:03 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071015035309.GN12364@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> Message-ID: <47132943.9090301@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 00:32 Mon 15 Oct , Yevgeny Kliteynik wrote: >>>>>> Added CA-by-name hash to the QoS policy object and >>>>> Why it is called "CA"-by-name? In the code below I see that hash is >>>>> created for all nodes (including switches and routers). >>>> In osm_qos_policy.c: >>>> >>>> if (p_node->node_info.node_type == IB_NODE_TYPE_CA) >>>> st_insert(p_qos_policy->p_ca_hash, >>>> (st_data_t)p_node->print_desc, >>>> (st_data_t)p_node); >>> Ok, so what is wrong with switches and routers? Why it cannot be >>> specified by "name"? >> Switches have the NodeDescription filled by FW, and it's usually the >> same string for all the switches. > > It must not be same. Also I suppose that node description can be changed > at least for some managed switches even today. Come on, man... How many cluster administrators that you know will actually go and set NodeDescription on switches??? I don't want to give user an easy way to make mistakes. If the user wants to include all the switches in the port group, there's an easy way to do it just by saying "node-type: SWITCH". If the user is so advanced that he wants to create port groups with a specific switches, it can be done by specifying guids. >> Also, what would be the meaning of >> "host id" for switches? > > I don't like "host id" approach - it assume a predefined node > description format. > >> As for routers - I don't know what's going on there, since I didn't >> get the chance to lay my hands on it yet (those IB routers are hard >> to get right now :-) > > So I think it is better to include switches and routers here and to use > "node name" instead of "CA name". In worst case when all records are > same we will lost one or two hash table entries per fabric. >>>> From /etc/init.d/openibd: >>>> >>>> # Add node description to sysfs >>>> IBSYSDIR="/sys/class/infiniband" >>>> if [ -d ${IBSYSDIR} ]; then >>>> declare -i hca_id=1 >>>> for hca in ${IBSYSDIR}/* >>>> do >>>> if [ -e ${hca}/node_desc ]; then >>>> echo -n "$(hostname -s) HCA-${hca_id}" >> >>>> ${hca}/node_desc >>>> fi >>>> let hca_id++ >>>> done >>>> fi >>> This script is optional, even when used the way how node_desc is >>> generated can be easily changed. I think it is not good idea to copy the >>> algorithm to OpenSM code and in this way to enforce an user to use the >>> only this hardcoded node_desc format. >>> Actually this (or another similar) script is sort of config file, as >>> well as qos policy file, and both are in admin's hands. So basically I >>> agree that it is ok to require to define node_desc (if an admin wishes >>> to use names for her QoS). _But_ we cannot dictate how it should be >>> generated - it clearly must be user's and not our choice. >>> So instead of approaching hardcoded node_desc format I think that name >>> definition in qos policy file should refer node_desc as whole string >>> (well, in improved case it could be single substring with wild cards). >> Ok, so let's elaborate on this. >> >> Currently NodeDescription is filled by the openibd script. >> Although this script can be modified by admin, I doubt that an average >> admin would like to tweak it. Thus, I believe that in most cases the >> NodeDescription will look this way: "node-id hca-num". > > It must not be so. For instance I'm not using this script at all, OpenSM > must not be installed as part of OFED, nodes can run other than Linux > OSes (Solar*, Win*, etc.), etc. > >> If we want to allow port names to have number ranges or asterisk >> (and we do want it), then we have to have *some* format. > > Why? What is wrong with plain strings? > > Look at example - you are able to use wildcards with 'ls' command even > if file names doesn't have any predefined format. Right? > >> So here's my suggestion: >> 1. First of all, when the ca-by-name hash is created, osm will check that >> the NodeDescriptions are unique. If they aren't - parsing of the port >> names will be off, even if specified in the policy file. > > It is overkill IMO - an user is responsible to setup things properly, > finally it is her choice in which tricky way to use it. > >> 2. If a port name doesn't have any special characters, it will be compared >> to the NodeDescription as is, and it'd better be unique > > Ok. > >> 3. If the admin would like to include num. ranges and asterisks in the >> port name, he has to make sure that the NodeDescription is created >> like it is created now by openibd. > > Again, why this limitation is needed? What is wrong with wildcards like > "myname*", "hostname[1-3] *", etc.? In the policy file the user specifies *port* names, not *node* names. You HAVE to have SOME format in order to understand where is the port number. -- Yevgeny >> Sounds ok? > > (2) + (3 without format limitation) looks fine for me. > > Sasha > From vlad at lists.openfabrics.org Mon Oct 15 02:56:12 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 15 Oct 2007 02:56:12 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071015-0200 daily build status Message-ID: <20071015095612.319E6E608F6@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on ppc64 with linux-2.6.23 Log: -include include/linux/autoconf.h \ -include /home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_ppc64_check/include/linux/autoconf.h \ ' \ modules make[1]: Entering directory `/home/vlad/kernel.org/ppc64/linux-2.6.23' Makefile:492: /home/vlad/kernel.org/ppc64/linux-2.6.23/arch/ppc64/Makefile: No such file or directory make[1]: *** No rule to make target `/home/vlad/kernel.org/ppc64/linux-2.6.23/arch/ppc64/Makefile'. Stop. make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.23' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.23 Log: /home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband/core/addr.c:361: error: 'struct neighbour' has no member named 'nud_state' /home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband/core/addr.c: In function 'addr_init': /home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband/core/addr.c:376: error: 'ENOMEM' undeclared (first use in this function) make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071015-0200_linux-2.6.23_powerpc_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/powerpc/linux-2.6.23' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Mon Oct 15 03:39:18 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 12:39:18 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <47132943.9090301@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> Message-ID: <20071015103918.GO12364@sashak.voltaire.com> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > >> Switches have the NodeDescription filled by FW, and it's usually the > >> same string for all the switches. > > It must not be same. Also I suppose that node description can be changed > > at least for some managed switches even today. > > Come on, man... > How many cluster administrators that you know will actually go and set > NodeDescription on switches??? I know at least one asked for this. > I don't want to give user an easy way to make mistakes. > If the user wants to include all the switches in the port group, there's an > easy way to do it just by saying "node-type: SWITCH". > If the user is so advanced that he wants to create port groups with a > specific > switches, it can be done by specifying guids. The same is true for CAs. So what is your point with "by name" resolution then? > >> 3. If the admin would like to include num. ranges and asterisks in the > >> port name, he has to make sure that the NodeDescription is created > >> like it is created now by openibd. > > Again, why this limitation is needed? What is wrong with wildcards like > > "myname*", "hostname[1-3] *", etc.? > > In the policy file the user specifies *port* names, not *node* names. Sure, I meant only node's component here. Have it in 'node name' + 'port number' form. What is easier? Sasha From hrosenstock at xsigo.com Mon Oct 15 04:31:37 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 04:31:37 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-12 at 15:14 -0700, Hal Rosenstock wrote: > On Fri, 2007-10-12 at 14:59 -0700, Hal Rosenstock wrote: > > On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: > > > ibwarn: [25274] pma_query: lid 1 port 1 > > > ibwarn: [25274] mad_rpc: data offs 64 sz 192 > > > mad data > > > 0101 0000 0000 0014 0000 0000 0000 0000 > > > > Thanks; AllPortSelect is off in CapabilityMask which is consistent with > > the behavior. (It would be trivial for those HCA PMAs to indicate > > AllPortSelect is supported (since it's the same as supporting one port) > > and then all would be fine but that's not a requirement). > > > > A check should be added in perfquery for this.I will generate a patch > > for that but that won't fix the problem. > > Actually, perfquery gets the number of ports and could do multiple > PerfGets, one per port, and accumulate the "all" ports. > > This approach may be better than dealing with the scripts. Can you try this and let me know if this resolves your issue ? The patch is against the master (OFED 1.3): diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 148e452..c976fc5 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -42,7 +43,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.2 +#define __BUILD_VERSION_TAG__ 1.2.3 #include #include #include @@ -99,6 +100,9 @@ main(int argc, char **argv) int ca_port = 0; int extended = 0; uint16_t cap_mask; + int allports = 0; + int node_type, num_ports; + uint8_t data[IB_SMP_DATA_SIZE]; static char const str_opts[] = "C:P:s:t:dGearRVhu"; static const struct option long_opts[] = { @@ -191,6 +195,35 @@ main(int argc, char **argv) /* PerfMgt ClassPortInfo is a required attribute */ if (!perf_classportinfo_query(pc, &portid, port, timeout)) IBERROR("classportinfo query"); + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ + cap_mask = ntohs(cap_mask); + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ + if (port == 255) { + allports = 1; + IBWARN("AllPortSelect not supported"); + } + + if (allports == 1) { + + /* + * Simulate all ports support in PMA + * Determine node type, number of (physical) ports, + * and, if switch, whether SP0 is enhanced + * to determine first and last port to query + */ + + /* For now, support single port CAs */ + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) + IBERROR("smp query nodeinfo failed"); + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ + IBERROR("smp query nodeinfo: Node type not CA"); + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); + if (num_ports != 1) + IBERROR("smp query nodeinfo: %d ports; only 1 supported currently", num_ports); + port = num_ports; + } if (reset_only) goto do_reset; @@ -201,9 +234,6 @@ main(int argc, char **argv) mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { - /* Should ClassPortInfo be implemented in libibmad ? */ - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ - cap_mask = ntohs(cap_mask); if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); > > -- Hal > > > I will try to find time to look at the scripts and see what it will take > > to fix this. Where AllPortSelect is not supported, they need to drop > > back to individual ports. > > > > -- Hal > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Mon Oct 15 04:39:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 15 Oct 2007 13:39:30 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071015103918.GO12364@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> Message-ID: <47135172.9080208@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: >>>> Switches have the NodeDescription filled by FW, and it's usually the >>>> same string for all the switches. >>> It must not be same. Also I suppose that node description can be changed >>> at least for some managed switches even today. >> Come on, man... >> How many cluster administrators that you know will actually go and set >> NodeDescription on switches??? > > I know at least one asked for this. > >> I don't want to give user an easy way to make mistakes. >> If the user wants to include all the switches in the port group, there's an >> easy way to do it just by saying "node-type: SWITCH". >> If the user is so advanced that he wants to create port groups with a >> specific >> switches, it can be done by specifying guids. > > The same is true for CAs. So what is your point with "by name" > resolution then? I'm sure you realize the difference between host names and switch names. But never mind, forget it. >>>> 3. If the admin would like to include num. ranges and asterisks in the >>>> port name, he has to make sure that the NodeDescription is created >>>> like it is created now by openibd. >>> Again, why this limitation is needed? What is wrong with wildcards like >>> "myname*", "hostname[1-3] *", etc.? >> In the policy file the user specifies *port* names, not *node* names. > > Sure, I meant only node's component here. Have it in 'node name' + 'port > number' form. What is easier? 'node_name'/Pn Where node_name is compared AS IS (including possible white spaces, slashes, brackets or anything else) to the NodeDescription content, and 'n' is port number. In the future I'll think about enhancing the node_name parsing with wild cards. Looks OK? -- Yevgeny > Sasha > From sashak at voltaire.com Mon Oct 15 05:08:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 14:08:48 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <4712E990.9020906@Sun.COM> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> Message-ID: <20071015120848.GP12364@sashak.voltaire.com> On 09:46 Mon 15 Oct , Sumit Gaur - Sun Microsystem wrote: > Thanks Sasha for patch I will try it and looks like it would work. Looking more at this I found that if we are going to identify opened umad device by file descriptor value (as it is done in the patch I sent) we perfectly can remove that 'struct Port' completely. This is simpler approach in general, supports multiple open()s and as side effect turns libibumad to be thread-safe. Hal, do you remember what was original motivation to track opened umad devices internally in libibumad (with 'struct Port ports[]'). Am I missing something? The patch is below. Sasha diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 589684c..a3bbf54 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -80,70 +80,14 @@ typedef struct ib_user_mad_reg_req { int umaddebug = 0; -#define UMAD_DEV_NAME_SZ 32 #define UMAD_DEV_FILE_SZ 256 static char *def_ca_name = "mthca0"; static int def_ca_port = 1; -typedef struct Port { - char dev_file[UMAD_DEV_FILE_SZ]; - char dev_name[UMAD_DEV_NAME_SZ]; - int dev_port; - int dev_fd; - int id; -} Port; - -static Port ports[UMAD_MAX_PORTS]; - /************************************* * Port */ -static Port * -port_alloc(int portid, char *dev, int portnum) -{ - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) { - IBWARN("bad umad portid %d", portid); - errno = EINVAL; - return 0; - } - - if (port->dev_name[0]) { - IBWARN("umad port id %d is already allocated for %s %d", - portid, port->dev_name, port->dev_port); - errno = EBUSY; - return 0; - } - - strncpy(port->dev_name, dev, UMAD_CA_NAME_LEN); - port->dev_port = portnum; - port->id = portid; - - return port; -} - -static Port * -port_get(int portid) -{ - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) - return 0; - - if (port->dev_name[0] == 0) - return 0; - - return port; -} - -static void -port_free(Port *port) -{ - memset(port, 0, sizeof *port); -} - static int find_cached_ca(char *ca_name, umad_ca_t *ca) { @@ -571,8 +515,8 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) int umad_open_port(char *ca_name, int portnum) { - int umad_id; - Port *port; + char dev_file[UMAD_DEV_FILE_SZ]; + int umad_id, fd; TRACE("ca %s port %d", ca_name, portnum); @@ -584,19 +528,16 @@ umad_open_port(char *ca_name, int portnum) if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) return -EINVAL; - if (!(port = port_alloc(umad_id, ca_name, portnum))) - return -errno; - - snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", + snprintf(dev_file, sizeof dev_file - 1, "%s/umad%d", UMAD_DEV_DIR , umad_id); - if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { - DEBUG("open %s failed: %s", port->dev_file, strerror(errno)); + if ((fd = open(dev_file, O_RDWR|O_NONBLOCK)) < 0) { + DEBUG("open %s failed: %s", dev_file, strerror(errno)); return -EIO; } - DEBUG("opened %s fd %d portid %d", port->dev_file, port->dev_fd, port->id); - return port->id; + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); + return fd; } int @@ -667,26 +608,16 @@ umad_release_port(umad_port_t *port) } int -umad_close_port(int portid) +umad_close_port(int fd) { - Port *port; - - TRACE("portid %d", portid); - if (!(port = port_get(portid))) - return -EINVAL; - - close(port->dev_fd); - - port_free(port); - - DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); + close(fd); + DEBUG("closed fd %d", fd); return 0; } void * umad_get_mad(void *umad) { - TRACE("umad %p", umad); return ((struct ib_user_mad *)umad)->data; } @@ -753,21 +684,15 @@ umad_set_addr_net(void *umad, int dlid, int dqp, int sl, int qkey) } int -umad_send(int portid, int agentid, void *umad, int length, +umad_send(int fd, int agentid, void *umad, int length, int timeout_ms, int retries) { struct ib_user_mad *mad = umad; - Port *port; int n; - TRACE("portid %d agentid %d umad %p timeout %u", - portid, agentid, umad, timeout_ms); + TRACE("fd %d agentid %d umad %p timeout %u", + fd, agentid, umad, timeout_ms); errno = 0; - if (!(port = port_get(portid))) { - if (!errno) - errno = EINVAL; - return -EINVAL; - } mad->timeout_ms = timeout_ms; mad->retries = retries; @@ -776,7 +701,7 @@ umad_send(int portid, int agentid, void *umad, int length, if (umaddebug > 1) umad_dump(mad); - n = write(port->dev_fd, mad, length + sizeof *mad); + n = write(fd, mad, length + sizeof *mad); if (n == length + sizeof *mad) return 0; @@ -806,33 +731,26 @@ dev_poll(int fd, int timeout_ms) } int -umad_recv(int portid, void *umad, int *length, int timeout_ms) +umad_recv(int fd, void *umad, int *length, int timeout_ms) { struct ib_user_mad *mad = umad; - Port *port; int n; errno = 0; - TRACE("portid %d umad %p timeout %u", portid, umad, timeout_ms); + TRACE("fd %d umad %p timeout %u", fd, umad, timeout_ms); if (!umad || !length) { errno = EINVAL; return -EINVAL; } - if (!(port = port_get(portid))) { - if (!errno) - errno = EINVAL; - return -EINVAL; - } - - if (timeout_ms && (n = dev_poll(port->dev_fd, timeout_ms)) < 0) { + if (timeout_ms && (n = dev_poll(fd, timeout_ms)) < 0) { if (!errno) errno = -n; return n; } - n = read(port->dev_fd, umad, sizeof *mad + *length); + n = read(fd, umad, sizeof *mad + *length); VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); @@ -861,43 +779,29 @@ umad_recv(int portid, void *umad, int *length, int timeout_ms) } int -umad_poll(int portid, int timeout_ms) +umad_poll(int fd, int timeout_ms) { - Port *port; - - TRACE("portid %d timeout %u", portid, timeout_ms); - if (!(port = port_get(portid))) - return -EINVAL; - - return dev_poll(port->dev_fd, timeout_ms); + TRACE("fd %d timeout %u", fd, timeout_ms); + return dev_poll(fd, timeout_ms); } int -umad_get_fd(int portid) +umad_get_fd(int fd) { - Port *port; - - TRACE("portid %d", portid); - if (!(port = port_get(portid))) - return -EINVAL; - - return port->dev_fd; + TRACE("fd %d", fd); + return fd; } int -umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, +umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, uint8_t oui[3], uint32_t method_mask[4]) { struct ib_user_mad_reg_req req; - Port *port; - TRACE("portid %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", - portid, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], + TRACE("fd %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", + fd, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], (int)oui[2], method_mask); - if (!(port = port_get(portid))) - return -EINVAL; - if (mgmt_class < 0x30 || mgmt_class > 0x4f) { DEBUG("mgmt class %d not in vendor range 2", mgmt_class); return -EINVAL; @@ -916,31 +820,27 @@ umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { - DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", - portid, req.id, req.qpn, req.mgmt_class, oui); + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { + DEBUG("fd %d registered to use agent %d qp %d class 0x%x oui %p", + fd, req.id, req.qpn, req.mgmt_class, oui); return req.id; /* return agentid */ } - DEBUG("portid %d registering qp %d class 0x%x version %d oui %p failed: %m", - portid, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); + DEBUG("fd %d registering qp %d class 0x%x version %d oui %p failed: %m", + fd, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); return -EPERM; } int -umad_register(int portid, int mgmt_class, int mgmt_version, +umad_register(int fd, int mgmt_class, int mgmt_version, uint8_t rmpp_version, uint32_t method_mask[4]) { struct ib_user_mad_reg_req req; - Port *port; uint32_t oui = htonl(IB_OPENIB_OUI); int qp; - TRACE("portid %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", - portid, mgmt_class, mgmt_version, rmpp_version, method_mask); - - if (!(port = port_get(portid))) - return -EINVAL; + TRACE("fd %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", + fd, mgmt_class, mgmt_version, rmpp_version, method_mask); req.qpn = qp = (mgmt_class == 0x1 || mgmt_class == 0x81) ? 0 : 1; req.mgmt_class = mgmt_class; @@ -956,28 +856,22 @@ umad_register(int portid, int mgmt_class, int mgmt_version, VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { - DEBUG("portid %d registered to use agent %d qp %d", - portid, req.id, qp); + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { + DEBUG("fd %d registered to use agent %d qp %d", + fd, req.id, qp); return req.id; /* return agentid */ } - DEBUG("portid %d registering qp %d class 0x%x version %d failed: %m", - portid, qp, mgmt_class, mgmt_version); + DEBUG("fd %d registering qp %d class 0x%x version %d failed: %m", + fd, qp, mgmt_class, mgmt_version); return -EPERM; } int -umad_unregister(int portid, int agentid) +umad_unregister(int fd, int agentid) { - Port *port; - - TRACE("portid %d unregistering agent %d", portid, agentid); - - if (!(port = port_get(portid))) - return -EINVAL; - - return ioctl(port->dev_fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); + TRACE("fd %d unregistering agent %d", fd, agentid); + return ioctl(fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); } int From keshetti.mahesh at yahoo.co.in Mon Oct 15 05:04:47 2007 From: keshetti.mahesh at yahoo.co.in (Keshetti Mahesh) Date: Mon, 15 Oct 2007 17:34:47 +0530 (IST) Subject: [ofa-general] ***SPAM*** [query] ucast file loading with openSM Message-ID: <522058.84170.qm@web8323.mail.in.yahoo.com> Is it compulsory to give lid matrix file while loading ucast routing file with openSM? I observed errors when I've tried to load an ucast file (-U) without giving lid matrix file (-M) to openSM. (I am using OFED-1.2) regards, Mahesh Get the freedom to save as many mails as you wish. To know how, go to http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From keshetti85-student at yahoo.co.in Mon Oct 15 05:23:23 2007 From: keshetti85-student at yahoo.co.in (keshetti85-student at yahoo.co.in) Date: Mon, 15 Oct 2007 17:53:23 +0530 (IST) Subject: [ofa-general] ***SPAM*** [query] ucast file loading with openSM Message-ID: <245037.48085.qm@web8327.mail.in.yahoo.com> Is it compulsory to give lid matrix file while loading unicast routing file with openSM (in OFED-1.2) ? I got some errors when I've tried to load an ucast file (-U) without giving lid matrix file (-M) to openSM. regards, Mahesh Now you can chat without downloading messenger. Go to http://in.messenger.yahoo.com/webmessengerpromo.php -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Oct 15 05:37:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 14:37:27 +0200 Subject: [ofa-general] [query] ucast file loading with openSM In-Reply-To: <522058.84170.qm@web8323.mail.in.yahoo.com> References: <522058.84170.qm@web8323.mail.in.yahoo.com> Message-ID: <20071015123727.GQ12364@sashak.voltaire.com> On 17:34 Mon 15 Oct , Keshetti Mahesh wrote: > Is it compulsory to give lid matrix file while loading ucast routing file > with openSM? No, it should be optional. > I observed errors when I've tried to load an ucast file (-U) without giving > lid matrix file (-M) to openSM. (I am using OFED-1.2) Do you mean like this: do_lid_matrix_file_load: ERR 6304: lid matrix file name is not defined; using default lid matrix generation algorithm If so, just ignore - there should be warning. I will fix. Sasha From sashak at voltaire.com Mon Oct 15 05:39:14 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 14:39:14 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <47135172.9080208@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <47135172.9080208@dev.mellanox.co.il> Message-ID: <20071015123914.GR12364@sashak.voltaire.com> On 13:39 Mon 15 Oct , Yevgeny Kliteynik wrote: > > 'node_name'/Pn > Where node_name is compared AS IS (including possible white spaces, slashes, > brackets or anything else) to the NodeDescription content, and 'n' is port > number. > In the future I'll think about enhancing the node_name parsing with wild > cards. > > Looks OK? Perfect! Sasha From hrosenstock at xsigo.com Mon Oct 15 06:18:53 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 06:18:53 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071015120848.GP12364@sashak.voltaire.com> References: <46A9C633.7040302@Sun.COM> <470B2E58.2040509@Sun.COM> <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> Message-ID: <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-15 at 14:08 +0200, Sasha Khapyorsky wrote: > On 09:46 Mon 15 Oct , Sumit Gaur - Sun Microsystem wrote: > > Thanks Sasha for patch I will try it and looks like it would work. > > Looking more at this I found that if we are going to identify opened > umad device by file descriptor value (as it is done in the patch I sent) > we perfectly can remove that 'struct Port' completely. This is simpler > approach in general, supports multiple open()s and as side effect turns > libibumad to be thread-safe. > > Hal, do you remember what was original motivation to track opened umad > devices internally in libibumad (with 'struct Port ports[]'). Am I > missing something? Sasha, I don't recall as this is from a very very long time ago but in looking at this, I agree with your assessment that it can be simplified (and there appears to be no real need for what is contained in struct Port other than the fd). The only downside I see is the subtle change in the public umad_ APIs changing int portid -> int fd. I suppose all the tools would continue to work without change here even if libibumad were changed underneath it, right ? BTW, when you do this, the umad man pages should all be updated for this change. -- Hal > The patch is below. > > Sasha > > > diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c > index 589684c..a3bbf54 100644 > --- a/libibumad/src/umad.c > +++ b/libibumad/src/umad.c > @@ -80,70 +80,14 @@ typedef struct ib_user_mad_reg_req { > > int umaddebug = 0; > > -#define UMAD_DEV_NAME_SZ 32 > #define UMAD_DEV_FILE_SZ 256 > > static char *def_ca_name = "mthca0"; > static int def_ca_port = 1; > > -typedef struct Port { > - char dev_file[UMAD_DEV_FILE_SZ]; > - char dev_name[UMAD_DEV_NAME_SZ]; > - int dev_port; > - int dev_fd; > - int id; > -} Port; > - > -static Port ports[UMAD_MAX_PORTS]; > - > /************************************* > * Port > */ > -static Port * > -port_alloc(int portid, char *dev, int portnum) > -{ > - Port *port = ports + portid; > - > - if (portid < 0 || portid >= UMAD_MAX_PORTS) { > - IBWARN("bad umad portid %d", portid); > - errno = EINVAL; > - return 0; > - } > - > - if (port->dev_name[0]) { > - IBWARN("umad port id %d is already allocated for %s %d", > - portid, port->dev_name, port->dev_port); > - errno = EBUSY; > - return 0; > - } > - > - strncpy(port->dev_name, dev, UMAD_CA_NAME_LEN); > - port->dev_port = portnum; > - port->id = portid; > - > - return port; > -} > - > -static Port * > -port_get(int portid) > -{ > - Port *port = ports + portid; > - > - if (portid < 0 || portid >= UMAD_MAX_PORTS) > - return 0; > - > - if (port->dev_name[0] == 0) > - return 0; > - > - return port; > -} > - > -static void > -port_free(Port *port) > -{ > - memset(port, 0, sizeof *port); > -} > - > static int > find_cached_ca(char *ca_name, umad_ca_t *ca) > { > @@ -571,8 +515,8 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) > int > umad_open_port(char *ca_name, int portnum) > { > - int umad_id; > - Port *port; > + char dev_file[UMAD_DEV_FILE_SZ]; > + int umad_id, fd; > > TRACE("ca %s port %d", ca_name, portnum); > > @@ -584,19 +528,16 @@ umad_open_port(char *ca_name, int portnum) > if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) > return -EINVAL; > > - if (!(port = port_alloc(umad_id, ca_name, portnum))) > - return -errno; > - > - snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", > + snprintf(dev_file, sizeof dev_file - 1, "%s/umad%d", > UMAD_DEV_DIR , umad_id); > > - if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { > - DEBUG("open %s failed: %s", port->dev_file, strerror(errno)); > + if ((fd = open(dev_file, O_RDWR|O_NONBLOCK)) < 0) { > + DEBUG("open %s failed: %s", dev_file, strerror(errno)); > return -EIO; > } > > - DEBUG("opened %s fd %d portid %d", port->dev_file, port->dev_fd, port->id); > - return port->id; > + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); > + return fd; > } > > int > @@ -667,26 +608,16 @@ umad_release_port(umad_port_t *port) > } > > int > -umad_close_port(int portid) > +umad_close_port(int fd) > { > - Port *port; > - > - TRACE("portid %d", portid); > - if (!(port = port_get(portid))) > - return -EINVAL; > - > - close(port->dev_fd); > - > - port_free(port); > - > - DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); > + close(fd); > + DEBUG("closed fd %d", fd); > return 0; > } > > void * > umad_get_mad(void *umad) > { > - TRACE("umad %p", umad); > return ((struct ib_user_mad *)umad)->data; > } > > @@ -753,21 +684,15 @@ umad_set_addr_net(void *umad, int dlid, int dqp, int sl, int qkey) > } > > int > -umad_send(int portid, int agentid, void *umad, int length, > +umad_send(int fd, int agentid, void *umad, int length, > int timeout_ms, int retries) > { > struct ib_user_mad *mad = umad; > - Port *port; > int n; > > - TRACE("portid %d agentid %d umad %p timeout %u", > - portid, agentid, umad, timeout_ms); > + TRACE("fd %d agentid %d umad %p timeout %u", > + fd, agentid, umad, timeout_ms); > errno = 0; > - if (!(port = port_get(portid))) { > - if (!errno) > - errno = EINVAL; > - return -EINVAL; > - } > > mad->timeout_ms = timeout_ms; > mad->retries = retries; > @@ -776,7 +701,7 @@ umad_send(int portid, int agentid, void *umad, int length, > if (umaddebug > 1) > umad_dump(mad); > > - n = write(port->dev_fd, mad, length + sizeof *mad); > + n = write(fd, mad, length + sizeof *mad); > if (n == length + sizeof *mad) > return 0; > > @@ -806,33 +731,26 @@ dev_poll(int fd, int timeout_ms) > } > > int > -umad_recv(int portid, void *umad, int *length, int timeout_ms) > +umad_recv(int fd, void *umad, int *length, int timeout_ms) > { > struct ib_user_mad *mad = umad; > - Port *port; > int n; > > errno = 0; > - TRACE("portid %d umad %p timeout %u", portid, umad, timeout_ms); > + TRACE("fd %d umad %p timeout %u", fd, umad, timeout_ms); > > if (!umad || !length) { > errno = EINVAL; > return -EINVAL; > } > > - if (!(port = port_get(portid))) { > - if (!errno) > - errno = EINVAL; > - return -EINVAL; > - } > - > - if (timeout_ms && (n = dev_poll(port->dev_fd, timeout_ms)) < 0) { > + if (timeout_ms && (n = dev_poll(fd, timeout_ms)) < 0) { > if (!errno) > errno = -n; > return n; > } > > - n = read(port->dev_fd, umad, sizeof *mad + *length); > + n = read(fd, umad, sizeof *mad + *length); > > VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); > > @@ -861,43 +779,29 @@ umad_recv(int portid, void *umad, int *length, int timeout_ms) > } > > int > -umad_poll(int portid, int timeout_ms) > +umad_poll(int fd, int timeout_ms) > { > - Port *port; > - > - TRACE("portid %d timeout %u", portid, timeout_ms); > - if (!(port = port_get(portid))) > - return -EINVAL; > - > - return dev_poll(port->dev_fd, timeout_ms); > + TRACE("fd %d timeout %u", fd, timeout_ms); > + return dev_poll(fd, timeout_ms); > } > > int > -umad_get_fd(int portid) > +umad_get_fd(int fd) > { > - Port *port; > - > - TRACE("portid %d", portid); > - if (!(port = port_get(portid))) > - return -EINVAL; > - > - return port->dev_fd; > + TRACE("fd %d", fd); > + return fd; > } > > int > -umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, > +umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, > uint8_t oui[3], uint32_t method_mask[4]) > { > struct ib_user_mad_reg_req req; > - Port *port; > > - TRACE("portid %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", > - portid, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], > + TRACE("fd %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", > + fd, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], > (int)oui[2], method_mask); > > - if (!(port = port_get(portid))) > - return -EINVAL; > - > if (mgmt_class < 0x30 || mgmt_class > 0x4f) { > DEBUG("mgmt class %d not in vendor range 2", mgmt_class); > return -EINVAL; > @@ -916,31 +820,27 @@ umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, > > VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); > > - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { > - DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", > - portid, req.id, req.qpn, req.mgmt_class, oui); > + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { > + DEBUG("fd %d registered to use agent %d qp %d class 0x%x oui %p", > + fd, req.id, req.qpn, req.mgmt_class, oui); > return req.id; /* return agentid */ > } > > - DEBUG("portid %d registering qp %d class 0x%x version %d oui %p failed: %m", > - portid, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); > + DEBUG("fd %d registering qp %d class 0x%x version %d oui %p failed: %m", > + fd, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); > return -EPERM; > } > > int > -umad_register(int portid, int mgmt_class, int mgmt_version, > +umad_register(int fd, int mgmt_class, int mgmt_version, > uint8_t rmpp_version, uint32_t method_mask[4]) > { > struct ib_user_mad_reg_req req; > - Port *port; > uint32_t oui = htonl(IB_OPENIB_OUI); > int qp; > > - TRACE("portid %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", > - portid, mgmt_class, mgmt_version, rmpp_version, method_mask); > - > - if (!(port = port_get(portid))) > - return -EINVAL; > + TRACE("fd %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", > + fd, mgmt_class, mgmt_version, rmpp_version, method_mask); > > req.qpn = qp = (mgmt_class == 0x1 || mgmt_class == 0x81) ? 0 : 1; > req.mgmt_class = mgmt_class; > @@ -956,28 +856,22 @@ umad_register(int portid, int mgmt_class, int mgmt_version, > > VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); > > - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { > - DEBUG("portid %d registered to use agent %d qp %d", > - portid, req.id, qp); > + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { > + DEBUG("fd %d registered to use agent %d qp %d", > + fd, req.id, qp); > return req.id; /* return agentid */ > } > > - DEBUG("portid %d registering qp %d class 0x%x version %d failed: %m", > - portid, qp, mgmt_class, mgmt_version); > + DEBUG("fd %d registering qp %d class 0x%x version %d failed: %m", > + fd, qp, mgmt_class, mgmt_version); > return -EPERM; > } > > int > -umad_unregister(int portid, int agentid) > +umad_unregister(int fd, int agentid) > { > - Port *port; > - > - TRACE("portid %d unregistering agent %d", portid, agentid); > - > - if (!(port = port_get(portid))) > - return -EINVAL; > - > - return ioctl(port->dev_fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); > + TRACE("fd %d unregistering agent %d", fd, agentid); > + return ioctl(fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); > } > > int > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Oct 15 06:54:32 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 15:54:32 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> References: <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071015135432.GU12364@sashak.voltaire.com> Hi Hal, On 06:18 Mon 15 Oct , Hal Rosenstock wrote: > > I don't recall as this is from a very very long time ago but in looking > at this, I agree with your assessment that it can be simplified (and > there appears to be no real need for what is contained in struct Port > other than the fd). The only downside I see is the subtle change in the > public umad_ APIs changing int portid -> int fd. There is no API change at all - umad_open_port() still return unique integer descriptor as it was before. Here we are only changing undocumented (at least I'm not able to find any public description about what umad_open_port() should return) behavior of this API (by replacing mad device number as umad_open_port() return value, but if we want to support multiple open()s there is no choice - device number is not suitable for this). > I suppose all the tools > would continue to work without change here even if libibumad were > changed underneath it, right ? Right. > BTW, when you do this, the umad man pages > should all be updated for this change. I see only that umad_open_port.3 should be fixed - it says that return value is "0" on success, which is not correct anyway. Not really related to the patch. Do you see another places to fix in man? Sasha From sashak at voltaire.com Mon Oct 15 07:03:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 16:03:02 +0200 Subject: [ofa-general] [PATCH] libibumad/man: fix umad_open_port man page In-Reply-To: <20071015135432.GU12364@sashak.voltaire.com> References: <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> Message-ID: <20071015140302.GV12364@sashak.voltaire.com> Fix umad_open_port man page - describe correct return values. Signed-off-by: Sasha Khapyorsky --- libibumad/man/umad_open_port.3 | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibumad/man/umad_open_port.3 b/libibumad/man/umad_open_port.3 index 0051f4d..d3c6056 100644 --- a/libibumad/man/umad_open_port.3 +++ b/libibumad/man/umad_open_port.3 @@ -22,7 +22,7 @@ for details). .fi .SH "RETURN VALUE" .B umad_open_port() -returns 0 on success, and a negative value on error as follows: +returns an uniquie 0 or positive value of umad device descriptor on success, and a negative value on error as follows: -ENODEV IB device can\'t be resolved -EINVAL port is not valid (bad .I portnum\fR -- 1.5.3.4.206.g58ba4 From jsquyres at cisco.com Mon Oct 15 06:51:03 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 15 Oct 2007 09:51:03 -0400 Subject: [ofa-general] Save the date: OFA Developer's Summit: November 15-16 in Nevada In-Reply-To: <20070927221328.GA16000@cuprite.pathscale.com> References: <20070927221328.GA16000@cuprite.pathscale.com> Message-ID: <71E64D94-510F-4CCD-84FC-9BF9921AEEDA@cisco.com> Johann -- Is there an agenda worked up yet? I'm trying to make travel arrangements and wanted to see if flying out Friday afternoon was a possibility. On Sep 27, 2007, at 6:13 PM, Johann George wrote: > We hope you will plan on attending the OpenFabrics Developer's > Summit being held November 15-16, 2007 at the Boomtown Hotel in > Verdi, Nevada. It will begin at 1pm on Thursday, November 15th > and run until the early evening. Friday's session will begin at > 8am and end at noon. > > Last year, this turned out to be a good forum to work through issues > that required collaboration. If you have items that ought to be on > the agenda, please email them to me. We will have a proposed agenda > shortly. > > This event takes place at the tail end of SC07. The Boomtown hotel is > about a twenty minute drive from the Reno-Sparks convention center > where SC07 is being held. Rooms are available if needed at the > Boomtown hotel starting at $70/night. > > Thanks for your participation. > > Johann > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Mon Oct 15 07:08:10 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 15 Oct 2007 10:08:10 -0400 Subject: [ofa-general] Open MPI update Message-ID: I have updated Open MPI in openfabrics.org:~jsquyres/ofed_1.2 to be version 1.2.4. It includes configuration for Mellanox ConnectX hardware (which helps latency) and some other bug fixes. The new version should start getting picked up in the 1.3 nightly builds. -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Mon Oct 15 07:13:56 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 15 Oct 2007 10:13:56 -0400 Subject: [ofa-general] Downloads web page Message-ID: Shouldn't the "OFED" link be the first/most prominent download link on http://www.openfabrics.org/downloads.htm? All the others are supplemental and for unusual / specific user needs. For newbies, there's no indication that OFED is *the* main distribution. The OFED link is buried deep down in the list; it is not indicated at all that OFED is what they should be downloading (rather than all the individual libraries). Indeed, the acronym "OFED" is not even defined at all. More specifically: can the download page be re-worked to be a bit more optimized for the common and newbie use-cases? Right now, it's both non-obvious and sub optimal for what is expected to be the most common case (downloading the entire OFED distribution -- you may even have to scroll down to find the OFED link). Thanks. -- Jeff Squyres Cisco Systems From sashak at voltaire.com Mon Oct 15 07:28:57 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 16:28:57 +0200 Subject: [ofa-general] [PATCH] libibumad: remove opened umad devices internal tracking. In-Reply-To: <20071015135432.GU12364@sashak.voltaire.com> References: <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> Message-ID: <20071015142857.GX12364@sashak.voltaire.com> Opened umad devices will be tracked by its file descriptor and not by umad device number. This is simpler in general, doesn't formal API and in this way multiple opening of the same device is supported. Also this doesn't require internal device tracking in libibumad (struct Port ports[] array). It is removed completely here, which turns the library to be thread-safe. Signed-off-by: Sasha Khapyorsky --- libibumad/src/umad.c | 192 +++++++++++-------------------------------------- 1 files changed, 43 insertions(+), 149 deletions(-) diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 589684c..a3bbf54 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -80,70 +80,14 @@ typedef struct ib_user_mad_reg_req { int umaddebug = 0; -#define UMAD_DEV_NAME_SZ 32 #define UMAD_DEV_FILE_SZ 256 static char *def_ca_name = "mthca0"; static int def_ca_port = 1; -typedef struct Port { - char dev_file[UMAD_DEV_FILE_SZ]; - char dev_name[UMAD_DEV_NAME_SZ]; - int dev_port; - int dev_fd; - int id; -} Port; - -static Port ports[UMAD_MAX_PORTS]; - /************************************* * Port */ -static Port * -port_alloc(int portid, char *dev, int portnum) -{ - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) { - IBWARN("bad umad portid %d", portid); - errno = EINVAL; - return 0; - } - - if (port->dev_name[0]) { - IBWARN("umad port id %d is already allocated for %s %d", - portid, port->dev_name, port->dev_port); - errno = EBUSY; - return 0; - } - - strncpy(port->dev_name, dev, UMAD_CA_NAME_LEN); - port->dev_port = portnum; - port->id = portid; - - return port; -} - -static Port * -port_get(int portid) -{ - Port *port = ports + portid; - - if (portid < 0 || portid >= UMAD_MAX_PORTS) - return 0; - - if (port->dev_name[0] == 0) - return 0; - - return port; -} - -static void -port_free(Port *port) -{ - memset(port, 0, sizeof *port); -} - static int find_cached_ca(char *ca_name, umad_ca_t *ca) { @@ -571,8 +515,8 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) int umad_open_port(char *ca_name, int portnum) { - int umad_id; - Port *port; + char dev_file[UMAD_DEV_FILE_SZ]; + int umad_id, fd; TRACE("ca %s port %d", ca_name, portnum); @@ -584,19 +528,16 @@ umad_open_port(char *ca_name, int portnum) if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) return -EINVAL; - if (!(port = port_alloc(umad_id, ca_name, portnum))) - return -errno; - - snprintf(port->dev_file, sizeof port->dev_file - 1, "%s/umad%d", + snprintf(dev_file, sizeof dev_file - 1, "%s/umad%d", UMAD_DEV_DIR , umad_id); - if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { - DEBUG("open %s failed: %s", port->dev_file, strerror(errno)); + if ((fd = open(dev_file, O_RDWR|O_NONBLOCK)) < 0) { + DEBUG("open %s failed: %s", dev_file, strerror(errno)); return -EIO; } - DEBUG("opened %s fd %d portid %d", port->dev_file, port->dev_fd, port->id); - return port->id; + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); + return fd; } int @@ -667,26 +608,16 @@ umad_release_port(umad_port_t *port) } int -umad_close_port(int portid) +umad_close_port(int fd) { - Port *port; - - TRACE("portid %d", portid); - if (!(port = port_get(portid))) - return -EINVAL; - - close(port->dev_fd); - - port_free(port); - - DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); + close(fd); + DEBUG("closed fd %d", fd); return 0; } void * umad_get_mad(void *umad) { - TRACE("umad %p", umad); return ((struct ib_user_mad *)umad)->data; } @@ -753,21 +684,15 @@ umad_set_addr_net(void *umad, int dlid, int dqp, int sl, int qkey) } int -umad_send(int portid, int agentid, void *umad, int length, +umad_send(int fd, int agentid, void *umad, int length, int timeout_ms, int retries) { struct ib_user_mad *mad = umad; - Port *port; int n; - TRACE("portid %d agentid %d umad %p timeout %u", - portid, agentid, umad, timeout_ms); + TRACE("fd %d agentid %d umad %p timeout %u", + fd, agentid, umad, timeout_ms); errno = 0; - if (!(port = port_get(portid))) { - if (!errno) - errno = EINVAL; - return -EINVAL; - } mad->timeout_ms = timeout_ms; mad->retries = retries; @@ -776,7 +701,7 @@ umad_send(int portid, int agentid, void *umad, int length, if (umaddebug > 1) umad_dump(mad); - n = write(port->dev_fd, mad, length + sizeof *mad); + n = write(fd, mad, length + sizeof *mad); if (n == length + sizeof *mad) return 0; @@ -806,33 +731,26 @@ dev_poll(int fd, int timeout_ms) } int -umad_recv(int portid, void *umad, int *length, int timeout_ms) +umad_recv(int fd, void *umad, int *length, int timeout_ms) { struct ib_user_mad *mad = umad; - Port *port; int n; errno = 0; - TRACE("portid %d umad %p timeout %u", portid, umad, timeout_ms); + TRACE("fd %d umad %p timeout %u", fd, umad, timeout_ms); if (!umad || !length) { errno = EINVAL; return -EINVAL; } - if (!(port = port_get(portid))) { - if (!errno) - errno = EINVAL; - return -EINVAL; - } - - if (timeout_ms && (n = dev_poll(port->dev_fd, timeout_ms)) < 0) { + if (timeout_ms && (n = dev_poll(fd, timeout_ms)) < 0) { if (!errno) errno = -n; return n; } - n = read(port->dev_fd, umad, sizeof *mad + *length); + n = read(fd, umad, sizeof *mad + *length); VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); @@ -861,43 +779,29 @@ umad_recv(int portid, void *umad, int *length, int timeout_ms) } int -umad_poll(int portid, int timeout_ms) +umad_poll(int fd, int timeout_ms) { - Port *port; - - TRACE("portid %d timeout %u", portid, timeout_ms); - if (!(port = port_get(portid))) - return -EINVAL; - - return dev_poll(port->dev_fd, timeout_ms); + TRACE("fd %d timeout %u", fd, timeout_ms); + return dev_poll(fd, timeout_ms); } int -umad_get_fd(int portid) +umad_get_fd(int fd) { - Port *port; - - TRACE("portid %d", portid); - if (!(port = port_get(portid))) - return -EINVAL; - - return port->dev_fd; + TRACE("fd %d", fd); + return fd; } int -umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, +umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, uint8_t oui[3], uint32_t method_mask[4]) { struct ib_user_mad_reg_req req; - Port *port; - TRACE("portid %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", - portid, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], + TRACE("fd %d mgmt_class %u rmpp_version %d oui 0x%x%x%x method_mask %p", + fd, mgmt_class, (int)rmpp_version, (int)oui[0], (int)oui[1], (int)oui[2], method_mask); - if (!(port = port_get(portid))) - return -EINVAL; - if (mgmt_class < 0x30 || mgmt_class > 0x4f) { DEBUG("mgmt class %d not in vendor range 2", mgmt_class); return -EINVAL; @@ -916,31 +820,27 @@ umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { - DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", - portid, req.id, req.qpn, req.mgmt_class, oui); + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { + DEBUG("fd %d registered to use agent %d qp %d class 0x%x oui %p", + fd, req.id, req.qpn, req.mgmt_class, oui); return req.id; /* return agentid */ } - DEBUG("portid %d registering qp %d class 0x%x version %d oui %p failed: %m", - portid, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); + DEBUG("fd %d registering qp %d class 0x%x version %d oui %p failed: %m", + fd, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); return -EPERM; } int -umad_register(int portid, int mgmt_class, int mgmt_version, +umad_register(int fd, int mgmt_class, int mgmt_version, uint8_t rmpp_version, uint32_t method_mask[4]) { struct ib_user_mad_reg_req req; - Port *port; uint32_t oui = htonl(IB_OPENIB_OUI); int qp; - TRACE("portid %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", - portid, mgmt_class, mgmt_version, rmpp_version, method_mask); - - if (!(port = port_get(portid))) - return -EINVAL; + TRACE("fd %d mgmt_class %u mgmt_version %u rmpp_version %d method_mask %p", + fd, mgmt_class, mgmt_version, rmpp_version, method_mask); req.qpn = qp = (mgmt_class == 0x1 || mgmt_class == 0x81) ? 0 : 1; req.mgmt_class = mgmt_class; @@ -956,28 +856,22 @@ umad_register(int portid, int mgmt_class, int mgmt_version, VALGRIND_MAKE_MEM_DEFINED(&req, sizeof req); - if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { - DEBUG("portid %d registered to use agent %d qp %d", - portid, req.id, qp); + if (!ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { + DEBUG("fd %d registered to use agent %d qp %d", + fd, req.id, qp); return req.id; /* return agentid */ } - DEBUG("portid %d registering qp %d class 0x%x version %d failed: %m", - portid, qp, mgmt_class, mgmt_version); + DEBUG("fd %d registering qp %d class 0x%x version %d failed: %m", + fd, qp, mgmt_class, mgmt_version); return -EPERM; } int -umad_unregister(int portid, int agentid) +umad_unregister(int fd, int agentid) { - Port *port; - - TRACE("portid %d unregistering agent %d", portid, agentid); - - if (!(port = port_get(portid))) - return -EINVAL; - - return ioctl(port->dev_fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); + TRACE("fd %d unregistering agent %d", fd, agentid); + return ioctl(fd, IB_USER_MAD_UNREGISTER_AGENT, &agentid); } int -- 1.5.3.4.206.g58ba4 From sashak at voltaire.com Mon Oct 15 07:30:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 16:30:23 +0200 Subject: [ofa-general] [PATCH] opensm: only warning if no lid matrix or lft file given Message-ID: <20071015143023.GY12364@sashak.voltaire.com> When 'file' routing engine is used it is is up to an user to load LFT dump file, lid matrix file or both (this is specified with optional -U and -L command line options). When one of the files is not given OpenSM complains an error (but works properly). This turns this to just a warning. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_file.c | 11 +++++------ 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index 9928831..ed3ca10 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -136,9 +136,8 @@ static int do_ucast_file_load(void *context) file_name = p_osm->subn.opt.ucast_dump_file; if (!file_name) { - osm_log(&p_osm->log, OSM_LOG_ERROR | OSM_LOG_SYS, - "do_ucast_file_load: ERR 6301: " - "ucast dump file name is not defined; " + osm_log(&p_osm->log, OSM_LOG_VERBOSE, "do_ucast_file_load: " + "ucast dump file name is not given; " "using default routing algorithm\n"); return -1; } @@ -274,9 +273,9 @@ static int do_lid_matrix_file_load(void *context) file_name = p_osm->subn.opt.lid_matrix_dump_file; if (!file_name) { - osm_log(&p_osm->log, OSM_LOG_ERROR | OSM_LOG_SYS, - "do_lid_matrix_file_load: ERR 6304: " - "lid matrix file name is not defined; " + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "do_lid_matrix_file_load: " + "lid matrix file name is not given; " "using default lid matrix generation algorithm\n"); return -1; } -- 1.5.3.4.206.g58ba4 From tziporet at mellanox.co.il Mon Oct 15 07:31:28 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 15 Oct 2007 16:31:28 +0200 Subject: [ofa-general] OFED 1.3 Alpha release is available Message-ID: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Hi, OFED 1.3 Alpha release is available on http://www.openfabrics.org/builds/ofed-1.3/release/ File: OFED-1.3-alpha2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ The beta release is expected on 29 October Tziporet & Vlad ======================================================================== Release information: -------------------- OS support: Novell: - SLES10 - SLES10 SP1 Redhat: - Redhat EL4 up4 and up5 - Redhat EL5 kernel.org: - 2.6.23 Note: Fedora C6 and Open SUSE 10.2 and Redhat EL4 up3 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64* *Note: On PPC64 installation fails on the packages: ibutils, mvapich2, MPI tests over Open MPI. Main Changes from OFED 1.2.5 ============================ 1. General changes o Kernel code based on 2.6.23 o Quality of Service support in OpenSM, CMA, IPoIB, SRP o Added Neteffect driver (nes) 2. Package and install o There is a new install script. See OFED_Installation_Guide.txt for more details on the new installation and build procedures. Note: There is an easy way to install in one command line without a conf file, and without the interactive mode. Example: ./install.pl --all --prefix /usr/local o User space packages are now in different source RPMs (as opposed to one source RPM in previous OFED releases). o The option for a build without installing is not supported any more. o Added an option to generate tarball with kernel sources for each kernel. 3. IPoIB o Stateless offloads o IGMP for user-space multicast IB o NAPI is enabled default o High availability is supported via the bonding module only (removed ipoib tool scripts) 4. SDP - these are not yet in the alpha release o Keep-alive o Asynch IO o Send Zero Copy 5. iSER o ??? 6. qlgc_vnic o Update for PathScale HCA 7. RDS o RDMA API (using FMRs) - under work 8. uDAPL - these are not yet in the alpha release o Add DAT 2.0 API run-time library and development support. uDAPL 2.0 will include IB extensions for IB rdma write with immediate data and IB atomic operations. o Both uDAPL 1.2 and 2.0 packages will be provided and will co-exist 9. Libraries a. libibverbs 1.1.1 o Added Extended RC transport type b. librdmacm (uCMA) 1.0.3 10. OSM o More routing performance improvements o Even more speedups o Better packaging/installation o "Native" daemon mode o Performance management o Quality of Service manager: Based on IBTA annex 11. Management o Multiple partitions 12. MPI: a. OSU MVAPICH o Version is 0.9.9 - same as in 1.2.5 - to be replaced later b. Open MPI o Version is 1.2.2-1 - same as in 1.2.5 - to be replaced later c. OSU MVAPICH2 o Version was updated to 1.0-1. Tasks that should be completed for the beta release: ---------------------------------------------------- 1. Integrate all SDP features 2. Complete RDS work 3. Apply patches that fix warning of backport patches 4. Fix compilation problems on PPC 5. Add qperf test from Qlogic 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) 7. Support RHEL 5 up1 8. SPEC files should be part of each user space package From hrosenstock at xsigo.com Mon Oct 15 07:38:20 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 07:38:20 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071015135432.GU12364@sashak.voltaire.com> References: <1191930206.22963.164.camel@hrosenstock-ws.xsigo.com> <470C9C55.3090304@Sun.COM> <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> Message-ID: <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Mon, 2007-10-15 at 15:54 +0200, Sasha Khapyorsky wrote: > Hi Hal, > > On 06:18 Mon 15 Oct , Hal Rosenstock wrote: > > > > I don't recall as this is from a very very long time ago but in looking > > at this, I agree with your assessment that it can be simplified (and > > there appears to be no real need for what is contained in struct Port > > other than the fd). The only downside I see is the subtle change in the > > public umad_ APIs changing int portid -> int fd. > > There is no API change at all - umad_open_port() still return unique > integer descriptor as it was before. Here we are only changing > undocumented (at least I'm not able to find any public description about > what umad_open_port() should return) behavior of this API (by replacing > mad device number as umad_open_port() return value, It's all the other APIs which say umad_xxx(int portid, ...) are now umad_xxxx(int fd, ...). A subtle change. > but if we want to > support multiple open()s there is no choice - device number is not > suitable for this). Understood. > > I suppose all the tools > > would continue to work without change here even if libibumad were > > changed underneath it, right ? > > Right. > > > BTW, when you do this, the umad man pages > > should all be updated for this change. > > I see only that umad_open_port.3 should be fixed - it says that return > value is "0" on success, which is not correct anyway. Not really related > to the patch. Do you see another places to fix in man? Don't a number of them indicate int portid as an input parameter (and this should now be int fd) ? Just grep for portid in those man pages... -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Oct 15 08:06:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 17:06:24 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> References: <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071015150624.GZ12364@sashak.voltaire.com> On 07:38 Mon 15 Oct , Hal Rosenstock wrote: > Hi Sasha, > > On Mon, 2007-10-15 at 15:54 +0200, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 06:18 Mon 15 Oct , Hal Rosenstock wrote: > > > > > > I don't recall as this is from a very very long time ago but in looking > > > at this, I agree with your assessment that it can be simplified (and > > > there appears to be no real need for what is contained in struct Port > > > other than the fd). The only downside I see is the subtle change in the > > > public umad_ APIs changing int portid -> int fd. > > > > There is no API change at all - umad_open_port() still return unique > > integer descriptor as it was before. Here we are only changing > > undocumented (at least I'm not able to find any public description about > > what umad_open_port() should return) behavior of this API (by replacing > > mad device number as umad_open_port() return value, > > It's all the other APIs which say umad_xxx(int portid, ...) are now > umad_xxxx(int fd, ...). A subtle change. I changed this only in umad.c files (to make it clear for internal implementation reviewers) and saved it as 'portid' in the header where API is described - an user should not care what internal meaning of portid is. For getting fd explicitly there is umad_get_fd(portid) method. > > > but if we want to > > support multiple open()s there is no choice - device number is not > > suitable for this). > > Understood. > > > > I suppose all the tools > > > would continue to work without change here even if libibumad were > > > changed underneath it, right ? > > > > Right. > > > > > BTW, when you do this, the umad man pages > > > should all be updated for this change. > > > > I see only that umad_open_port.3 should be fixed - it says that return > > value is "0" on success, which is not correct anyway. Not really related > > to the patch. Do you see another places to fix in man? > > Don't a number of them indicate int portid as an input parameter (and > this should now be int fd) ? Just grep for portid in those man pages... Don't think we want to make the internal in its nature "portid = fd feature" to be part of the public API. 'portid' is fine IMO because it doesn't mean a lot - just "0 or an unique positive value...", pretty suitable for public API. Sasha From johann.george at qlogic.com Mon Oct 15 08:25:17 2007 From: johann.george at qlogic.com (Johann George) Date: Mon, 15 Oct 2007 08:25:17 -0700 Subject: [ofa-general] Save the date: OFA Developer's Summit: November 15-16 in Nevada In-Reply-To: <71E64D94-510F-4CCD-84FC-9BF9921AEEDA@cisco.com> References: <20070927221328.GA16000@cuprite.pathscale.com> <71E64D94-510F-4CCD-84FC-9BF9921AEEDA@cisco.com> Message-ID: <20071015152517.GB26086@cuprite.pathscale.com> Jeff, We have not finalized the agenda; although there is a list of topics that will be covered on the OpenFabrics website. Go to the front page of the website and in the OFA Developer Summit box, Click on the "here" in "To register, click here". Here is a direct link: http://www.acteva.com/booking.cfm?bevaid=143964 The summit will end at noon on Friday and lunch will be served from noon until 1pm. Johann On Mon, Oct 15, 2007 at 09:51:03AM -0400, Jeff Squyres wrote: > Johann -- > > Is there an agenda worked up yet? I'm trying to make travel > arrangements and wanted to see if flying out Friday afternoon was a > possibility. > > > On Sep 27, 2007, at 6:13 PM, Johann George wrote: > > >We hope you will plan on attending the OpenFabrics Developer's > >Summit being held November 15-16, 2007 at the Boomtown Hotel in > >Verdi, Nevada. It will begin at 1pm on Thursday, November 15th > >and run until the early evening. Friday's session will begin at > >8am and end at noon. > > > >Last year, this turned out to be a good forum to work through issues > >that required collaboration. If you have items that ought to be on > >the agenda, please email them to me. We will have a proposed agenda > >shortly. > > > >This event takes place at the tail end of SC07. The Boomtown hotel is > >about a twenty minute drive from the Reno-Sparks convention center > >where SC07 is being held. Rooms are available if needed at the > >Boomtown hotel starting at $70/night. > > > >Thanks for your participation. > > > >Johann > >_______________________________________________ > >general mailing list > >general at lists.openfabrics.org > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/ > >openib-general > > > -- > Jeff Squyres > Cisco Systems From erezz at Voltaire.COM Mon Oct 15 08:30:37 2007 From: erezz at Voltaire.COM (Erez Zilber) Date: Mon, 15 Oct 2007 17:30:37 +0200 Subject: [ofa-general] Re: [ewg] OFED 1.3 Alpha release is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <4713879D.3090606@Voltaire.COM> > > 5. iSER > o ??? > open-iscsi is based on r865.12. There are no changes in iSER itself. Erez From swise at opengridcomputing.com Mon Oct 15 08:35:26 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 15 Oct 2007 10:35:26 -0500 Subject: [ofa-general] openfabrics CMA interfaces for iWARP In-Reply-To: <470EA544.9030101@Sun.COM> References: <470EA544.9030101@Sun.COM> Message-ID: <471388BE.3000504@opengridcomputing.com> Ramaswamy Tummala wrote: > I have a few questions about the openfabrics CMA interfaces for iWARP. > I'd appreciate if anyone could clarify them. > > - If RNIC's modify_qp() entry point is called to move the QP state to > CLOSING or > ERROR while there are some WQEs on SQ and RQ, does RNIC flush the > incomplete > WRs on the SQ or RQ? It is really up to the device, but if the rnic supports the RDMAC verbs then it will. > If so, does RNIC wait until the flush is complete > before returning modify_qp() to the caller? If RNIC does not wait for the > flush to complete how does the caller know when the flush is complete > (so that caller can poll CQ to retrieve the CQ entries)? This is up to the provider also. I don't think the verbs specify that the flush will be done by the time you return from modify_qp(). Its up to the application to deal with knowing when the flush is done. > > [ Another possibility is, when RNIC's modify_qp() entry point called to > move the QP state to CLOSING while there some WQEs on the SQ, the RNIC > would > internally move the QP state to ERROR. My question still is does RNIC > wait until the flushing of incomplete WRs from SQ and RQ are done before > returning modify_qp() to the caller even though it internally > transitioned > the QP state to ERROR. If RNIC does not wait for the flush to complete > how does the caller know when the flush is complete? ] > I think the answer is no, you cannot depend on the flush being complete when you exit modify_qp... > - If RNIC's modify_qp() entry point called to move the QP state to CLOSING, > does RNIC just initiate LLP CLOSE and return to the caller?, or does > it wait > until LLP CLOSE is complete?. It initiates the LLP CLOSE. It does not wait for the close to complete. > > - It appears that RNIC should send IW_CM_EVENT_DISCONNECT event to CMA > prior > to the start of closing or aborting the connection (except in the case > where the disconnect has been initiated by CMA itself, for example by CMA > calling modify_qp entry point of RNIC to move the QP state to CLOSING or > ERROR). Is this correct? I'm not sure I understand your question. > > - It appears that RNIC should send IW_CM_EVENT_CLOSE event after the > connection > has been closed. Should this event be sent on both active and passive > sides > after the connection has been closed? Yes. > > - RNIC has add_ref(struct ib_qp *qp), and rem_ref(struct ib_qp *qp) entry > points. What is the expected use of CMA calling these entry points? My > general > thinking is that CMA can increase the reference count on QP (i.e. > add_ref) > to prevent the QP from being destroyed by RNIC. But, it is the CMA that > initiates destroying of QP by calling destroy_qp() entry point of RNIC. > So, CMA could maintain the reference count for QP in its own private data > (instead of calling RNIC's add_ref entry point) and not call > destroy_qp() entry point of RNIC if the reference count is not zero. The iWCM keeps the ref on the QP while the QP is directly associated with a iw_cm_id. > > - It appears that if RNIC's accept() entry point is called to accept an > incoming connection, the RNIC, after successful processing of accept, > would send IW_CM_EVENT_ESTABLISHED event to CMA. What event RNIC should > send if the call to accept() succeeded, but later RNIC encountered some > error in sending MPA reply message to the remote peer or some other > error? > In this case although the call to accept() succeeded, the connection > could > still be not be established. So the RNIC can not send > IW_CM_EVENT_ESTABLISHED event. It is ok to block in the provider until the connection and qp are bound and in FPDU mode. That's what the chelsio device does. So when the ESTABLISHED event is posted, the MPA reply was sent and ACKed and the QP /connection moved into FPDU mode. Then any problems in the connection would be posted as IW_CM_EVENT_CLOSE or IW_CM_EVENT_DISCONNECT. > > - It appears that a client of CMA needs to call rdma_resolve_route() after > a successful rdma_resolve_addr(). Any reason for the existence of two > interfaces instead of one interface that combines the functionality of > both the interfaces? Its an infiniband requirement, I think. iWARP doesn't do anything for resolve route. The "next hop route" in iWARP terms is actually determined as part to address resolution... Steve. From tziporet at dev.mellanox.co.il Mon Oct 15 08:54:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 15 Oct 2007 17:54:06 +0200 Subject: [ofa-general] Re: [ewg] OFED 1.3 Alpha release is available In-Reply-To: <4713879D.3090606@Voltaire.COM> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> <4713879D.3090606@Voltaire.COM> Message-ID: <47138D1E.9060000@mellanox.co.il> Erez Zilber wrote: > >> >> 5. iSER >> o ??? thanks - will update the RN accordingly Tziporet From monisonlists at gmail.com Mon Oct 15 09:00:55 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:00:55 +0200 Subject: [ofa-general] [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver Message-ID: <47138EB7.40703@gmail.com> This is the 7th version of this patch series. See link to V6 below. Changes from the previous version --------------------------------- * Some patches required modifications to remove offsets so they can be applied with git-apply * Patch #3 was first modified by Jay and later by me to make it work with header_ops * patch #8 was changed to fix the problem that caused 'ifconfig down' to stuck (dev_close was called twice) Jay, I removed the Acked-by lines from patches 3 & 8. Can you please add them back after you approve? thanks MoniS Link to V6: http://lists.openfabrics.org/pipermail/general/2007-September/041139.html From monis at voltaire.com Mon Oct 15 09:03:54 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:03:54 +0200 Subject: [ofa-general] [PATCH V7 1/8] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <47138F6A.8040103@voltaire.com> IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb->dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n->dev->flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++- 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 6545fa7..1b3327a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -349,6 +349,7 @@ #endif struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_head list; }; @@ -365,7 +366,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index e072f3c..cae026c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -517,7 +517,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb->dst->neighbour); + neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -817,6 +817,13 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + neigh = *to_ipoib_neigh(n); + if (neigh) { + priv = netdev_priv(neigh->dev); + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", + n->dev->name); + } else + return; ipoib_dbg(priv, "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), @@ -824,13 +831,10 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(&priv->lock, flags); - neigh = *to_ipoib_neigh(n); - if (neigh) { - if (neigh->ah) - ah = neigh->ah; - list_del(&neigh->list); - ipoib_neigh_free(n->dev, neigh); - } + if (neigh->ah) + ah = neigh->ah; + list_del(&neigh->list); + ipoib_neigh_free(n->dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); @@ -838,7 +842,8 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) { struct ipoib_neigh *neigh; @@ -847,6 +852,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh->neighbour = neighbour; + neigh->dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(&neigh->queue); ipoib_cm_set(neigh, NULL); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 827820e..9bcfc7a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -705,7 +705,8 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, + skb->dev); if (neigh) { kref_get(&mcast->ah->ref); From monisonlists at gmail.com Mon Oct 15 09:08:23 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:08:23 +0200 Subject: [ofa-general] [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <47139077.1070702@gmail.com> Moni Shoua wrote: > This is the 7th version of this patch series. See link to V6 below. > I forgot to mention that the patches are relative to jgarzik/netdev-2.6.git#master. I couldn't compile the 2.6.24 or the upstream branches so I used master branch to test the fixes. From monis at voltaire.com Mon Oct 15 09:09:52 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:09:52 +0200 Subject: [ofa-general] [PATCH V7 2/8] IB/ipoib: Verify address handle validity on send In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <471390D0.30501@voltaire.com> When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cae026c..362610d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -692,9 +692,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, + if (unlikely((memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid))) || + (neigh->dev != dev))) { spin_lock(&priv->lock); /* * It's safe to call ipoib_put_ah() inside From fuscous at castell-jalpi.com Mon Oct 15 09:10:21 2007 From: fuscous at castell-jalpi.com (Perry Campbell) Date: Mon, 15 Oct 2007 17:10:21 +0100 Subject: [ofa-general] Microsoft Off|ce Pro -New Vista/XP Edition- 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80f44$c5c0d180$0100007f@localhost> microsoft4less . com From monis at voltaire.com Mon Oct 15 09:10:44 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:10:44 +0200 Subject: [ofa-general] [PATCH V7 3/8] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <47139104.2070605@voltaire.com> This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 34 ++++++++++++++++++++++++++++++++++ 1 files changed, 34 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 64bfec3..4f61958 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1238,6 +1238,21 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + bond_dev->header_ops = slave_dev->header_ops; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1312,6 +1327,25 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different " + "from other slaves (%d), can not enslave it.\n", + slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " From monis at voltaire.com Mon Oct 15 09:11:31 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:11:31 +0200 Subject: [ofa-general] [PATCH V7 4/8] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address() In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <47139133.9050103@voltaire.com> This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 87 +++++++++++++++++++++++++++------------- drivers/net/bonding/bonding.h | 1 2 files changed, 60 insertions(+), 28 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 4f61958..32dc75e 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1096,6 +1096,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC + * address is the one of the active slave. + */ + if (new_active && !bond->do_set_mac_addr) + memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, + new_active->dev->addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1347,13 +1355,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev->set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified does " - "not support setting the MAC address. " - "Your kernel likely does not support slave " - "devices.\n", bond_dev->name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond->slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + ": %s: Warning: The first slave device you " + "specified does not support setting the MAC " + "address. This bond MAC address would be that " + "of the active slave.\n", bond_dev->name); + bond->do_set_mac_addr = 0; + } else if (bond->do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + ": %s: Error: The slave device you specified " + "does not support setting the MAC addres,." + "but this bond uses this practice. \n" + , bond_dev->name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1374,16 +1391,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - /* - * Set slave to master's mac address. The application already - * set the master's mac address to that of the first slave - */ - memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); - addr.sa_family = slave_dev->type; - res = dev_set_mac_address(slave_dev, &addr); - if (res) { - dprintk("Error %d calling set_mac_address\n", res); - goto err_free; + if (bond->do_set_mac_addr) { + /* + * Set slave to master's mac address. The application already + * set the master's mac address to that of the first slave + */ + memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); + addr.sa_family = slave_dev->type; + res = dev_set_mac_address(slave_dev, &addr); + if (res) { + dprintk("Error %d calling set_mac_address\n", res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1608,9 +1627,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } err_free: kfree(new_slave); @@ -1783,10 +1804,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address */ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address */ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1873,10 +1896,12 @@ static int bond_release_all(struct net_d /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE); @@ -3914,6 +3939,9 @@ static int bond_set_mac_address(struct n dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); + if (!bond->do_set_mac_addr) + return -EOPNOTSUPP; + if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; } @@ -4300,6 +4328,9 @@ #ifdef CONFIG_PROC_FS bond_create_proc_entry(bond); #endif + /* set do_set_mac_addr to true on startup */ + bond->do_set_mac_addr = 1; + list_add_tail(&bond->bond_list, &bond_dev_list); return 0; diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index 2a6af7d..5011ba9 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -185,6 +185,7 @@ struct bonding { struct timer_list mii_timer; struct timer_list arp_timer; s8 kill_timers; + s8 do_set_mac_addr; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From monis at voltaire.com Mon Oct 15 09:12:31 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:12:31 +0200 Subject: [ofa-general] [PATCH V7 5/8] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <4713916F.8040006@voltaire.com> Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 5 +++-- drivers/net/bonding/bond_sysfs.c | 6 ++---- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 32dc75e..d7e43ba 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1281,8 +1281,9 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev->flags & IFF_UP)) { - dprintk("Error, master_dev is not up\n"); - return -EPERM; + printk(KERN_WARNING DRV_NAME + " %s: master_dev is not up in bond_enslave\n", + bond_dev->name); } /* already enslaved */ diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 6f49ca7..ca4e429 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond->dev->flags & IFF_UP)) { - printk(KERN_ERR DRV_NAME - ": %s: Unable to update slaves because interface is down.\n", + printk(KERN_WARNING DRV_NAME + ": %s: doing slave updates when interface is down.\n", bond->dev->name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond->lock here, as bond_create grabs it. */ From monis at voltaire.com Mon Oct 15 09:13:10 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:13:10 +0200 Subject: [ofa-general] [PATCH V7 6/8] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <47139196.30501@voltaire.com> bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 3 ++- drivers/net/bonding/bond_sysfs.c | 10 ++++++++-- drivers/net/bonding/bonding.h | 1 + 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index d7e43ba..3f082dc 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1225,7 +1225,8 @@ static int bond_compute_features(struct struct slave *slave; struct net_device *bond_dev = bond->dev; unsigned long features = bond_dev->features; - unsigned short max_hard_header_len = ETH_HLEN; + unsigned short max_hard_header_len = max((u16)ETH_HLEN, + bond_dev->hard_header_len); int i; features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES); diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index ca4e429..583c568 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev->mtu; if (dev->mtu != bond->dev->mtu) { if (dev->change_mtu) { res = dev->change_mtu(dev, @@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond->dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) + slave->original_mtu = original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) { dev = slave->dev; + original_mtu = slave->original_mtu; break; } if (dev) { @@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev->change_mtu) { - dev->change_mtu(dev, 1500); + dev->change_mtu(dev, original_mtu); } else { - dev->mtu = 1500; + dev->mtu = original_mtu; } } else { diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index 5011ba9..ad9c632 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -156,6 +156,7 @@ struct slave { s8 link; /* one of BOND_LINK_XXXX */ s8 state; /* one of BOND_STATE_XXXX */ u32 original_flags; + u32 original_mtu; u32 link_failure_count; u16 speed; u8 duplex; From monis at voltaire.com Mon Oct 15 09:13:55 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:13:55 +0200 Subject: [ofa-general] PATCH V6 7/8] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <471391C3.6060108@voltaire.com> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev->state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua Acked-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 24 +++++++++++++++++++++--- drivers/net/bonding/bonding.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 3f082dc..c017042 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1103,8 +1103,14 @@ void bond_change_active_slave(struct bon if (new_active && !bond->do_set_mac_addr) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); - - bond_send_gratuitous_arp(bond); + if (bond->curr_active_slave && + test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) { + dprintk("delaying gratuitous arp on %s\n", + bond->curr_active_slave->dev->name); + bond->send_grat_arp = 1; + } else + bond_send_gratuitous_arp(bond); } } @@ -2074,6 +2080,17 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond->send_grat_arp) { + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, + &bond->curr_active_slave->dev->state)) + dprintk("Needs to send gratuitous arp but not yet\n"); + else { + dprintk("sending delayed gratuitous arp on on %s\n", + bond->curr_active_slave->dev->name); + bond_send_gratuitous_arp(bond); + bond->send_grat_arp = 0; + } + } read_lock(&bond->curr_slave_lock); oldcurrent = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); @@ -2475,7 +2492,7 @@ static void bond_send_gratuitous_arp(str if (bond->master_ip) { bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, - bond->master_ip, 0); + bond->master_ip, 0); } list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { @@ -4281,6 +4298,7 @@ static int bond_init(struct net_device * bond->current_arp_slave = NULL; bond->primary_slave = NULL; bond->dev = bond_dev; + bond->send_grat_arp = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index ad9c632..e0e06a8 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -187,6 +187,7 @@ struct bonding { struct timer_list arp_timer; s8 kill_timers; s8 do_set_mac_addr; + s8 send_grat_arp; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; From monis at voltaire.com Mon Oct 15 09:14:43 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 15 Oct 2007 18:14:43 +0200 Subject: [ofa-general] [PATCH V7 8/8] net/bonding: Destroy bonding master when last slave is gone In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <471391F3.5020209@voltaire.com> When bonding enslaves non Ethernet devices it takes pointers to functions in the module that owns the slaves. In this case it becomes unsafe to keep the bonding master registered after last slave was unenslaved because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero ensures that these functions be used anymore. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 37 ++++++++++++++++++++++++++++++++++++- drivers/net/bonding/bond_sysfs.c | 9 +++++---- drivers/net/bonding/bonding.h | 3 +++ 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c017042..23edf18 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1257,6 +1257,7 @@ static int bond_compute_features(struct static void bond_setup_by_slave(struct net_device *bond_dev, struct net_device *slave_dev) { + struct bonding *bond = bond_dev->priv; bond_dev->neigh_setup = slave_dev->neigh_setup; bond_dev->type = slave_dev->type; @@ -1266,6 +1267,7 @@ static void bond_setup_by_slave(struct n memcpy(bond_dev->broadcast, slave_dev->broadcast, slave_dev->addr_len); + bond->setup_by_slave = 1; } /* enslave device to bond device */ @@ -1829,6 +1831,35 @@ int bond_release(struct net_device *bond } /* +* Destroy a bonding device. +* Must be under rtnl_lock when this function is called. +*/ +void bond_destroy(struct bonding *bond) +{ + bond_deinit(bond->dev); + bond_destroy_sysfs_entry(bond); + unregister_netdevice(bond->dev); +} + +/* +* First release a slave and than destroy the bond if no more slaves iare left. +* Must be under rtnl_lock when this function is called. +*/ +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev) +{ + struct bonding *bond = bond_dev->priv; + int ret; + + ret = bond_release(bond_dev, slave_dev); + if ((ret == 0) && (bond->slave_cnt == 0)) { + printk(KERN_INFO DRV_NAME ": %s: destroying bond %s.\n", + bond_dev->name, bond_dev->name); + bond_destroy(bond); + } + return ret; +} + +/* * This function releases all slaves. */ static int bond_release_all(struct net_device *bond_dev) @@ -3311,7 +3342,10 @@ static int bond_slave_netdev_event(unsig switch (event) { case NETDEV_UNREGISTER: if (bond_dev) { - bond_release(bond_dev, slave_dev); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); } break; case NETDEV_CHANGE: @@ -4299,6 +4333,7 @@ static int bond_init(struct net_device * bond->primary_slave = NULL; bond->dev = bond_dev; bond->send_grat_arp = 0; + bond->setup_by_slave = 0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 583c568..b5d2a13 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc printk(KERN_INFO DRV_NAME ": %s is being deleted...\n", bond->dev->name); - bond_deinit(bond->dev); - bond_destroy_sysfs_entry(bond); - unregister_netdevice(bond->dev); + bond_destroy(bond); rtnl_unlock(); goto out; } @@ -363,7 +361,10 @@ static ssize_t bonding_store_slaves(stru printk(KERN_INFO DRV_NAME ": %s: Removing slave %s\n", bond->dev->name, dev->name); rtnl_lock(); - res = bond_release(bond->dev, dev); + if (bond->setup_by_slave) + res = bond_release_and_destroy(bond->dev, dev); + else + res = bond_release(bond->dev, dev); rtnl_unlock(); if (res) { ret = res; diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index e0e06a8..85e579b 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -188,6 +188,7 @@ struct bonding { s8 kill_timers; s8 do_set_mac_addr; s8 send_grat_arp; + s8 setup_by_slave; struct net_device_stats stats; #ifdef CONFIG_PROC_FS struct proc_dir_entry *proc_entry; @@ -295,6 +296,8 @@ static inline void bond_unset_master_alb struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr); int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev); int bond_create(char *name, struct bond_params *params, struct bonding **newbond); +void bond_destroy(struct bonding *bond); +int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev); void bond_deinit(struct net_device *bond_dev); int bond_create_sysfs(void); void bond_destroy_sysfs(void); From suri at baymicrosystems.com Mon Oct 15 09:19:25 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Mon, 15 Oct 2007 12:19:25 -0400 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDR SMP responses from userspace In-Reply-To: <1192362184.4962.133.camel@hrosenstock-ws.xsigo.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <005001c80dbd$17eeb8c0$a865a8c0@catcher> <1192362184.4962.133.camel@hrosenstock-ws.xsigo.com> Message-ID: <012e01c80f47$2ab5eec0$1914a8c0@md.baymicrosystems.com> >From a code review it looked OK as far as a switch implementation was concerned. Let me apply the patch and try once before we commit this, please give me a few days, Thanks, Suri > -----Original Message----- > From: Hal Rosenstock [mailto:hrosenstock at xsigo.com] > Sent: Sunday, October 14, 2007 7:43 AM > To: Steve Welch > Cc: 'Hal Rosenstock'; rdreier at cisco.com; general at lists.openfabrics.org; suri at baymicrosystems.com; > ralph.campbell at qlogic.com > Subject: RE: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDR SMP responses from > userspace > > Hi Steve, > > On Sat, 2007-10-13 at 12:18 -0500, Steve Welch wrote: > > Hi Hal, > > > > > > > > Looks pretty good. A few things below and a couple of nits embedded: > > > > > > I think the original description was more detailed and should be added > > > to the above: > > > > When I submit the next revision I will update the description to put > > the detail back in. > > Thanks. > > > > Signed-off-by: Steve Welch > > > > > > My main concern is verifying this with the various HCA drivers > > > (Mellanox (in normal HCA mode), iPath, and eHCA) as well as switches > > > (Suri, can you try this ?) in addition to running this on a node where > > > OpenSM resides (Sasha, can you try this ?). How much of this have you > > > done ? Thanks. > > > > > > > Good point, I think we are good with regard to the SM and mthca. > > I have run the code with the mthca driver loaded in non-router mode, > > and verified proper operation (ports can be brought up, so > > process_mad() is handing off SMP requests to the internal SMA, > > etc.). I've also run the SM on that host, again local ports are > > brought up and the SM is able to bring up the attached fabric. Local > > user space utilities like smpquery operate normally for local and > > remote queries using both directed route and LID routed addressing. > > > > However, I have not run on top of the iPath or eHCA. > > I don't think this currently is utilized by eHCA as all this is done in > firmware but there is at least one known switch implementation out there > which should IMO be reverified with this change. > > > A quick code > > inspection of the iPath driver indicates that the desired effect > > will not be achieved with that driver in every case. For the > > SM info attribute it looks OK and is handled properly currently. > > For DR SMP's with the GET_RESPONSE method the iPath driver returns > > IB_MAD_RESULT_FAILURE instead of IB_MAD_RESULT_SUCCESS. > > This will cause the core mad processing to drop the SMP MAD instead > > of attempting to pass it on to a local agent. Of course this > > iPath behavior exists with or without this patch. I'm not sure > > why the iPath driver considers this a failure, it does not > > consume or process the MAD in that case, but the MAD has passed > > their incoming sanity checks. The comment in this code indicates > > they intended to do the right thing, but are just returning the > > wrong status (see ipath_mad.c, process_subn()). > > I don't know either but that could be a separate patch. Maybe Ralph > could comment on this. > > > I just don't think this is a code path that has been exercised > > on iPath, it requires a user space SMA sendig DR SMP's responses > > that must be locally loopbacked. To get consistent behavior iPath > > will need a change, but I do not have the hardware required to > > make and test that change. > > > > I'm not sure about the eHca driver, it appears to not implement > > the process_mad() IB device function. > > Right; it currently does not expose QP0. It is all done in firmware. > > > > > } > > > > + > > > > +/* > > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > > SMA/SM > > > > + * via process_mad > > > > + */ > > > > +static inline enum smi_action smi_check_local_returning_smp(struct > > > ib_smp *smp, > > > > + struct ib_device > > > *device) > > > > > > Nit. Not sure this lines up properly. > > > > > The function names are a little verbose and we're pushing 80 columns, so > > the second parameter could not line exactly with the first without exceeding > > the limit. I can break the first line up if that is preferred. > > I agree they are verbose but I think that makes them clearer. Maybe they > can be shortened: Just make their names is_local_outgoing/returning_smp, > perhaps ? > > -- Hal > > > Thanks for you feedback, > > Steve > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Mon Oct 15 10:06:30 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Oct 2007 10:06:30 -0700 Subject: [ofa-general] openfabrics CMA interfaces for iWARP In-Reply-To: <471388BE.3000504@opengridcomputing.com> References: <470EA544.9030101@Sun.COM> <471388BE.3000504@opengridcomputing.com> Message-ID: <47139E16.2010302@ichips.intel.com> >> - It appears that a client of CMA needs to call rdma_resolve_route() >> after >> a successful rdma_resolve_addr(). Any reason for the existence of two >> interfaces instead of one interface that combines the functionality of >> both the interfaces? > > Its an infiniband requirement, I think. iWARP doesn't do anything for > resolve route. The "next hop route" in iWARP terms is actually > determined as part to address resolution... Steve's correct. For IB, these are two distinct operations (mapping IP address to GID & path record query). The only benefit to combining the steps is to save a single call down. The drawback is that it ends up complicating the rdma_cm implementation in case the user calls rdma_destroy_id() between the two steps. - Sean From gmk at lbl.gov Mon Oct 15 11:06:14 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Mon, 15 Oct 2007 11:06:14 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> Message-ID: <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> Yes, the patch fixes perfquery so now it supports the "-a" option properly but I had to make some minor tweaks as I am running the released 1.3.2 version. I also disabled the IBWARN and also tweaked ibcheckerrs just enough so that ibcheckerrors is reporting properly now. Attached is the patch that includes both of the above modifications and integrates properly against the 1.3.2 released tree. Again, thank you. :) Greg -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-diags-1.3.2-allports_workaround.patch Type: application/octet-stream Size: 3375 bytes Desc: not available URL: -------------- next part -------------- On Oct 15, 2007, at 4:31 AM, Hal Rosenstock wrote: > On Fri, 2007-10-12 at 15:14 -0700, Hal Rosenstock wrote: >> On Fri, 2007-10-12 at 14:59 -0700, Hal Rosenstock wrote: >>> On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: >>>> ibwarn: [25274] pma_query: lid 1 port 1 >>>> ibwarn: [25274] mad_rpc: data offs 64 sz 192 >>>> mad data >>>> 0101 0000 0000 0014 0000 0000 0000 0000 >>> >>> Thanks; AllPortSelect is off in CapabilityMask which is >>> consistent with >>> the behavior. (It would be trivial for those HCA PMAs to indicate >>> AllPortSelect is supported (since it's the same as supporting one >>> port) >>> and then all would be fine but that's not a requirement). >>> >>> A check should be added in perfquery for this.I will generate a >>> patch >>> for that but that won't fix the problem. >> >> Actually, perfquery gets the number of ports and could do multiple >> PerfGets, one per port, and accumulate the "all" ports. >> >> This approach may be better than dealing with the scripts. > > Can you try this and let me know if this resolves your issue ? The > patch > is against the master (OFED 1.3): > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/ > src/perfquery.c > index 148e452..c976fc5 100644 > --- a/infiniband-diags/src/perfquery.c > +++ b/infiniband-diags/src/perfquery.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the > GNU > @@ -42,7 +43,7 @@ > #include > #include > > -#define __BUILD_VERSION_TAG__ 1.2.2 > +#define __BUILD_VERSION_TAG__ 1.2.3 > #include > #include > #include > @@ -99,6 +100,9 @@ main(int argc, char **argv) > int ca_port = 0; > int extended = 0; > uint16_t cap_mask; > + int allports = 0; > + int node_type, num_ports; > + uint8_t data[IB_SMP_DATA_SIZE]; > > static char const str_opts[] = "C:P:s:t:dGearRVhu"; > static const struct option long_opts[] = { > @@ -191,6 +195,35 @@ main(int argc, char **argv) > /* PerfMgt ClassPortInfo is a required attribute */ > if (!perf_classportinfo_query(pc, &portid, port, timeout)) > IBERROR("classportinfo query"); > + /* ClassPortInfo should be supported as part of libibmad */ > + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ > + cap_mask = ntohs(cap_mask); > + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ > + if (port == 255) { > + allports = 1; > + IBWARN("AllPortSelect not supported"); > + } > + > + if (allports == 1) { > + > + /* > + * Simulate all ports support in PMA > + * Determine node type, number of (physical) ports, > + * and, if switch, whether SP0 is enhanced > + * to determine first and last port to query > + */ > + > + /* For now, support single port CAs */ > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > + IBERROR("smp query nodeinfo failed"); > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > + IBERROR("smp query nodeinfo: Node type not CA"); > + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); > + if (num_ports != 1) > + IBERROR("smp query nodeinfo: %d ports; only 1 supported > currently", num_ports); > + port = num_ports; > + } > > if (reset_only) > goto do_reset; > @@ -201,9 +234,6 @@ main(int argc, char **argv) > > mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > } else { > - /* Should ClassPortInfo be implemented in libibmad ? */ > - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ > - cap_mask = ntohs(cap_mask); > if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended > counter support */ > IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not > indicated\n", cap_mask); > > > >> >> -- Hal >> >>> I will try to find time to look at the scripts and see what it >>> will take >>> to fix this. Where AllPortSelect is not supported, they need to drop >>> back to individual ports. >>> >>> -- Hal >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >>> openib-general >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general -- Greg Kurtzer gmk at lbl.gov From harms at alcf.anl.gov Mon Oct 15 11:15:31 2007 From: harms at alcf.anl.gov (Kevin Harms) Date: Mon, 15 Oct 2007 13:15:31 -0500 Subject: [ofa-general] SRP and thread count Message-ID: <0F6CA3F2-0F4D-48F8-B184-62B2482A6E04@alcf.anl.gov> we are using a storage array that is connected via IB and using SRP (OFED-1.2). When i try to use 6 or more threads the system takes a large performance hit. 4 threads using iozone => 630 MB/s, six threads using iozone => 60 MB/s. This is using sles10 sp1. Has anyone else seen this problem? thanks, Kevin Harms From jgarzik at pobox.com Mon Oct 15 11:23:02 2007 From: jgarzik at pobox.com (Jeff Garzik) Date: Mon, 15 Oct 2007 14:23:02 -0400 Subject: [ofa-general] Re: [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <47138EB7.40703@gmail.com> References: <47138EB7.40703@gmail.com> Message-ID: <4713B006.9090908@pobox.com> Moni Shoua wrote: > This is the 7th version of this patch series. See link to V6 below. > > Changes from the previous version > --------------------------------- > > * Some patches required modifications to remove offsets so they can be applied with git-apply > * Patch #3 was first modified by Jay and later by me to make it work with header_ops > * patch #8 was changed to fix the problem that caused 'ifconfig down' to stuck (dev_close was called twice) I just applied the latest version Jay sent, are there any remaining changes? From printabilities at gibraltarfidelifax.com Mon Oct 15 11:32:46 2007 From: printabilities at gibraltarfidelifax.com (Lloyd Bennett) Date: Mon, 15 Oct 2007 19:32:46 +0100 Subject: [ofa-general] Microsoft Off|ce Pro -New Vista/XP Edition- 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80f57$e8a8b100$0100007f@localhost> microsoft4less . com From hrosenstock at xsigo.com Mon Oct 15 11:35:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 11:35:51 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071015150624.GZ12364@sashak.voltaire.com> References: <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> <20071015150624.GZ12364@sashak.voltaire.com> Message-ID: <1192473351.4962.275.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-15 at 17:06 +0200, Sasha Khapyorsky wrote: > On 07:38 Mon 15 Oct , Hal Rosenstock wrote: > > Hi Sasha, > > > > On Mon, 2007-10-15 at 15:54 +0200, Sasha Khapyorsky wrote: > > > Hi Hal, > > > > > > On 06:18 Mon 15 Oct , Hal Rosenstock wrote: > > > > > > > > I don't recall as this is from a very very long time ago but in looking > > > > at this, I agree with your assessment that it can be simplified (and > > > > there appears to be no real need for what is contained in struct Port > > > > other than the fd). The only downside I see is the subtle change in the > > > > public umad_ APIs changing int portid -> int fd. > > > > > > There is no API change at all - umad_open_port() still return unique > > > integer descriptor as it was before. Here we are only changing > > > undocumented (at least I'm not able to find any public description about > > > what umad_open_port() should return) behavior of this API (by replacing > > > mad device number as umad_open_port() return value, > > > > It's all the other APIs which say umad_xxx(int portid, ...) are now > > umad_xxxx(int fd, ...). A subtle change. > > I changed this only in umad.c files (to make it clear for internal > implementation reviewers) and saved it as 'portid' in the header where > API is described - an user should not care what internal meaning of > portid is. For getting fd explicitly there is umad_get_fd(portid) > method. Which could be eliminated as redundant; not sure anything is using this API. > > > but if we want to > > > support multiple open()s there is no choice - device number is not > > > suitable for this). > > > > Understood. > > > > > > I suppose all the tools > > > > would continue to work without change here even if libibumad were > > > > changed underneath it, right ? > > > > > > Right. > > > > > > > BTW, when you do this, the umad man pages > > > > should all be updated for this change. > > > > > > I see only that umad_open_port.3 should be fixed - it says that return > > > value is "0" on success, which is not correct anyway. Not really related > > > to the patch. Do you see another places to fix in man? > > > > Don't a number of them indicate int portid as an input parameter (and > > this should now be int fd) ? Just grep for portid in those man pages... > > Don't think we want to make the internal in its nature "portid = fd > feature" to be part of the public API. 'portid' is fine IMO because it > doesn't mean a lot - just "0 or an unique positive value...", pretty > suitable for public API. portid gives a level of abstraction but is it needed ? If we were starting today, would you say the same thing ? -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Mon Oct 15 11:38:30 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 15 Oct 2007 11:38:30 -0700 Subject: [ofa-general] OpenFabrics Downloads page Message-ID: Hi All, Hay Arlin was nice enough to work with the OpenFabrics website people to update the downloads page to allow developers to manager their own directories. http://www.openfabrics.org/downloads.htm So what developers need to do is create a WEB_README in the appropriate directory under downloads and put your latest released packages into that directory, simple as that. You can then update the directory yourself as new packages are released and they will show up via the downloads page of the website. woody From swise at opengridcomputing.com Mon Oct 15 11:50:55 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 15 Oct 2007 13:50:55 -0500 Subject: [ofa-general] Re: [ewg] OFED 1.3 Alpha release is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <4713B68F.8080807@opengridcomputing.com> I see am iSCSI build failure trying to build on rhel4u4 with a 2.6.20.6 kernel... Bug 738 opened. ------- gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/.scsi_transport_iscsi.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/3.4.6/include -D__KERNEL__ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/kernel_addons/backport/2.6.20/include/ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/debug -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/net/cxgb3 -Iinclude -include include/linux/autoconf.h -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -fomit-frame-pointer -g -Wdeclaration-after-statement -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(scsi_transport_iscsi)" -D"KBUILD_MODNAME=KBUILD_STR(scsi_transport_iscsi)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/.tmp_scsi_transport_iscsi.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c: In function `iscsi_if_rx': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c:1122: warning: implicit declaration of function `nlmsg_hdr' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c:1122: warning: assignment makes pointer from integer without a cast /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c: In function `iscsi_transport_init': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.c:1527: error: too many arguments to function `netlink_kernel_create' make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi/scsi_transport_iscsi.o] Error 1 make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/scsi] Error 2 make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3] Error 2 make[1]: Leaving directory `/opt/kernel/linux-2.6.20.6' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.68413 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.68413 (%build) Tziporet Koren wrote: > Hi, > > OFED 1.3 Alpha release is available on > http://www.openfabrics.org/builds/ofed-1.3/release/ > File: OFED-1.3-alpha2.tgz > To get BUILD_ID run ofed_info > > Please report any issues in bugzilla https://bugs.openfabrics.org/ > > The beta release is expected on 29 October > > Tziporet & Vlad > > ======================================================================== > > Release information: > -------------------- > OS support: > Novell: > - SLES10 > - SLES10 SP1 > Redhat: > - Redhat EL4 up4 and up5 > - Redhat EL5 > kernel.org: > - 2.6.23 > > Note: Fedora C6 and Open SUSE 10.2 and Redhat EL4 up3 are not part of > the > official list. We keep the backport patches for these OSes and make sure > > OFED compile and loaded properly but will not do full QA cycle. > > Systems: > * x86_64 > * x86 > * ia64 > * ppc64* > > *Note: On PPC64 installation fails on the packages: ibutils, mvapich2, > MPI tests over Open MPI. > > > Main Changes from OFED 1.2.5 > ============================ > 1. General changes > o Kernel code based on 2.6.23 > o Quality of Service support in OpenSM, CMA, IPoIB, SRP > o Added Neteffect driver (nes) > > 2. Package and install > o There is a new install script. See OFED_Installation_Guide.txt for > more details on the new installation and build procedures. > Note: There is an easy way to install in one command line > without a conf file, and without the interactive mode. > Example: ./install.pl --all --prefix /usr/local > o User space packages are now in different source RPMs (as opposed to > one source RPM in previous OFED releases). > o The option for a build without installing is not supported any > more. > o Added an option to generate tarball with kernel sources for each > kernel. > > 3. IPoIB > o Stateless offloads > o IGMP for user-space multicast IB > o NAPI is enabled default > o High availability is supported via the bonding module only (removed > ipoib tool scripts) > > 4. SDP - these are not yet in the alpha release > o Keep-alive > o Asynch IO > o Send Zero Copy > > 5. iSER > o ??? > > 6. qlgc_vnic > o Update for PathScale HCA > > 7. RDS > o RDMA API (using FMRs) - under work > > 8. uDAPL - these are not yet in the alpha release > o Add DAT 2.0 API run-time library and development support. > uDAPL 2.0 will include IB extensions for IB rdma write with > immediate > data and IB atomic operations. > o Both uDAPL 1.2 and 2.0 packages will be provided and will co-exist > > 9. Libraries > a. libibverbs 1.1.1 > o Added Extended RC transport type > b. librdmacm (uCMA) 1.0.3 > > 10. OSM > o More routing performance improvements > o Even more speedups > o Better packaging/installation > o "Native" daemon mode > o Performance management > o Quality of Service manager: Based on IBTA annex > > 11. Management > o Multiple partitions > > 12. MPI: > a. OSU MVAPICH > o Version is 0.9.9 - same as in 1.2.5 - to be replaced later > b. Open MPI > o Version is 1.2.2-1 - same as in 1.2.5 - to be replaced later > c. OSU MVAPICH2 > o Version was updated to 1.0-1. > > > > Tasks that should be completed for the beta release: > ---------------------------------------------------- > 1. Integrate all SDP features > 2. Complete RDS work > 3. Apply patches that fix warning of backport patches > 4. Fix compilation problems on PPC > 5. Add qperf test from Qlogic > 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) > 7. Support RHEL 5 up1 > 8. SPEC files should be part of each user space package > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From hrosenstock at xsigo.com Mon Oct 15 12:17:39 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 12:17:39 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> Message-ID: <1192475859.4962.295.camel@hrosenstock-ws.xsigo.com> Hi Greg, On Mon, 2007-10-15 at 11:06 -0700, Greg Kurtzer wrote: > Yes, the patch fixes perfquery so now it supports the "-a" option > properly but I had to make some minor tweaks as I am running the > released 1.3.2 version. > > I also disabled the IBWARN Does this somehow "get in the way" ? > and also tweaked ibcheckerrs just enough > so that ibcheckerrors is reporting properly now. Ah, I didn't try that. Good catch. BTW, that same change is applicable to some other scripts. > Attached is the patch that includes both of the above modifications > and integrates properly against the 1.3.2 released tree. > > Again, thank you. :) Thanks for testing this out :-) -- Hal > Greg > > > > > > > On Oct 15, 2007, at 4:31 AM, Hal Rosenstock wrote: > > > On Fri, 2007-10-12 at 15:14 -0700, Hal Rosenstock wrote: > >> On Fri, 2007-10-12 at 14:59 -0700, Hal Rosenstock wrote: > >>> On Fri, 2007-10-12 at 14:47 -0700, Greg Kurtzer wrote: > >>>> ibwarn: [25274] pma_query: lid 1 port 1 > >>>> ibwarn: [25274] mad_rpc: data offs 64 sz 192 > >>>> mad data > >>>> 0101 0000 0000 0014 0000 0000 0000 0000 > >>> > >>> Thanks; AllPortSelect is off in CapabilityMask which is > >>> consistent with > >>> the behavior. (It would be trivial for those HCA PMAs to indicate > >>> AllPortSelect is supported (since it's the same as supporting one > >>> port) > >>> and then all would be fine but that's not a requirement). > >>> > >>> A check should be added in perfquery for this.I will generate a > >>> patch > >>> for that but that won't fix the problem. > >> > >> Actually, perfquery gets the number of ports and could do multiple > >> PerfGets, one per port, and accumulate the "all" ports. > >> > >> This approach may be better than dealing with the scripts. > > > > Can you try this and let me know if this resolves your issue ? The > > patch > > is against the master (OFED 1.3): > > > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/ > > src/perfquery.c > > index 148e452..c976fc5 100644 > > --- a/infiniband-diags/src/perfquery.c > > +++ b/infiniband-diags/src/perfquery.c > > @@ -1,5 +1,6 @@ > > /* > > * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > > * > > * This software is available to you under a choice of one of two > > * licenses. You may choose to be licensed under the terms of the > > GNU > > @@ -42,7 +43,7 @@ > > #include > > #include > > > > -#define __BUILD_VERSION_TAG__ 1.2.2 > > +#define __BUILD_VERSION_TAG__ 1.2.3 > > #include > > #include > > #include > > @@ -99,6 +100,9 @@ main(int argc, char **argv) > > int ca_port = 0; > > int extended = 0; > > uint16_t cap_mask; > > + int allports = 0; > > + int node_type, num_ports; > > + uint8_t data[IB_SMP_DATA_SIZE]; > > > > static char const str_opts[] = "C:P:s:t:dGearRVhu"; > > static const struct option long_opts[] = { > > @@ -191,6 +195,35 @@ main(int argc, char **argv) > > /* PerfMgt ClassPortInfo is a required attribute */ > > if (!perf_classportinfo_query(pc, &portid, port, timeout)) > > IBERROR("classportinfo query"); > > + /* ClassPortInfo should be supported as part of libibmad */ > > + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ > > + cap_mask = ntohs(cap_mask); > > + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ > > + if (port == 255) { > > + allports = 1; > > + IBWARN("AllPortSelect not supported"); > > + } > > + > > + if (allports == 1) { > > + > > + /* > > + * Simulate all ports support in PMA > > + * Determine node type, number of (physical) ports, > > + * and, if switch, whether SP0 is enhanced > > + * to determine first and last port to query > > + */ > > + > > + /* For now, support single port CAs */ > > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > > + IBERROR("smp query nodeinfo failed"); > > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > > + IBERROR("smp query nodeinfo: Node type not CA"); > > + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); > > + if (num_ports != 1) > > + IBERROR("smp query nodeinfo: %d ports; only 1 supported > > currently", num_ports); > > + port = num_ports; > > + } > > > > if (reset_only) > > goto do_reset; > > @@ -201,9 +234,6 @@ main(int argc, char **argv) > > > > mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > > } else { > > - /* Should ClassPortInfo be implemented in libibmad ? */ > > - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ > > - cap_mask = ntohs(cap_mask); > > if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended > > counter support */ > > IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not > > indicated\n", cap_mask); > > > > > > > >> > >> -- Hal > >> > >>> I will try to find time to look at the scripts and see what it > >>> will take > >>> to fix this. Where AllPortSelect is not supported, they need to drop > >>> back to individual ports. > >>> > >>> -- Hal > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ > >>> openib-general > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ > >> openib-general > > -- > Greg Kurtzer > gmk at lbl.gov > > > From gmk at lbl.gov Mon Oct 15 12:26:18 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Mon, 15 Oct 2007 12:26:18 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: <1192475859.4962.295.camel@hrosenstock-ws.xsigo.com> References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> <1192475859.4962.295.camel@hrosenstock-ws.xsigo.com> Message-ID: On Oct 15, 2007, at 12:17 PM, Hal Rosenstock wrote: > Hi Greg, > > On Mon, 2007-10-15 at 11:06 -0700, Greg Kurtzer wrote: >> Yes, the patch fixes perfquery so now it supports the "-a" option >> properly but I had to make some minor tweaks as I am running the >> released 1.3.2 version. >> >> I also disabled the IBWARN > > Does this somehow "get in the way" ? Well the scripts report on the warnings, thus the output gets rather messy. ;) > >> and also tweaked ibcheckerrs just enough >> so that ibcheckerrors is reporting properly now. > > Ah, I didn't try that. Good catch. BTW, that same change is applicable > to some other scripts. > >> Attached is the patch that includes both of the above modifications >> and integrates properly against the 1.3.2 released tree. >> >> Again, thank you. :) > > Thanks for testing this out :-) Anytime! I am glad to be able to help out. :) Great work guys! -- Greg Kurtzer gmk at lbl.gov From hrosenstock at xsigo.com Mon Oct 15 12:29:30 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 12:29:30 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> <1192475859.4962.295.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192476570.4962.304.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-15 at 12:26 -0700, Greg Kurtzer wrote: > Well the scripts report on the warnings, thus the output gets rather > messy. ;) Guess I'll need to come up with something better for this. I think that warning is useful in perfquery even though it gets in the way of the scripts. -- Hal From mshefty at ichips.intel.com Mon Oct 15 12:36:31 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 15 Oct 2007 12:36:31 -0700 Subject: [ofa-general] librdmacm feature request In-Reply-To: <1191767680.19888.310.camel@firewall.xsintricity.com> References: <1191767680.19888.310.camel@firewall.xsintricity.com> Message-ID: <4713C13F.70409@ichips.intel.com> > 3) The man pages on rdma_connect() and rdma_accept() aren't really > clear on the role of the connection parameters struct that gets passed > in. Specifically, it doesn't say whether or not the initiator_depth and > responder_resources in the parm struct present in the listen event are > what the other side set, or if they are already swapped to indicate the > minimum/maximum that we can set on our side of the connection. Also, I've added documentation regarding initiator_depth and responder_resources, plus fully defined the data carried in rdma_cm_event. > the initial message pointer is not detailed. When we call > rdma_accept/rdma_reject, does our parm struct need to have that same > pointer? Do we need to free that mem? Can we supply a new initial > message and not leak the memory associated with the incoming initial > message? Can you clarify what message you're referring to? My assumption is rdma_cm_event, but I want to make sure. I should have the documentation updates available for review later today or tomorrow. - Sean From sashak at voltaire.com Mon Oct 15 12:51:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 15 Oct 2007 21:51:40 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <1192473351.4962.275.camel@hrosenstock-ws.xsigo.com> References: <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> <20071015150624.GZ12364@sashak.voltaire.com> <1192473351.4962.275.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071015195140.GA12364@sashak.voltaire.com> On 11:35 Mon 15 Oct , Hal Rosenstock wrote: > > > > I changed this only in umad.c files (to make it clear for internal > > implementation reviewers) and saved it as 'portid' in the header where > > API is described - an user should not care what internal meaning of > > portid is. For getting fd explicitly there is umad_get_fd(portid) > > method. > > Which could be eliminated as redundant; not sure anything is using this > API. Then this would be API change. I think it was useful method when poll() and select() on multiple file descriptors was needed and somebody can have it in a code. > > > Don't a number of them indicate int portid as an input parameter (and > > > this should now be int fd) ? Just grep for portid in those man pages... > > > > Don't think we want to make the internal in its nature "portid = fd > > feature" to be part of the public API. 'portid' is fine IMO because it > > doesn't mean a lot - just "0 or an unique positive value...", pretty > > suitable for public API. > > portid gives a level of abstraction but is it needed ? Only in sense of keeping the same API meaning. Seems you don't think it is very critical, cannot say I disagree so much. Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED? > If we were > starting today, would you say the same thing ? Most likely not. :) Sasha From hrosenstock at xsigo.com Mon Oct 15 12:48:00 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 15 Oct 2007 12:48:00 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071015195140.GA12364@sashak.voltaire.com> References: <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <1192459100.4962.197.camel@hrosenstock-ws.xsigo.com> <20071015150624.GZ12364@sashak.voltaire.com> <1192473351.4962.275.camel@hrosenstock-ws.xsigo.com> <20071015195140.GA12364@sashak.voltaire.com> Message-ID: <1192477680.4962.309.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-15 at 21:51 +0200, Sasha Khapyorsky wrote: > On 11:35 Mon 15 Oct , Hal Rosenstock wrote: > > > > > > I changed this only in umad.c files (to make it clear for internal > > > implementation reviewers) and saved it as 'portid' in the header where > > > API is described - an user should not care what internal meaning of > > > portid is. For getting fd explicitly there is umad_get_fd(portid) > > > method. > > > > Which could be eliminated as redundant; not sure anything is using this > > API. > > Then this would be API change. True. > I think it was useful method when poll() > and select() on multiple file descriptors was needed and somebody can > have it in a code. Good point. Wasn't thinking of that use model. > > > > Don't a number of them indicate int portid as an input parameter (and > > > > this should now be int fd) ? Just grep for portid in those man pages... > > > > > > Don't think we want to make the internal in its nature "portid = fd > > > feature" to be part of the public API. 'portid' is fine IMO because it > > > doesn't mean a lot - just "0 or an unique positive value...", pretty > > > suitable for public API. > > > > portid gives a level of abstraction but is it needed ? > > Only in sense of keeping the same API meaning. > > Seems you don't think it is very critical, cannot say I disagree so much. > Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED? I was just wondering (looking at the change as to how far should it go). Seems like what you propose is fine. We can see what additional feedback there is. No need to change after OFED 1.3. > > If we were > > starting today, would you say the same thing ? > > Most likely not. :) That's what I thought might be the case :) I was just wondering if this was the "legacy" code motivation and it seems that this is what is the primary factor in keeping it portid. I'm fine with that. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon Oct 15 12:56:50 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 15 Oct 2007 12:56:50 -0700 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071015195140.GA12364@sashak.voltaire.com> References: <470F168A.50703@Sun.COM><1192189817.14052.259.camel@hrosenstock-ws.xsigo.com><20071014151115.GD6489@sashak.voltaire.com><4712E990.9020906@Sun.COM><20071015120848.GP12364@sashak.voltaire.com><1192454333.4962.174.camel@hrosenstock-ws.xsigo.com><20071015135432.GU12364@sashak.voltaire.com><1192459100.4962.197.camel@hrosenstock-ws.xsigo.com><20071015150624.GZ12364@sashak.voltaire.com><1192473351.4962.275.camel@hrosenstock-ws.xsigo.com> <20071015195140.GA12364@sashak.voltaire.com> Message-ID: <000001c80f65$87bea170$3c98070a@amr.corp.intel.com> >Seems you don't think it is very critical, cannot say I disagree so much. >Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED? My vote is to retain some sort of abstraction. Once we get rid of it, it will be very hard to add it back in. My concern is that multi-thread receive handling isn't easily supported when RMPP is involved, and having umad_recv take an abstract 'id' gives us some flexibility that could come in useful someday. E.g. something like: umad_recv() -> returns too small, gives necessary size + id specific to a mad uamd_recv(mad id, new size ...) -> returns reassembled rmpp mad would allow multiple threads to block for receives, with only one needing to deal with the rmpp mad. - Sean From dledford at redhat.com Mon Oct 15 12:56:06 2007 From: dledford at redhat.com (Doug Ledford) Date: Mon, 15 Oct 2007 15:56:06 -0400 Subject: [ofa-general] librdmacm feature request In-Reply-To: <4713C13F.70409@ichips.intel.com> References: <1191767680.19888.310.camel@firewall.xsintricity.com> <4713C13F.70409@ichips.intel.com> Message-ID: <1192478166.4400.25.camel@firewall.xsintricity.com> On Mon, 2007-10-15 at 12:36 -0700, Sean Hefty wrote: > > 3) The man pages on rdma_connect() and rdma_accept() aren't really > > clear on the role of the connection parameters struct that gets passed > > in. Specifically, it doesn't say whether or not the initiator_depth and > > responder_resources in the parm struct present in the listen event are > > what the other side set, or if they are already swapped to indicate the > > minimum/maximum that we can set on our side of the connection. Also, > > I've added documentation regarding initiator_depth and > responder_resources, plus fully defined the data carried in rdma_cm_event. > > > the initial message pointer is not detailed. When we call > > rdma_accept/rdma_reject, does our parm struct need to have that same > > pointer? Do we need to free that mem? Can we supply a new initial > > message and not leak the memory associated with the incoming initial > > message? > > Can you clarify what message you're referring to? My assumption is > rdma_cm_event, but I want to make sure. No, I'm referring to the private_data pointer. After looking through the code, I can tell that on rmda_connect the contents of this pointer are copied to kernel space prior to the call returning, so it's safe to be a stack variable. That wasn't clear. And when you get a private_data pointer in the event struct during a connection request event, then from what I can tell the memory gets freed when you call rdma_ack_event, this also wasn't clear. Some libraries have been known to malloc() memory and pass it to the program and expect the program to free() it when it's done. > I should have the documentation updates available for review later today > or tomorrow. > > - Sean -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From fubar at us.ibm.com Mon Oct 15 13:34:46 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Mon, 15 Oct 2007 13:34:46 -0700 Subject: [ofa-general] Re: [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <4713B006.9090908@pobox.com> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> Message-ID: <27349.1192480486@death> Jeff Garzik wrote: >Moni Shoua wrote: >> This is the 7th version of this patch series. See link to V6 below. >> >> Changes from the previous version >> --------------------------------- >> >> * Some patches required modifications to remove offsets so they can be applied with git-apply >> * Patch #3 was first modified by Jay and later by me to make it work with header_ops >> * patch #8 was changed to fix the problem that caused 'ifconfig down' to stuck (dev_close was called twice) > >I just applied the latest version Jay sent, are there any remaining changes? Yes, Moni changed patches 3 and 8 from the series I posted to fix a couple of problems. The others aren't changed from my posting of the series. Since I see you've just pushed it, do you want a patch to correct just the two individual things, or would you rather have new patches? -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com From jgarzik at pobox.com Mon Oct 15 13:50:23 2007 From: jgarzik at pobox.com (Jeff Garzik) Date: Mon, 15 Oct 2007 16:50:23 -0400 Subject: [ofa-general] Re: [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <27349.1192480486@death> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> Message-ID: <4713D28F.3010904@pobox.com> Jay Vosburgh wrote: > Since I see you've just pushed it, do you want a patch to > correct just the two individual things, or would you rather have new > patches? On top of what was just pushed, please. Jeff From fubar at us.ibm.com Mon Oct 15 14:53:53 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Mon, 15 Oct 2007 14:53:53 -0700 Subject: [ofa-general] Re: [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <4713D28F.3010904@pobox.com> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> <4713D28F.3010904@pobox.com> Message-ID: <31162.1192485233@death> Jeff Garzik wrote: >Jay Vosburgh wrote: >> Since I see you've just pushed it, do you want a patch to >> correct just the two individual things, or would you rather have new >> patches? > > >On top of what was just pushed, please. Ok, I'll figure that out and then rebase the locking stuff on top of that. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com From jgarzik at pobox.com Mon Oct 15 14:56:31 2007 From: jgarzik at pobox.com (Jeff Garzik) Date: Mon, 15 Oct 2007 17:56:31 -0400 Subject: [ofa-general] Re: [PATCH V7 0/8] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <31162.1192485233@death> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> <4713D28F.3010904@pobox.com> <31162.1192485233@death> Message-ID: <4713E20F.9080305@pobox.com> Jay Vosburgh wrote: > Jeff Garzik wrote: > >> Jay Vosburgh wrote: >>> Since I see you've just pushed it, do you want a patch to >>> correct just the two individual things, or would you rather have new >>> patches? >> >> On top of what was just pushed, please. > > Ok, I'll figure that out and then rebase the locking stuff on > top of that. FWIW Linus just pulled, so you may diff against mainline if you wish. Whatever is easiest for you... Jeff From kliteyn at dev.mellanox.co.il Mon Oct 15 15:42:36 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 00:42:36 +0200 Subject: [ofa-general] [PATCH] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit Message-ID: <4713ECDC.2090205@dev.mellanox.co.il> Adding ClassPortInfo:CapabilityMask2 field and turning on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). Signed-off-by: Yevgeny Kliteynik --- infiniband-diags/src/saquery.c | 6 +- opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- opensm/include/opensm/osm_base.h | 12 +++ opensm/opensm/osm_sa_class_port_info.c | 4 +- opensm/osmtest/osmtest.c | 13 +++- 5 files changed, 162 insertions(+), 10 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index a9a8da4..e17ec5a 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) "\t\tBase version.............%d\n" "\t\tClass version............%d\n" "\t\tCapability mask..........0x%04X\n" - "\t\tResponse time value......0x%08X\n" + "\t\tCapability mask 2........0x%08X\n" + "\t\tResponse time value......0x%02X\n" "\t\tRedirect GID.............0x%s\n" "\t\tRedirect TC/SL/FL........0x%08X\n" "\t\tRedirect LID.............0x%04X\n" @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) class_port_info->base_ver, class_port_info->class_ver, cl_ntoh16(class_port_info->cap_mask), - class_port_info->resp_time_val, + ib_class_cap_mask2(class_port_info), + ib_class_resp_time_val(class_port_info), sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), cl_ntoh32(class_port_info->redir_tc_sl_fl), cl_ntoh16(class_port_info->redir_lid), diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 0969755..e1785f1 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { uint8_t base_ver; uint8_t class_ver; ib_net16_t cap_mask; - uint8_t reserved[3]; - uint8_t resp_time_val; + uint32_t cap_mask2_resp_time; ib_gid_t redir_gid; ib_net32_t redir_tc_sl_fl; ib_net16_t redir_lid; @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { * cap_mask * Supported capabilities of this management class. * -* resp_time_value -* Maximum expected response time. +* cap_mask2_resp_time +* Maximum expected response time and additional +* supported capabilities of this management class. * * redr_gid * GID to use for redirection, or zero @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { * *********/ +/****f* IBA Base: Types/ib_class_set_resp_time_val +* NAME +* ib_class_set_resp_time_val +* +* DESCRIPTION +* Set maximum expected responce time. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, + IN const uint8_t val) +{ + p_cpi->cap_mask2_resp_time = + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* val +* [in] Responce time value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_resp_time_val +* NAME +* ib_class_resp_time_val +* +* DESCRIPTION +* Get responce time value. +* +* SYNOPSIS +*/ +static inline uint8_t OSM_API +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) +{ + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & + IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* Responce time value. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_set_cap_mask_2 +* NAME +* ib_class_set_cap_mask_2 +* +* DESCRIPTION +* Set ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, + IN const uint32_t cap_mask2) +{ + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(cap_mask2 << 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* cap_mask_2 +* [in] CapabilityMask2 value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_cap_mask2 +* NAME +* ib_class_cap_mask2 +* +* DESCRIPTION +* Get ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline uint32_t OSM_API +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) +{ + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* CapabilityMask2 of the ClassPortInfo. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + /****s* IBA Base: Types/ib_sm_info_t * NAME * ib_sm_info_t diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index e635dcb..26ef067 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) /***********/ +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED +* Name +* OSM_CAP2_IS_QOS_SUPPORTED +* +* DESCRIPTION +* QoS is supported +* +* SYNOPSIS +*/ +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) +/***********/ + /****d* OpenSM: Base/osm_sm_state_t * NAME * osm_sm_state_t diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c index d5c9f82..96d8898 100644 --- a/opensm/opensm/osm_sa_class_port_info.c +++ b/opensm/opensm/osm_sa_class_port_info.c @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, } } rtv += 8; - p_resp_cpi->resp_time_val = rtv; + ib_class_set_resp_time_val(p_resp_cpi, rtv); p_resp_cpi->redir_gid = zero_gid; p_resp_cpi->redir_tc_sl_fl = 0; p_resp_cpi->redir_lid = 0; @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; #endif + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); + if (p_rcv->p_subn->opt.no_multicast_option != TRUE) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index 73933a3..de54f2d 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); osm_log(&p_osmt->log, OSM_LOG_INFO, - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", + "osmtest_validate_sa_class_port_info:\n" + "-----------------------------\n" + "SA Class Port Info:\n" + " base_ver:%u\n" + " class_ver:%u\n" + " cap_mask:0x%X\n" + " cap_mask2:0x%X\n" + " resp_time_val:0x%X\n" + "-----------------------------\n", p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), - p_cpi->resp_time_val); + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); Exit: #if 0 -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Mon Oct 15 15:44:52 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 00:44:52 +0200 Subject: [ofa-general] [PATCH] osm: QoS parser - adding support for quoted string Message-ID: <4713ED64.8070709@dev.mellanox.co.il> Adding support for quoted strings in the policy file parser. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_parser.l | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_qos_parser.l b/opensm/opensm/osm_qos_parser.l index 60b2d1c..bfc4637 100644 --- a/opensm/opensm/osm_qos_parser.l +++ b/opensm/opensm/osm_qos_parser.l @@ -171,6 +171,7 @@ WHITE [ \t]+ NEW_LINE \n COMMENT \#.*\n WHITE_DOTDOT_WHITE [ \t]*:[ \t]* +QUOTED_TEXT \"[^\"]*\" %% @@ -298,6 +299,13 @@ WHITE_DOTDOT_WHITE [ \t]*:[ \t]* return TK_ASTERISK; } +{QUOTED_TEXT} { + SAVE_POS; + __qos_parser_lval = strdup(&__qos_parser_text[1]); + __qos_parser_lval[strlen(__qos_parser_lval)-1] = '\0'; + return TK_TEXT; + } + . { SAVE_POS; __qos_parser_lval = strdup(__qos_parser_text); return TK_TEXT;} %% -- 1.5.1.4 From Ramaswamy.Tummala at Sun.COM Mon Oct 15 16:43:01 2007 From: Ramaswamy.Tummala at Sun.COM (Ramaswamy Tummala) Date: Mon, 15 Oct 2007 16:43:01 -0700 Subject: [ofa-general] openfabrics CMA interfaces for iWARP In-Reply-To: <471388BE.3000504@opengridcomputing.com> References: <470EA544.9030101@Sun.COM> <471388BE.3000504@opengridcomputing.com> Message-ID: <4713FB05.6060105@Sun.COM> Thanks Steve for answering the questions. >> - It appears that RNIC should send IW_CM_EVENT_DISCONNECT event to CMA >> prior >> to the start of closing or aborting the connection (except in the case >> where the disconnect has been initiated by CMA itself, for example >> by CMA >> calling modify_qp entry point of RNIC to move the QP state to >> CLOSING or >> ERROR). Is this correct? > > I'm not sure I understand your question. Basically, I am trying to understand when RNIC should send IW_CM_EVENT_DISCONNECT event. Thank you, Ramaswamy. Steve Wise wrote: > > > Ramaswamy Tummala wrote: >> I have a few questions about the openfabrics CMA interfaces for iWARP. >> I'd appreciate if anyone could clarify them. >> >> - If RNIC's modify_qp() entry point is called to move the QP state to >> CLOSING or >> ERROR while there are some WQEs on SQ and RQ, does RNIC flush the >> incomplete >> WRs on the SQ or RQ? > > It is really up to the device, but if the rnic supports the RDMAC verbs > then it will. > >> If so, does RNIC wait until the flush is complete >> before returning modify_qp() to the caller? If RNIC does not wait >> for the >> flush to complete how does the caller know when the flush is complete >> (so that caller can poll CQ to retrieve the CQ entries)? > > This is up to the provider also. I don't think the verbs specify that > the flush will be done by the time you return from modify_qp(). Its up > to the application to deal with knowing when the flush is done. > >> >> [ Another possibility is, when RNIC's modify_qp() entry point called to >> move the QP state to CLOSING while there some WQEs on the SQ, the >> RNIC would >> internally move the QP state to ERROR. My question still is does RNIC >> wait until the flushing of incomplete WRs from SQ and RQ are done >> before >> returning modify_qp() to the caller even though it internally >> transitioned >> the QP state to ERROR. If RNIC does not wait for the flush to complete >> how does the caller know when the flush is complete? ] >> > > I think the answer is no, you cannot depend on the flush being complete > when you exit modify_qp... > >> - If RNIC's modify_qp() entry point called to move the QP state to >> CLOSING, >> does RNIC just initiate LLP CLOSE and return to the caller?, or does >> it wait >> until LLP CLOSE is complete?. > > It initiates the LLP CLOSE. It does not wait for the close to complete. > >> >> - It appears that RNIC should send IW_CM_EVENT_DISCONNECT event to CMA >> prior >> to the start of closing or aborting the connection (except in the case >> where the disconnect has been initiated by CMA itself, for example >> by CMA >> calling modify_qp entry point of RNIC to move the QP state to >> CLOSING or >> ERROR). Is this correct? > > I'm not sure I understand your question. > >> >> - It appears that RNIC should send IW_CM_EVENT_CLOSE event after the >> connection >> has been closed. Should this event be sent on both active and >> passive sides >> after the connection has been closed? > > Yes. > >> >> - RNIC has add_ref(struct ib_qp *qp), and rem_ref(struct ib_qp *qp) entry >> points. What is the expected use of CMA calling these entry points? >> My general >> thinking is that CMA can increase the reference count on QP (i.e. >> add_ref) >> to prevent the QP from being destroyed by RNIC. But, it is the CMA that >> initiates destroying of QP by calling destroy_qp() entry point of RNIC. >> So, CMA could maintain the reference count for QP in its own private >> data >> (instead of calling RNIC's add_ref entry point) and not call >> destroy_qp() entry point of RNIC if the reference count is not zero. > > The iWCM keeps the ref on the QP while the QP is directly associated > with a iw_cm_id. > >> >> - It appears that if RNIC's accept() entry point is called to accept an >> incoming connection, the RNIC, after successful processing of accept, >> would send IW_CM_EVENT_ESTABLISHED event to CMA. What event RNIC should >> send if the call to accept() succeeded, but later RNIC encountered some >> error in sending MPA reply message to the remote peer or some other >> error? >> In this case although the call to accept() succeeded, the connection >> could >> still be not be established. So the RNIC can not send >> IW_CM_EVENT_ESTABLISHED event. > > It is ok to block in the provider until the connection and qp are bound > and in FPDU mode. That's what the chelsio device does. So when the > ESTABLISHED event is posted, the MPA reply was sent and ACKed and the QP > /connection moved into FPDU mode. Then any problems in the connection > would be posted as IW_CM_EVENT_CLOSE or IW_CM_EVENT_DISCONNECT. > >> >> - It appears that a client of CMA needs to call rdma_resolve_route() >> after >> a successful rdma_resolve_addr(). Any reason for the existence of two >> interfaces instead of one interface that combines the functionality of >> both the interfaces? > > Its an infiniband requirement, I think. iWARP doesn't do anything for > resolve route. The "next hop route" in iWARP terms is actually > determined as part to address resolution... > > Steve. From fubar at us.ibm.com Mon Oct 15 16:44:27 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Mon, 15 Oct 2007 16:44:27 -0700 Subject: [ofa-general] [PATCH linux-2.6] bonding: two small fixes for IPoIB support In-Reply-To: <4713E20F.9080305@pobox.com> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> <4713D28F.3010904@pobox.com> <31162.1192485233@death> <4713E20F.9080305@pobox.com> Message-ID: <9245.1192491867@death> Two small fixes to IPoIB support for bonding: 1- copy header_ops from slave to bonding for IPoIB slaves 2- move release and destroy logic to UNREGISTER from GOING_DOWN notifier to avoid double release Set bonding to version 3.2.1. Signed-off-by: Moni Shoua Signed-off-by: Jay Vosburgh --- drivers/net/bonding/bond_main.c | 11 +++++------ drivers/net/bonding/bonding.h | 4 ++-- 2 files changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index db80f24..6f85cc3 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1263,6 +1263,7 @@ static void bond_setup_by_slave(struct net_device *bond_dev, struct bonding *bond = bond_dev->priv; bond_dev->neigh_setup = slave_dev->neigh_setup; + bond_dev->header_ops = slave_dev->header_ops; bond_dev->type = slave_dev->type; bond_dev->hard_header_len = slave_dev->hard_header_len; @@ -3351,7 +3352,10 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave switch (event) { case NETDEV_UNREGISTER: if (bond_dev) { - bond_release(bond_dev, slave_dev); + if (bond->setup_by_slave) + bond_release_and_destroy(bond_dev, slave_dev); + else + bond_release(bond_dev, slave_dev); } break; case NETDEV_CHANGE: @@ -3366,11 +3370,6 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave * ... Or is it this? */ break; - case NETDEV_GOING_DOWN: - dprintk("slave %s is going down\n", slave_dev->name); - if (bond->setup_by_slave) - bond_release_and_destroy(bond_dev, slave_dev); - break; case NETDEV_CHANGEMTU: /* * TODO: Should slaves be allowed to diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index a8bbd56..b818060 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,8 +22,8 @@ #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION "3.2.0" -#define DRV_RELDATE "September 13, 2007" +#define DRV_VERSION "3.2.1" +#define DRV_RELDATE "October 15, 2007" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" -- 1.5.3.1 From meier3 at llnl.gov Mon Oct 15 16:52:49 2007 From: meier3 at llnl.gov (Timothy A. Meier) Date: Mon, 15 Oct 2007 16:52:49 -0700 Subject: [ofa-general] [PATCH] opensm & osm_console: modified console framework to support multiple connections Message-ID: <4713FD51.4010506@llnl.gov> Sasha, This patch is setting up for adding Remote/Secure Console capability using SSL/TSL (we need at LLNL). Its a big patch because I changed to an abstract server model, instead of the original single connection and synchronous model. There is no significant functional difference (yet). ======== From cb69c1e2c8ea526bcb1e81d079bfa787eda09ba8 Mon Sep 17 00:00:00 2001 From: Tim Meier Date: Mon, 15 Oct 2007 16:08:10 -0700 Subject: [PATCH] opensm & osm_console: modified console framework to support multiple connections Provided an abstract console service that supports the current connection types (local, loopback, socket) as well as supporting the addition of a secure connection type. * A server implementation supports multiple connections, and reduces the posibility of an inadvertant denial of service (currently vulnerable). * An IO abstraction (CIO) is employed to facilitate the future implementation of a secure socket (SSL / TSL) connection, while maintaining backward compatibility. Signed-off-by: Tim Meier --- opensm/include/opensm/osm_console.h | 35 +- opensm/opensm/main.c | 77 ++- opensm/opensm/osm_console.c | 1500 +++++++++++++++++++++++++---------- 3 files changed, 1177 insertions(+), 435 deletions(-) diff --git a/opensm/include/opensm/osm_console.h b/opensm/include/opensm/osm_console.h index 33e41e7..75111a4 100644 --- a/opensm/include/opensm/osm_console.h +++ b/opensm/include/opensm/osm_console.h @@ -49,6 +49,14 @@ #define OSM_DEFAULT_CONSOLE OSM_DISABLE_CONSOLE #define OSM_DEFAULT_CONSOLE_PORT 10000 #define OSM_DAEMON_NAME "opensm" +#define OSM_QUIT_CMD "quit" +#define OSM_LOOP_PERIOD_SEC 2 + +#define CIO_BUFSIZE 1024 +#define CIO_INFO_SIZE 128 +#define CIO_NOTE_SIZE 64 +#define CIO_MAX_CONNECTS 5 +#define CIO_CONNECTION_PORT 10000 #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -59,10 +67,29 @@ #endif /* __cplusplus */ BEGIN_C_DECLS -void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm); -void osm_console(osm_opensm_t * p_osm); -void osm_console_prompt(FILE * out); -void osm_console_close_socket(osm_opensm_t * p_osm); + +/* TODO move when fully implemented */ +typedef struct _CIO_t +{ + int fd; // file descriptor (socket) + FILE *out; + FILE *err; + FILE *in; + struct pollfd *pfd; +} CIO_t; + +int osm_console_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm); +void osm_console_server_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm); +void osm_console_server_destroy(osm_opensm_t *p_osm); +int is_console_enabled(osm_subn_opt_t *p_opt); + +/* TODO move along with other IO abstraction code */ +int cio_printf( CIO_t *cio, const char *format, ...); +int cio_flush( CIO_t *cio); +int cio_getline( char **lineptr, size_t *n, CIO_t *cio); +int cio_open( CIO_t *cio); +int cio_close( CIO_t *cio); +int cio_poll(CIO_t *cio, int timeout); END_C_DECLS #endif /* _OSM_CONSOLE_H_ */ diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 0250551..b744157 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -229,11 +229,13 @@ void show_usage(void) " SMPs.\n" " Without -maxsmps, OpenSM defaults to a maximum of\n" " 4 outstanding SMPs.\n\n"); - printf("-console [off|local" #ifdef ENABLE_OSM_CONSOLE_SOCKET - "|socket|loopback" + printf("-console [%s|%s|%s|%s]", OSM_DISABLE_CONSOLE, OSM_LOCAL_CONSOLE, + OSM_REMOTE_CONSOLE, OSM_LOOPBACK_CONSOLE); +#else + printf("-console [%s|%s]", OSM_DISABLE_CONSOLE, OSM_LOCAL_CONSOLE); #endif - "]\n This option activates the OpenSM console (default off).\n\n"); + printf("]\n This option activates the OpenSM console (default off).\n\n"); #ifdef ENABLE_OSM_CONSOLE_SOCKET printf("-console-port \n" " Specify an alternate telnet port for the console (default %d).\n\n", @@ -566,6 +568,45 @@ static int daemonize(osm_opensm_t * osm) return 0; } +/* simple server to provide an interface to support + * interactive (and non-interactive) commands + * loop here until an exit signal is received + * + * currently just support a command console + */ +void osm_opensm_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) +{ + if(is_console_enabled(p_opt)) + osm_console_server_init(p_opt, p_osm); + + /* + Sit here forever - dwelling or running the server + */ + while (!osm_exit_flag) + { + if(is_console_enabled(p_opt)) + osm_console_server(p_opt, p_osm); + else + cl_thread_suspend( 10000); + + if (osm_usr1_flag) + { + osm_usr1_flag = 0; + osm_log_reopen_file(&(p_osm->log)); + } + if (osm_hup_flag) + { + osm_hup_flag = 0; + /* a HUP signal should only start a new heavy sweep */ + p_osm->subn.force_immediate_heavy_sweep = TRUE; + osm_opensm_sweep(p_osm); + } + } + + if(is_console_enabled(p_opt)) + osm_console_server_destroy(p_osm); +} + /********************************************************************** **********************************************************************/ int main(int argc, char *argv[]) @@ -1034,34 +1075,8 @@ int main(int argc, char *argv[]) osm_exit_flag = 1; } } else { - osm_console_init(&opt, &osm); - - /* - Sit here forever - */ - while (!osm_exit_flag) { - if (strcmp(opt.console, OSM_LOCAL_CONSOLE) == 0 -#ifdef ENABLE_OSM_CONSOLE_SOCKET - || strcmp(opt.console, OSM_REMOTE_CONSOLE) == 0 - || strcmp(opt.console, OSM_LOOPBACK_CONSOLE) == 0 -#endif - ) - osm_console(&osm); - else - cl_thread_suspend(10000); - - if (osm_usr1_flag) { - osm_usr1_flag = 0; - osm_log_reopen_file(&osm.log); - } - if (osm_hup_flag) { - osm_hup_flag = 0; - /* a HUP signal should only start a new heavy sweep */ - osm.subn.force_immediate_heavy_sweep = TRUE; - osm_opensm_sweep(&osm); - } - } - osm_console_close_socket(&osm); + // start a server that runs indefinately + osm_opensm_server(&opt, &osm); } #if 0 diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index c6e02ab..9d62774 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -38,15 +38,16 @@ #define _GNU_SOURCE /* for getline */ #include #include +#include #include #include #include #include #ifdef ENABLE_OSM_CONSOLE_SOCKET #include -#endif #include #include +#endif #include #include #include @@ -57,20 +58,113 @@ #include #include +typedef struct _LoopCmd +{ + int on; + int running; + int delay_s; + void (*loop_function)(osm_opensm_t *p_osm, CIO_t *out); + cl_thread_t loopThread; // a specific thread for each looping cmd +} LoopCmd; + +// unique attributes for each connection +typedef struct _osm_console_thread_t +{ + int used; + unsigned short int port; + int authorized; + int state; + char name[CIO_INFO_SIZE]; + char in_buff[CIO_BUFSIZE]; + char out_buff[CIO_BUFSIZE]; + char client_type[CIO_NOTE_SIZE]; // maps to option->console (off|local|socket) + char client_ip[CIO_NOTE_SIZE]; + char client_hn[CIO_INFO_SIZE]; + unsigned int thread_num; // a unique ever increasing number + osm_opensm_t *p_osm; // the global opensm singleton (protect with lock) + CIO_t io; // the io streams for the connection + LoopCmd loop_command; + cl_thread_t consoleThread; // a specific thread each console connection + struct timeval connect_time; +} osm_console_thread_t; + struct command { - char *name; - void (*help_function) (FILE * out, int detail); - void (*parse_function) (char **p_last, osm_opensm_t * p_osm, - FILE * out); + char *name; + void (*help_function)(CIO_t *out, int detail); + void (*parse_function)(char **p_last, osm_console_thread_t *p_oct, CIO_t *out); }; -struct { - int on; - int delay_s; - time_t previous; - void (*loop_function) (osm_opensm_t * p_osm, FILE * out); -} loop_command = { -on: 0, delay_s: 2, loop_function:NULL}; +/* connection pool for remote clients - currently only consoles */ +static osm_console_thread_t ConsoleThreadPool[CIO_MAX_CONNECTS]; +static cl_plock_t ThreadLock; +static volatile unsigned int cio_thread_counter = 0; +static struct timeval ServerTime; + +/********************************************************************** + * convenience function + **********************************************************************/ +CIO_t* getCIO(osm_console_thread_t *oct) +{ + return &oct->io; +} + +/********************************************************************** + * thread pool primitive: counts the number currently in use + **********************************************************************/ +int num_console_threads(void) +{ + // count them up + + int i; + int num = 0; + + cl_plock_acquire(&ThreadLock); + for(i = 0; i < CIO_MAX_CONNECTS; ++i) + { + if(ConsoleThreadPool[i].used != 0) + num++; + } + cl_plock_release(&ThreadLock); + + return num; +} + +/********************************************************************** + * thread pool primitive: the current value reflects the number of + * connection attempts made since program execution. + **********************************************************************/ +unsigned int get_console_thread_counter(void) +{ + return cio_thread_counter; +} + +int is_loopback(char* str) +{ + // convenience - checks if socket based connection + if(str) + return (strcmp(str, OSM_LOOPBACK_CONSOLE) == 0); +return 0; +} + +int is_remote(char* str) +{ + // convenience - checks if socket based connection + if(str) + return (strcmp(str, OSM_REMOTE_CONSOLE) == 0) + || is_loopback(str); +return 0; +} + +int is_console_enabled(osm_subn_opt_t *p_opt) +{ + // checks for a variety of types of consoles - default is off or 0 + if(p_opt) + return ((strcmp(p_opt->console, OSM_LOCAL_CONSOLE) == 0) + || (strcmp(p_opt->console, OSM_LOOPBACK_CONSOLE) == 0) + || (strcmp(p_opt->console, OSM_REMOTE_CONSOLE) == 0)); +return 0; +} + static const struct command console_cmds[]; @@ -79,114 +173,103 @@ static inline char *next_token(char **p_last) return strtok_r(NULL, " \t\n\r", p_last); } -static void help_command(FILE * out, int detail) +static void help_command(CIO_t *out, int detail) { int i; - fprintf(out, "Supported commands and syntax:\n"); - fprintf(out, "help []\n"); + cio_printf(out, "Supported commands and syntax:\n"); + cio_printf(out, "help []\n"); /* skip help command */ for (i = 1; console_cmds[i].name; i++) console_cmds[i].help_function(out, 0); } -static void help_quit(FILE * out, int detail) +static void help_quit(CIO_t *out, int detail) { - fprintf(out, "quit (not valid in local mode; use ctl-c)\n"); + cio_printf(out, "%s -- stops the console\n", OSM_QUIT_CMD); + if (detail) { + cio_printf(out, " OpenSM will continue, to kill; \n"); + cio_printf(out, " use ctrl-C in local mode or\n"); + cio_printf(out, " kill the process\n"); + } } -static void help_loglevel(FILE * out, int detail) + +static void help_loglevel(CIO_t *out, int detail) { - fprintf(out, "loglevel []\n"); + cio_printf(out, "loglevel []\n"); if (detail) { - fprintf(out, " log-level is OR'ed from the following\n"); - fprintf(out, " OSM_LOG_NONE 0x%02X\n", - OSM_LOG_NONE); - fprintf(out, " OSM_LOG_ERROR 0x%02X\n", - OSM_LOG_ERROR); - fprintf(out, " OSM_LOG_INFO 0x%02X\n", - OSM_LOG_INFO); - fprintf(out, " OSM_LOG_VERBOSE 0x%02X\n", - OSM_LOG_VERBOSE); - fprintf(out, " OSM_LOG_DEBUG 0x%02X\n", - OSM_LOG_DEBUG); - fprintf(out, " OSM_LOG_FUNCS 0x%02X\n", - OSM_LOG_FUNCS); - fprintf(out, " OSM_LOG_FRAMES 0x%02X\n", - OSM_LOG_FRAMES); - fprintf(out, " OSM_LOG_ROUTING 0x%02X\n", - OSM_LOG_ROUTING); - fprintf(out, " OSM_LOG_SYS 0x%02X\n", - OSM_LOG_SYS); - fprintf(out, "\n"); - fprintf(out, " OSM_LOG_DEFAULT_LEVEL 0x%02X\n", - OSM_LOG_DEFAULT_LEVEL); + cio_printf(out, " log-level is OR'ed from the following\n"); + cio_printf(out, " OSM_LOG_NONE 0x%02X\n", OSM_LOG_NONE); + cio_printf(out, " OSM_LOG_ERROR 0x%02X\n", OSM_LOG_ERROR); + cio_printf(out, " OSM_LOG_INFO 0x%02X\n", OSM_LOG_INFO); + cio_printf(out, " OSM_LOG_VERBOSE 0x%02X\n", OSM_LOG_VERBOSE); + cio_printf(out, " OSM_LOG_DEBUG 0x%02X\n", OSM_LOG_DEBUG); + cio_printf(out, " OSM_LOG_FUNCS 0x%02X\n", OSM_LOG_FUNCS); + cio_printf(out, " OSM_LOG_FRAMES 0x%02X\n", OSM_LOG_FRAMES); + cio_printf(out, " OSM_LOG_ROUTING 0x%02X\n", OSM_LOG_ROUTING); + cio_printf(out, " OSM_LOG_SYS 0x%02X\n", OSM_LOG_SYS); + cio_printf(out, "\n"); + cio_printf(out, " OSM_LOG_DEFAULT_LEVEL 0x%02X\n", OSM_LOG_DEFAULT_LEVEL); } } -static void help_priority(FILE * out, int detail) +static void help_priority(CIO_t *out, int detail) { - fprintf(out, "priority []\n"); + cio_printf(out, "priority []\n"); } -static void help_resweep(FILE * out, int detail) +static void help_resweep(CIO_t *out, int detail) { - fprintf(out, "resweep [heavy|light]\n"); + cio_printf(out, "resweep [heavy|light]\n"); } -static void help_status(FILE * out, int detail) +static void help_status(CIO_t *out, int detail) { - fprintf(out, "status [loop]\n"); + cio_printf(out, "status [loop]\n"); if (detail) { - fprintf(out, " loop -- type \"q\" to quit\n"); + cio_printf(out, " loop -- type \"q\" to quit\n"); } } -static void help_logflush(FILE * out, int detail) +static void help_logflush(CIO_t *out, int detail) { - fprintf(out, "logflush -- flush the opensm.log file\n"); + cio_printf(out, "logflush -- flush the opensm.log file\n"); } -static void help_querylid(FILE * out, int detail) +static void help_querylid(CIO_t *out, int detail) { - fprintf(out, - "querylid lid -- print internal information about the lid specified\n"); + cio_printf(out, + "querylid lid -- print internal information about the lid specified\n"); } -static void help_portstatus(FILE * out, int detail) +static void help_portstatus(CIO_t *out, int detail) { - fprintf(out, "portstatus [ca|switch|router]\n"); + cio_printf(out, "portstatus [ca|switch|router]\n"); if (detail) { - fprintf(out, "summarize port status\n"); - fprintf(out, - " [ca|switch|router] -- limit the results to the node type specified\n"); + cio_printf(out, "summarize port status\n"); + cio_printf(out, " [ca|switch|router] -- limit the results to the node type specified\n"); } } #ifdef ENABLE_OSM_PERF_MGR -static void help_perfmgr(FILE * out, int detail) +static void help_perfmgr(CIO_t *out, int detail) { - fprintf(out, - "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n"); + cio_printf(out, "perfmgr [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n"); if (detail) { - fprintf(out, - "perfmgr -- print the performance manager state\n"); - fprintf(out, - " [enable|disable] -- change the perfmgr state\n"); - fprintf(out, - " [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n"); - fprintf(out, - " [clear_counters] -- clear the counters stored\n"); - fprintf(out, - " [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n"); + cio_printf(out, "perfmgr -- print the performance manager state\n"); + cio_printf(out, " [enable|disable] -- change the perfmgr state\n"); + cio_printf(out, " [sweep_time] -- change the perfmgr sweep time (requires [seconds] option)\n"); + cio_printf(out, " [clear_counters] -- clear the counters stored\n"); + cio_printf(out, " [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n"); } } #endif /* ENABLE_OSM_PERF_MGR */ /* more help routines go here */ -static void help_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void help_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { char *p_cmd; int i, found = 0; @@ -203,21 +286,21 @@ static void help_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } } if (!found) { - fprintf(out, "%s : Command not found\n\n", p_cmd); + cio_printf(out, "%s : Command not found\n\n", p_cmd); help_command(out, 0); } } } -static void loglevel_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void loglevel_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { + osm_opensm_t *p_osm = p_oct->p_osm; char *p_cmd; int level; p_cmd = next_token(p_last); if (!p_cmd) - fprintf(out, "Current log level is 0x%x\n", - osm_log_get_level(&p_osm->log)); + cio_printf(out, "Current log level is 0x%x\n", osm_log_get_level(&p_osm->log)); else { /* Handle x, 0x, and decimal specification of log level */ if (!strncmp(p_cmd, "x", 1)) { @@ -231,31 +314,29 @@ static void loglevel_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) level = strtol(p_cmd, NULL, 10); } if ((level >= 0) && (level < 256)) { - fprintf(out, "Setting log level to 0x%x\n", level); + cio_printf(out, "Setting log level to 0x%x\n", level); osm_log_set_level(&p_osm->log, level); } else - fprintf(out, "Invalid log level 0x%x\n", level); + cio_printf(out, "Invalid log level 0x%x\n", level); } } -static void priority_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void priority_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { + osm_opensm_t *p_osm = p_oct->p_osm; char *p_cmd; int priority; p_cmd = next_token(p_last); if (!p_cmd) - fprintf(out, "Current sm-priority is %d\n", - p_osm->subn.opt.sm_priority); + cio_printf(out, "Current sm-priority is %d\n", p_osm->subn.opt.sm_priority); else { priority = strtol(p_cmd, NULL, 0); if (0 > priority || 15 < priority) - fprintf(out, - "Invalid sm-priority %d; must be between 0 and 15\n", - priority); + cio_printf(out, "Invalid sm-priority %d; must be between 0 and 15\n", priority); else { - fprintf(out, "Setting sm-priority to %d\n", priority); - p_osm->subn.opt.sm_priority = (uint8_t) priority; + cio_printf(out, "Setting sm-priority to %d\n", priority); + p_osm->subn.opt.sm_priority = (uint8_t)priority; /* Does the SM state machine need a kick now ? */ } } @@ -371,24 +452,23 @@ static char *sm_state_mgr_str(osm_sm_state_t state) } } -static void print_status(osm_opensm_t * p_osm, FILE * out) +static void print_status(osm_opensm_t *p_osm, CIO_t *out) { if (out) { - fprintf(out, " OpenSM Version : %s\n", OSM_VERSION); - fprintf(out, " SM State/Mgr State : %s/%s\n", + cio_printf(out, " OpenSM Version : %s\n", OSM_VERSION); + cio_printf(out, " SM State/Mgr State : %s/%s\n", sm_state_str(p_osm->subn.sm_state), sm_state_mgr_str(p_osm->sm.state_mgr.state)); - fprintf(out, " SA State : %s\n", + cio_printf(out, " SA State : %s\n", sa_state_str(p_osm->sa.state)); - fprintf(out, " Routing Engine : %s\n", - p_osm->routing_engine.name ? p_osm->routing_engine. - name : "null (min-hop)"); + cio_printf(out, " Routing Engine : %s\n", + p_osm->routing_engine.name ? p_osm->routing_engine.name : "null (min-hop)"); #ifdef ENABLE_OSM_PERF_MGR - fprintf(out, "\n PerfMgr state/sweep state : %s/%s\n", + cio_printf(out, "\n PerfMgr state/sweep state : %s/%s\n", osm_perfmgr_get_state_str(&(p_osm->perfmgr)), osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr))); #endif - fprintf(out, "\n MAD stats\n" + cio_printf(out, "\n MAD stats\n" " ---------\n" " QP0 MADs outstanding : %d\n" " QP0 MADs outstanding (on wire) : %d\n" @@ -412,7 +492,7 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) p_osm->stats.sa_mads_sent, p_osm->stats.sa_mads_rcvd_unknown, p_osm->stats.sa_mads_ignored); - fprintf(out, "\n Subnet flags\n" + cio_printf(out, "\n Subnet flags\n" " ------------\n" " Ignore existing lfts : %d\n" " Subnet Init errors : %d\n" @@ -426,32 +506,24 @@ static void print_status(osm_opensm_t * p_osm, FILE * out) p_osm->subn.moved_to_master_state, p_osm->subn.first_time_master_sweep, p_osm->subn.coming_out_of_standby); - fprintf(out, "\n"); - } -} - -static int loop_command_check_time(void) -{ - time_t cur = time(NULL); - if ((loop_command.previous + loop_command.delay_s) < cur) { - loop_command.previous = cur; - return (1); + cio_printf(out, "\n"); } - return (0); } -static void status_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void status_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { + osm_opensm_t *p_osm = p_oct->p_osm; char *p_cmd; p_cmd = next_token(p_last); if (p_cmd) { if (strcmp(p_cmd, "loop") == 0) { - fprintf(out, "Looping on status command...\n"); - fflush(out); - loop_command.on = 1; - loop_command.previous = time(NULL); - loop_command.loop_function = print_status; + cio_printf(out, "Looping on status command...\n"); + cio_flush(out); + p_oct->loop_command.on = 1; + p_oct->loop_command.delay_s = OSM_LOOP_PERIOD_SEC; + p_oct->loop_command.running = 0; + p_oct->loop_command.loop_function = print_status; } else { help_status(out, 1); return; @@ -460,14 +532,15 @@ static void status_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) print_status(p_osm, out); } -static void resweep_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void resweep_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { + osm_opensm_t *p_osm = p_oct->p_osm; char *p_cmd; p_cmd = next_token(p_last); if (!p_cmd || (strcmp(p_cmd, "heavy") != 0 && strcmp(p_cmd, "light") != 0)) { - fprintf(out, "Invalid resweep command\n"); + cio_printf(out, "Invalid resweep command\n"); help_resweep(out, 1); } else { if (strcmp(p_cmd, "heavy") == 0) { @@ -477,20 +550,21 @@ static void resweep_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } } -static void logflush_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void logflush_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { - fflush(p_osm->log.out_port); + fflush(p_oct->p_osm->log.out_port); } -static void querylid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void querylid_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { - int p = 0; - uint16_t lid = 0; + osm_opensm_t *p_osm = p_oct->p_osm; + int p = 0; + uint16_t lid = 0; osm_port_t *p_port = NULL; char *p_cmd = next_token(p_last); if (!p_cmd) { - fprintf(out, "no LID specified\n"); + cio_printf(out, "no LID specified\n"); help_querylid(out, 1); return; } @@ -503,8 +577,8 @@ static void querylid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) if (!p_port) goto invalid_lid; - fprintf(out, "Query results for LID %d\n", lid); - fprintf(out, + cio_printf(out, "Query results for LID %d\n", lid); + cio_printf(out, " GUID : 0x%016" PRIx64 "\n" " Node Desc : %s\n" " Node Type : %s\n" @@ -518,20 +592,19 @@ static void querylid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) p = 0; else p = 1; - for ( /* see above */ ; p < p_port->p_node->physp_tbl_size; p++) { - fprintf(out, + for (/* see above */; p < p_port->p_node->physp_tbl_size; p++) { + cio_printf(out, " Port %d health : %s\n", p, - p_port->p_node->physp_table[p]. - healthy ? "OK" : "ERROR"); + p_port->p_node->physp_table[p].healthy ? "OK" : "ERROR"); } cl_plock_release(&p_osm->lock); return; - invalid_lid: +invalid_lid: cl_plock_release(&p_osm->lock); - fprintf(out, "Invalid lid %d\n", lid); + cio_printf(out, "Invalid lid %d\n", lid); return; } @@ -564,11 +637,11 @@ __tag_port_report(port_report_t ** head, uint64_t node_guid, *head = rep; } -static void __print_port_report(FILE * out, port_report_t * head) +static void __print_port_report(CIO_t *out, port_report_t *head) { port_report_t *item = head; while (item != NULL) { - fprintf(out, " 0x%016" PRIx64 " %d (%s)\n", + cio_printf(out, " 0x%016"PRIx64" %d (%s)\n", item->node_guid, item->port_num, item->print_desc); port_report_t *next = item->next; free(item); @@ -689,10 +762,11 @@ static void __get_stats(cl_map_item_t * const p_map_item, void *context) } } -static void portstatus_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void portstatus_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { - fabric_stats_t fs; - struct timeval before, after; + osm_opensm_t *p_osm = p_oct->p_osm; + fabric_stats_t fs; + struct timeval before, after; char *p_cmd; memset(&fs, 0, sizeof(fs)); @@ -706,7 +780,7 @@ static void portstatus_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } else if (strcmp(p_cmd, "router") == 0) { fs.node_type_lim = IB_NODE_TYPE_ROUTER; } else { - fprintf(out, "Node type not understood\n"); + cio_printf(out, "Node type not understood\n"); help_portstatus(out, 1); return; } @@ -723,58 +797,56 @@ static void portstatus_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) gettimeofday(&after, NULL); /* report the stats */ - fprintf(out, "\"%s\" port status:\n", - fs.node_type_lim ? ib_get_node_type_str(fs. - node_type_lim) : "ALL"); - fprintf(out, - " %" PRIu64 " port(s) scanned on %" PRIu64 - " nodes in %lu us\n", fs.total_ports, fs.total_nodes, - after.tv_usec - before.tv_usec); + cio_printf(out, "\"%s\" port status:\n", + fs.node_type_lim ? ib_get_node_type_str(fs.node_type_lim) : "ALL"); + cio_printf(out, " %"PRIu64" port(s) scanned on %"PRIu64" nodes in %lu us\n", + fs.total_ports, fs.total_nodes, after.tv_usec - before.tv_usec); if (fs.ports_down) - fprintf(out, " %" PRIu64 " down\n", fs.ports_down); + cio_printf(out, " %"PRIu64" down\n", fs.ports_down); if (fs.ports_active) - fprintf(out, " %" PRIu64 " active\n", fs.ports_active); + cio_printf(out, " %"PRIu64" active\n", fs.ports_active); if (fs.ports_1X) - fprintf(out, " %" PRIu64 " at 1X\n", fs.ports_1X); + cio_printf(out, " %"PRIu64" at 1X\n", fs.ports_1X); if (fs.ports_4X) - fprintf(out, " %" PRIu64 " at 4X\n", fs.ports_4X); + cio_printf(out, " %"PRIu64" at 4X\n", fs.ports_4X); if (fs.ports_8X) - fprintf(out, " %" PRIu64 " at 8X\n", fs.ports_8X); + cio_printf(out, " %"PRIu64" at 8X\n", fs.ports_8X); if (fs.ports_12X) - fprintf(out, " %" PRIu64 " at 12X\n", fs.ports_12X); + cio_printf(out, " %"PRIu64" at 12X\n", fs.ports_12X); if (fs.ports_sdr) - fprintf(out, " %" PRIu64 " at 2.5 Gbps\n", fs.ports_sdr); + cio_printf(out, " %"PRIu64" at 2.5 Gbps\n", fs.ports_sdr); if (fs.ports_ddr) - fprintf(out, " %" PRIu64 " at 5.0 Gbps\n", fs.ports_ddr); + cio_printf(out, " %"PRIu64" at 5.0 Gbps\n", fs.ports_ddr); if (fs.ports_qdr) - fprintf(out, " %" PRIu64 " at 10.0 Gbps\n", fs.ports_qdr); + cio_printf(out, " %"PRIu64" at 10.0 Gbps\n", fs.ports_qdr); if (fs.ports_disabled + fs.ports_reduced_speed + fs.ports_reduced_width - > 0) { - fprintf(out, "\nPossible issues:\n"); + > 0) { + cio_printf(out, "\nPossible issues:\n"); } if (fs.ports_disabled) { - fprintf(out, " %" PRIu64 " disabled\n", fs.ports_disabled); + cio_printf(out, " %"PRIu64" disabled\n", fs.ports_disabled); __print_port_report(out, fs.disabled_ports); } if (fs.ports_reduced_speed) { - fprintf(out, " %" PRIu64 " with reduced speed\n", + cio_printf(out, " %"PRIu64" with reduced speed\n", fs.ports_reduced_speed); __print_port_report(out, fs.reduced_speed_ports); } if (fs.ports_reduced_width) { - fprintf(out, " %" PRIu64 " with reduced width\n", + cio_printf(out, " %"PRIu64" with reduced width\n", fs.ports_reduced_width); __print_port_report(out, fs.reduced_width_ports); } - fprintf(out, "\n"); + cio_printf(out, "\n"); } #ifdef ENABLE_OSM_PERF_MGR -static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void perfmgr_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { + osm_opensm_t *p_osm = p_oct->p_osm; char *p_cmd; p_cmd = next_token(p_last); @@ -803,309 +875,937 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), time_s); } else { - fprintf(out, + cio_printf(out, "sweep_time requires a time period (in seconds) to be specified\n"); } } else { - fprintf(out, "\"%s\" option not found\n", p_cmd); + cio_printf(out, "\"%s\" option not found\n", p_cmd); } } else { - fprintf(out, "Performance Manager status:\n" + cio_printf(out, "Performance Manager status:\n" "state : %s\n" "sweep state : %s\n" "sweep time : %us\n" - "outstanding queries/max : %d/%u\n" - "loaded event plugin : %s\n", + "outstanding queries/max : %d/%u\n", osm_perfmgr_get_state_str(&(p_osm->perfmgr)), osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr)), osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)), p_osm->perfmgr.outstanding_queries, - p_osm->perfmgr.max_outstanding_queries, - p_osm->perfmgr.event_plugin ? - p_osm->perfmgr.event_plugin->plugin_name : "NONE"); + p_osm->perfmgr.max_outstanding_queries); } } #endif /* ENABLE_OSM_PERF_MGR */ -/* This is public to be able to close it on exit */ -void osm_console_close_socket(osm_opensm_t * p_osm) +static void help_version(CIO_t *out, int detail) { - if (p_osm->console.socket > 0) { - close(p_osm->console.in_fd); - p_osm->console.in_fd = -1; - p_osm->console.out_fd = -1; - p_osm->console.in = NULL; - p_osm->console.out = NULL; - } + cio_printf(out, "version -- print the OSM version\n"); } -static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) +static void version_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) { - osm_console_close_socket(p_osm); + cio_printf(out, "%s build %s %s\n", OSM_VERSION, __DATE__, __TIME__); } -static void help_version(FILE * out, int detail) +/********************************************************************** + * thread pool primitive: returns the thread structure to the pool, and + * makes it available + **********************************************************************/ +int free_console_thread(osm_console_thread_t *oct) { - fprintf(out, "version -- print the OSM version\n"); + // just clear the used flag, mark as available + oct->used = 0; + return 1; +} + +/********************************************************************** + * Cleans up the thread that was established for a connection. + * The connection should already be closed. This method releases + * any resources and destroy the thread (done automagically??) + * + * refer to: osm_console_thread and osm_console_thread_init +**********************************************************************/ +int osm_console_thread_destroy(osm_console_thread_t *oct) +{ + free_console_thread(oct); + + // there are a few end cases that might need this (e.g. not completely init) + cio_close(getCIO(oct)); + + return 0; } -static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) + +/********************************************************************** + * Gracefully shut down the console connection, release resources + * refer to: osm_console_init + **********************************************************************/ +void osm_console_destroy(osm_console_thread_t *p_oct) +{ + osm_opensm_t *p_osm = p_oct->p_osm; + CIO_t *out = getCIO(p_oct); + + osm_log(&(p_osm->log), OSM_LOG_INFO, + "osm_console_destroy: Console connection being closed: %s (%s) s#%d\n", p_oct->client_hn, + p_oct->client_ip, out->fd); + fflush(p_osm->log.out_port); + cio_printf(out, "Closing this connection from osm_console_destroy\n"); + + cio_close(out); + } + +/********************************************************************** + * thread pool primitive: kills and disconnects connections. If the + * argument is a current thread, it will NOT be cleared (will be skipped) + **********************************************************************/ +int kill_console_thread_pool(osm_console_thread_t* p_oct, osm_opensm_t *p_osm) +{ + // kill everything but my connection if p_oct is in the list + int i; + osm_console_thread_t* oct; + CIO_t *p_out = getCIO(p_oct); + CIO_t *out = getCIO(p_oct); + + // brute force this, don't use locks because don't want to get deadlocked +// cl_plock_acquire(&ThreadLock); + for(i = 0; i < CIO_MAX_CONNECTS; ++i) + { + oct = &ConsoleThreadPool[i]; + if((oct) && (oct->used) && (p_oct != oct)) + { + cio_printf(p_out, " killing thread: %s\n", oct->name); + out = getCIO(oct); + + // disconnect gracefully?? + osm_log(&(p_osm->log), OSM_LOG_INFO, + "kill_console_thread_pool: %d (s#%d)\n", i, out->fd); + + // return all the console resources + osm_console_destroy(oct); + + // return all the thread and connection resources + osm_console_thread_destroy(oct); + } + } +// cl_plock_release(&ThreadLock); + return i; +} + +/********************************************************************** + * releases all of the resources used by all of the connections, by + * closing sockets, freeing threads, etc.. + * + * a good method for handling a kill signal + **********************************************************************/ +int free_console_threads(osm_opensm_t *p_osm) { - fprintf(out, "%s build %s %s\n", OSM_VERSION, __DATE__, __TIME__); + // just make sure everything is gone + int rtnval = kill_console_thread_pool(NULL, p_osm); + return rtnval; } + +/********************************************************************** + * thread pool primitive: clears and initializes all the threads. If the + * argument is a current thread, it will NOT be cleared (will be skipped) + **********************************************************************/ +int print_console_thread_pool(osm_console_thread_t* p_oct, osm_opensm_t *p_osm, CIO_t *out) +{ + // show whats in use, and whats available + + int i; + osm_console_thread_t* oct; + + char *t_string = ctime(&(ServerTime.tv_sec)); + t_string[strlen(t_string)-1]=0; + cio_printf(out, "OSM Server - Up since: %s, Users: %d, * = this connection\n", t_string, num_console_threads()); + + // (careful not to double lock .. num_console_threads() + cl_plock_acquire(&ThreadLock); + + for(i = 0; i < CIO_MAX_CONNECTS; ++i) + { + oct = &ConsoleThreadPool[i]; + if((oct) && (oct->used)) + { + if(p_oct == oct) + cio_printf(out, "*"); + else + cio_printf(out, " "); + cio_printf(out, "Thread: %s [%d]\n", oct->name, oct->thread_num); + cio_printf(out, " User: %s, (%s)\n", oct->client_hn, oct->client_ip); + t_string = ctime(&(oct->connect_time.tv_sec)); + t_string[strlen(t_string)-1]=0; + cio_printf(out, " Since: %s\n", t_string); + cio_printf(out, " Port: %d\n", oct->port); + cio_printf(out, " Socket: %d\n", oct->io.fd); + cio_printf(out, " State: %d\n", oct->state); + } + } + cl_plock_release(&ThreadLock); + return i; +} + +/* close and free up resources used by socket */ +static void osm_console_deinit_socket(osm_opensm_t *p_osm) +{ + if (p_osm->console.socket > 0) + { + osm_log(&(p_osm->log), OSM_LOG_INFO, + "osm_console: Closing the primary (listening) socket connection (%d)\n", p_osm->console.in_fd); + + close(p_osm->console.in_fd); + p_osm->console.in_fd = -1; + p_osm->console.out_fd = -1; + p_osm->console.in = NULL; + p_osm->console.out = NULL; + } +} + +/* do everything necessary to gracefully turn off the console */ +void osm_console_server_destroy(osm_opensm_t *p_osm) +{ + /* make sure consoles are closed before stopping the main listener socket */ + free_console_threads(p_osm); + + cl_plock_destroy(&ThreadLock); + + /* close the socket, listening for connections */ + osm_console_deinit_socket(p_osm); +} + +/* turns off the console, signature needs to match the parse_funciton() */ +static void quit_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t *out) +{ + // set the "done" flag used by the isDone() method + p_oct->authorized = 0; // temporarily use this as the done flag + + // do other necessary things to clean up and turn off +} + + /* more parse routines go here */ static const struct command console_cmds[] = { - {"help", &help_command, &help_parse}, - {"quit", &help_quit, &quit_parse}, - {"loglevel", &help_loglevel, &loglevel_parse}, - {"priority", &help_priority, &priority_parse}, - {"resweep", &help_resweep, &resweep_parse}, - {"status", &help_status, &status_parse}, - {"logflush", &help_logflush, &logflush_parse}, - {"querylid", &help_querylid, &querylid_parse}, - {"portstatus", &help_portstatus, &portstatus_parse}, - {"version", &help_version, &version_parse}, + { "help", &help_command, &help_parse}, + { OSM_QUIT_CMD, &help_quit, &quit_parse}, + { "loglevel", &help_loglevel, &loglevel_parse}, + { "priority", &help_priority, &priority_parse}, + { "resweep", &help_resweep, &resweep_parse}, + { "status", &help_status, &status_parse}, + { "logflush", &help_logflush, &logflush_parse}, + { "querylid", &help_querylid, &querylid_parse}, + { "portstatus", &help_portstatus, &portstatus_parse}, + { "version", &help_version, &version_parse}, #ifdef ENABLE_OSM_PERF_MGR {"perfmgr", &help_perfmgr, &perfmgr_parse}, #endif /* ENABLE_OSM_PERF_MGR */ {NULL, NULL, NULL} /* end of array */ }; -static void parse_cmd_line(char *line, osm_opensm_t * p_osm) -{ - char *p_cmd, *p_last; - int i, found = 0; - FILE *out = p_osm->console.out; - - while (isspace(*line)) - line++; - if (!*line) - return; - /* find first token which is the command */ - p_cmd = strtok_r(line, " \t\n\r", &p_last); - if (p_cmd) { - for (i = 0; console_cmds[i].name; i++) { - if (loop_command.on) { - if (!strcmp(p_cmd, "q")) { - loop_command.on = 0; - } - found = 1; - break; - } - if (!strcmp(p_cmd, console_cmds[i].name)) { - found = 1; - console_cmds[i].parse_function(&p_last, p_osm, - out); - break; - } - } - if (!found) { - fprintf(out, "%s : Command not found\n\n", p_cmd); - help_command(out, 0); - } - } else { - fprintf(out, "Error parsing command line: `%s'\n", line); - } - if (loop_command.on) { - fprintf(out, "use \"q\" to quit loop\n"); - fflush(out); - } +static void parse_cmd_line(char *line, osm_console_thread_t *oct) +{ + char *p_cmd, *p_last; + int i, found = 0; + CIO_t *out = getCIO(oct); + + while (isspace(*line)) + line++; + if (!*line) + return; + + /* find first token which is the command */ + p_cmd = strtok_r(line, " \t\n\r", &p_last); + if (p_cmd) { + for (i = 0; console_cmds[i].name; i++) { + if (oct->loop_command.on ) { + if (!strcmp(p_cmd, "q")) { + oct->loop_command.on = 0; + } + found = 1; + break; + } + if (!strcmp(p_cmd, console_cmds[i].name)) { + found = 1; + console_cmds[i].parse_function(&p_last, oct, out); + break; + } + } + if (!found) { + cio_printf(out, "%s : Command not found\n\n", p_cmd); + help_command(out, 0); + } + } else { + cio_printf(out, "Error parsing command line: `%s'\n", line); + } } -void osm_console_prompt(FILE * out) +void osm_console_prompt(CIO_t *out, int loop_prompt) { if (out) { - fprintf(out, "OpenSM %s", OSM_COMMAND_PROMPT); - fflush(out); + if(loop_prompt) + cio_printf(out, "use \"q\" to quit loop\n"); + else + cio_printf(out, "OpenSM %s", OSM_COMMAND_PROMPT); + cio_flush(out); } } -void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm) +/* open and setup socket connection */ +static void osm_console_init_socket(osm_opensm_t *p_osm, uint16_t console_port, char* console_type) { - p_osm->console.socket = -1; - /* set up the file descriptors for the console */ - if (strcmp(opt->console, OSM_LOCAL_CONSOLE) == 0) { - p_osm->console.in = stdin; - p_osm->console.out = stdout; - p_osm->console.in_fd = fileno(stdin); - p_osm->console.out_fd = fileno(stdout); - - osm_console_prompt(p_osm->console.out); #ifdef ENABLE_OSM_CONSOLE_SOCKET - } else if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0 - || strcmp(opt->console, OSM_LOOPBACK_CONSOLE) == 0) { - struct sockaddr_in sin; - int optval = 1; - - if ((p_osm->console.socket = - socket(AF_INET, SOCK_STREAM, 0)) < 0) { - osm_log(&(p_osm->log), OSM_LOG_ERROR, - "osm_console_init: ERR 4B01: Failed to open console socket: %s\n", - strerror(errno)); - return; - } - setsockopt(p_osm->console.socket, SOL_SOCKET, SO_REUSEADDR, - &optval, sizeof(optval)); - sin.sin_family = AF_INET; - sin.sin_port = htons(opt->console_port); - if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0) - sin.sin_addr.s_addr = htonl(INADDR_ANY); - else - sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (bind(p_osm->console.socket, &sin, sizeof(sin)) < 0) { - osm_log(&(p_osm->log), OSM_LOG_ERROR, - "osm_console_init: ERR 4B02: Failed to bind console socket: %s\n", - strerror(errno)); - return; - } - if (listen(p_osm->console.socket, 1) < 0) { - osm_log(&(p_osm->log), OSM_LOG_ERROR, - "osm_console_init: ERR 4B03: Failed to listen on socket: %s\n", - strerror(errno)); - return; - } - signal(SIGPIPE, SIG_IGN); /* protect ourselves from closed pipes */ - p_osm->console.in = NULL; - p_osm->console.out = NULL; - p_osm->console.in_fd = -1; - p_osm->console.out_fd = -1; - osm_log(&(p_osm->log), OSM_LOG_INFO, - "osm_console_init: Console listening on port %d\n", - opt->console_port); + struct sockaddr_in sin; + int optval = 1; + + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init_socket: Initializing the socket: %d\n", console_port); + + if ((p_osm->console.socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) + { + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR 4B01: Failed to open console socket: %s\n", strerror(errno)); + return; + } + setsockopt(p_osm->console.socket, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval)); + sin.sin_family = AF_INET; + sin.sin_port = htons(console_port); + + // loopback or ... + if(is_loopback(console_type)) + sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + else + sin.sin_addr.s_addr = htonl(INADDR_ANY); + if (bind(p_osm->console.socket, &sin, sizeof(sin))< 0) + { + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR 4B02: Failed to bind console socket: %s\n", strerror(errno)); + return; + } + if (listen(p_osm->console.socket, 2)< 0) + { + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR 4B03: Failed to listen on socket: %s\n", strerror(errno)); + return; + } + + signal(SIGPIPE, SIG_IGN); /* protect ourselves from closed pipes */ + p_osm->console.in = NULL; + p_osm->console.out = NULL; + p_osm->console.in_fd = -1; + p_osm->console.out_fd = -1; + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init_socket: Console listening on port %d\n", console_port); #endif - } } -#ifdef ENABLE_OSM_CONSOLE_SOCKET -static void handle_osm_connection(osm_opensm_t * p_osm, int new_fd, - char *client_ip, char *client_hn) +/********************************************************************** + * thread pool primitive: gets the next available thread structure from + * the pool. + * + * refer to free_console_thread() + **********************************************************************/ +osm_console_thread_t* new_console_thread(void) { - char *p_line; - size_t len; - ssize_t n; - - if (p_osm->console.in_fd >= 0) { - FILE *file = fdopen(new_fd, "w+"); - - fprintf(file, "OpenSM Console connection already in use\n" - " kill other session (y/n)? "); - fflush(file); - p_line = NULL; - n = getline(&p_line, &len, file); - if (n > 0 && (p_line[0] == 'y' || p_line[0] == 'Y')) { - osm_console_close_socket(p_osm); - } else { - close(new_fd); - return; - } - } - p_osm->console.in_fd = new_fd; - p_osm->console.out_fd = p_osm->console.in_fd; - p_osm->console.in = fdopen(p_osm->console.in_fd, "w+"); - p_osm->console.out = p_osm->console.in; - osm_console_prompt(p_osm->console.out); - osm_log(&(p_osm->log), OSM_LOG_INFO, - "osm_console_init: Console connection accepted: %s (%s)\n", - client_hn, client_ip); + // return the next available thread from the pool + // just iterate through.. + + int i; + osm_console_thread_t* next = NULL; + + cl_plock_acquire(&ThreadLock); + for(i = 0; i < CIO_MAX_CONNECTS; ++i) + { + next = &ConsoleThreadPool[i]; + if(next->used == 0) + break; + } + + if(i >= CIO_MAX_CONNECTS) + next = NULL; // full + else + { + // immediately mark this as NOT available + next->used = 1; + next->thread_num = ++cio_thread_counter; + gettimeofday(&(next->connect_time), NULL); + } + cl_plock_release(&ThreadLock); + + return next; } -static int connection_ok(char *client_ip, char *client_hn) +/********************************************************************** + * thread pool primitive: clears and initializes all the threads. If the + * argument is a current thread, it will NOT be cleared (will be skipped) + **********************************************************************/ +int init_console_thread_pool(osm_console_thread_t* p_oct, osm_subn_opt_t *opt, osm_opensm_t *p_osm) { - return (hosts_ctl - (OSM_DAEMON_NAME, client_hn, client_ip, "STRING_UNKNOWN")); + // initialize + + int i; + osm_console_thread_t* oct; + + cl_plock_acquire(&ThreadLock); + for(i = 0; i < CIO_MAX_CONNECTS; ++i) + { + oct = &ConsoleThreadPool[i]; + if(p_oct == NULL || p_oct != oct) + { + oct->used = 0; + oct->thread_num = -1; + oct->authorized = 0; + oct->port = CIO_CONNECTION_PORT; + oct->io.fd = -1; + oct->state = 0; + oct->p_osm = p_osm; + if(opt != NULL) + { + oct->port = opt->console_port; + strncpy(oct->name, opt->console, CIO_INFO_SIZE); + } + } + } + cl_plock_release(&ThreadLock); + return i; } -#endif -void osm_console(osm_opensm_t * p_osm) +void osm_console_server_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm) { - struct pollfd pollfd[2]; - char *p_line; - size_t len; - ssize_t n; - struct pollfd *fds; - nfds_t nfds; - - pollfd[0].fd = p_osm->console.socket; - pollfd[0].events = POLLIN; - pollfd[0].revents = 0; - - pollfd[1].fd = p_osm->console.in_fd; - pollfd[1].events = POLLIN; - pollfd[1].revents = 0; - - fds = p_osm->console.socket < 0 ? &pollfd[1] : pollfd; - nfds = p_osm->console.socket < 0 || pollfd[1].fd < 0 ? 1 : 2; - - if (loop_command.on && loop_command_check_time() && - loop_command.loop_function) { - if (p_osm->console.out) { - loop_command.loop_function(p_osm, p_osm->console.out); - fflush(p_osm->console.out); - } else { - loop_command.on = 0; - } - } + int status = 0; + + cl_plock_construct(&ThreadLock); + status = cl_plock_init(&ThreadLock); + if (status != IB_SUCCESS) + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_server_init: lock initialization error\n"); + + init_console_thread_pool(NULL, opt, p_osm); + + gettimeofday(&ServerTime, NULL); // start time + + p_osm->console.socket = -1; + + /* set up the file descriptors for the console */ + if (strcmp(opt->console, OSM_LOCAL_CONSOLE)== 0) + { + p_osm->console.in = stdin; + p_osm->console.out = stdout; + p_osm->console.in_fd = fileno(stdin); + p_osm->console.out_fd = fileno(stdout); + } + else if (is_remote(opt->console)) + { + osm_console_init_socket(p_osm, opt->console_port, opt->console); + } + // TODO - other types of "console" connections here +} - if (poll(fds, nfds, 1000) <= 0) - return; +/********************************************************************** + * Main Loop Thread. + * + * Continuously loop on this command until turned off + **********************************************************************/ +void osm_loop_thread(void *p_ptr) +{ + osm_console_thread_t *oct = ( osm_console_thread_t * ) p_ptr; + CIO_t *p_io = getCIO(oct); + + oct->loop_command.running = 1; + while (oct->loop_command.on && oct->loop_command.loop_function) + { + if (p_io->out) + { + // dwell here + cl_thread_suspend(oct->loop_command.delay_s * 1000); + oct->loop_command.loop_function(oct->p_osm, p_io); + + // send the cmd prompt + osm_console_prompt(p_io, oct->loop_command.on); + cio_flush(p_io); + } + else + { + oct->loop_command.on = 0; + } + } + oct->loop_command.running = 0; + return; +} +/********************************************************************** + * Do authentication & authorization check + **********************************************************************/ +static int is_authorized(osm_console_thread_t *p_oct) +{ #ifdef ENABLE_OSM_CONSOLE_SOCKET - if (pollfd[0].revents & POLLIN) { - int new_fd = 0; - struct sockaddr_in sin; - socklen_t len = sizeof(sin); - char client_ip[64]; - char client_hn[128]; - struct hostent *hent; - if ((new_fd = accept(p_osm->console.socket, &sin, &len)) < 0) { - osm_log(&(p_osm->log), OSM_LOG_ERROR, - "osm_console: ERR 4B04: Failed to accept console socket: %s\n", - strerror(errno)); - p_osm->console.in_fd = -1; - return; - } - if (inet_ntop - (AF_INET, &sin.sin_addr, client_ip, - sizeof(client_ip)) == NULL) { - snprintf(client_ip, 64, "STRING_UNKNOWN"); - } - if ((hent = gethostbyaddr((const char *)&sin.sin_addr, - sizeof(struct in_addr), - AF_INET)) == NULL) { - snprintf(client_hn, 128, "STRING_UNKNOWN"); - } else { - snprintf(client_hn, 128, "%s", hent->h_name); - } - if (connection_ok(client_ip, client_hn)) { - handle_osm_connection(p_osm, new_fd, client_ip, - client_hn); - } else { - osm_log(&(p_osm->log), OSM_LOG_ERROR, - "osm_console: ERR 4B05: Console connection denied: %s (%s)\n", - client_hn, client_ip); - close(new_fd); - } - return; - } + //// oct->authorized = pam_authorize(pTs); + p_oct->authorized = !is_remote(p_oct->client_type) || + hosts_ctl(OSM_DAEMON_NAME, p_oct->client_hn, p_oct->client_ip, "STRING_UNKNOWN"); +#else + p_oct->authorized = 1; #endif + return p_oct->authorized; +} - if (pollfd[1].revents & POLLIN) { - p_line = NULL; - /* Get input line */ - n = getline(&p_line, &len, p_osm->console.in); - if (n > 0) { - /* Parse and act on input */ - parse_cmd_line(p_line, p_osm); - if (!loop_command.on) { - osm_console_prompt(p_osm->console.out); - } - } else - osm_console_close_socket(p_osm); - if (p_line) - free(p_line); - } +/* + * determine if the connection should be closed + */ +static int is_done(osm_console_thread_t *oct) +{ + int done = 0; // set to 1 when finished + + /* Look for a condition that signals the connection should be closed */ + if (!(oct->authorized) || !strcmp(oct->in_buff, OSM_QUIT_CMD) || osm_exit_flag) + { + done = 1; + } + return (done); +} + +/* + * handle basic output to the client + * + * this includes results from a command, error information + * or any appropriate feedback + */ +static int output(osm_console_thread_t *oct) +{ + CIO_t *out = getCIO(oct); + + // send the output buffer to the client + cio_printf(out, oct->out_buff); + cio_flush(out); + + // clear the output buffer?? + oct->out_buff[0] = 0; + + // send the cmd prompt + if(!oct->loop_command.on) + osm_console_prompt(out, 0); + + return (is_done(oct)); +} + +/* + * handle basic input from the socket + */ +static int input(osm_console_thread_t *oct) +{ + char *p_line = NULL; + size_t len; + ssize_t n; + CIO_t *p_io = getCIO(oct); + + // if we are in a loop command, the don't block + if(oct->loop_command.on && !cio_poll(p_io, 1000)) + return 0; + + /* Get input line */ + n = cio_getline(&p_line, &len, p_io); + if (n > 0) + { + // got something, so copy it to the input buffer + sprintf(oct->in_buff, "%s", p_line); + + if(p_line) + free(p_line); + } + + return (0); +} + +/* + * process the command in the input buffer - + * take action, produce results, copy to output buffer + */ +static int commands(osm_console_thread_t *oct) +{ + osm_opensm_t *p_osm = oct->p_osm; + + ib_api_status_t status = IB_INSUFFICIENT_RESOURCES; + + parse_cmd_line(oct->in_buff, oct); + + /* if parsed and executed then clear the input buffer + */ + oct->in_buff[0] = 0; + + /* special case, only allow one loop command + */ + if(!oct->loop_command.running && oct->loop_command.on && oct->loop_command.loop_function) + { + status = cl_thread_init(&oct->loop_command.loopThread, osm_loop_thread, oct, "Loop command"); + if (status != IB_SUCCESS) + { + // something bad + osm_log(&(p_osm->log), OSM_LOG_ERROR, + "commands: Couldn't create a thread for the loop command!\n"); + return -1; + } + } + return (0); +} + +/********************************************************************** + * Initialization and configuration of the console connection. + * (security & authorization, plus some bookkeeping) + * + * returns 1 if okay + * 0 if not authorized + * -1 if too many connections + * -2 if error?? + **********************************************************************/ +int osm_console_init(osm_console_thread_t *p_oct) +{ + // the first opportunity to do thread specific actions + + int status = 0; // not authorized + int max_connects_exceeded = (num_console_threads() >= CIO_MAX_CONNECTS); + + osm_opensm_t *p_osm = p_oct->p_osm; + CIO_t *p_io = getCIO(p_oct); + + // check for authorization + if(is_authorized(p_oct)) + { + // check for available connections (too many?) + if (!max_connects_exceeded) + { + cio_open(p_io); + + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init: Console connection accepted: %s (%s) s#%d\n", p_oct->client_hn, + p_oct->client_ip, p_io->fd); + status = 1; + } + else + { + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init: ERR 4B06: No available connections: %s (%s) t#%d\n", p_oct->client_hn, + p_oct->client_ip, num_console_threads()); + status = -1; + } + } + else + { + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init: ERR 4B05: Console connection denied: %s (%s)\n", p_oct->client_hn, + p_oct->client_ip); + status = 0; + } + + fflush(p_osm->log.out_port); + return status; +} + +/********************************************************************** + * The console I/O and command loop + * refer to: osm_console_init and osm_console_destroy + **********************************************************************/ +void osm_console(osm_console_thread_t *oct) +{ + cl_thread_suspend(100); // wait for other threads to initialize + + // provide feedback from the server (probably from a previous command) + while(!output(oct)) + { + // read the socket + input(oct); + + // process or act on the input + commands(oct); + } + // final methods?? } + +/********************************************************************** + * Main Console Thread. + * + * Finish setting up the connection ( secure & authorized) and misc config + * + * Loop continuously in the osm_console method. + * + * Clean up, and gracefully exit when done + **********************************************************************/ +void osm_console_thread(void *p_ptr) +{ + osm_console_thread_t *p_oct = ( osm_console_thread_t * ) p_ptr; + + /* Finish setting up the connection (secure & authorized) and misc config */ + if(osm_console_init(p_oct) == 1) + { + // do all i/o and commands until done + osm_console(p_oct); + + // done, so close down the console gracefully + osm_console_destroy(p_oct); + } + + // nothing left to do but destroy our own thread, return to pool + osm_console_thread_destroy(p_oct); + return; +} + +/* Prepare to launch the console by encapsulating all the necessary data in a thread + * safe data structure. + * + * Support for single (local) or multiple (socket) threads. + * + * initialize the console data structure for a thread, and then.. + * if socket + * create the thread + * else + * run inline + * + * refer to: osm_console_thread and osm_console_thread_destroy + * + */ +int osm_console_thread_init(int socket, struct sockaddr_in *sin, osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) +{ + static int n_local = 0; + osm_console_thread_t *oct; // see free_console_thread() !! + ib_api_status_t status = IB_INSUFFICIENT_RESOURCES; + + // have we used up all available connections? + if ((!is_remote(p_opt->console) && n_local) || ((oct = new_console_thread())== NULL)) + { + if(n_local) + cl_thread_suspend( 100000); // denied, dwell here before trying again. + else + osm_log(&(p_osm->log), OSM_LOG_ERROR, + "osm_console_thread_init: Maximum number of connections exceeded, connection denied (%d)\n", + num_console_threads()); + return status; + } + + if(!is_remote(p_opt->console)) + n_local++; // only one local connection... + + /* fill in the osm_console_thread_t structure (can't be NULL) */ + oct->authorized = 0; + oct->state = 0; + oct->p_osm = p_osm; + oct->io.fd = socket; + oct->port = p_opt->console_port; + snprintf(oct->client_type, CIO_NOTE_SIZE, p_opt->console); + +#ifdef ENABLE_OSM_CONSOLE_SOCKET + /* get then name and ip of the client (console connection) */ + if(is_remote(oct->client_type)) + { + /* get the clients ip address */ + if (inet_ntop(AF_INET, &sin->sin_addr, oct->client_ip, sizeof(oct->client_ip))== NULL) + { + snprintf(oct->client_ip, CIO_NOTE_SIZE, "STRING_UNKNOWN"); + } + + /* get the clients host name */ + struct hostent *hent; + if ((hent = gethostbyaddr((const char *)&sin->sin_addr, sizeof(struct in_addr), AF_INET)) == NULL) + { + snprintf(oct->client_hn, CIO_INFO_SIZE, "STRING_UNKNOWN"); + } + else + { + snprintf(oct->client_hn, CIO_INFO_SIZE, "%s", hent->h_name); + } + } + else +#endif + { + if(gethostname(oct->client_hn, CIO_INFO_SIZE)) + { + snprintf(oct->client_hn, CIO_INFO_SIZE, "localhost"); + snprintf(oct->client_ip, CIO_NOTE_SIZE, "localhost"); + } + else + snprintf(oct->client_ip, CIO_NOTE_SIZE, oct->client_hn); + } + + + // create a name for the thread, based on the connection + snprintf(oct->name, CIO_INFO_SIZE, "%s %d", OSM_CONSOLE_NAME, oct->io.fd); + + // ***** Finally, create a new thread for this connection ****** + status = cl_thread_init(&oct->consoleThread, osm_console_thread, oct, oct->name); + if (status != IB_SUCCESS) + { + // something bad + osm_log(&(p_osm->log), OSM_LOG_ERROR, + "osm_console_thread_init: Couldn't create a thread for the socket!\n"); + + // free up the thread, wasn't actually used + osm_console_thread_destroy(oct); + return -1; + } + return 0; +} + + +/* Multi-threaded service to handle zero or more osm_consoles + * + * Typically the OSM runs as a daemon process, with zero + * consoles. Occationally it is necessary to remotely connect + * to the OSM through a console connection. + * + * Allow one Master remote console and many Slaves. + * + * Provide a mechanism to release and assume Master role. + * + */ +int osm_console_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) +{ + struct sockaddr_in sin; + int s = 0; + + /* don't enter this code section, if the exit flag is true */ + if (!osm_exit_flag) + { + // handle IO from local or remote console + // blocks here until a client tries to connect + + /* + * this version is supposed to block + * + * the block is released when a connection occurs, which causes a new + * thread to be spawned to handle the connection. The new thread cleans + * up after itself. + * + * return only happens after a successful connection has been established, + * and needs to be prepared for another connection. + */ +#ifdef ENABLE_OSM_CONSOLE_SOCKET + socklen_t len = sizeof(sin); + if (is_remote(p_opt->console) && ((s = accept(p_osm->console.socket, &sin, &len)) < 0)) + { + // kill sig can cause this... which would be normal during a shutdown + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_server: not accepting socket connections\n"); + return -1; + } + else +#endif + // create a thread to handle the i/o on this connection + osm_console_thread_init(s, &sin, p_opt, p_osm); + } + else + free_console_threads(p_osm); // clean up + return s; +} + + +/********************************************************************** + * Function Name: + * cio_vprintf + * + * This routine formats a message and uses a Stream IO abstraction to determine + * how and where to write the message out (stdout, socket, ssl, etc.) + * + * Side Effects: + * Unknown, uses vsprintf and variable arguments. Possible stack problems. + * + * cio pointer to the Connection IO data structure - an IO Stream abstraction + * + * format A string literal that describes the desired text and formatting. See printf(). + * + * args A variable argument list, of the type available between a va_start() and + * va_end() block. + * + * Always returns 0 + ******************************************************************************/ + + int cio_vprintf( CIO_t *cio, const char *format, va_list args) + { + char msg_buffer[CIO_BUFSIZE]; + + // create the formatted string and place it in the local string buffer + vsprintf(msg_buffer, format, args); + + // send it out the proper I/O channel + fprintf(cio->out, msg_buffer); + + return 0; + } + +/****************************************************************************** + * Function Name: + * cio_printf + * + * This is an abstract form of the standard fprintf() routine. It can be used + * in an identical manner, with the exception of the first argument that needs + * to be the Connection IO abstraction, rather than a FILE. + * + * Side Effects: + * Unknown, uses vsprintf and variable arguments. Possible stack problems. + * + * cio pointer to the Connection IO data structure - an IO Stream abstraction + * + * format A string literal that describes the desired text and formatting. See printf(). + * + * args A variable argument list, of the type available between a va_start() and + * va_end() block. + * + * Always returns 0, from cio_vprintf() + ******************************************************************************/ + + int cio_printf( CIO_t *cio, const char *format, ...) + { + int returnval = 0; + va_list args; + + // Sink Filter or Message Filter. Does it get printed?? + if(1) + { + va_start(args, format); + returnval = cio_vprintf(cio, format, args); + va_end(args); + } + return returnval; + } + + int cio_flush( CIO_t *cio) + { + int returnval = fflush(cio->out); + + return returnval; + } + + int cio_getline( char **lineptr, size_t *n, CIO_t *cio) + { + int returnval = getline(lineptr, n, cio->in); + + return returnval; + } + + int cio_open( CIO_t *cio) + { + // returns zero, if opened fine, -1 otherwise + + struct pollfd *pd = (struct pollfd* )malloc(sizeof(struct pollfd)); + if (pd == NULL) + return -1; // should not happen + + cio->in = fdopen(cio->fd, "w+"); + cio->out = cio->in; + cio->err = cio->in; + + cio->pfd = pd; + cio->pfd[0].fd = cio->fd; + cio->pfd[0].events = POLLIN; + cio->pfd[0].revents = 0; + + return (cio->in == NULL) ? -1 : 0; + } + + int cio_close( CIO_t *cio) + { + int rtnval = -1; + if(cio && (cio->fd > 0)) + { + free(cio->pfd); + rtnval = close(cio->fd); + } + cio->fd = 0; + return rtnval; + } + + /* return true if input available */ + int cio_poll(CIO_t *cio, int timeout) + { + // if timeout is less than 1, return true, alw + if(timeout < 1) + return 1; + return (poll(cio->pfd, 1, timeout) > 0); + } -- 1.5.1 ======== -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-opensm-osm_console-modified-console-framework-to.patch URL: From transter at gmail.com Mon Oct 15 16:54:57 2007 From: transter at gmail.com (lbt) Date: Mon, 15 Oct 2007 16:54:57 -0700 Subject: [ofa-general] Missing IB_EVENT_PATH_MIG events Message-ID: Hi, I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA (ib_mthca driver). When I have several RCQP's that I am trying to migrate (software triggered migration using ib_modify_qp), I've noticed that sometimes 1 or 2 of the remote QP's never generate an IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that it just gets lost. I looked through some of the ib_mthca patches in git.kernel.org/?p=linux/kernel/git/roland/infiniband.git, and incorporated the mmiowb patch for ib_mthca commands ( http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd). But still seeing same issue. I have a test case that repeates software-triggered migrations + rearming in a loop, and this problem usually occurs in the first few cycles, but is not too frequent. If anyone has any ideas on what might be wrong, or tips on where I can look/do to debug this, that would be very much appreciated! For example, this is the console output I will see (printed out by our rcqp event handler): On the local end - initiates software triggered migration, using ib_modify_qp: Event IB_EVENT_PATH_MIG occurred on QP#1043 Event IB_EVENT_PATH_MIG occurred on QP#1040 Event IB_EVENT_PATH_MIG occurred on QP#1033 On the remote end: Event IB_EVENT_PATH_MIG occurred on QP#1040 Event IB_EVENT_PATH_MIG occurred on QP#1043 Thanks so much for any pointers! Lan -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Mon Oct 15 19:08:18 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 15 Oct 2007 19:08:18 -0700 Subject: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: I have added version 1.3alpha2 to bugzilla. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tziporet Koren > Sent: Monday, October 15, 2007 7:31 AM > To: ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ofa-general] OFED 1.3 Alpha release is available > > Hi, > > OFED 1.3 Alpha release is available on > http://www.openfabrics.org/builds/ofed-1.3/release/ > File: OFED-1.3-alpha2.tgz > To get BUILD_ID run ofed_info > > Please report any issues in bugzilla https://bugs.openfabrics.org/ > > The beta release is expected on 29 October > > Tziporet & Vlad > > ============================================================== > ========== > > Release information: > -------------------- > OS support: > Novell: > - SLES10 > - SLES10 SP1 > Redhat: > - Redhat EL4 up4 and up5 > - Redhat EL5 > kernel.org: > - 2.6.23 > > Note: Fedora C6 and Open SUSE 10.2 and Redhat EL4 up3 are not part of > the > official list. We keep the backport patches for these OSes > and make sure > > OFED compile and loaded properly but will not do full QA cycle. > > Systems: > * x86_64 > * x86 > * ia64 > * ppc64* > > *Note: On PPC64 installation fails on the packages: ibutils, > mvapich2, > MPI tests over Open MPI. > > > Main Changes from OFED 1.2.5 > ============================ > 1. General changes > o Kernel code based on 2.6.23 > o Quality of Service support in OpenSM, CMA, IPoIB, SRP > o Added Neteffect driver (nes) > > 2. Package and install > o There is a new install script. See > OFED_Installation_Guide.txt for > more details on the new installation and build procedures. > Note: There is an easy way to install in one command line > without a conf file, and without the interactive mode. > Example: ./install.pl --all --prefix /usr/local > o User space packages are now in different source RPMs (as > opposed to > one source RPM in previous OFED releases). > o The option for a build without installing is not supported any > more. > o Added an option to generate tarball with kernel sources for each > kernel. > > 3. IPoIB > o Stateless offloads > o IGMP for user-space multicast IB > o NAPI is enabled default > o High availability is supported via the bonding module > only (removed > ipoib tool scripts) > > 4. SDP - these are not yet in the alpha release > o Keep-alive > o Asynch IO > o Send Zero Copy > > 5. iSER > o ??? > > 6. qlgc_vnic > o Update for PathScale HCA > > 7. RDS > o RDMA API (using FMRs) - under work > > 8. uDAPL - these are not yet in the alpha release > o Add DAT 2.0 API run-time library and development support. > uDAPL 2.0 will include IB extensions for IB rdma write with > immediate > data and IB atomic operations. > o Both uDAPL 1.2 and 2.0 packages will be provided and > will co-exist > > 9. Libraries > a. libibverbs 1.1.1 > o Added Extended RC transport type > b. librdmacm (uCMA) 1.0.3 > > 10. OSM > o More routing performance improvements > o Even more speedups > o Better packaging/installation > o "Native" daemon mode > o Performance management > o Quality of Service manager: Based on IBTA annex > > 11. Management > o Multiple partitions > > 12. MPI: > a. OSU MVAPICH > o Version is 0.9.9 - same as in 1.2.5 - to be replaced later > b. Open MPI > o Version is 1.2.2-1 - same as in 1.2.5 - to be replaced later > c. OSU MVAPICH2 > o Version was updated to 1.0-1. > > > > Tasks that should be completed for the beta release: > ---------------------------------------------------- > 1. Integrate all SDP features > 2. Complete RDS work > 3. Apply patches that fix warning of backport patches > 4. Fix compilation problems on PPC > 5. Add qperf test from Qlogic > 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) > 7. Support RHEL 5 up1 > 8. SPEC files should be part of each user space package > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sweitzen at cisco.com Mon Oct 15 19:46:21 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 15 Oct 2007 19:46:21 -0700 Subject: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl In-Reply-To: References: Message-ID: Vlad, I opened bug 740 for this, can you please fix? Scott ________________________________ From: Scott Weitzenkamp (sweitzen) Sent: Sunday, October 14, 2007 11:10 PM To: Scott Weitzenkamp (sweitzen); OpenFabricsEWG; Vladimir Sokolovsky Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl I also don't see a way to use K_VER to compile for a kernel other than the currently booted kernel, like I could in 1.2.5 and earlier. Scott ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Sunday, October 14, 2007 10:57 PM To: OpenFabricsEWG; Vladimir Sokolovsky Cc: general at lists.openfabrics.org Subject: [ofa-general] OFA_KERNEL_PARAMS is missing from OFED 1.3 install.pl Vlad, I don't see a way to configure OFED 1.3 during installation with OFA_KERNEL_PARAMS like I could in 1.2.5 and earlier. I am specifically looking for the params --without-modprobe, --without-ipoibconf, and --with-madeye-mod. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Oct 15 19:47:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 15 Oct 2007 19:47:49 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <20070726014931.GL10235@sgi.com> (akepner@sgi.com's message of "Wed, 25 Jul 2007 18:49:31 -0700") References: <20070726014931.GL10235@sgi.com> Message-ID: Sorry for taking so long on this. Anyway, here is what I am testing now and planning on merging assuming no bugs turn up. It seems to generate better code on every architecture I tried (32- and 64-bit x86, 32-bit powerpc and ia64). On ia64, the .text for ib_mthca.ko shrinks by almost 1500 bytes! I don't know how to provoke the unaligned traps on ia64, so I'm not positive this will fix the issue, but the compiler should be able to see what's going on so I'm assuming it works. Confirmation of this and/or review would be appreciated. Thanks, Roland >From 81e3286b6f7905ec4bb3ca61107f8c8800c40e9f Mon Sep 17 00:00:00 2001 From: Roland Dreier Date: Sun, 14 Oct 2007 20:40:27 -0700 Subject: [PATCH] IB/mthca: Avoid alignment traps when writing doorbells Architectures such as ia64 see alignment traps when doing a 64-bit read from __be32 doorbell[2] arrays to do doorbell writes in mthca_write64(). Fix this by just passing the two halves of the doorbell value into mthca_write64(). This actually improves the generated code by allowing the compiler to see what's going on better. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 48 +++++++++---------------- drivers/infiniband/hw/mthca/mthca_doorbell.h | 13 ++++--- drivers/infiniband/hw/mthca/mthca_eq.c | 21 ++--------- drivers/infiniband/hw/mthca/mthca_qp.c | 49 +++++++++----------------- drivers/infiniband/hw/mthca/mthca_srq.c | 11 +----- 5 files changed, 48 insertions(+), 94 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index be6e1e0..f6ebed0 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -204,16 +204,11 @@ static void dump_cqe(struct mthca_dev *dev, void *cqe_ptr) static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, int incr) { - __be32 doorbell[2]; - if (mthca_is_memfree(dev)) { *cq->set_ci_db = cpu_to_be32(cq->cons_index); wmb(); } else { - doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(incr - 1); - - mthca_write64(doorbell, + mthca_write64(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn, incr - 1, dev->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -731,17 +726,12 @@ repoll: int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? - MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : - MTHCA_TAVOR_CQ_DB_REQ_NOT) | - to_mcq(cq)->cqn); - doorbell[1] = (__force __be32) 0xffffffff; + u32 dbhi = ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : + MTHCA_TAVOR_CQ_DB_REQ_NOT) | + to_mcq(cq)->cqn; - mthca_write64(doorbell, - to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_write64(dbhi, 0xffffffff, to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); return 0; @@ -750,19 +740,20 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct mthca_cq *cq = to_mcq(ibcq); - __be32 doorbell[2]; + __be32 db_rec[2]; u32 sn; + u32 dbhi; __be32 ci; sn = cq->arm_sn & 3; ci = cpu_to_be32(cq->cons_index); - doorbell[0] = ci; - doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | - ((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? 1 : 2)); + db_rec[0] = ci; + db_rec[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + ((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? 1 : 2)); - mthca_write_db_rec(doorbell, cq->arm_db); + mthca_write_db_rec(db_rec, cq->arm_db); /* * Make sure that the doorbell record in host memory is @@ -770,15 +761,12 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) */ wmb(); - doorbell[0] = cpu_to_be32((sn << 28) | - ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? - MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : - MTHCA_ARBEL_CQ_DB_REQ_NOT) | - cq->cqn); - doorbell[1] = ci; + dbhi = (sn << 28) | + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : + MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn; - mthca_write64(doorbell, - to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_write64(dbhi, ci, to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); return 0; diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index dd9a44d..b374dc3 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -58,10 +58,10 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writeq((__force u64) val, dest); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { - __raw_writeq(*(u64 *) val, dest); + __raw_writeq((__force u64) cpu_to_be64((u64) hi << 32 | lo), dest); } static inline void mthca_write_db_rec(__be32 val[2], __be32 *db) @@ -87,14 +87,17 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writel(((__force u32 *) &val)[1], dest + 4); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; + hi = (__force u32) cpu_to_be32(hi); + lo = (__force u32) cpu_to_be32(lo); + spin_lock_irqsave(doorbell_lock, flags); - __raw_writel((__force u32) val[0], dest); - __raw_writel((__force u32) val[1], dest + 4); + __raw_writel(hi, dest); + __raw_writel(lo, dest + 4); spin_unlock_irqrestore(doorbell_lock, flags); } diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 8592b26..b29de51 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -173,11 +173,6 @@ static inline u64 async_mask(struct mthca_dev *dev) static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); - doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); - /* * This barrier makes sure that all updates to ownership bits * done by set_eqe_hw() hit memory before the consumer index @@ -187,7 +182,7 @@ static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u * having set_eqe_hw() overwrite the owner field. */ wmb(); - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_SET_CI | eq->eqn, ci & (eq->nent - 1), dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -212,12 +207,7 @@ static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); - doorbell[1] = 0; - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_REQ_NOT | eqn, 0, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -230,12 +220,7 @@ static inline void arbel_eq_req_not(struct mthca_dev *dev, u32 eqn_mask) static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { if (!mthca_is_memfree(dev)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_DISARM_CQ | eqn, cqn, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index df01b20..183f68c 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1799,15 +1799,11 @@ int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + - qp->send_wqe_offset) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - wmb(); - mthca_write64(doorbell, + mthca_write64(((qp->sq.next_ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0, + (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -1829,7 +1825,6 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1907,13 +1902,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32(qp->qpn << 8); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); qp->rq.next_ind = ind; @@ -1923,13 +1915,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8 | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -1951,7 +1940,7 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + u32 dbhi; void *wqe; void *prev_wqe; unsigned long flags; @@ -1981,11 +1970,6 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; /* @@ -2000,7 +1984,11 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, + + dbhi = (MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0; + + mthca_write64(dbhi, (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -2154,11 +2142,6 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((nreq << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - qp->sq.head += nreq; /* @@ -2173,8 +2156,10 @@ out: * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + + dbhi = (nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0; + + mthca_write64(dbhi, (qp->qpn << 8 | size0), dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 3f58c11..553d681 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -491,7 +491,6 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); - __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -563,16 +562,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32(srq->srqn << 8); - /* * Make sure that descriptors are written * before doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, srq->srqn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); @@ -581,16 +577,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, } if (likely(nreq)) { - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); - /* * Make sure that descriptors are written before * doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, (srq->srqn << 8) | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -- 1.5.3.2 From rdreier at cisco.com Mon Oct 15 20:20:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 15 Oct 2007 20:20:35 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: (Roland Dreier's message of "Mon, 15 Oct 2007 19:47:49 -0700") References: <20070726014931.GL10235@sgi.com> Message-ID: err... Now With Even Fewer Bugs! Here's the version that actually passed some of my tests... >From ab8403c424a35364a3a2c753f7c5917fcbb4d809 Mon Sep 17 00:00:00 2001 From: Roland Dreier Date: Sun, 14 Oct 2007 20:40:27 -0700 Subject: [PATCH] IB/mthca: Avoid alignment traps when writing doorbells Architectures such as ia64 see alignment traps when doing a 64-bit read from __be32 doorbell[2] arrays to do doorbell writes in mthca_write64(). Fix this by just passing the two halves of the doorbell value into mthca_write64(). This actually improves the generated code by allowing the compiler to see what's going on better. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 53 +++++++++---------------- drivers/infiniband/hw/mthca/mthca_doorbell.h | 13 ++++-- drivers/infiniband/hw/mthca/mthca_eq.c | 21 +--------- drivers/infiniband/hw/mthca/mthca_qp.c | 45 +++++++-------------- drivers/infiniband/hw/mthca/mthca_srq.c | 11 +---- 5 files changed, 47 insertions(+), 96 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index be6e1e0..6bd9f13 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -204,16 +204,11 @@ static void dump_cqe(struct mthca_dev *dev, void *cqe_ptr) static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, int incr) { - __be32 doorbell[2]; - if (mthca_is_memfree(dev)) { *cq->set_ci_db = cpu_to_be32(cq->cons_index); wmb(); } else { - doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(incr - 1); - - mthca_write64(doorbell, + mthca_write64(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn, incr - 1, dev->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -731,17 +726,12 @@ repoll: int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) { - __be32 doorbell[2]; + u32 dbhi = ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : + MTHCA_TAVOR_CQ_DB_REQ_NOT) | + to_mcq(cq)->cqn; - doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? - MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : - MTHCA_TAVOR_CQ_DB_REQ_NOT) | - to_mcq(cq)->cqn); - doorbell[1] = (__force __be32) 0xffffffff; - - mthca_write64(doorbell, - to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_write64(dbhi, 0xffffffff, to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); return 0; @@ -750,19 +740,16 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct mthca_cq *cq = to_mcq(ibcq); - __be32 doorbell[2]; - u32 sn; - __be32 ci; - - sn = cq->arm_sn & 3; - ci = cpu_to_be32(cq->cons_index); + __be32 db_rec[2]; + u32 dbhi; + u32 sn = cq->arm_sn & 3; - doorbell[0] = ci; - doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | - ((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? 1 : 2)); + db_rec[0] = cpu_to_be32(cq->cons_index); + db_rec[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + ((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? 1 : 2)); - mthca_write_db_rec(doorbell, cq->arm_db); + mthca_write_db_rec(db_rec, cq->arm_db); /* * Make sure that the doorbell record in host memory is @@ -770,14 +757,12 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) */ wmb(); - doorbell[0] = cpu_to_be32((sn << 28) | - ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? - MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : - MTHCA_ARBEL_CQ_DB_REQ_NOT) | - cq->cqn); - doorbell[1] = ci; + dbhi = (sn << 28) | + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : + MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn; - mthca_write64(doorbell, + mthca_write64(dbhi, cq->cons_index, to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index dd9a44d..b374dc3 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -58,10 +58,10 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writeq((__force u64) val, dest); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { - __raw_writeq(*(u64 *) val, dest); + __raw_writeq((__force u64) cpu_to_be64((u64) hi << 32 | lo), dest); } static inline void mthca_write_db_rec(__be32 val[2], __be32 *db) @@ -87,14 +87,17 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writel(((__force u32 *) &val)[1], dest + 4); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; + hi = (__force u32) cpu_to_be32(hi); + lo = (__force u32) cpu_to_be32(lo); + spin_lock_irqsave(doorbell_lock, flags); - __raw_writel((__force u32) val[0], dest); - __raw_writel((__force u32) val[1], dest + 4); + __raw_writel(hi, dest); + __raw_writel(lo, dest + 4); spin_unlock_irqrestore(doorbell_lock, flags); } diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 8592b26..b29de51 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -173,11 +173,6 @@ static inline u64 async_mask(struct mthca_dev *dev) static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); - doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); - /* * This barrier makes sure that all updates to ownership bits * done by set_eqe_hw() hit memory before the consumer index @@ -187,7 +182,7 @@ static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u * having set_eqe_hw() overwrite the owner field. */ wmb(); - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_SET_CI | eq->eqn, ci & (eq->nent - 1), dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -212,12 +207,7 @@ static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); - doorbell[1] = 0; - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_REQ_NOT | eqn, 0, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -230,12 +220,7 @@ static inline void arbel_eq_req_not(struct mthca_dev *dev, u32 eqn_mask) static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { if (!mthca_is_memfree(dev)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_DISARM_CQ | eqn, cqn, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index df01b20..0e5461c 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1799,15 +1799,11 @@ int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + - qp->send_wqe_offset) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - wmb(); - mthca_write64(doorbell, + mthca_write64(((qp->sq.next_ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0, + (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -1829,7 +1825,6 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1907,13 +1902,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32(qp->qpn << 8); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); qp->rq.next_ind = ind; @@ -1923,13 +1915,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8 | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -1951,7 +1940,7 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + u32 dbhi; void *wqe; void *prev_wqe; unsigned long flags; @@ -1981,10 +1970,8 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + dbhi = (MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0; qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; @@ -2000,7 +1987,8 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, + + mthca_write64(dbhi, (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -2154,10 +2142,7 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((nreq << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + dbhi = (nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0; qp->sq.head += nreq; @@ -2173,8 +2158,8 @@ out: * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + + mthca_write64(dbhi, (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 3f58c11..553d681 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -491,7 +491,6 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); - __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -563,16 +562,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32(srq->srqn << 8); - /* * Make sure that descriptors are written * before doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, srq->srqn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); @@ -581,16 +577,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, } if (likely(nreq)) { - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); - /* * Make sure that descriptors are written before * doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, (srq->srqn << 8) | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -- 1.5.3.2 From sureid at cisco.com Mon Oct 15 20:28:04 2007 From: sureid at cisco.com (Suzanne Reid -X (sureid - Spherion at Cisco)) Date: Mon, 15 Oct 2007 23:28:04 -0400 Subject: [ofa-general] add to mailing list Message-ID: <068A634310232F4BBBCEDB50F6AD2E6902219573@xmb-rtp-20b.amer.cisco.com> Please add me to your mailing list. Thanks Suzanne Reid Sr. Recruiter-CDO Staffing and Management Cisco Systems 318-254-0486 Office sureid at cisco.com Change the way you work - visit www.cisco.com/jobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From plaices at nicholasjohnson.org Mon Oct 15 22:22:01 2007 From: plaices at nicholasjohnson.org (Dwight Johnson) Date: Tue, 16 Oct 2007 07:22:01 +0200 Subject: [ofa-general] Microsoft 0ff!ce PR0, New Vista/XP Edition 79$, Save 999.95$ 0ff Retai| Message-ID: <000001c80fb2$bbd5b780$0100007f@localhost> microsoft4less . com From dotanb at dev.mellanox.co.il Mon Oct 15 23:15:06 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 16 Oct 2007 08:15:06 +0200 Subject: [ofa-general] Missing IB_EVENT_PATH_MIG events In-Reply-To: References: Message-ID: <471456EA.3060403@dev.mellanox.co.il> Hi. lbt wrote: > Hi, > > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA > (ib_mthca driver). When I have several RCQP's that I am trying to > migrate (software triggered migration using ib_modify_qp), I've > noticed that sometimes 1 or 2 of the remote QP's never generate an > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that > it just gets lost. I looked through some of the ib_mthca patches in > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git > , and > incorporated the mmiowb patch for ib_mthca commands > (http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd > ). > But still seeing same issue. I have a test case that repeates > software-triggered migrations + rearming in a loop, and this problem > usually occurs in the first few cycles, but is not too frequent. If > anyone has any ideas on what might be wrong, or tips on where I can > look/do to debug this, that would be very much appreciated! > > For example, this is the console output I will see (printed out by our > rcqp event handler): > On the local end - initiates software triggered migration, using > ib_modify_qp: > Event IB_EVENT_PATH_MIG occurred on QP#1043 > Event IB_EVENT_PATH_MIG occurred on QP#1040 > Event IB_EVENT_PATH_MIG occurred on QP#1033 > > On the remote end: > Event IB_EVENT_PATH_MIG occurred on QP#1040 > Event IB_EVENT_PATH_MIG occurred on QP#1043 Is the timeout value (in the QP attributes) is 0? If the answer is no, can you please supply some more details on this? thanks Dotan From kliteyn at dev.mellanox.co.il Mon Oct 15 23:57:31 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 08:57:31 +0200 Subject: [ofa-general] [PATCH] osm: Adding two SA MAD class-specific status values Message-ID: <471460DB.5070703@dev.mellanox.co.il> Adding two SA MAD class-specific status values: - ERR_REQ_DENIED - ERR_REQ_PRIORITY_SUGGESTED Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index e1785f1..c6f16b9 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -903,6 +903,8 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) #define IB_SA_MAD_STATUS_TOO_MANY_RECORDS (CL_HTON16(0x0400)) #define IB_SA_MAD_STATUS_INVALID_GID (CL_HTON16(0x0500)) #define IB_SA_MAD_STATUS_INSUF_COMPS (CL_HTON16(0x0600)) +#define IB_SA_MAD_STATUS_DENIED (CL_HTON16(0x0700)) +#define IB_SA_MAD_STATUS_PRIO_SUGGESTED (CL_HTON16(0x0800)) #define IB_DM_MAD_STATUS_NO_IOC_RESP (CL_HTON16(0x0100)) #define IB_DM_MAD_STATUS_NO_SVC_ENTRIES (CL_HTON16(0x0200)) -- 1.5.1.4 From monis at voltaire.com Tue Oct 16 00:56:54 2007 From: monis at voltaire.com (Moni Shoua) Date: Tue, 16 Oct 2007 09:56:54 +0200 Subject: [ofa-general] Re: [PATCH linux-2.6] bonding: two small fixes for IPoIB support In-Reply-To: <9245.1192491867@death> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> <4713D28F.3010904@pobox.com> <31162.1192485233@death> <4713E20F.9080305@pobox.com> <9245.1192491867@death> Message-ID: <47146EC6.1000109@voltaire.com> Jay Vosburgh wrote: > Two small fixes to IPoIB support for bonding: > > 1- copy header_ops from slave to bonding for IPoIB slaves > 2- move release and destroy logic to UNREGISTER from GOING_DOWN > notifier to avoid double release > > Set bonding to version 3.2.1. > > Signed-off-by: Moni Shoua > Signed-off-by: Jay Vosburgh > > --- > drivers/net/bonding/bond_main.c | 11 +++++------ > drivers/net/bonding/bonding.h | 4 ++-- > 2 files changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index db80f24..6f85cc3 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -1263,6 +1263,7 @@ static void bond_setup_by_slave(struct net_device *bond_dev, > struct bonding *bond = bond_dev->priv; > > bond_dev->neigh_setup = slave_dev->neigh_setup; > + bond_dev->header_ops = slave_dev->header_ops; > > bond_dev->type = slave_dev->type; > bond_dev->hard_header_len = slave_dev->hard_header_len; > @@ -3351,7 +3352,10 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave > switch (event) { > case NETDEV_UNREGISTER: > if (bond_dev) { > - bond_release(bond_dev, slave_dev); > + if (bond->setup_by_slave) > + bond_release_and_destroy(bond_dev, slave_dev); > + else > + bond_release(bond_dev, slave_dev); > } > break; > case NETDEV_CHANGE: > @@ -3366,11 +3370,6 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave > * ... Or is it this? > */ > break; > - case NETDEV_GOING_DOWN: > - dprintk("slave %s is going down\n", slave_dev->name); > - if (bond->setup_by_slave) > - bond_release_and_destroy(bond_dev, slave_dev); > - break; > case NETDEV_CHANGEMTU: > /* > * TODO: Should slaves be allowed to > diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h > index a8bbd56..b818060 100644 > --- a/drivers/net/bonding/bonding.h > +++ b/drivers/net/bonding/bonding.h > @@ -22,8 +22,8 @@ > #include "bond_3ad.h" > #include "bond_alb.h" > > -#define DRV_VERSION "3.2.0" > -#define DRV_RELDATE "September 13, 2007" > +#define DRV_VERSION "3.2.1" > +#define DRV_RELDATE "October 15, 2007" > #define DRV_NAME "bonding" > #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" > Jay, Thanks for this work. Jeff, Thanks for applying. I noticed that this patch isn't applied. It includes important fixes. Can you please apply it also? MoniS From madhu.lakshmanan at qlogic.com Tue Oct 16 02:50:10 2007 From: madhu.lakshmanan at qlogic.com (Lakshmanan, Madhu) Date: Tue, 16 Oct 2007 04:50:10 -0500 Subject: [ofa-general] Building an OFED distribution package In-Reply-To: <470FAD63.7090702@datadirectnet.com> References: <470FAD63.7090702@datadirectnet.com> Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE06119341AEF@EPEXCH2.qlogic.org> > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On > Behalf Of Martin W. Schlining III > Subject: [ofa-general] Building an OFED distribution package > > I'd like to patch the OFED-1.2.5 source file ib_srp.h (or use the modified > source file) and rebuild the source RPM (whichever one ib_srp.h comes > from) and > the OFED 1.2.5 distribution package. > > I'll probably want to do the same for OFED 1.3 when it is released. > > Now, how do I do this? > > Martin > [Madhu: ] There is a utility - ofed_patch.sh - under the 'docs' directory. Instructions on how to use it can be found in the 'OFED_tips.txt' document and by running 'ofed_patch -h'. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Tue Oct 16 02:56:18 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 16 Oct 2007 02:56:18 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071016-0200 daily build status Message-ID: <20071016095618.C3177E6086D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.23 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From hrosenstock at xsigo.com Tue Oct 16 03:50:28 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 03:50:28 -0700 Subject: [ofa-general] [PATCH] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <4713ECDC.2090205@dev.mellanox.co.il> References: <4713ECDC.2090205@dev.mellanox.co.il> Message-ID: <1192531828.5492.50.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, Looks good; just some nits below. -- Hal On Tue, 2007-10-16 at 00:42 +0200, Yevgeny Kliteynik wrote: > Adding ClassPortInfo:CapabilityMask2 field and turning > on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). > > Signed-off-by: Yevgeny Kliteynik > --- > infiniband-diags/src/saquery.c | 6 +- > opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- > opensm/include/opensm/osm_base.h | 12 +++ > opensm/opensm/osm_sa_class_port_info.c | 4 +- > opensm/osmtest/osmtest.c | 13 +++- > 5 files changed, 162 insertions(+), 10 deletions(-) > > diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c > index a9a8da4..e17ec5a 100644 > --- a/infiniband-diags/src/saquery.c > +++ b/infiniband-diags/src/saquery.c > @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) > "\t\tBase version.............%d\n" > "\t\tClass version............%d\n" > "\t\tCapability mask..........0x%04X\n" > - "\t\tResponse time value......0x%08X\n" > + "\t\tCapability mask 2........0x%08X\n" > + "\t\tResponse time value......0x%02X\n" > "\t\tRedirect GID.............0x%s\n" > "\t\tRedirect TC/SL/FL........0x%08X\n" > "\t\tRedirect LID.............0x%04X\n" > @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) > class_port_info->base_ver, > class_port_info->class_ver, > cl_ntoh16(class_port_info->cap_mask), > - class_port_info->resp_time_val, > + ib_class_cap_mask2(class_port_info), > + ib_class_resp_time_val(class_port_info), > sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), > cl_ntoh32(class_port_info->redir_tc_sl_fl), > cl_ntoh16(class_port_info->redir_lid), > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index 0969755..e1785f1 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { > uint8_t base_ver; > uint8_t class_ver; > ib_net16_t cap_mask; > - uint8_t reserved[3]; > - uint8_t resp_time_val; > + uint32_t cap_mask2_resp_time; Is this ib_net32_t ? > ib_gid_t redir_gid; > ib_net32_t redir_tc_sl_fl; > ib_net16_t redir_lid; > @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { > * cap_mask > * Supported capabilities of this management class. > * > -* resp_time_value > -* Maximum expected response time. > +* cap_mask2_resp_time > +* Maximum expected response time and additional > +* supported capabilities of this management class. > * > * redr_gid > * GID to use for redirection, or zero > @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { > * > *********/ > > +/****f* IBA Base: Types/ib_class_set_resp_time_val > +* NAME > +* ib_class_set_resp_time_val > +* > +* DESCRIPTION > +* Set maximum expected responce time. ^^^^^^^^ typo response > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, > + IN const uint8_t val) > +{ > + p_cpi->cap_mask2_resp_time = > + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | > + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* val > +* [in] Responce time value to set. ^^^^^^^^ typo Response > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_resp_time_val > +* NAME > +* ib_class_resp_time_val > +* > +* DESCRIPTION > +* Get responce time value. ^^^^^^^^ typo response > +* > +* SYNOPSIS > +*/ > +static inline uint8_t OSM_API > +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) > +{ > + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & > + IB_CLASS_RESP_TIME_MASK); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* RETURN VALUES > +* Responce time value. ^^^^^^^^ typo Response > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_set_cap_mask_2 > +* NAME > +* ib_class_set_cap_mask_2 How about ib_class_set_cap_mask2 for this ? > +* > +* DESCRIPTION > +* Set ClassPortInfo:CapabilityMask2. > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, > + IN const uint32_t cap_mask2) > +{ > + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & > + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | > + cl_hton32(cap_mask2 << 5); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* cap_mask_2 > +* [in] CapabilityMask2 value to set. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_cap_mask2 > +* NAME > +* ib_class_cap_mask2 > +* > +* DESCRIPTION > +* Get ClassPortInfo:CapabilityMask2. > +* > +* SYNOPSIS > +*/ > +static inline uint32_t OSM_API > +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) > +{ > + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* RETURN VALUES > +* CapabilityMask2 of the ClassPortInfo. > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > /****s* IBA Base: Types/ib_sm_info_t > * NAME > * ib_sm_info_t > diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h > index e635dcb..26ef067 100644 > --- a/opensm/include/opensm/osm_base.h > +++ b/opensm/include/opensm/osm_base.h > @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { > #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) > /***********/ > > +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED > +* Name > +* OSM_CAP2_IS_QOS_SUPPORTED > +* > +* DESCRIPTION > +* QoS is supported > +* > +* SYNOPSIS > +*/ > +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) > +/***********/ > + > /****d* OpenSM: Base/osm_sm_state_t > * NAME > * osm_sm_state_t > diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c > index d5c9f82..96d8898 100644 > --- a/opensm/opensm/osm_sa_class_port_info.c > +++ b/opensm/opensm/osm_sa_class_port_info.c > @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, > } > } > rtv += 8; > - p_resp_cpi->resp_time_val = rtv; > + ib_class_set_resp_time_val(p_resp_cpi, rtv); > p_resp_cpi->redir_gid = zero_gid; > p_resp_cpi->redir_tc_sl_fl = 0; > p_resp_cpi->redir_lid = 0; > @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, > p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | > OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; > #endif > + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); > + > if (p_rcv->p_subn->opt.no_multicast_option != TRUE) > p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; > p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); > diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c > index 73933a3..de54f2d 100644 > --- a/opensm/osmtest/osmtest.c > +++ b/opensm/osmtest/osmtest.c > @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) > (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); > > osm_log(&p_osmt->log, OSM_LOG_INFO, > - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" > - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", > + "osmtest_validate_sa_class_port_info:\n" > + "-----------------------------\n" > + "SA Class Port Info:\n" > + " base_ver:%u\n" > + " class_ver:%u\n" > + " cap_mask:0x%X\n" > + " cap_mask2:0x%X\n" > + " resp_time_val:0x%X\n" > + "-----------------------------\n", > p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), > - p_cpi->resp_time_val); > + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); > > Exit: > #if 0 From john.leidel at gmail.com Tue Oct 16 05:44:05 2007 From: john.leidel at gmail.com (John Leidel) Date: Tue, 16 Oct 2007 07:44:05 -0500 Subject: [ofa-general] Sockets Direct Message-ID: <1192538645.4558.27.camel@e521.site> All, when writing applications to arbitrarily use SDP, which of the address family designations to I use: AF_INET_OFFLOAD *OR* AF_INET_SDP cheers john From swise at opengridcomputing.com Tue Oct 16 06:19:11 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 16 Oct 2007 08:19:11 -0500 Subject: [ofa-general] openfabrics CMA interfaces for iWARP In-Reply-To: <4713FB05.6060105@Sun.COM> References: <470EA544.9030101@Sun.COM> <471388BE.3000504@opengridcomputing.com> <4713FB05.6060105@Sun.COM> Message-ID: <4714BA4F.5050306@opengridcomputing.com> Ramaswamy Tummala wrote: > Thanks Steve for answering the questions. > > >> - It appears that RNIC should send IW_CM_EVENT_DISCONNECT event to CMA > >> prior > >> to the start of closing or aborting the connection (except in the > case > >> where the disconnect has been initiated by CMA itself, for example > >> by CMA > >> calling modify_qp entry point of RNIC to move the QP state to > >> CLOSING or > >> ERROR). Is this correct? > > > > I'm not sure I understand your question. > > Basically, I am trying to understand when RNIC should send > IW_CM_EVENT_DISCONNECT event. > When you receive a FIN while the endpoint is in FPDU mode and the qp is in RTS. From jackm at dev.mellanox.co.il Tue Oct 16 07:00:35 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 16 Oct 2007 16:00:35 +0200 Subject: [ofa-general] [PATCH] mlx4_core: enable changing default max HCA resource limits at run time Message-ID: <200710161600.35511.jackm@dev.mellanox.co.il> mlx4-core: enable changing default max HCA resource limits at run time. Enable module-initialization time modification of default HCA maximum resource limits via module parameters, as is done in mthca. Signed-off-by: Jack Morgenstein Index: infiniband/drivers/net/mlx4/main.c =================================================================== --- infiniband.orig/drivers/net/mlx4/main.c 2007-10-10 17:21:17.938882000 +0200 +++ infiniband/drivers/net/mlx4/main.c 2007-10-16 15:51:29.571850000 +0200 @@ -85,6 +85,29 @@ static struct mlx4_profile default_profi .num_mtt = 1 << 20, }; +module_param_named(num_qp, default_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA"); + +module_param_named(num_srq, default_profile.num_srq, int, 0444); +MODULE_PARM_DESC(num_srq, "maximum number of SRQs per HCA"); + +module_param_named(rdmarc_per_qp, default_profile.rdmarc_per_qp, int, 0444); +MODULE_PARM_DESC(rdmarc_per_qp, "number of RDMARC buffers per QP"); + +module_param_named(num_cq, default_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, default_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection table entries per HCA"); + +module_param_named(num_mtt, default_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); + static int __devinit mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) { int err; From kliteyn at dev.mellanox.co.il Tue Oct 16 07:11:50 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 16:11:50 +0200 Subject: [ofa-general] [PATCH v3] osm: QoS - parsing port names Message-ID: <4714C6A6.7050300@dev.mellanox.co.il> Added node-by-name hash to the QoS policy object and as port names are parsed they use this hash to locate that actual port that the name refers to. For now I prefer to keep this hash local, so it's part of QoS policy object. When the same parser will be used for partitions too, this hash will be moved to be part of the subnet object. V3 changes (vs. V2): - node-by-name instead of ca-by-name - removed any constraints on the format of node name Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 3 +- opensm/opensm/osm_qos_parser.y | 64 ++++++++++++++++++++++++++------ opensm/opensm/osm_qos_policy.c | 38 ++++++++++++++++--- 3 files changed, 86 insertions(+), 19 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 30c2e6d..61fc325 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -49,6 +49,7 @@ #include #include +#include #include #include #include @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ - cl_list_t port_name_list; /* list of port names (.../.../...) */ uint8_t node_types; /* node types bitmask */ cl_qmap_t port_map; } osm_qos_port_group_t; @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ osm_subn_t *p_subn; /* osm subnet object */ + st_table * p_node_hash; /* node by name hash */ } osm_qos_policy_t; /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index d2917d3..5a6e0c9 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -245,7 +245,8 @@ qos_policy_entry: port_groups_section * use: our SRP storage targets * port-guid: 0x1000000000000001,0x1000000000000002 * ... - * port-name: vs1/HCA-1/P1 + * port-name: vs1 HCA-1/P1 + * port-name: node_and_HCA_name/P2 * ... * pkey: 0x00FF-0x0FFF * ... @@ -602,21 +603,60 @@ port_group_use_start: TK_USE { port_group_port_name: port_group_port_name_start string_list { /* 'port-name' in 'port-group' - any num of instances */ - cl_list_iterator_t list_iterator; - char * tmp_str; - - list_iterator = cl_list_head(&tmp_parser_struct.str_list); - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) + cl_list_iterator_t list_iterator; + osm_node_t * p_node; + osm_physp_t * p_physp; + unsigned port_num; + char * tmp_str; + char * port_str; + + /* parsing port name strings */ + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); + list_iterator != cl_list_end(&tmp_parser_struct.str_list); + list_iterator = cl_list_next(list_iterator)) { tmp_str = (char*)cl_list_obj(list_iterator); + if (tmp_str) + { + /* last slash in port name string is a separator + between node name and port number */ + port_str = strrchr(tmp_str, '/'); + if (!port_str || (strlen(port_str) < 3) || + (port_str[1] != 'p' && port_str[1] != 'P')) { + yyerror("illegal port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - /* - * TODO: parse port name strings - */ + if (!(port_num = strtoul(&port_str[2],NULL,0))) { + yyerror("illegal port number in port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - if (tmp_str) - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); - list_iterator = cl_list_next(list_iterator); + /* separate node name from port number */ + port_str[0] = '\0'; + + if (st_lookup(p_qos_policy->p_node_hash, + (st_data_t)tmp_str, + (st_data_t*)&p_node)) + { + /* we found the node, now get the right port */ + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp) { + yyerror("port number out of range in port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } + /* we found the port, now add it to guid table */ + __parser_add_port_to_port_map(&p_current_port_group->port_map, + p_physp); + } + free(tmp_str); + } } cl_list_remove_all(&tmp_parser_struct.str_list); } diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 51dd7b9..1207295 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -59,6 +59,33 @@ /*************************************************** ***************************************************/ +static void +__build_nodebyname_hash(osm_qos_policy_t * p_qos_policy) +{ + osm_node_t * p_node; + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; + + p_qos_policy->p_node_hash = st_init_strtable(); + CL_ASSERT(p_qos_policy->p_node_hash); + + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) + return; + + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { + if (!st_lookup(p_qos_policy->p_node_hash, + (st_data_t)p_node->print_desc, + (st_data_t*)&p_node)) + st_insert(p_qos_policy->p_node_hash, + (st_data_t)p_node->print_desc, + (st_data_t)p_node); + } +} + +/*************************************************** + ***************************************************/ + static boolean_t __is_num_in_range_arr(uint64_t ** range_arr, unsigned range_arr_len, uint64_t num) @@ -127,8 +154,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() return NULL; memset(p, 0, sizeof(osm_qos_port_group_t)); - - cl_list_init(&p->port_name_list, 10); cl_qmap_init(&p->port_map); return p; @@ -150,10 +175,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) if (p->use) free(p->use); - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); - cl_list_remove_all(&p->port_name_list); - cl_list_destroy(&p->port_name_list); - p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) { @@ -423,6 +444,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) cl_list_init(&p_qos_policy->qos_match_rules, 10); p_qos_policy->p_subn = p_subn; + __build_nodebyname_hash(p_qos_policy); + return p_qos_policy; } @@ -495,6 +518,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) cl_list_remove_all(&p_qos_policy->qos_match_rules); cl_list_destroy(&p_qos_policy->qos_match_rules); + if (p_qos_policy->p_node_hash) + st_free_table(p_qos_policy->p_node_hash); + free(p_qos_policy); p_qos_policy = NULL; -- 1.5.1.4 From swise at opengridcomputing.com Tue Oct 16 07:24:31 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 16 Oct 2007 09:24:31 -0500 Subject: [ofa-general] mpi-selector busted in ofed-1.3-alpha2 Message-ID: <4714C99F.8050505@opengridcomputing.com> The mpi-selector tool doesn't seem to work in ofed-1.3. Bug 742 opened... Steve. From kliteyn at dev.mellanox.co.il Tue Oct 16 07:24:43 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 16:24:43 +0200 Subject: [ofa-general] [PATCH] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <1192531828.5492.50.camel@hrosenstock-ws.xsigo.com> References: <4713ECDC.2090205@dev.mellanox.co.il> <1192531828.5492.50.camel@hrosenstock-ws.xsigo.com> Message-ID: <4714C9AB.50502@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > Looks good; just some nits below. Thanks. I'll repost the patch. -- Yevgeny > > -- Hal > > On Tue, 2007-10-16 at 00:42 +0200, Yevgeny Kliteynik wrote: >> Adding ClassPortInfo:CapabilityMask2 field and turning >> on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> infiniband-diags/src/saquery.c | 6 +- >> opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- >> opensm/include/opensm/osm_base.h | 12 +++ >> opensm/opensm/osm_sa_class_port_info.c | 4 +- >> opensm/osmtest/osmtest.c | 13 +++- >> 5 files changed, 162 insertions(+), 10 deletions(-) >> >> diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c >> index a9a8da4..e17ec5a 100644 >> --- a/infiniband-diags/src/saquery.c >> +++ b/infiniband-diags/src/saquery.c >> @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) >> "\t\tBase version.............%d\n" >> "\t\tClass version............%d\n" >> "\t\tCapability mask..........0x%04X\n" >> - "\t\tResponse time value......0x%08X\n" >> + "\t\tCapability mask 2........0x%08X\n" >> + "\t\tResponse time value......0x%02X\n" >> "\t\tRedirect GID.............0x%s\n" >> "\t\tRedirect TC/SL/FL........0x%08X\n" >> "\t\tRedirect LID.............0x%04X\n" >> @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) >> class_port_info->base_ver, >> class_port_info->class_ver, >> cl_ntoh16(class_port_info->cap_mask), >> - class_port_info->resp_time_val, >> + ib_class_cap_mask2(class_port_info), >> + ib_class_resp_time_val(class_port_info), >> sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), >> cl_ntoh32(class_port_info->redir_tc_sl_fl), >> cl_ntoh16(class_port_info->redir_lid), >> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h >> index 0969755..e1785f1 100644 >> --- a/opensm/include/iba/ib_types.h >> +++ b/opensm/include/iba/ib_types.h >> @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { >> uint8_t base_ver; >> uint8_t class_ver; >> ib_net16_t cap_mask; >> - uint8_t reserved[3]; >> - uint8_t resp_time_val; >> + uint32_t cap_mask2_resp_time; > > Is this ib_net32_t ? > >> ib_gid_t redir_gid; >> ib_net32_t redir_tc_sl_fl; >> ib_net16_t redir_lid; >> @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { >> * cap_mask >> * Supported capabilities of this management class. >> * >> -* resp_time_value >> -* Maximum expected response time. >> +* cap_mask2_resp_time >> +* Maximum expected response time and additional >> +* supported capabilities of this management class. >> * >> * redr_gid >> * GID to use for redirection, or zero >> @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { >> * >> *********/ >> >> +/****f* IBA Base: Types/ib_class_set_resp_time_val >> +* NAME >> +* ib_class_set_resp_time_val >> +* >> +* DESCRIPTION >> +* Set maximum expected responce time. > ^^^^^^^^ > typo response >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, >> + IN const uint8_t val) >> +{ >> + p_cpi->cap_mask2_resp_time = >> + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | >> + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* val >> +* [in] Responce time value to set. > ^^^^^^^^ > typo Response >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_resp_time_val >> +* NAME >> +* ib_class_resp_time_val >> +* >> +* DESCRIPTION >> +* Get responce time value. > ^^^^^^^^ > typo response >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint8_t OSM_API >> +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) >> +{ >> + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & >> + IB_CLASS_RESP_TIME_MASK); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* RETURN VALUES >> +* Responce time value. > ^^^^^^^^ > typo Response >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_set_cap_mask_2 >> +* NAME >> +* ib_class_set_cap_mask_2 > > How about ib_class_set_cap_mask2 for this ? > >> +* >> +* DESCRIPTION >> +* Set ClassPortInfo:CapabilityMask2. >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, >> + IN const uint32_t cap_mask2) >> +{ >> + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & >> + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | >> + cl_hton32(cap_mask2 << 5); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* cap_mask_2 >> +* [in] CapabilityMask2 value to set. >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_cap_mask2 >> +* NAME >> +* ib_class_cap_mask2 >> +* >> +* DESCRIPTION >> +* Get ClassPortInfo:CapabilityMask2. >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint32_t OSM_API >> +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) >> +{ >> + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* RETURN VALUES >> +* CapabilityMask2 of the ClassPortInfo. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> /****s* IBA Base: Types/ib_sm_info_t >> * NAME >> * ib_sm_info_t >> diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h >> index e635dcb..26ef067 100644 >> --- a/opensm/include/opensm/osm_base.h >> +++ b/opensm/include/opensm/osm_base.h >> @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { >> #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) >> /***********/ >> >> +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED >> +* Name >> +* OSM_CAP2_IS_QOS_SUPPORTED >> +* >> +* DESCRIPTION >> +* QoS is supported >> +* >> +* SYNOPSIS >> +*/ >> +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) >> +/***********/ >> + >> /****d* OpenSM: Base/osm_sm_state_t >> * NAME >> * osm_sm_state_t >> diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c >> index d5c9f82..96d8898 100644 >> --- a/opensm/opensm/osm_sa_class_port_info.c >> +++ b/opensm/opensm/osm_sa_class_port_info.c >> @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, >> } >> } >> rtv += 8; >> - p_resp_cpi->resp_time_val = rtv; >> + ib_class_set_resp_time_val(p_resp_cpi, rtv); >> p_resp_cpi->redir_gid = zero_gid; >> p_resp_cpi->redir_tc_sl_fl = 0; >> p_resp_cpi->redir_lid = 0; >> @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, >> p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | >> OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; >> #endif >> + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); >> + >> if (p_rcv->p_subn->opt.no_multicast_option != TRUE) >> p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; >> p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); >> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c >> index 73933a3..de54f2d 100644 >> --- a/opensm/osmtest/osmtest.c >> +++ b/opensm/osmtest/osmtest.c >> @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) >> (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); >> >> osm_log(&p_osmt->log, OSM_LOG_INFO, >> - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" >> - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", >> + "osmtest_validate_sa_class_port_info:\n" >> + "-----------------------------\n" >> + "SA Class Port Info:\n" >> + " base_ver:%u\n" >> + " class_ver:%u\n" >> + " cap_mask:0x%X\n" >> + " cap_mask2:0x%X\n" >> + " resp_time_val:0x%X\n" >> + "-----------------------------\n", >> p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), >> - p_cpi->resp_time_val); >> + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); >> >> Exit: >> #if 0 > From kliteyn at dev.mellanox.co.il Tue Oct 16 07:24:33 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 16 Oct 2007 16:24:33 +0200 Subject: [ofa-general] [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit Message-ID: <4714C9A1.5010304@dev.mellanox.co.il> Adding ClassPortInfo:CapabilityMask2 field and turning on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). Signed-off-by: Yevgeny Kliteynik --- infiniband-diags/src/saquery.c | 6 +- opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- opensm/include/opensm/osm_base.h | 12 +++ opensm/opensm/osm_sa_class_port_info.c | 4 +- opensm/osmtest/osmtest.c | 13 +++- 5 files changed, 162 insertions(+), 10 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index a9a8da4..e17ec5a 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) "\t\tBase version.............%d\n" "\t\tClass version............%d\n" "\t\tCapability mask..........0x%04X\n" - "\t\tResponse time value......0x%08X\n" + "\t\tCapability mask 2........0x%08X\n" + "\t\tResponse time value......0x%02X\n" "\t\tRedirect GID.............0x%s\n" "\t\tRedirect TC/SL/FL........0x%08X\n" "\t\tRedirect LID.............0x%04X\n" @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) class_port_info->base_ver, class_port_info->class_ver, cl_ntoh16(class_port_info->cap_mask), - class_port_info->resp_time_val, + ib_class_cap_mask2(class_port_info), + ib_class_resp_time_val(class_port_info), sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), cl_ntoh32(class_port_info->redir_tc_sl_fl), cl_ntoh16(class_port_info->redir_lid), diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 0969755..3685007 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { uint8_t base_ver; uint8_t class_ver; ib_net16_t cap_mask; - uint8_t reserved[3]; - uint8_t resp_time_val; + ib_net32_t cap_mask2_resp_time; ib_gid_t redir_gid; ib_net32_t redir_tc_sl_fl; ib_net16_t redir_lid; @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { * cap_mask * Supported capabilities of this management class. * -* resp_time_value -* Maximum expected response time. +* cap_mask2_resp_time +* Maximum expected response time and additional +* supported capabilities of this management class. * * redr_gid * GID to use for redirection, or zero @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { * *********/ +/****f* IBA Base: Types/ib_class_set_resp_time_val +* NAME +* ib_class_set_resp_time_val +* +* DESCRIPTION +* Set maximum expected response time. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, + IN const uint8_t val) +{ + p_cpi->cap_mask2_resp_time = + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* val +* [in] Response time value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_resp_time_val +* NAME +* ib_class_resp_time_val +* +* DESCRIPTION +* Get response time value. +* +* SYNOPSIS +*/ +static inline uint8_t OSM_API +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) +{ + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & + IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* Response time value. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_set_cap_mask2 +* NAME +* ib_class_set_cap_mask2 +* +* DESCRIPTION +* Set ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, + IN const uint32_t cap_mask2) +{ + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(cap_mask2 << 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* cap_mask2 +* [in] CapabilityMask2 value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_cap_mask2 +* NAME +* ib_class_cap_mask2 +* +* DESCRIPTION +* Get ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline uint32_t OSM_API +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) +{ + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* CapabilityMask2 of the ClassPortInfo. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + /****s* IBA Base: Types/ib_sm_info_t * NAME * ib_sm_info_t diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index e635dcb..26ef067 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) /***********/ +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED +* Name +* OSM_CAP2_IS_QOS_SUPPORTED +* +* DESCRIPTION +* QoS is supported +* +* SYNOPSIS +*/ +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) +/***********/ + /****d* OpenSM: Base/osm_sm_state_t * NAME * osm_sm_state_t diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c index d5c9f82..96d8898 100644 --- a/opensm/opensm/osm_sa_class_port_info.c +++ b/opensm/opensm/osm_sa_class_port_info.c @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, } } rtv += 8; - p_resp_cpi->resp_time_val = rtv; + ib_class_set_resp_time_val(p_resp_cpi, rtv); p_resp_cpi->redir_gid = zero_gid; p_resp_cpi->redir_tc_sl_fl = 0; p_resp_cpi->redir_lid = 0; @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; #endif + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); + if (p_rcv->p_subn->opt.no_multicast_option != TRUE) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index 73933a3..de54f2d 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); osm_log(&p_osmt->log, OSM_LOG_INFO, - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", + "osmtest_validate_sa_class_port_info:\n" + "-----------------------------\n" + "SA Class Port Info:\n" + " base_ver:%u\n" + " class_ver:%u\n" + " cap_mask:0x%X\n" + " cap_mask2:0x%X\n" + " resp_time_val:0x%X\n" + "-----------------------------\n", p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), - p_cpi->resp_time_val); + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); Exit: #if 0 -- 1.5.1.4 From hrosenstock at xsigo.com Tue Oct 16 08:06:03 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 08:06:03 -0700 Subject: [ofa-general] ibcheckerrrors/perfquery failure In-Reply-To: References: <4BAF8FA8-C252-4FD1-8784-2AACA967D4F4@lbl.gov> <1192222602.4962.49.camel@hrosenstock-ws.xsigo.com> <0CCC56F3-F5AF-436B-9C8F-097BAF56B02B@lbl.gov> <1192223741.4962.60.camel@hrosenstock-ws.xsigo.com> <1AECA938-DF36-412E-AC3A-4FD300324AEA@lbl.gov> <5130D6AB-FB4E-405B-A1D1-4F1B66C0354A@lbl.gov> <1192226362.4962.66.camel@hrosenstock-ws.xsigo.com> <1192227244.4962.72.camel@hrosenstock-ws.xsigo.com> <1192447897.4962.162.camel@hrosenstock-ws.xsigo.com> <6BEA33D7-DCD7-4809-A75D-47801FA3EA87@lbl.gov> <1192475859.4962.295.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192547163.5921.36.camel@hrosenstock-ws.xsigo.com> Hi again Greg, On Mon, 2007-10-15 at 12:26 -0700, Greg Kurtzer wrote: > On Oct 15, 2007, at 12:17 PM, Hal Rosenstock wrote: > > > Hi Greg, > > > > On Mon, 2007-10-15 at 11:06 -0700, Greg Kurtzer wrote: > >> Yes, the patch fixes perfquery so now it supports the "-a" option > >> properly but I had to make some minor tweaks as I am running the > >> released 1.3.2 version. > >> > >> I also disabled the IBWARN > > > > Does this somehow "get in the way" ? > > Well the scripts report on the warnings, thus the output gets rather > messy. ;) > > > > >> and also tweaked ibcheckerrs just enough > >> so that ibcheckerrors is reporting properly now. > > > > Ah, I didn't try that. Good catch. BTW, that same change is applicable > > to some other scripts. > > > >> Attached is the patch that includes both of the above modifications > >> and integrates properly against the 1.3.2 released tree. > >> > >> Again, thank you. :) > > > > Thanks for testing this out :-) > > Anytime! I am glad to be able to help out. :) I am working on a slightly modified approach which will not require neutering the portnum 255 check in the scripts and will handle the new IBWARN. Would you be willing to try this out ? -- Hal > Great work guys! > > -- > Greg Kurtzer > gmk at lbl.gov > > From fenkes at de.ibm.com Tue Oct 16 08:22:28 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:22:28 +0200 Subject: [ofa-general] [PATCH 0/5] IB/ehca: SRQ and MR/MW fixes Message-ID: <200710161722.29144.fenkes@de.ibm.com> Here are some more fixes for the eHCA driver, fixing some problems we found during internal system test. [1/5] fixes the QP pointer determination for SRQ base QPs [2/5] fixes a masking error in {,re}reg_phys_mr() [3/5] fixes a bug in alloc_fmr() and simplifies some code [4/5] refactors hca_cap_mr_pgsize and fixes a problem with ib_srp [5/5] enables large page MRs by default I built the patches on top of Roland's for-2.6.24 git branch. Please review and queue them for 2.6.24-rc1 if you're okay with them. Thanks! Cheers, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com From fenkes at de.ibm.com Tue Oct 16 08:24:07 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:24:07 +0200 Subject: [ofa-general] [PATCH 1/5] IB/ehca: Supply QP token for SRQ base QPs In-Reply-To: <200710161722.29144.fenkes@de.ibm.com> References: <200710161722.29144.fenkes@de.ibm.com> Message-ID: <200710161724.08286.fenkes@de.ibm.com> Because hardware reports the SRQ token in RWQEs of SRQ base QPs, supply the base QP token as SRQ token, so we can properly find the SRQ base QP. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index e2bd62b..de18264 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -451,7 +451,6 @@ static struct ehca_qp *internal_create_qp( has_srq = 1; parms.ext_type = EQPT_SRQBASE; parms.srq_qpn = my_srq->real_qp_num; - parms.srq_token = my_srq->token; } if (is_llqp && has_srq) { @@ -583,6 +582,9 @@ static struct ehca_qp *internal_create_qp( goto create_qp_exit1; } + if (has_srq) + parms.srq_token = my_qp->token; + parms.servicetype = ibqptype2servicetype(qp_type); if (parms.servicetype < 0) { ret = -EINVAL; -- 1.5.2 From fenkes at de.ibm.com Tue Oct 16 08:25:50 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:25:50 +0200 Subject: [ofa-general] [PATCH 2/5] IB/ehca: Fix masking error in {, re}reg_phys_mr() In-Reply-To: <200710161722.29144.fenkes@de.ibm.com> References: <200710161722.29144.fenkes@de.ibm.com> Message-ID: <200710161725.50371.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index da88738..16c9efd 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -259,7 +259,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; pginfo.next_hwpage = - ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; + ((u64)iova_start & ~PAGE_MASK) / hw_pgsize; ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, @@ -547,7 +547,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; pginfo.next_hwpage = - ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; + ((u64)iova_start & ~PAGE_MASK) / hw_pgsize; } if (mr_rereg_mask & IB_MR_REREG_ACCESS) new_acl = mr_access_flags; -- 1.5.2 From fenkes at de.ibm.com Tue Oct 16 08:26:54 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:26:54 +0200 Subject: [ofa-general] [PATCH 3/5] IB/ehca: Fix ehca_encode_hwpage_size() and alloc_fmr() In-Reply-To: <200710161722.29144.fenkes@de.ibm.com> References: <200710161722.29144.fenkes@de.ibm.com> Message-ID: <200710161726.54648.fenkes@de.ibm.com> Simplify ehca_encode_hwpage_size(), fixing an infinite loop for pgsize == 0 in the process. Fix the bug in alloc_fmr() that triggered the loop. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 15 ++++----------- 1 files changed, 4 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 16c9efd..b9a788c 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -72,17 +72,9 @@ enum ehca_mr_pgsize { static u32 ehca_encode_hwpage_size(u32 pgsize) { - u32 idx = 0; - pgsize >>= 12; - /* - * map mr page size into hw code: - * 0, 1, 2, 3 for 4K, 64K, 1M, 64M - */ - while (!(pgsize & 1)) { - idx++; - pgsize >>= 4; - } - return idx; + int log = ilog2(pgsize); + WARN_ON(log < 12 || log > 24 || log & 3); + return (log - 12) / 4; } static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca) @@ -826,6 +818,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); + pginfo.hwpage_size = hw_pgsize; /* * pginfo.num_hwpages==0, ie register_rpages() will not be called * but deferred to map_phys_fmr() -- 1.5.2 From fenkes at de.ibm.com Tue Oct 16 08:31:14 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:31:14 +0200 Subject: [ofa-general] [PATCH 4/5] IB/ehca: Change meaning of hca_cap_mr_pgsize In-Reply-To: <200710161722.29144.fenkes@de.ibm.com> References: <200710161722.29144.fenkes@de.ibm.com> Message-ID: <200710161731.14577.fenkes@de.ibm.com> ehca_shca.hca_cap_mr_pgsize now contains all supported page sizes ORed together. This makes some checks easier to code and understand, plus we can return this value verbatim in query_hca(), fixing a problem with SRP (reported by Anton Blanchard -- thanks!). Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 1 - drivers/infiniband/hw/ehca/ehca_hca.c | 1 + drivers/infiniband/hw/ehca/ehca_main.c | 18 ++++++++++++- drivers/infiniband/hw/ehca/ehca_mrmw.c | 38 ++++++++++++++-------------- 4 files changed, 36 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 0f7a55d..365bc5d 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -323,7 +323,6 @@ extern int ehca_static_rate; extern int ehca_port_act_time; extern int ehca_use_hp_mr; extern int ehca_scaling_code; -extern int ehca_mr_largepage; struct ipzu_queue_resp { u32 qe_size; /* queue entry size */ diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 4aa3ffa..15806d1 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -77,6 +77,7 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) } memset(props, 0, sizeof(struct ib_device_attr)); + props->page_size_cap = shca->hca_cap_mr_pgsize; props->fw_ver = rblock->hw_ver; props->max_mr_size = rblock->max_mr_size; props->vendor_id = rblock->vendor_id >> 8; diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 403467f..d477dc3 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -260,13 +260,20 @@ static struct cap_descr { { HCA_CAP_MINI_QP, "HCA_CAP_MINI_QP" }, }; -int ehca_sense_attributes(struct ehca_shca *shca) +static int ehca_sense_attributes(struct ehca_shca *shca) { int i, ret = 0; u64 h_ret; struct hipz_query_hca *rblock; struct hipz_query_port *port; + static const u32 pgsize_map[] = { + HCA_CAP_MR_PGSIZE_4K, 0x1000, + HCA_CAP_MR_PGSIZE_64K, 0x10000, + HCA_CAP_MR_PGSIZE_1M, 0x100000, + HCA_CAP_MR_PGSIZE_16M, 0x1000000, + }; + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); @@ -329,8 +336,15 @@ int ehca_sense_attributes(struct ehca_shca *shca) if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) ehca_gen_dbg(" %s", hca_cap_descr[i].descr); - shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported; + /* translate supported MR page sizes; always support 4K */ + shca->hca_cap_mr_pgsize = EHCA_PAGESIZE; + if (ehca_mr_largepage) { /* support extra sizes only if enabled */ + for (i = 0; i < ARRAY_SIZE(pgsize_map); i += 2) + if (rblock->memory_page_size_supported & pgsize_map[i]) + shca->hca_cap_mr_pgsize |= pgsize_map[i + 1]; + } + /* query max MTU from first port -- it's the same for all ports */ port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index b9a788c..bb97915 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -79,9 +79,7 @@ static u32 ehca_encode_hwpage_size(u32 pgsize) static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca) { - if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M) - return EHCA_MR_PGSIZE16M; - return EHCA_MR_PGSIZE4K; + return 1UL << ilog2(shca->hca_cap_mr_pgsize); } static struct ehca_mr *ehca_mr_new(void) @@ -288,7 +286,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, container_of(pd->device, struct ehca_shca, ib_device); struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd); struct ehca_mr_pginfo pginfo; - int ret; + int ret, page_shift; u32 num_kpages; u32 num_hwpages; u64 hwpage_size; @@ -343,19 +341,20 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, /* determine number of MR pages */ num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); /* select proper hw_pgsize */ - if (ehca_mr_largepage && - (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { - int page_shift = PAGE_SHIFT; - if (e_mr->umem->hugetlb) { - /* determine page_shift, clamp between 4K and 16M */ - page_shift = (fls64(length - 1) + 3) & ~3; - page_shift = min(max(page_shift, EHCA_MR_PGSHIFT4K), - EHCA_MR_PGSHIFT16M); - } - hwpage_size = 1UL << page_shift; - } else - hwpage_size = EHCA_MR_PGSIZE4K; /* ehca1 only supports 4k */ - ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); + page_shift = PAGE_SHIFT; + if (e_mr->umem->hugetlb) { + /* determine page_shift, clamp between 4K and 16M */ + page_shift = (fls64(length - 1) + 3) & ~3; + page_shift = min(max(page_shift, EHCA_MR_PGSHIFT4K), + EHCA_MR_PGSHIFT16M); + } + hwpage_size = 1UL << page_shift; + + /* now that we have the desired page size, shift until it's + * supported, too. 4K is always supported, so this terminates. + */ + while (!(hwpage_size & shca->hca_cap_mr_pgsize)) + hwpage_size >>= 4; reg_user_mr_fallback: num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size); @@ -801,8 +800,9 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, ib_fmr = ERR_PTR(-EINVAL); goto alloc_fmr_exit0; } - hw_pgsize = ehca_get_max_hwpage_size(shca); - if ((1 << fmr_attr->page_shift) != hw_pgsize) { + + hw_pgsize = 1 << fmr_attr->page_shift; + if (!(hw_pgsize & shca->hca_cap_mr_pgsize)) { ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x", fmr_attr->page_shift); ib_fmr = ERR_PTR(-EINVAL); -- 1.5.2 From fenkes at de.ibm.com Tue Oct 16 08:31:59 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Tue, 16 Oct 2007 17:31:59 +0200 Subject: [ofa-general] [PATCH 5/5] IB/ehca: Enable large page MRs by default In-Reply-To: <200710161722.29144.fenkes@de.ibm.com> References: <200710161722.29144.fenkes@de.ibm.com> Message-ID: <200710161731.59688.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index d477dc3..2f51c13 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -65,7 +65,7 @@ int ehca_port_act_time = 30; int ehca_poll_all_eqs = 1; int ehca_static_rate = -1; int ehca_scaling_code = 0; -int ehca_mr_largepage = 0; +int ehca_mr_largepage = 1; module_param_named(open_aqp1, ehca_open_aqp1, int, S_IRUGO); module_param_named(debug_level, ehca_debug_level, int, S_IRUGO); -- 1.5.2 From sweitzen at cisco.com Tue Oct 16 09:19:52 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 16 Oct 2007 09:19:52 -0700 Subject: [ofa-general] Sockets Direct In-Reply-To: <1192538645.4558.27.camel@e521.site> References: <1192538645.4558.27.camel@e521.site> Message-ID: You should use AF_INET_SDP, although it can be hard to find, see https://bugs.openfabrics.org//show_bug.cgi?id=25. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > John Leidel > Sent: Tuesday, October 16, 2007 5:44 AM > To: openib-general at openib.org > Subject: [ofa-general] Sockets Direct > > All, when writing applications to arbitrarily use SDP, which of the > address family designations to I use: > > AF_INET_OFFLOAD > > *OR* > > AF_INET_SDP > > > cheers > john > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From changquing.tang at hp.com Tue Oct 16 09:32:49 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 16 Oct 2007 16:32:49 -0000 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> Sean: Is there a way to let system choose a port for me ? like TCP/IP, if port is set to 0, system will return an unused port. Thanks. --CQ From mshefty at ichips.intel.com Tue Oct 16 09:40:16 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 16 Oct 2007 09:40:16 -0700 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> Message-ID: <4714E970.3060507@ichips.intel.com> > Is there a way to let system choose a port for me ? like TCP/IP, > if port is set to 0, system will return an unused port. Yes - binding to port 0 will return a usable port. From sean.hefty at intel.com Tue Oct 16 09:59:39 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 16 Oct 2007 09:59:39 -0700 Subject: [ofa-general] [PATCH] librdmacm/man: update man pages to clarify connection request params Message-ID: <000001c81015$f09a70b0$3c98070a@amr.corp.intel.com> Document connection requests parameters in rdma_connect(), rdma_accept(), and rdma_get_cm_event(), specifically regarding initiator_depth and responder_resources. Signed-off-by: Sean Hefty --- Doug, these are the updates to the man pages that I've made to better document some of the parameters based on your feedback. I will look at some code changes to try to trap errors setting initiator_depth and responder_resources earlier in the connection setup. I will request that these changes go into OFED 1.3. The only other request I can recall was adding the ability to migrate an id to another event channel. I still need to look into this more, but this would miss OFED 1.3. man/rdma_accept.3 | 22 +++++++++--- man/rdma_ack_cm_event.3 | 3 +- man/rdma_connect.3 | 12 +++++-- man/rdma_get_cm_event.3 | 83 ++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 109 insertions(+), 11 deletions(-) diff --git a/man/rdma_accept.3 b/man/rdma_accept.3 index e7c3788..c0c12d8 100644 --- a/man/rdma_accept.3 +++ b/man/rdma_accept.3 @@ -11,7 +11,8 @@ rdma_accept \- Called to accept a connection request. .IP "id" 12 Connection identifier associated with the request. .IP "conn_param" 12 -Information needed to establish the connection. +Information needed to establish the connection. See CONNECTION PROPERTIES +below for details. .SH "DESCRIPTION" Called from the listening side to accept a connection or datagram service lookup request. @@ -25,13 +26,16 @@ rdma_accept is called on the new rdma_cm_id. .SH "CONNECTION PROPERTIES" The following properties are used to configure the communication and specified by the conn_param parameter when accepting a connection or datagram -communication request. Users should use the conn_param values reported in -the connection request event to determine appropriate values for these fields -when accepting. +communication request. Users should use the rdma_conn_param values reported +in the connection request event to determine appropriate values for these +fields when accepting. Users may reference the rdma_conn_param structure in +the connection event directly, or can reference their own structure. If the +rdma_conn_param structure from an event is referenced, the event must not be +acked until after this call returns. .IP private_data References a user-controlled data buffer. The contents of the buffer are -transparently passed to the remote side as part of the communication request. -May be NULL if private_data is not required. +copied and transparently passed to the remote side as part of the +communication request. May be NULL if private_data is not required. .IP private_data_len Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may @@ -39,9 +43,15 @@ be larger than that requested. .IP responder_resources The maximum number of outstanding RDMA read and atomic operations that the local side will accept from the remote side. Applies only to RDMA_PS_TCP. +This value must be less than or equal to the local RDMA device attribute +max_qp_rd_atom and the responder_resources value reported in the connect +request event. .IP initiator_depth The maximum number of outstanding RDMA read and atomic operations that the local side will have to the remote side. Applies only to RDMA_PS_TCP. +This value must be less than or equal to the local RDMA device attribute +max_qp_init_rd_atom and the initiator_depth value reported in the connect +request event. .IP flow_control Specifies if hardware flow control should be used. Applies only to RDMA_PS_TCP. .IP retry_count diff --git a/man/rdma_ack_cm_event.3 b/man/rdma_ack_cm_event.3 index 20ccd9c..3c24357 100644 --- a/man/rdma_ack_cm_event.3 +++ b/man/rdma_ack_cm_event.3 @@ -12,6 +12,7 @@ Event to be released. .SH "DESCRIPTION" All events which are allocated by rdma_get_cm_event must be released, there should be a one-to-one correspondence between successful gets -and acks. +and acks. This call frees the event structure and any memory that it +references. .SH "SEE ALSO" rdma_get_cm_event(3), rdma_destroy_id(3) diff --git a/man/rdma_connect.3 b/man/rdma_connect.3 index a0a9095..71d5594 100644 --- a/man/rdma_connect.3 +++ b/man/rdma_connect.3 @@ -11,7 +11,7 @@ rdma_connect \- Initiate an active connection request. .IP "id" 12 RDMA identifier. .IP "conn_param" 12 -connection parameters. +connection parameters. See CONNECTION PROPERTIES below for details. .SH "DESCRIPTION" For an rdma_cm_id of type RDMA_PS_TCP, this call initiates a connection request to a remote destination. For an rdma_cm_id of type RDMA_PS_UDP, it initiates @@ -25,8 +25,8 @@ by the conn_param parameter when connecting or establishing datagram communication. .IP private_data References a user-controlled data buffer. The contents of the buffer are -transparently passed to the remote side as part of the communication request. -May be NULL if private_data is not required. +copied and transparently passed to the remote side as part of the +communication request. May be NULL if private_data is not required. .IP private_data_len Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may @@ -34,9 +34,15 @@ be larger than that requested. .IP responder_resources The maximum number of outstanding RDMA read and atomic operations that the local side will accept from the remote side. Applies only to RDMA_PS_TCP. +This value must be less than or equal to the local RDMA device attribute +max_qp_rd_atom and remote RDMA device attribute max_qp_init_rd_atom. The +remote endpoint can adjust this value when accepting the connection. .IP initiator_depth The maximum number of outstanding RDMA read and atomic operations that the local side will have to the remote side. Applies only to RDMA_PS_TCP. +This value must be less than or equal to the local RDMA device attribute +max_qp_init_rd_atom and remote RDMA device attribute max_qp_rd_atom. The +remote endpoint can adjust this value when accepting the connection. .IP flow_control Specifies if hardware flow control should be used. Applies only to RDMA_PS_TCP. .IP retry_count diff --git a/man/rdma_get_cm_event.3 b/man/rdma_get_cm_event.3 index 252a7ab..987ead5 100644 --- a/man/rdma_get_cm_event.3 +++ b/man/rdma_get_cm_event.3 @@ -21,7 +21,88 @@ modifying the file descriptor associated with the given channel. All events that are reported must be acknowledged by calling rdma_ack_cm_event. Destruction of an rdma_cm_id will block until related events have been acknowledged. -.SH "EVENTS" +.SH "EVENT DATA" +Communication event details are returned in the rdma_cm_event structure. +This structure is allocated by the rdma_cm and released by the +rdma_ack_cm_event routine. Details of the rdma_cm_event structure are +given below. +.IP "id" 12 +The rdma_cm identifier associated with the event. If the event type is +RDMA_CM_EVENT_CONNECT_REQUEST, then this references a new id for that +communication. +.IP "listen_id" 12 +For RDMA_CM_EVENT_CONNECT_REQUEST event types, this references the +corresponding listening request identifier. +.IP "event" 12 +Specifies the type of communication event which occurred. See EVENT TYPES +below. +.IP "status" 12 +Returns any asynchronous error information associated with an event. The +status is zero unless the corresponding operation failed. +.IP "param" 12 +Provides additional details based on the type of event. Users should +select the conn or ud subfields based on the rdma_port_space of the +rdma_cm_id associated with the event. See UD EVENT DATA and CONN EVENT +DATA below. +.SH "UD EVENT DATA" +Event parameters related to unreliable datagram (UD) services: RDMA_PS_UDP and +RDMA_PS_IPOIB. The UD event data is valid for RDMA_CM_EVENT_ESTABLISHED and +RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise. +.IP "private_data" 12 +References any user-specified data associated with RDMA_CM_EVENT_CONNECT_REQUEST +or RDMA_CM_EVENT_ESTABLISHED events. The data referenced by this field matches +that specified by the remote side when calling rdma_connect or rdma_accept. +This field is NULL if the event does not include private data. The buffer +referenced by this pointer is deallocated when calling rdma_ack_cm_event. +.IP "private_data_len" 12 +The size of the private data buffer. Users should note that the size of +the private data buffer may be larger than the amount of private data +sent by the remote side. Any additional space in the buffer will be +zeroed out. +.IP "ah_attr" 12 +Address information needed to send data to the remote endpoint(s). +Users should use this structure when allocating their address handle. +.IP "qp_num" 12 +QP number of the remote endpoint or multicast group. +.IP "qkey" 12 +QKey needed to send data to the remote endpoint(s). +.SH "CONN EVENT DATA" +Event parameters related to connected QP services: RDMA_PS_TCP. The +connection related event data is valid for RDMA_CM_EVENT_CONNECT_REQUEST +and RDMA_CM_EVENT_ESTABLISHED events, unless stated otherwise. +.IP "private_data" 12 +References any user-specified data associated with the event. The data +referenced by this field matches that specified by the remote side when +calling rdma_connect or rdma_accept. This field is NULL if the event +does not include private data. The buffer referenced by this pointer is +deallocated when calling rdma_ack_cm_event. +.IP "private_data_len" 12 +The size of the private data buffer. Users should note that the size of +the private data buffer may be larger than the amount of private data +sent by the remote side. Any additional space in the buffer will be +zeroed out. +.IP "responder_resources" 12 +The number of responder resources requested of the recipient. +This field matches the initiator depth specified by the remote node when +calling rdma_connect and rdma_accept. +.IP "initiator_depth" 12 +The maximum number of outstanding RDMA read/atomic operations +that the recipient may have outstanding. This field matches the responder +resources specified by the remote node when calling rdma_connect and +rdma_accept. +.IP "flow_control" 12 +Indicates if hardware level flow control is provided. +.IP "retry_count" 12 +For RDMA_CM_EVENT_CONNECT_REQUEST events only, indicates the number of times +that the recipient should retry send operations. +.IP "rnr_retry_count" 12 +The number of times that the recipient should retry receiver not ready (RNR) +NACK errors. +.IP "srq" 12 +Specifies if the sender is using a shared-receive queue. +.IP "qp_num" 12 +Indicates the remote QP number for the connection. +.SH "EVENT TYPES" The following types of communication events may be reported. .IP RDMA_CM_EVENT_ADDR_RESOLVED Address resolution (rdma_resolve_addr) completed successfully. From transter at gmail.com Tue Oct 16 10:40:38 2007 From: transter at gmail.com (lbt) Date: Tue, 16 Oct 2007 10:40:38 -0700 Subject: [ofa-general] Missing IB_EVENT_PATH_MIG events In-Reply-To: <471456EA.3060403@dev.mellanox.co.il> References: <471456EA.3060403@dev.mellanox.co.il> Message-ID: Thanks for your reply Dotan! The timeout is set to 16. Here is some more info. Please let me know if there is any other info I can provide. Setup: - 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001, InfiniHost III firmware 25218, v. 5.2.0) - this is the latest Mellanox firmware I believe - port 1 of each node is connected to one IB switch, and likewise for port 2 --> thus have 2 separate IB subnets, providing 2 possible paths between the 2 nodes - IB switch is InfiniScale MT43132 ** - Using OFED 1.2 driver stack Our software creates RCQPs between 2 nodes, with primary and alternate path specified. Test does the following: Using 10 RCQPs 1. Hardware triggered migration by bringing down the port of the primary path (haven't ever seen a problem with the hardware triggered migrations) 2. Restore the port --> reloads alternate path - Local QPs send LAP - Remote QPs reply with APR 3. Redistributes RCQP's across both ports for load balancing using software triggered migrations for the RCQPs selected for migration. a. Local QPs: use ib_modify_qp to trigger migration --> get IB_EVENT_PATH_MIG on local QPs b. Remote QPs: IB_EVENT_PATH_MIG c. Local QPs: after software-triggered migration completes, reloads alternate path by sending LAP d. Remote QPs: reply with APR Keep doing this in a loop. The issue is that in 3b, not all the remote QP's reporte an IB_EVENT for the path migration triggered in 3a. I noticed that when this happens it's usually in the first and/or second cycle (subsequent cycles don't manifest this issue), and it occurs on the last RCQP's that were migrated in 3a. BTW: Do you know if there there is a way I can determine/dump which events are in the Event Queue? Thanks again! Lan On 10/15/07, Dotan Barak wrote: > > Hi. > > lbt wrote: > > Hi, > > > > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA > > (ib_mthca driver). When I have several RCQP's that I am trying to > > migrate (software triggered migration using ib_modify_qp), I've > > noticed that sometimes 1 or 2 of the remote QP's never generate an > > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that > > it just gets lost. I looked through some of the ib_mthca patches in > > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git > > , and > > incorporated the mmiowb patch for ib_mthca commands > > ( > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd > > < > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd > >). > > But still seeing same issue. I have a test case that repeates > > software-triggered migrations + rearming in a loop, and this problem > > usually occurs in the first few cycles, but is not too frequent. If > > anyone has any ideas on what might be wrong, or tips on where I can > > look/do to debug this, that would be very much appreciated! > > > > For example, this is the console output I will see (printed out by our > > rcqp event handler): > > On the local end - initiates software triggered migration, using > > ib_modify_qp: > > Event IB_EVENT_PATH_MIG occurred on QP#1043 > > Event IB_EVENT_PATH_MIG occurred on QP#1040 > > Event IB_EVENT_PATH_MIG occurred on QP#1033 > > > > On the remote end: > > Event IB_EVENT_PATH_MIG occurred on QP#1040 > > Event IB_EVENT_PATH_MIG occurred on QP#1043 > Is > the timeout value (in the QP attributes) is 0? > If the answer is no, can you please supply some more details on this? > > > thanks > Dotan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Tue Oct 16 11:18:00 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 11:18:00 -0700 Subject: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option Message-ID: <1192558680.5921.59.camel@hrosenstock-ws.xsigo.com> infiniband-diags: Support PMAs which don't support AllPortSelect option Currently only support single port HCAs but can be extended for other devices if needed Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index aa29525..1a2d228 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -186,6 +186,8 @@ BEGIN { /^CounterSelect/ {next} +/AllPortSelect/ {next} + /^ib/ {print $0} /ibpanic:/ {print $0} /ibwarn:/ {print $0} diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 148e452..17aafb6 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -42,7 +43,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.2 +#define __BUILD_VERSION_TAG__ 1.2.3 #include #include #include @@ -99,6 +100,9 @@ main(int argc, char **argv) int ca_port = 0; int extended = 0; uint16_t cap_mask; + int allports = 0; + int node_type, num_ports; + uint8_t data[IB_SMP_DATA_SIZE]; static char const str_opts[] = "C:P:s:t:dGearRVhu"; static const struct option long_opts[] = { @@ -191,6 +195,35 @@ main(int argc, char **argv) /* PerfMgt ClassPortInfo is a required attribute */ if (!perf_classportinfo_query(pc, &portid, port, timeout)) IBERROR("classportinfo query"); + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ + cap_mask = ntohs(cap_mask); + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ + if (port == 255) { + allports = 1; + IBWARN("AllPortSelect not supported"); + } + + if (allports == 1) { + + /* + * Simulate all ports support in PMA + * Determine node type, number of (physical) ports, + * and, if switch, whether SP0 is enhanced + * to determine first and last port to query + */ + + /* For now, support single port CAs */ + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) + IBERROR("smp query nodeinfo failed"); + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ + IBERROR("smp query nodeinfo: Node type not CA"); + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); + if (num_ports != 1) + IBERROR("smp query nodeinfo: %d ports; only 1 supported currently", num_ports); + port = num_ports; + } if (reset_only) goto do_reset; @@ -199,17 +232,20 @@ main(int argc, char **argv) if (!port_performance_query(pc, &portid, port, timeout)) IBERROR("perfquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { - /* Should ClassPortInfo be implemented in libibmad ? */ - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ - cap_mask = ntohs(cap_mask); if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); if (!port_performance_ext_query(pc, &portid, port, timeout)) IBERROR("perfextquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters_ext(buf, sizeof buf, pc, sizeof pc); } From hrosenstock at xsigo.com Tue Oct 16 11:20:29 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 11:20:29 -0700 Subject: [Fwd: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option] Message-ID: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> Greg, Can you try this version out ? This is again against OFED 1.3. Differences from previous version are: 1. In perfquery.c, two places where PortSelect is faked out when all ports is being "simulated". 2. Trivial mod to ibcheckerrs.in to eliminate AllPortSelect warning. Thanks! -- Hal -------- Forwarded Message -------- From: Hal Rosenstock To: Greg Kurtzer Cc: general at lists.openfabrics.org Subject: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option Date: Tue, 16 Oct 2007 11:18:00 -0700 infiniband-diags: Support PMAs which don't support AllPortSelect option Currently only support single port HCAs but can be extended for other devices if needed Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index aa29525..1a2d228 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -186,6 +186,8 @@ BEGIN { /^CounterSelect/ {next} +/AllPortSelect/ {next} + /^ib/ {print $0} /ibpanic:/ {print $0} /ibwarn:/ {print $0} diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 148e452..17aafb6 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -42,7 +43,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.2 +#define __BUILD_VERSION_TAG__ 1.2.3 #include #include #include @@ -99,6 +100,9 @@ main(int argc, char **argv) int ca_port = 0; int extended = 0; uint16_t cap_mask; + int allports = 0; + int node_type, num_ports; + uint8_t data[IB_SMP_DATA_SIZE]; static char const str_opts[] = "C:P:s:t:dGearRVhu"; static const struct option long_opts[] = { @@ -191,6 +195,35 @@ main(int argc, char **argv) /* PerfMgt ClassPortInfo is a required attribute */ if (!perf_classportinfo_query(pc, &portid, port, timeout)) IBERROR("classportinfo query"); + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ + cap_mask = ntohs(cap_mask); + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ + if (port == 255) { + allports = 1; + IBWARN("AllPortSelect not supported"); + } + + if (allports == 1) { + + /* + * Simulate all ports support in PMA + * Determine node type, number of (physical) ports, + * and, if switch, whether SP0 is enhanced + * to determine first and last port to query + */ + + /* For now, support single port CAs */ + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) + IBERROR("smp query nodeinfo failed"); + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ + IBERROR("smp query nodeinfo: Node type not CA"); + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); + if (num_ports != 1) + IBERROR("smp query nodeinfo: %d ports; only 1 supported currently", num_ports); + port = num_ports; + } if (reset_only) goto do_reset; @@ -199,17 +232,20 @@ main(int argc, char **argv) if (!port_performance_query(pc, &portid, port, timeout)) IBERROR("perfquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { - /* Should ClassPortInfo be implemented in libibmad ? */ - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ - cap_mask = ntohs(cap_mask); if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); if (!port_performance_ext_query(pc, &portid, port, timeout)) IBERROR("perfextquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters_ext(buf, sizeof buf, pc, sizeof pc); } _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From suri at baymicrosystems.com Tue Oct 16 11:53:32 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Tue, 16 Oct 2007 14:53:32 -0400 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Message-ID: <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> Steve: This patch looks good on my system, meaning it did not break any of my usual tests (switch related). Thanks, Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of swelch at systemfabricworks.com > Sent: Wednesday, October 10, 2007 11:29 PM > To: rdreier at cisco.com; sean.hefty at intel.com; general at lists.openfabrics.org > Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace > > > > Sean, Roland, > > This patch [v3] replaces the [v2] patch; it includes those changes but renames > the smi function testing returning SMP requests to the name Hal recommends. > > This patch allows userspace DR SMP responses to be looped back and delivered > to a local mad agent by the management stack. > > Thanks, Steve > > Signed-off-by: Steve Welch > --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..98148d6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > mad_agent_priv->agent.port_num); > if (port_priv) { > - mad_priv->mad.mad.mad_hdr.tid = > - ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..aff96ba 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, int port_num); > > /* > - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > */ > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > struct ib_device *device) > @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > + */ > +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swelch at systemfabricworks.com Tue Oct 16 12:01:12 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Tue, 16 Oct 2007 14:01:12 -0500 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> Message-ID: <001001c81026$ec087cc0$a865a8c0@catcher> > -----Original Message----- > From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > Steve: > > This patch looks good on my system, meaning it did not break any of my > usual > tests (switch related). > Suri, thanks for testing this. Hal, I will resubmit the patch to the list to include the detailed description as we discussed previously. Thanks, Steve > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf > > Of swelch at systemfabricworks.com > > Sent: Wednesday, October 10, 2007 11:29 PM > > To: rdreier at cisco.com; sean.hefty at intel.com; > general at lists.openfabrics.org > > Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR > SMP responses from userspace > > > > > > > > Sean, Roland, > > > > This patch [v3] replaces the [v2] patch; it includes those changes but > renames > > the smi function testing returning SMP requests to the name Hal > recommends. > > > > This patch allows userspace DR SMP responses to be looped back and > delivered > > to a local mad agent by the management stack. > > > > Thanks, Steve > > > > Signed-off-by: Steve Welch > > --- > > drivers/infiniband/core/mad.c | 6 +++--- > > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > > 2 files changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/infiniband/core/mad.c > b/drivers/infiniband/core/mad.c > > index 6f42877..98148d6 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > } > > > > /* Check to post send on QP or process locally */ > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > mad_agent_priv->agent.port_num); > > if (port_priv) { > > - mad_priv->mad.mad.mad_hdr.tid = > > - ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h > b/drivers/infiniband/core/smi.h > > index 1cfc298..aff96ba 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct > ib_smp *smp, > > u8 node_type, int port_num); > > > > /* > > - * Return 1 if the SMP should be handled by the local SMA/SM via > process_mad > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > SMA/SM > > + * via process_mad > > */ > > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > struct ib_device *device) > > @@ -71,4 +72,19 @@ static inline enum smi_action > smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > SMA/SM > > + * via process_mad > > + */ > > +static inline enum smi_action smi_check_local_returning_smp(struct > ib_smp *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > + > > #endif /* __SMI_H_ */ > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From hrosenstock at xsigo.com Tue Oct 16 12:05:21 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 12:05:21 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <001001c81026$ec087cc0$a865a8c0@catcher> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> <001001c81026$ec087cc0$a865a8c0@catcher> Message-ID: <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-16 at 14:01 -0500, Steve Welch wrote: > > -----Original Message----- > > From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > > Steve: > > > > This patch looks good on my system, meaning it did not break any of my > > usual > > tests (switch related). > > > Suri, thanks for testing this. > > Hal, I will resubmit the patch to the list to include the detailed > description as we discussed previously. Great; thanks. It'd be nice to hear from the iPathers to really nail this one. -- Hal > Thanks, > Steve > > > > > > -----Original Message----- > > > From: general-bounces at lists.openfabrics.org [mailto:general- > > bounces at lists.openfabrics.org] On Behalf > > > Of swelch at systemfabricworks.com > > > Sent: Wednesday, October 10, 2007 11:29 PM > > > To: rdreier at cisco.com; sean.hefty at intel.com; > > general at lists.openfabrics.org > > > Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR > > SMP responses from userspace > > > > > > > > > > > > Sean, Roland, > > > > > > This patch [v3] replaces the [v2] patch; it includes those changes but > > renames > > > the smi function testing returning SMP requests to the name Hal > > recommends. > > > > > > This patch allows userspace DR SMP responses to be looped back and > > delivered > > > to a local mad agent by the management stack. > > > > > > Thanks, Steve > > > > > > Signed-off-by: Steve Welch > > > --- > > > drivers/infiniband/core/mad.c | 6 +++--- > > > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > > > 2 files changed, 20 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/infiniband/core/mad.c > > b/drivers/infiniband/core/mad.c > > > index 6f42877..98148d6 100644 > > > --- a/drivers/infiniband/core/mad.c > > > +++ b/drivers/infiniband/core/mad.c > > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > } > > > > > > /* Check to post send on QP or process locally */ > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > > > goto out; > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > > mad_agent_priv->agent.port_num); > > > if (port_priv) { > > > - mad_priv->mad.mad.mad_hdr.tid = > > > - ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > ib_mad)); > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h > > b/drivers/infiniband/core/smi.h > > > index 1cfc298..aff96ba 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct > > ib_smp *smp, > > > u8 node_type, int port_num); > > > > > > /* > > > - * Return 1 if the SMP should be handled by the local SMA/SM via > > process_mad > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > SMA/SM > > > + * via process_mad > > > */ > > > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > > struct ib_device *device) > > > @@ -71,4 +72,19 @@ static inline enum smi_action > > smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > SMA/SM > > > + * via process_mad > > > + */ > > > +static inline enum smi_action smi_check_local_returning_smp(struct > > ib_smp *smp, > > > + struct ib_device *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > + > > > #endif /* __SMI_H_ */ > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > > general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Oct 16 12:29:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 16 Oct 2007 12:29:24 -0700 Subject: [ofa-general] [PATCH] Use round_jiffies() in IPoIB code In-Reply-To: <20071015055056.GF3257@kryten> (Anton Blanchard's message of "Mon, 15 Oct 2007 00:50:56 -0500") References: <20071015055056.GF3257@kryten> Message-ID: Looks good, queued for 2.6.24. From rdreier at cisco.com Tue Oct 16 12:26:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 16 Oct 2007 12:26:25 -0700 Subject: [ofa-general] Re: [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch In-Reply-To: <000001c80d2c$c08ce380$f4cc180a@amr.corp.intel.com> (Sean Hefty's message of "Fri, 12 Oct 2007 17:05:22 -0700") References: <000001c80d2c$c08ce380$f4cc180a@amr.corp.intel.com> Message-ID: Thanks, I queued up both of these patches for my next merge. From rdreier at cisco.com Tue Oct 16 12:34:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 16 Oct 2007 12:34:01 -0700 Subject: [ofa-general] [PATCH] Use round_jiffies() in IPoIB code In-Reply-To: <20071015055056.GF3257@kryten> (Anton Blanchard's message of "Mon, 15 Oct 2007 00:50:56 -0500") References: <20071015055056.GF3257@kryten> Message-ID: BTW, a question about all this wakeup-saving stuff. For a case like this, would we be better off INIT_DELAYED_WORK_DEFERRABLE(), since we don't really care about cleaning up stale address handles on a system that is totally idle and blissfully sleeping? Similarly for the mthca code that polls the HCA for a catastrophic error (internal HCA crash), do you think we should be rounding the 5*HZ interval, or using a deferrable timer? - R. From dledford at redhat.com Tue Oct 16 13:26:40 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 16 Oct 2007 16:26:40 -0400 Subject: [ofa-general] Re: [PATCH] librdmacm/man: update man pages to clarify connection request params In-Reply-To: <000001c81015$f09a70b0$3c98070a@amr.corp.intel.com> References: <000001c81015$f09a70b0$3c98070a@amr.corp.intel.com> Message-ID: <1192566400.1666.23.camel@firewall.xsintricity.com> On Tue, 2007-10-16 at 09:59 -0700, Sean Hefty wrote: > Document connection requests parameters in rdma_connect(), > rdma_accept(), and rdma_get_cm_event(), specifically regarding > initiator_depth and responder_resources. > > Signed-off-by: Sean Hefty > --- > Doug, these are the updates to the man pages that I've made to > better document some of the parameters based on your feedback. I > will look at some code changes to try to trap errors setting > initiator_depth and responder_resources earlier in the > connection setup. I will request that these changes go into > OFED 1.3. Looks good, thanks. > The only other request I can recall was adding the ability to migrate > an id to another event channel. I still need to look into this more, > but this would miss OFED 1.3. > > man/rdma_accept.3 | 22 +++++++++--- > man/rdma_ack_cm_event.3 | 3 +- > man/rdma_connect.3 | 12 +++++-- > man/rdma_get_cm_event.3 | 83 ++++++++++++++++++++++++++++++++++++++++++++++- > 4 files changed, 109 insertions(+), 11 deletions(-) > > diff --git a/man/rdma_accept.3 b/man/rdma_accept.3 > index e7c3788..c0c12d8 100644 > --- a/man/rdma_accept.3 > +++ b/man/rdma_accept.3 > @@ -11,7 +11,8 @@ rdma_accept \- Called to accept a connection request. > .IP "id" 12 > Connection identifier associated with the request. > .IP "conn_param" 12 > -Information needed to establish the connection. > +Information needed to establish the connection. See CONNECTION PROPERTIES > +below for details. > .SH "DESCRIPTION" > Called from the listening side to accept a connection or datagram > service lookup request. > @@ -25,13 +26,16 @@ rdma_accept is called on the new rdma_cm_id. > .SH "CONNECTION PROPERTIES" > The following properties are used to configure the communication and specified > by the conn_param parameter when accepting a connection or datagram > -communication request. Users should use the conn_param values reported in > -the connection request event to determine appropriate values for these fields > -when accepting. > +communication request. Users should use the rdma_conn_param values reported > +in the connection request event to determine appropriate values for these > +fields when accepting. Users may reference the rdma_conn_param structure in > +the connection event directly, or can reference their own structure. If the > +rdma_conn_param structure from an event is referenced, the event must not be > +acked until after this call returns. > .IP private_data > References a user-controlled data buffer. The contents of the buffer are > -transparently passed to the remote side as part of the communication request. > -May be NULL if private_data is not required. > +copied and transparently passed to the remote side as part of the > +communication request. May be NULL if private_data is not required. > .IP private_data_len > Specifies the size of the user-controlled data buffer. Note that the actual > amount of data transferred to the remote side is transport dependent and may > @@ -39,9 +43,15 @@ be larger than that requested. > .IP responder_resources > The maximum number of outstanding RDMA read and atomic operations that the > local side will accept from the remote side. Applies only to RDMA_PS_TCP. > +This value must be less than or equal to the local RDMA device attribute > +max_qp_rd_atom and the responder_resources value reported in the connect > +request event. > .IP initiator_depth > The maximum number of outstanding RDMA read and atomic operations that the > local side will have to the remote side. Applies only to RDMA_PS_TCP. > +This value must be less than or equal to the local RDMA device attribute > +max_qp_init_rd_atom and the initiator_depth value reported in the connect > +request event. > .IP flow_control > Specifies if hardware flow control should be used. Applies only to RDMA_PS_TCP. > .IP retry_count > diff --git a/man/rdma_ack_cm_event.3 b/man/rdma_ack_cm_event.3 > index 20ccd9c..3c24357 100644 > --- a/man/rdma_ack_cm_event.3 > +++ b/man/rdma_ack_cm_event.3 > @@ -12,6 +12,7 @@ Event to be released. > .SH "DESCRIPTION" > All events which are allocated by rdma_get_cm_event must be released, > there should be a one-to-one correspondence between successful gets > -and acks. > +and acks. This call frees the event structure and any memory that it > +references. > .SH "SEE ALSO" > rdma_get_cm_event(3), rdma_destroy_id(3) > diff --git a/man/rdma_connect.3 b/man/rdma_connect.3 > index a0a9095..71d5594 100644 > --- a/man/rdma_connect.3 > +++ b/man/rdma_connect.3 > @@ -11,7 +11,7 @@ rdma_connect \- Initiate an active connection request. > .IP "id" 12 > RDMA identifier. > .IP "conn_param" 12 > -connection parameters. > +connection parameters. See CONNECTION PROPERTIES below for details. > .SH "DESCRIPTION" > For an rdma_cm_id of type RDMA_PS_TCP, this call initiates a connection request > to a remote destination. For an rdma_cm_id of type RDMA_PS_UDP, it initiates > @@ -25,8 +25,8 @@ by the conn_param parameter when connecting or establishing datagram > communication. > .IP private_data > References a user-controlled data buffer. The contents of the buffer are > -transparently passed to the remote side as part of the communication request. > -May be NULL if private_data is not required. > +copied and transparently passed to the remote side as part of the > +communication request. May be NULL if private_data is not required. > .IP private_data_len > Specifies the size of the user-controlled data buffer. Note that the actual > amount of data transferred to the remote side is transport dependent and may > @@ -34,9 +34,15 @@ be larger than that requested. > .IP responder_resources > The maximum number of outstanding RDMA read and atomic operations that the > local side will accept from the remote side. Applies only to RDMA_PS_TCP. > +This value must be less than or equal to the local RDMA device attribute > +max_qp_rd_atom and remote RDMA device attribute max_qp_init_rd_atom. The > +remote endpoint can adjust this value when accepting the connection. > .IP initiator_depth > The maximum number of outstanding RDMA read and atomic operations that the > local side will have to the remote side. Applies only to RDMA_PS_TCP. > +This value must be less than or equal to the local RDMA device attribute > +max_qp_init_rd_atom and remote RDMA device attribute max_qp_rd_atom. The > +remote endpoint can adjust this value when accepting the connection. > .IP flow_control > Specifies if hardware flow control should be used. Applies only to RDMA_PS_TCP. > .IP retry_count > diff --git a/man/rdma_get_cm_event.3 b/man/rdma_get_cm_event.3 > index 252a7ab..987ead5 100644 > --- a/man/rdma_get_cm_event.3 > +++ b/man/rdma_get_cm_event.3 > @@ -21,7 +21,88 @@ modifying the file descriptor associated with the given channel. All > events that are reported must be acknowledged by calling rdma_ack_cm_event. > Destruction of an rdma_cm_id will block until related events have been > acknowledged. > -.SH "EVENTS" > +.SH "EVENT DATA" > +Communication event details are returned in the rdma_cm_event structure. > +This structure is allocated by the rdma_cm and released by the > +rdma_ack_cm_event routine. Details of the rdma_cm_event structure are > +given below. > +.IP "id" 12 > +The rdma_cm identifier associated with the event. If the event type is > +RDMA_CM_EVENT_CONNECT_REQUEST, then this references a new id for that > +communication. > +.IP "listen_id" 12 > +For RDMA_CM_EVENT_CONNECT_REQUEST event types, this references the > +corresponding listening request identifier. > +.IP "event" 12 > +Specifies the type of communication event which occurred. See EVENT TYPES > +below. > +.IP "status" 12 > +Returns any asynchronous error information associated with an event. The > +status is zero unless the corresponding operation failed. > +.IP "param" 12 > +Provides additional details based on the type of event. Users should > +select the conn or ud subfields based on the rdma_port_space of the > +rdma_cm_id associated with the event. See UD EVENT DATA and CONN EVENT > +DATA below. > +.SH "UD EVENT DATA" > +Event parameters related to unreliable datagram (UD) services: RDMA_PS_UDP and > +RDMA_PS_IPOIB. The UD event data is valid for RDMA_CM_EVENT_ESTABLISHED and > +RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise. > +.IP "private_data" 12 > +References any user-specified data associated with RDMA_CM_EVENT_CONNECT_REQUEST > +or RDMA_CM_EVENT_ESTABLISHED events. The data referenced by this field matches > +that specified by the remote side when calling rdma_connect or rdma_accept. > +This field is NULL if the event does not include private data. The buffer > +referenced by this pointer is deallocated when calling rdma_ack_cm_event. > +.IP "private_data_len" 12 > +The size of the private data buffer. Users should note that the size of > +the private data buffer may be larger than the amount of private data > +sent by the remote side. Any additional space in the buffer will be > +zeroed out. > +.IP "ah_attr" 12 > +Address information needed to send data to the remote endpoint(s). > +Users should use this structure when allocating their address handle. > +.IP "qp_num" 12 > +QP number of the remote endpoint or multicast group. > +.IP "qkey" 12 > +QKey needed to send data to the remote endpoint(s). > +.SH "CONN EVENT DATA" > +Event parameters related to connected QP services: RDMA_PS_TCP. The > +connection related event data is valid for RDMA_CM_EVENT_CONNECT_REQUEST > +and RDMA_CM_EVENT_ESTABLISHED events, unless stated otherwise. > +.IP "private_data" 12 > +References any user-specified data associated with the event. The data > +referenced by this field matches that specified by the remote side when > +calling rdma_connect or rdma_accept. This field is NULL if the event > +does not include private data. The buffer referenced by this pointer is > +deallocated when calling rdma_ack_cm_event. > +.IP "private_data_len" 12 > +The size of the private data buffer. Users should note that the size of > +the private data buffer may be larger than the amount of private data > +sent by the remote side. Any additional space in the buffer will be > +zeroed out. > +.IP "responder_resources" 12 > +The number of responder resources requested of the recipient. > +This field matches the initiator depth specified by the remote node when > +calling rdma_connect and rdma_accept. > +.IP "initiator_depth" 12 > +The maximum number of outstanding RDMA read/atomic operations > +that the recipient may have outstanding. This field matches the responder > +resources specified by the remote node when calling rdma_connect and > +rdma_accept. > +.IP "flow_control" 12 > +Indicates if hardware level flow control is provided. > +.IP "retry_count" 12 > +For RDMA_CM_EVENT_CONNECT_REQUEST events only, indicates the number of times > +that the recipient should retry send operations. > +.IP "rnr_retry_count" 12 > +The number of times that the recipient should retry receiver not ready (RNR) > +NACK errors. > +.IP "srq" 12 > +Specifies if the sender is using a shared-receive queue. > +.IP "qp_num" 12 > +Indicates the remote QP number for the connection. > +.SH "EVENT TYPES" > The following types of communication events may be reported. > .IP RDMA_CM_EVENT_ADDR_RESOLVED > Address resolution (rdma_resolve_addr) completed successfully. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From anton at samba.org Tue Oct 16 13:27:15 2007 From: anton at samba.org (Anton Blanchard) Date: Tue, 16 Oct 2007 15:27:15 -0500 Subject: [ofa-general] [PATCH] Use round_jiffies() in IPoIB code In-Reply-To: References: <20071015055056.GF3257@kryten> Message-ID: <20071016202715.GB15989@kryten> Hi Roland, On Tue, Oct 16, 2007 at 12:34:01PM -0700, Roland Dreier wrote: > BTW, a question about all this wakeup-saving stuff. For a case like > this, would we be better off INIT_DELAYED_WORK_DEFERRABLE(), since we > don't really care about cleaning up stale address handles on a system > that is totally idle and blissfully sleeping? > > Similarly for the mthca code that polls the HCA for a catastrophic > error (internal HCA crash), do you think we should be rounding the > 5*HZ interval, or using a deferrable timer? Good question, I've been playing it safe and just rounding timers, but it does sound like both these cases would be served fine by using INIT_DELAYED_WORK_DEFERRABLE. Want me to make a patch? Anton From sashak at voltaire.com Tue Oct 16 13:40:21 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 16 Oct 2007 22:40:21 +0200 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <000001c80f65$87bea170$3c98070a@amr.corp.intel.com> References: <20071015195140.GA12364@sashak.voltaire.com> <000001c80f65$87bea170$3c98070a@amr.corp.intel.com> Message-ID: <20071016204021.GC12364@sashak.voltaire.com> On 12:56 Mon 15 Oct , Sean Hefty wrote: > >Seems you don't think it is very critical, cannot say I disagree so much. > >Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED? > > My vote is to retain some sort of abstraction. Once we get rid of it, it will > be very hard to add it back in. That is true, but I cannot find scenario when using fd as umad device handle could be insufficient. Even if we will need to create some internally tracked per device data again (unlikely) fd can serve as us index just well. The whole issue is all about naming and seems minor for me - without actual API change we can rename it once and rename again later if it will be needed or keep things as it is - both options are fine. Anc since there is concern let's do nothing and stay with "as is". > My concern is that multi-thread receive handling isn't easily supported when > RMPP is involved, and having umad_recv take an abstract 'id' gives us some > flexibility that could come in useful someday. > > E.g. something like: > umad_recv() -> returns too small, gives necessary size + id specific to a mad > uamd_recv(mad id, new size ...) -> returns reassembled rmpp mad With this second umad_recv() we also will need to specify which umad device to use, I think API change will be required, right? (the option to encode both fd and mad id as first umad_recv() parameter looks messy for me.) Sasha From sashak at voltaire.com Tue Oct 16 13:53:03 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 16 Oct 2007 22:53:03 +0200 Subject: [ofa-general] Re: [PATCH] osm: QoS parser - adding support for quoted string In-Reply-To: <4713ED64.8070709@dev.mellanox.co.il> References: <4713ED64.8070709@dev.mellanox.co.il> Message-ID: <20071016205303.GE12364@sashak.voltaire.com> On 00:44 Tue 16 Oct , Yevgeny Kliteynik wrote: > Adding support for quoted strings in the policy file parser. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From gmk at lbl.gov Tue Oct 16 14:24:17 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Tue, 16 Oct 2007 14:24:17 -0700 Subject: [Fwd: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option] In-Reply-To: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> References: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> Message-ID: <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> On Oct 16, 2007, at 11:20 AM, Hal Rosenstock wrote: > Greg, > > Can you try this version out ? This is again against OFED 1.3. Will the OFED 1.3 infiniband-diags build and link properly against the OFED 1.2 tree? If so, maybe I can just try to update just that. Are there available nightly builds for infiniband-diags? If using the OFED 1.3 diags package against the OFED 1.2 API is not recommended, I will need to test this by merging by hand. > > Differences from previous version are: > 1. In perfquery.c, two places where PortSelect is faked out when all > ports is being "simulated". > 2. Trivial mod to ibcheckerrs.in to eliminate AllPortSelect warning. Looks great. I will let you know. Thanks again! Greg -- Greg Kurtzer gmk at lbl.gov From sashak at voltaire.com Tue Oct 16 14:54:24 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 16 Oct 2007 23:54:24 +0200 Subject: [Fwd: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option] In-Reply-To: <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> References: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> Message-ID: <20071016215424.GG12364@sashak.voltaire.com> On 14:24 Tue 16 Oct , Greg Kurtzer wrote: > > On Oct 16, 2007, at 11:20 AM, Hal Rosenstock wrote: > > > Greg, > > > > Can you try this version out ? This is again against OFED 1.3. > > Will the OFED 1.3 infiniband-diags build and link properly against the OFED > 1.2 tree? I guess whole management tree should work. > If so, maybe I can just try to update just that. Are there > available nightly builds for infiniband-diags? Just grab recent git sources (master): git clone git://git.openfabrics.org/~sasahak/management Sasha From hrosenstock at xsigo.com Tue Oct 16 14:54:13 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 14:54:13 -0700 Subject: [Fwd: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option] In-Reply-To: <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> References: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> Message-ID: <1192571653.5921.139.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-16 at 14:24 -0700, Greg Kurtzer wrote: > On Oct 16, 2007, at 11:20 AM, Hal Rosenstock wrote: > > > Greg, > > > > Can you try this version out ? This is again against OFED 1.3. > > Will the OFED 1.3 infiniband-diags build and link properly against > the OFED 1.2 tree? If so, maybe I can just try to update just that. > Are there available nightly builds for infiniband-diags? > > If using the OFED 1.3 diags package against the OFED 1.2 API is not > recommended, I will need to test this by merging by hand. It is fine to test the changes applied by hand if that is easier. That's why I outlined them. -- Hal > > > > Differences from previous version are: > > 1. In perfquery.c, two places where PortSelect is faked out when all > > ports is being "simulated". > > 2. Trivial mod to ibcheckerrs.in to eliminate AllPortSelect warning. > > Looks great. I will let you know. > > Thanks again! > > Greg > > -- > Greg Kurtzer > gmk at lbl.gov > > From gmk at lbl.gov Tue Oct 16 15:14:36 2007 From: gmk at lbl.gov (Greg Kurtzer) Date: Tue, 16 Oct 2007 15:14:36 -0700 Subject: [Fwd: [ofa-general] [RFC/PATCHv2] infiniband-diags: Support PMAs which don't support AllPortSelect option] In-Reply-To: <1192571653.5921.139.camel@hrosenstock-ws.xsigo.com> References: <1192558829.5921.63.camel@hrosenstock-ws.xsigo.com> <91A65CB4-3008-4B1E-AA28-5C49DC8CED09@lbl.gov> <1192571653.5921.139.camel@hrosenstock-ws.xsigo.com> Message-ID: <2C7B8916-2A76-40C5-90EA-3FE65BC8A9CB@lbl.gov> On Oct 16, 2007, at 2:54 PM, Hal Rosenstock wrote: > On Tue, 2007-10-16 at 14:24 -0700, Greg Kurtzer wrote: >> On Oct 16, 2007, at 11:20 AM, Hal Rosenstock wrote: >> >>> Greg, >>> >>> Can you try this version out ? This is again against OFED 1.3. >> >> Will the OFED 1.3 infiniband-diags build and link properly against >> the OFED 1.2 tree? If so, maybe I can just try to update just that. >> Are there available nightly builds for infiniband-diags? >> >> If using the OFED 1.3 diags package against the OFED 1.2 API is not >> recommended, I will need to test this by merging by hand. > > It is fine to test the changes applied by hand if that is easier. > That's > why I outlined them. Patch (attached for reference) seems to be working great! :) Thanks, Greg -------------- next part -------------- A non-text attachment was scrubbed... Name: infiniband-diags-1.3.2-singleport.patch Type: application/octet-stream Size: 3205 bytes Desc: not available URL: -------------- next part -------------- -- Greg Kurtzer gmk at lbl.gov From ardavis at ichips.intel.com Tue Oct 16 15:35:53 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 16 Oct 2007 15:35:53 -0700 Subject: [ofa-general] [PATCH] DAT 2.0 change to include extended event data with all builds. Message-ID: <47153CC9.4000701@ichips.intel.com> James, Can you comment on the following patch to DAT v2.0. Do you see any issues? If ok, I would like to get this accepted and rolled back into the specification. Thanks, -arlin -- Modify dat.h dat_event to include event_extension_data[8]. Extend struct dat_event outside of extension build switch to enable non-extended applications to work with extended libraries. Otherwise, there is a potential for the event callee to write back too much event data and exceed callers non-extended event buffer. -- Signed-off by: Arlin Davis --- a/dat/include/dat/dat.h +++ b/dat/include/dat/dat.h @@ -944,9 +944,7 @@ typedef struct dat_event DAT_EVENT_NUMBER event_number; DAT_EVD_HANDLE evd_handle; DAT_EVENT_DATA event_data; -#ifdef DAT_EXTENSIONS DAT_UINT64 event_extension_data[8]; -#endif /* DAT_EXTENSIONS */ } DAT_EVENT; /* Provider/registration info */ From sean.hefty at intel.com Tue Oct 16 15:50:42 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 16 Oct 2007 15:50:42 -0700 Subject: [ofa-general] [PATCH] librdmacm/cma: provide sanity checks for max outstanding rdma ops Message-ID: <002a01c81046$fb3db6e0$3c98070a@amr.corp.intel.com> Ensure that the responder_resources and initiator_depth values provided by the user are supported by the local hardware. This traps errors sooner during connection establishment (when calling rdma_connect), rather than waiting until the modify QP fails (after calling rdma_accept). Signed-off-by: Sean Hefty --- src/cma.c | 31 +++++++++++++++++++++++++++++-- 1 files changed, 29 insertions(+), 2 deletions(-) diff --git a/src/cma.c b/src/cma.c index c0ae114..00ea394 100644 --- a/src/cma.c +++ b/src/cma.c @@ -116,6 +116,8 @@ struct cma_device { struct ibv_context *verbs; uint64_t guid; int port_cnt; + uint8_t max_initiator_depth; + uint8_t max_responder_resources; }; struct cma_id_private { @@ -246,6 +248,8 @@ static int ucma_init(void) } cma_dev->port_cnt = attr.phys_port_cnt; + cma_dev->max_initiator_depth = (uint8_t) attr.max_qp_init_rd_atom; + cma_dev->max_responder_resources = (uint8_t) attr.max_qp_rd_atom; } out: pthread_mutex_unlock(&mut); @@ -810,6 +814,21 @@ void rdma_destroy_qp(struct rdma_cm_id *id) ibv_destroy_qp(id->qp); } +static int ucma_valid_param(struct cma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + if (id_priv->id.ps != RDMA_PS_TCP) + return 0; + + if ((conn_param->responder_resources > + id_priv->cma_dev->max_responder_resources) || + (conn_param->initiator_depth > + id_priv->cma_dev->max_initiator_depth)) + return -EINVAL; + + return 0; +} + static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, uint32_t qp_num, uint8_t srq) @@ -837,8 +856,12 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) void *msg; int ret, size; - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_CONNECT, size); id_priv = container_of(id, struct cma_id_private, id); + ret = ucma_valid_param(id_priv, conn_param); + if (ret) + return ret; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_CONNECT, size); cmd->id = id_priv->handle; if (id->qp) ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, @@ -882,6 +905,11 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) void *msg; int ret, size; + id_priv = container_of(id, struct cma_id_private, id); + ret = ucma_valid_param(id_priv, conn_param); + if (ret) + return ret; + if (!ucma_is_ud_ps(id->ps)) { ret = ucma_modify_qp_rtr(id); if (ret) @@ -889,7 +917,6 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) } CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ACCEPT, size); - id_priv = container_of(id, struct cma_id_private, id); cmd->id = id_priv->handle; cmd->uid = (uintptr_t) id_priv; if (id->qp) From sashak at voltaire.com Tue Oct 16 16:32:37 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 01:32:37 +0200 Subject: [ofa-general] Re: [PATCH] osm: Adding two SA MAD class-specific status values In-Reply-To: <471460DB.5070703@dev.mellanox.co.il> References: <471460DB.5070703@dev.mellanox.co.il> Message-ID: <20071016233237.GI12364@sashak.voltaire.com> Hi Yevgeny, On 08:57 Tue 16 Oct , Yevgeny Kliteynik wrote: > Adding two SA MAD class-specific status values: > - ERR_REQ_DENIED > - ERR_REQ_PRIORITY_SUGGESTED > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/iba/ib_types.h | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index e1785f1..c6f16b9 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -903,6 +903,8 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > #define IB_SA_MAD_STATUS_TOO_MANY_RECORDS (CL_HTON16(0x0400)) > #define IB_SA_MAD_STATUS_INVALID_GID (CL_HTON16(0x0500)) > #define IB_SA_MAD_STATUS_INSUF_COMPS (CL_HTON16(0x0600)) > +#define IB_SA_MAD_STATUS_DENIED (CL_HTON16(0x0700)) > +#define IB_SA_MAD_STATUS_PRIO_SUGGESTED (CL_HTON16(0x0800)) Where those values are defined? In published by IBTA drafts both are 7 (STATUS_DENIED in Volume 1 Release 1.2.1 and STATUS_PRIO_SUGGESTED in Annex QoS13: QoS Management v0.9). Sasha > > #define IB_DM_MAD_STATUS_NO_IOC_RESP (CL_HTON16(0x0100)) > #define IB_DM_MAD_STATUS_NO_SVC_ENTRIES (CL_HTON16(0x0200)) > -- > 1.5.1.4 > > From sashak at voltaire.com Tue Oct 16 16:35:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 01:35:33 +0200 Subject: [ofa-general] Re: [PATCH] osm: Adding two SA MAD class-specific status values In-Reply-To: <20071016233237.GI12364@sashak.voltaire.com> References: <471460DB.5070703@dev.mellanox.co.il> <20071016233237.GI12364@sashak.voltaire.com> Message-ID: <20071016233533.GJ12364@sashak.voltaire.com> On 01:32 Wed 17 Oct , Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 08:57 Tue 16 Oct , Yevgeny Kliteynik wrote: > > Adding two SA MAD class-specific status values: > > - ERR_REQ_DENIED > > - ERR_REQ_PRIORITY_SUGGESTED > > > > Signed-off-by: Yevgeny Kliteynik > > --- > > opensm/include/iba/ib_types.h | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > > > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > > index e1785f1..c6f16b9 100644 > > --- a/opensm/include/iba/ib_types.h > > +++ b/opensm/include/iba/ib_types.h > > @@ -903,6 +903,8 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > > #define IB_SA_MAD_STATUS_TOO_MANY_RECORDS (CL_HTON16(0x0400)) > > #define IB_SA_MAD_STATUS_INVALID_GID (CL_HTON16(0x0500)) > > #define IB_SA_MAD_STATUS_INSUF_COMPS (CL_HTON16(0x0600)) > > +#define IB_SA_MAD_STATUS_DENIED (CL_HTON16(0x0700)) > > +#define IB_SA_MAD_STATUS_PRIO_SUGGESTED (CL_HTON16(0x0800)) > > Where those values are defined? In published by IBTA drafts both are 7 > (STATUS_DENIED in Volume 1 Release 1.2.1 and STATUS_PRIO_SUGGESTED in > Annex QoS13: QoS Management v0.9). Never mind - I overlooked that it is already changed in 1.2.1 draft. Sasha From sashak at voltaire.com Tue Oct 16 16:36:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 01:36:07 +0200 Subject: [ofa-general] Re: [PATCH] osm: Adding two SA MAD class-specific status values In-Reply-To: <471460DB.5070703@dev.mellanox.co.il> References: <471460DB.5070703@dev.mellanox.co.il> Message-ID: <20071016233607.GK12364@sashak.voltaire.com> On 08:57 Tue 16 Oct , Yevgeny Kliteynik wrote: > Adding two SA MAD class-specific status values: > - ERR_REQ_DENIED > - ERR_REQ_PRIORITY_SUGGESTED > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From hrosenstock at xsigo.com Tue Oct 16 16:25:37 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 16:25:37 -0700 Subject: [ofa-general] [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option Message-ID: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option Currently only support single port HCAs are supported in this mode but this can be extended for other devices if needed Tested-by: Greg Kurtzer Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 148e452..17aafb6 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -42,7 +43,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.2 +#define __BUILD_VERSION_TAG__ 1.2.3 #include #include #include @@ -99,6 +100,9 @@ main(int argc, char **argv) int ca_port = 0; int extended = 0; uint16_t cap_mask; + int allports = 0; + int node_type, num_ports; + uint8_t data[IB_SMP_DATA_SIZE]; static char const str_opts[] = "C:P:s:t:dGearRVhu"; static const struct option long_opts[] = { @@ -191,6 +195,35 @@ main(int argc, char **argv) /* PerfMgt ClassPortInfo is a required attribute */ if (!perf_classportinfo_query(pc, &portid, port, timeout)) IBERROR("classportinfo query"); + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ + cap_mask = ntohs(cap_mask); + if (!(cap_mask & 0x100)) /* bit 8 is AllPortSelect */ + if (port == 255) { + allports = 1; + IBWARN("AllPortSelect not supported"); + } + + if (allports == 1) { + + /* + * Simulate all ports support in PMA + * Determine node type, number of (physical) ports, + * and, if switch, whether SP0 is enhanced + * to determine first and last port to query + */ + + /* For now, support single port CAs */ + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) + IBERROR("smp query nodeinfo failed"); + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ + IBERROR("smp query nodeinfo: Node type not CA"); + mad_decode_field(data, IB_NODE_NPORTS_F, &num_ports); + if (num_ports != 1) + IBERROR("smp query nodeinfo: %d ports; only 1 supported currently", num_ports); + port = num_ports; + } if (reset_only) goto do_reset; @@ -199,17 +232,20 @@ main(int argc, char **argv) if (!port_performance_query(pc, &portid, port, timeout)) IBERROR("perfquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { - /* Should ClassPortInfo be implemented in libibmad ? */ - memcpy(&cap_mask, pc+2, sizeof(cap_mask)); /* CapabilityMask */ - cap_mask = ntohs(cap_mask); if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ IBWARN("PerfMgt ClassPortInfo 0x%x extended counters not indicated\n", cap_mask); if (!port_performance_ext_query(pc, &portid, port, timeout)) IBERROR("perfextquery"); + if (allports == 1) + pc[1] = 255; /* fake PortSelect */ + mad_dump_perfcounters_ext(buf, sizeof buf, pc, sizeof pc); } From hrosenstock at xsigo.com Tue Oct 16 16:30:57 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 16 Oct 2007 16:30:57 -0700 Subject: [ofa-general] [PATCHv2] infiniband-diags/scripts: Updated for perfquery support of no AllPortSelect option Message-ID: <1192577457.5921.179.camel@hrosenstock-ws.xsigo.com> infiniband-diags/scripts: Updated for perfquery support of no AllPortSelect option Eliminate new ibwarn message added to perfquery to let user know AllPortSelect option is not supported by specified PMA Tested-by: Greg Kurtzer Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index aa29525..1a2d228 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -186,6 +186,8 @@ BEGIN { /^CounterSelect/ {next} +/AllPortSelect/ {next} + /^ib/ {print $0} /ibpanic:/ {print $0} /ibwarn:/ {print $0} diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in index bbdff71..ccf9f34 100644 --- a/infiniband-diags/scripts/ibdatacounts.in +++ b/infiniband-diags/scripts/ibdatacounts.in @@ -130,6 +130,8 @@ function blue(s) /^CounterSelect/ {next} +/AllPortSelect/ {next} + /^ib/ {print $0} /ibpanic:/ {print $0} /ibwarn:/ {print $0} From eddiem at sgi.com Tue Oct 16 16:35:38 2007 From: eddiem at sgi.com (Edward Mascarenhas) Date: Tue, 16 Oct 2007 16:35:38 -0700 Subject: [ofa-general] Running OpenSM on large clusters Message-ID: <200710161635.38818.eddiem@sgi.com> Has anyone seen issues with running OpenSM on large (1500+ nodes) clusters? We are seeing 1000s of the following message in the system log __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is already overloaded with 6736 messages and queue time of:10006[msec] It seems like a huge number of datagrams are being generated resulting in increased time to bring up the fabric. Is there a threshold of cluster size beyond which we are likely to see these messages. How many MADs are generated during bring up? What is the largest cluster size for which OpenSM has been tried by others? Thanks, Edward From jgarzik at pobox.com Tue Oct 16 18:15:58 2007 From: jgarzik at pobox.com (Jeff Garzik) Date: Tue, 16 Oct 2007 21:15:58 -0400 Subject: [ofa-general] Re: [PATCH linux-2.6] bonding: two small fixes for IPoIB support In-Reply-To: <9245.1192491867@death> References: <47138EB7.40703@gmail.com> <4713B006.9090908@pobox.com> <27349.1192480486@death> <4713D28F.3010904@pobox.com> <31162.1192485233@death> <4713E20F.9080305@pobox.com> <9245.1192491867@death> Message-ID: <4715624E.8010909@pobox.com> Jay Vosburgh wrote: > Two small fixes to IPoIB support for bonding: > > 1- copy header_ops from slave to bonding for IPoIB slaves > 2- move release and destroy logic to UNREGISTER from GOING_DOWN > notifier to avoid double release > > Set bonding to version 3.2.1. > > Signed-off-by: Moni Shoua > Signed-off-by: Jay Vosburgh > applied From sweitzen at cisco.com Tue Oct 16 17:46:38 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 16 Oct 2007 17:46:38 -0700 Subject: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: > 3. IPoIB > o Stateless offloads > o NAPI is enabled default How does one measure these changes using tools like netperf or iperf? Do I need a specific HCA type? > 4. SDP - these are not yet in the alpha release > o Keep-alive > o Asynch IO > o Send Zero Copy If it didn't make it into alpha, perhaps it should not go into 1.3, so we can hold the release date better? What ever happened to NFS RDMA? Scott From akepner at sgi.com Tue Oct 16 19:02:10 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 16 Oct 2007 19:02:10 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: References: <20070726014931.GL10235@sgi.com> Message-ID: <20071017020210.GN5601@sgi.com> On Mon, Oct 15, 2007 at 08:20:35PM -0700, Roland Dreier wrote: > ..... > Here's the version that actually passed some of my tests... > > From ab8403c424a35364a3a2c753f7c5917fcbb4d809 Mon Sep 17 00:00:00 2001 > From: Roland Dreier > Date: Sun, 14 Oct 2007 20:40:27 -0700 > .... Works on ia64. Tested-by: Arthur Kepner Thanks. -- Arthur From johann.george at qlogic.com Tue Oct 16 19:32:48 2007 From: johann.george at qlogic.com (Johann George) Date: Tue, 16 Oct 2007 19:32:48 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: Open for Registration Message-ID: <20071017023248.GA20953@cuprite.pathscale.com> You can now register for the OpenFabrics Developer's Summit being held November 15-16, 2007 at the Boomtown Hotel in Verdi, Nevada. It begin at 1pm on Thursday, November 15th and ends at 1pm on Friday. Dinner will be provided on Thursday as well as breakfast and lunch on Friday. The registration cost is $195. For the first time, we have a student rate of $95. Click on this link to register. http://www.acteva.com/booking.cfm?bevaid=143964 The hotel is a twenty minute drive from the Reno-Sparks Convention Center. If you need a room for Thursday night, they are currently available at the Boomtown hotel starting at $59/night. We are still working on the agenda and hope to have a draft shortly. Here are some of the planned sessions and topics: * OFED: Feedback from Customers * iWARP Stack Unification: Convergence and Interoperability Issues * OFED 1.3 Update and Procedure Review * The Journey of a Patch: from Submission to Distros * Updates on MVAPICH, OpenMPI and NFSoRDMA. * OFED 1.4 Planned Features * OpenFabrics Logo Program: Experience So Far * SA Caching, IPoIB Stateless Offloads, Management Tools * Using Extended RC * Fibre-Channel over InfiniBand * Windows Stacks Unification and Release Process. * Parallel File Systems and OFED * Feedback from Distros on OFED Incorporation If you have a topic to present that would be of interest to this community, please email me. And do register early. It helps us to better plan. Thanks. Johann From tom at opengridcomputing.com Tue Oct 16 19:37:21 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 16 Oct 2007 21:37:21 -0500 Subject: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <1192588641.31031.27.camel@trinity.ogc.int> On Tue, 2007-10-16 at 17:46 -0700, Scott Weitzenkamp (sweitzen) wrote: > > 3. IPoIB > > o Stateless offloads > > o NAPI is enabled default > > How does one measure these changes using tools like netperf or iperf? > Do I need a specific HCA type? > > > 4. SDP - these are not yet in the alpha release > > o Keep-alive > > o Asynch IO > > o Send Zero Copy > > If it didn't make it into alpha, perhaps it should not go into 1.3, so > we can hold the release date better? > > What ever happened to NFS RDMA? The SVC transport switch and SVC-UDP/TCP/RDMA transport drivers are targeted for 2.6.25. To track this activity, see nfs at lists.sourceforge.net. > > Scott > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bramesh at vt.edu Tue Oct 16 20:09:55 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 16 Oct 2007 23:09:55 -0400 Subject: [ofa-general] Question on IB RDMA read timing. Message-ID: <20071017030955.GB15679@vt.edu> I wrote a simple test program to actual time it takes for RDMA read over IB. I find a huge difference in the numbers returned by timing. I was wondering if someone could help me in finding what I might be doing wrong in the way I am measuring the time. Steps I do for timing is as follows. 1) Create the send WR for RDMA Read. 2) call gettimeofday () 3) ibv_post_send () the WR 4) Loop around ibv_poll_cq () till I get the completion event. 5) call gettimeofday (); The difference in time would give me the time it takes to perform RDMA read over IB. I constantly get around 35 microsecs as the timing which seems to be really large considering the latency of IB. I am measuring the time for transferring 4K bytes of data. If anyone wants I can send the code that I have written. I am not subscribed to the list, if you could please cc me in the reply. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From changquing.tang at hp.com Tue Oct 16 21:05:18 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 17 Oct 2007 04:05:18 -0000 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <4714E970.3060507@ichips.intel.com> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> <4714E970.3060507@ichips.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> Below is the piece of rping.c code, how do I pick the returned port ? Do you reset sin.sin_port inside rdma_bind_addr() if I pass 0 to sin.sin_port ? I need the returned port to tell client side to call rdma_resolve_addr(). If I am right, rdma_resolve_addr() needs dest port number. static int rping_bind_server(struct rping_cb *cb) { struct sockaddr_in sin; int ret; memset(&sin, 0, sizeof(sin)); sin.sin_family = AF_INET; sin.sin_addr.s_addr = cb->addr; sin.sin_port = 0; ///////////cb->port; ret = rdma_bind_addr(cb->cm_id, (struct sockaddr *) &sin); if (ret) { fprintf(stderr, "rdma_bind_addr error %d\n", ret); return ret; } DEBUG_LOG("rdma_bind_addr successful\n"); --CQ > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 16, 2007 11:40 AM > To: Tang, Changqing > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] librdmacm port selection for > rdma_bind_addr() > > > Is there a way to let system choose a port for me ? > like TCP/IP, if > > port is set to 0, system will return an unused port. > > Yes - binding to port 0 will return a usable port. > From dotanb at dev.mellanox.co.il Wed Oct 17 00:14:53 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 17 Oct 2007 09:14:53 +0200 Subject: [ofa-general] Missing IB_EVENT_PATH_MIG events In-Reply-To: References: <471456EA.3060403@dev.mellanox.co.il> Message-ID: <4715B66D.7000408@dev.mellanox.co.il> Hi. The value of the timeout that you sent me means that the QP will wait for ~ 0.25 second before any retry, so it might take some time for the QPs to start doing the APM (depend of the retry_cnt value). From your description i believe that not all of the QPs have started to move to the alternate path. did you make sure that the APM state machine (in every QP) is ARMED before moving the port down ? (data packets must we sent between the local/remote QPs in order to make the QP in this state) (If you can send me the code that handles this scenario i will try to reproduce it here, in out lab) thanks Dotan lbt wrote: > Thanks for your reply Dotan! > > The timeout is set to 16. > > Here is some more info. Please let me know if there is any other info > I can provide. > Setup: > - 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001, > InfiniHost III firmware 25218, v. 5.2.0) - this is the latest Mellanox > firmware I believe > - port 1 of each node is connected to one IB switch, and likewise for > port 2 --> thus have 2 separate IB subnets, providing 2 possible paths > between the 2 nodes > - IB switch is InfiniScale MT43132 ** > - Using OFED 1.2 driver stack > > Our software creates RCQPs between 2 nodes, with primary and alternate > path specified. > Test does the following: Using 10 RCQPs > 1. Hardware triggered migration by bringing down the port of the > primary path (haven't ever seen a problem with the hardware triggered > migrations) > 2. Restore the port --> reloads alternate path > - Local QPs send LAP > - Remote QPs reply with APR > 3. Redistributes RCQP's across both ports for load balancing using > software triggered migrations for the RCQPs selected for migration. > a. Local QPs: use ib_modify_qp to trigger migration --> get > IB_EVENT_PATH_MIG on local QPs > b. Remote QPs: IB_EVENT_PATH_MIG > c. Local QPs: after software-triggered migration completes, reloads > alternate path by sending LAP > d. Remote QPs: reply with APR > > Keep doing this in a loop. The issue is that in 3b, not all the remote > QP's reporte an IB_EVENT for the path migration triggered in 3a. I > noticed that when this happens it's usually in the first and/or second > cycle (subsequent cycles don't manifest this issue), and it occurs on > the last RCQP's that were migrated in 3a. > > BTW: Do you know if there there is a way I can determine/dump which > events are in the Event Queue? > > Thanks again! > Lan > > On 10/15/07, *Dotan Barak* > wrote: > > Hi. > > lbt wrote: > > Hi, > > > > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA > > (ib_mthca driver). When I have several RCQP's that I am trying to > > migrate (software triggered migration using ib_modify_qp), I've > > noticed that sometimes 1 or 2 of the remote QP's never generate an > > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that > > it just gets lost. I looked through some of the ib_mthca patches in > > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git > > > < > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>, and > > incorporated the mmiowb patch for ib_mthca commands > > ( > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd > > < > http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd>). > > But still seeing same issue. I have a test case that repeates > > software-triggered migrations + rearming in a loop, and this > problem > > usually occurs in the first few cycles, but is not too frequent. If > > anyone has any ideas on what might be wrong, or tips on where I can > > look/do to debug this, that would be very much appreciated! > > > > For example, this is the console output I will see (printed out > by our > > rcqp event handler): > > On the local end - initiates software triggered migration, using > > ib_modify_qp: > > Event IB_EVENT_PATH_MIG occurred on QP#1043 > > Event IB_EVENT_PATH_MIG occurred on QP#1040 > > Event IB_EVENT_PATH_MIG occurred on QP#1033 > > > > On the remote end: > > Event IB_EVENT_PATH_MIG occurred on QP#1040 > > Event IB_EVENT_PATH_MIG occurred on QP#1043 > Is > the timeout value (in the QP attributes) is 0? > If the answer is no, can you please supply some more details on this? > > > thanks > Dotan > > From dotanb at dev.mellanox.co.il Wed Oct 17 00:44:04 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 17 Oct 2007 09:44:04 +0200 Subject: [ofa-general] Question on IB RDMA read timing. In-Reply-To: <20071017030955.GB15679@vt.edu> References: <20071017030955.GB15679@vt.edu> Message-ID: <4715BD44.2090200@dev.mellanox.co.il> Hi. Bharath Ramesh wrote: > I wrote a simple test program to actual time it takes for RDMA read over > IB. I find a huge difference in the numbers returned by timing. I was > wondering if someone could help me in finding what I might be doing > wrong in the way I am measuring the time. > > Steps I do for timing is as follows. > > 1) Create the send WR for RDMA Read. > 2) call gettimeofday () > 3) ibv_post_send () the WR > 4) Loop around ibv_poll_cq () till I get the completion event. > 5) call gettimeofday (); > > The difference in time would give me the time it takes to perform RDMA > read over IB. I constantly get around 35 microsecs as the timing which > seems to be really large considering the latency of IB. I am measuring > the time for transferring 4K bytes of data. If anyone wants I can send > the code that I have written. I am not subscribed to the list, if you > could please cc me in the reply. > I don't familiar with the implementation of gettimeofday, but i believe that this function do a context switch (and/or spend some time in the function to fill the struct that you supply to it) I suggest you to call gettimeoday, execute N times the following commands: 1) ibv_post_send () the WR 2) Loop around ibv_poll_cq () till I get the completion event. and then call gettimeoday again to calculate the average time for an RDMA read OR you can call a better function to get the CPU/machine time, like the performance tests do, for example: https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/src/userspace/perftest/get_clock.h and https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/src/userspace/perftest/get_clock.c Dotan From glebn at voltaire.com Wed Oct 17 00:56:31 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 17 Oct 2007 09:56:31 +0200 Subject: [ofa-general] Question on IB RDMA read timing. In-Reply-To: <4715BD44.2090200@dev.mellanox.co.il> References: <20071017030955.GB15679@vt.edu> <4715BD44.2090200@dev.mellanox.co.il> Message-ID: <20071017075631.GA5089@minantech.com> On Wed, Oct 17, 2007 at 09:44:04AM +0200, Dotan Barak wrote: > Hi. > > Bharath Ramesh wrote: >> I wrote a simple test program to actual time it takes for RDMA read over >> IB. I find a huge difference in the numbers returned by timing. I was >> wondering if someone could help me in finding what I might be doing >> wrong in the way I am measuring the time. >> >> Steps I do for timing is as follows. >> >> 1) Create the send WR for RDMA Read. >> 2) call gettimeofday () >> 3) ibv_post_send () the WR >> 4) Loop around ibv_poll_cq () till I get the completion event. >> 5) call gettimeofday (); >> >> The difference in time would give me the time it takes to perform RDMA >> read over IB. I constantly get around 35 microsecs as the timing which >> seems to be really large considering the latency of IB. I am measuring >> the time for transferring 4K bytes of data. If anyone wants I can send >> the code that I have written. I am not subscribed to the list, if you >> could please cc me in the reply. >> > > I don't familiar with the implementation of gettimeofday, but i believe > that this function do a context switch > (and/or spend some time in the function to fill the struct that you supply > to it) > Here: struct timeval tv_s, tv_e; gettimeofday(&tv_s, NULL); gettimeofday(&tv_e, NULL); printf("%d\n", tv_e.tv_usec - tv_s.tv_usec); Compile and run it. The overhead of two calls to gettimeofday is at most 1 microsecond. -- Gleb. From iuinformer at trustcu.com Wed Oct 17 03:03:49 2007 From: iuinformer at trustcu.com (Vernon Carlson) Date: Wed, 17 Oct 2007 15:33:49 +0530 Subject: [ofa-general] Endlich wieder Spass am Leben I have contacted various -- same problems. Message-ID: <01c810d3$1d1be390$de686dcb@iuinformer> Versuchen Sie unser Produkt und Sie werden fuhlen was unsere Kunden bestatigen Jetzt NEU EXTREM starke weibliche Pheromonparfum lassen jeden Mann schwach werden. Im Pheromonparfum WOMAN sind Sexlockstoffe enthalten, welche Manner magnetisch anziehen. Der Mann nimmt den edlen Duft des Parfums wahr ohne zu ahnen, dass er auch Sexlockstoffe inhaliert. Innerhalb kurzester Zeit wird jede Zuruckhaltung Ihnen gegenuber weichen. Original - 100% wirksam Ciiiaaaaaalis 10 Pack. 26,99 Euro Viiiaaaagra 10 Pack. 20,99 Euro - Kein peinlicher Arztbesuch erforderlich - Bequem und diskret online bestellen. - Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen - Diskrete Verpackung und Zahlung - Kostenlose, arztliche Telefon-Beratung - Visa verifizierter Onlineshop - keine versteckte Kosten Klicken Sie HIER und Sie erhalten vier Dosen umsonst Fantastische Wirkung! Fünf Jahre lang hatte ich es nicht mehr geschafft, meine Err. ..ektion während des Verkehrs zu halten und war richtig ängstlich geworden. Ich hatte auch ein Problem mit vorzeitigem Samenerguss. Außerdem bin ich Zuckerkrank. Vor einiger Zeit habe ich eine 50-mg-Dosis Viiiaaaagra genommen und zwei Stunden später mit einer 22-jährigen geschlafen. Kurz vor dem Vorspiel wurde mein Penis hart und ich konnte es kaum glauben. Ich habe in dieser Nacht dreimal Sex gehabt und es gab keine Probleme dabei. Kein Schuss ging daneben. Ich bin ein glücklicher Mann. Achmet, 52 (bitte warten Sie einen Moment bis die Seite vollstandig geladen wird) -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Wed Oct 17 02:18:39 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 17 Oct 2007 11:18:39 +0200 Subject: [ofa-general] [PATCH] IB/ipoib: Add likely in data path Message-ID: <1192612719.16674.2.camel@mtls03> Add likely in data path For connected mode, it is likely that if the neighbour has a cm object than IPOIB_FLAG_OPER_UP is set. Signed-off-by: Eli Cohen --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 894b1dc..350a048 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -680,7 +680,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) neigh = *to_ipoib_neigh(skb->dst->neighbour); if (ipoib_cm_get(neigh)) { - if (ipoib_cm_up(neigh)) { + if (likely(ipoib_cm_up(neigh))) { ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); goto out; } -- 1.5.3.4 From eli at mellanox.co.il Wed Oct 17 02:20:16 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 17 Oct 2007 11:20:16 +0200 Subject: [ofa-general] [PATCH]: IB/ipoib: Remove always true if condition Message-ID: <1192612816.16674.4.camel@mtls03> Remove always true if condition The state of the cm_rx object is set to IPOIB_CM_RX_LIVE and could not change so the if conditon is redundant. Signed-off-by: Eli Cohen --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..0e5339e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -323,8 +323,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even /* Add this entry to passive ids list head, but do not re-add it * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); + list_move(&p->list, &priv->cm.passive_ids); spin_unlock_irq(&priv->lock); ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); -- 1.5.3.4 From vlad at lists.openfabrics.org Wed Oct 17 02:54:06 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 17 Oct 2007 02:54:06 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071017-0200 daily build status Message-ID: <20071017095406.EF0B7E60886@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Wed Oct 17 04:30:49 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 13:30:49 +0200 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <200710161635.38818.eddiem@sgi.com> References: <200710161635.38818.eddiem@sgi.com> Message-ID: <20071017113049.GA6329@sashak.voltaire.com> On 16:35 Tue 16 Oct , Edward Mascarenhas wrote: > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > clusters? > > We are seeing 1000s of the following message in the system log > > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is > already overloaded with 6736 messages and queue time of:10006[msec] I guess you see this during fabric bringup when SA processor is not available yet. Which version of OpenSM you are using - we did some improvements in this area in recent versions (partially in OFED-1.2)? > It seems like a huge number of datagrams are being generated resulting > in increased time to bring up the fabric. > > Is there a threshold of cluster size beyond which we are likely to see > these messages. > > How many MADs are generated during bring up? A lot :). Exact number will depend on exact topology and requested configuration. Could you send us output of ibnetdiscover? > What is the largest cluster size for which OpenSM has been tried by > others? I hope others will answer. Largest cluster known for me was Thunderbird (4480 nodes), there are some details: http://openfabrics.org/archives/nov2006sc/ofa_devel_111606.pdf Sasha From tziporet at dev.mellanox.co.il Wed Oct 17 04:52:37 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 17 Oct 2007 13:52:37 +0200 Subject: [ewg] RE: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <4715F785.6080105@mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: >> 3. IPoIB >> o Stateless offloads >> o NAPI is enabled default >> > > How does one measure these changes using tools like netperf or iperf? > We use netperf and iperf > Do I need a specific HCA type? > NAPI - for all HCAs Stateless offloads are available only for UD mode: 1. Checksum offload: Uses HW ability to generate/validate checksum. Available in Arbel and ConnectX 2. LSO - available in ConnectX only 3. LRO - SW mostly but depends on checksum offload - thus supported in Arbel and ConnectX > >> 4. SDP - these are not yet in the alpha release >> o Keep-alive >> o Asynch IO >> o Send Zero Copy >> > > If it didn't make it into alpha, perhaps it should not go into 1.3, so > we can hold the release date better? > Since the code is running and tested and Jim just has not succeed to arrange it all in the git on time I think it should be in I cc Jim so he can answer in more details on the status. > What ever happened to NFS RDMA? > > > No one agreed to become a maintainer of NFSoRDMA for OFED Tziporet From jlentini at netapp.com Wed Oct 17 06:31:09 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 17 Oct 2007 09:31:09 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] DAT 2.0 change to include extended event data with all builds. In-Reply-To: <47153CC9.4000701@ichips.intel.com> References: <47153CC9.4000701@ichips.intel.com> Message-ID: This makes to me. On Tue, 16 Oct 2007, Arlin Davis wrote: > James, > > Can you comment on the following patch to DAT v2.0. Do you see any > issues? If ok, I would like to get this accepted and rolled back > into the specification. > > Thanks, > > -arlin > > > -- > Modify dat.h dat_event to include event_extension_data[8]. > Extend struct dat_event outside of extension build > switch to enable non-extended applications to work > with extended libraries. Otherwise, there is a potential > for the event callee to write back too much event data > and exceed callers non-extended event buffer. > -- > > Signed-off by: Arlin Davis > > --- a/dat/include/dat/dat.h > +++ b/dat/include/dat/dat.h > @@ -944,9 +944,7 @@ typedef struct dat_event > DAT_EVENT_NUMBER event_number; > DAT_EVD_HANDLE evd_handle; > DAT_EVENT_DATA event_data; > -#ifdef DAT_EXTENSIONS > DAT_UINT64 event_extension_data[8]; > -#endif /* DAT_EXTENSIONS */ > } DAT_EVENT; > > /* Provider/registration info */ > From jlentini at netapp.com Wed Oct 17 06:33:30 2007 From: jlentini at netapp.com (James Lentini) Date: Wed, 17 Oct 2007 09:33:30 -0400 (EDT) Subject: [ofa-general] Re: [PATCH] DAT 2.0 change to include extended event data with all builds. In-Reply-To: References: <47153CC9.4000701@ichips.intel.com> Message-ID: On Wed, 17 Oct 2007, James Lentini wrote: > > This makes to me. ^ sense :) > On Tue, 16 Oct 2007, Arlin Davis wrote: > > > James, > > > > Can you comment on the following patch to DAT v2.0. Do you see any > > issues? If ok, I would like to get this accepted and rolled back > > into the specification. > > > > Thanks, > > > > -arlin > > > > > > -- > > Modify dat.h dat_event to include event_extension_data[8]. > > Extend struct dat_event outside of extension build > > switch to enable non-extended applications to work > > with extended libraries. Otherwise, there is a potential > > for the event callee to write back too much event data > > and exceed callers non-extended event buffer. > > -- > > > > Signed-off by: Arlin Davis > > > > --- a/dat/include/dat/dat.h > > +++ b/dat/include/dat/dat.h > > @@ -944,9 +944,7 @@ typedef struct dat_event > > DAT_EVENT_NUMBER event_number; > > DAT_EVD_HANDLE evd_handle; > > DAT_EVENT_DATA event_data; > > -#ifdef DAT_EXTENSIONS > > DAT_UINT64 event_extension_data[8]; > > -#endif /* DAT_EXTENSIONS */ > > } DAT_EVENT; > > > > /* Provider/registration info */ > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eli at mellanox.co.il Wed Oct 17 06:47:17 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 17 Oct 2007 15:47:17 +0200 Subject: [ofa-general] [PATCH] IB/ipoib: CQ coalescing for connected mode Message-ID: <1192628837.16674.19.camel@mtls03> Add CQ moderation for Connected mode QPs Add predefined CQ coalescing parameters to CQs used by connected mode QPs for reporting send completions. Using CQ coalescing was proven to have good effect in connected mode. Signed-off-by: Eli Cohen --- This patch has been pushed into ofa.git tree to kernel_patches/fixes. It relies on the previously sent patch that added the ib_modify_cq verb. Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-17 15:20:17.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-10-17 18:13:37.000000000 +0200 @@ -914,6 +914,9 @@ static int ipoib_cm_tx_init(struct ipoib goto err_cq; } + if (ib_modify_cq(p->cq, IPOIB_CQ_COUNT, IPOIB_CQ_PERIOD)) + ipoib_dbg(priv, "modify CQ failed\n"); + ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP); if (ret) { ipoib_warn(priv, "failed to request completion notification: %d\n", ret); Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-17 15:20:19.000000000 +0200 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-10-17 18:07:36.000000000 +0200 @@ -794,4 +794,9 @@ extern int ipoib_debug_level; #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) +enum { + IPOIB_CQ_COUNT = 16, + IPOIB_CQ_PERIOD = 10, +}; + #endif /* _IPOIB_H */ From dwglobaltotalofficem at globaltotaloffice.com Wed Oct 17 06:50:13 2007 From: dwglobaltotalofficem at globaltotaloffice.com (Lucile Valencia) Date: Wed, 17 Oct 2007 15:50:13 +0200 Subject: [ofa-general] Legal software sales Message-ID: <01c810d5$679e5f90$75ce6155@dwglobaltotalofficem> Our purpose is to present low cost PC and Macintosh legal software and computer solutions for anyone. Whether you are a corporate purchaser, a small-scale enterprise possessor, or shopping for your own personal computer, we think that we'll help you. VIEW WHAT WE HAVE TO PROPOSE Most popular software in sight are: *Microsoft Office 2007 Enterprise: Retail price today - $899.00; Our just - $79.95 *Microsoft Windows Vista Business: Retail price for this time - $299.00; Our only today - $79.95 *Microsoft Plus! for Windows XP: Retail price now - $29.95; Our now just - $10.95 *Adobe Acrobat 8.0 Professional for Mac: Retail price for now - $449.00; Our only for today - $79.95 *Macromedia Studio 8: Retail price for this time - $999.00; Our only - $99.95 *Adobe Photoshop CS2 V 9.0: Retail price for now - $599.00; Our just - $69.95 *Microsoft Money Home & Business 7: Retail price this day - $89.90; - $39.95 *Microsoft Visual Basic 6.0 Professional: Retail price now - $419.00; - $49.95 COME TO US! MoreThan those I shed for him. Twas pretty though plagueTo. The clearness of our deservings. IntWhich might be felt that we. Th other and thine eyesSee it. Mine aimBut know I think and. Ratherthan lack it where there. He does weigh too light my. -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Wed Oct 17 07:38:06 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 17 Oct 2007 09:38:06 -0500 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> <4714E970.3060507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> Message-ID: <47161E4E.1000005@opengridcomputing.com> Tang, Changqing wrote: > Below is the piece of rping.c code, how do I pick the returned port ? Do > you reset sin.sin_port inside rdma_bind_addr() if I pass 0 to > sin.sin_port ? > Yes. > I need the returned port to tell client side to call > rdma_resolve_addr(). If I am right, rdma_resolve_addr() needs dest port > number. > Sean, Does resolve_addr really need anything more than the ip address? For iWARP it doesn't. > > static int rping_bind_server(struct rping_cb *cb) > { > struct sockaddr_in sin; > int ret; > > memset(&sin, 0, sizeof(sin)); > sin.sin_family = AF_INET; > sin.sin_addr.s_addr = cb->addr; > sin.sin_port = 0; ///////////cb->port; > > ret = rdma_bind_addr(cb->cm_id, (struct sockaddr *) &sin); > if (ret) { > fprintf(stderr, "rdma_bind_addr error %d\n", ret); > return ret; > } > DEBUG_LOG("rdma_bind_addr successful\n"); > > --CQ > >> -----Original Message----- >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] >> Sent: Tuesday, October 16, 2007 11:40 AM >> To: Tang, Changqing >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] librdmacm port selection for >> rdma_bind_addr() >> >>> Is there a way to let system choose a port for me ? >> like TCP/IP, if >>> port is set to 0, system will return an unused port. >> Yes - binding to port 0 will return a usable port. >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Thomas.Talpey at netapp.com Wed Oct 17 07:42:13 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 17 Oct 2007 10:42:13 -0400 Subject: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: At 08:46 PM 10/16/2007, Scott Weitzenkamp (sweitzen) wrote: >If it didn't make it into alpha, perhaps it should not go into 1.3, so >we can hold the release date better? > >What ever happened to NFS RDMA? The NFS/RDMA client is queued for 2.6.24-rc1, it has been in the NFS client maintainer's tree for some time and was pulled by Linus last week. I haven't announced it yet because it appears the 2.6.24 merge window is a bit of a mess! But I expect it to contain the client. If you want to see it in its current state, go to git://linux-nfs.org/nfs-2.6 I thought OFED1.3 was intended to be 2.6.24-based. In that case why would it exclude other 2.6.24 content simply because it wasn't there for an early Alpha? Tom. From changquing.tang at hp.com Wed Oct 17 07:49:38 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Wed, 17 Oct 2007 14:49:38 -0000 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <47161E4E.1000005@opengridcomputing.com> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> <4714E970.3060507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> <47161E4E.1000005@opengridcomputing.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403028B07EB@G3W0634.americas.hpqcorp.net> Thanks, For all the sample code, the call rdma_resolve_addr() specify a port number, which is the same as port number when calling rdma_bind_addr() on the other side, I hope port number is not necessary for rdma_resolve_addr() call. --CQ > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Wednesday, October 17, 2007 9:38 AM > To: Tang, Changqing > Cc: Sean Hefty; general at lists.openfabrics.org > Subject: Re: [ofa-general] librdmacm port selection for > rdma_bind_addr() > > > > Tang, Changqing wrote: > > Below is the piece of rping.c code, how do I pick the > returned port ? > > Do you reset sin.sin_port inside rdma_bind_addr() if I pass 0 to > > sin.sin_port ? > > > > Yes. > > > I need the returned port to tell client side to call > > rdma_resolve_addr(). If I am right, rdma_resolve_addr() needs dest > > port number. > > > > Sean, Does resolve_addr really need anything more than the ip > address? > For iWARP it doesn't. > > > > > > static int rping_bind_server(struct rping_cb *cb) { > > struct sockaddr_in sin; > > int ret; > > > > memset(&sin, 0, sizeof(sin)); > > sin.sin_family = AF_INET; > > sin.sin_addr.s_addr = cb->addr; > > sin.sin_port = 0; ///////////cb->port; > > > > ret = rdma_bind_addr(cb->cm_id, (struct sockaddr *) &sin); > > if (ret) { > > fprintf(stderr, "rdma_bind_addr error %d\n", ret); > > return ret; > > } > > DEBUG_LOG("rdma_bind_addr successful\n"); > > > > --CQ > > > >> -----Original Message----- > >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] > >> Sent: Tuesday, October 16, 2007 11:40 AM > >> To: Tang, Changqing > >> Cc: general at lists.openfabrics.org > >> Subject: Re: [ofa-general] librdmacm port selection for > >> rdma_bind_addr() > >> > >>> Is there a way to let system choose a port for me ? > >> like TCP/IP, if > >>> port is set to 0, system will return an unused port. > >> Yes - binding to port 0 will return a usable port. > >> > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > From jackm at dev.mellanox.co.il Wed Oct 17 07:58:02 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 17 Oct 2007 16:58:02 +0200 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: References: <20070726014931.GL10235@sgi.com> Message-ID: <200710171658.03184.jackm@dev.mellanox.co.il> On Tuesday 16 October 2007 05:20, Roland Dreier wrote: >  int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) >  { >         struct mthca_cq *cq = to_mcq(ibcq); > -       __be32 doorbell[2]; > -       u32 sn; > -       __be32 ci; > - > -       sn = cq->arm_sn & 3; > -       ci = cpu_to_be32(cq->cons_index); > +       __be32 db_rec[2]; > +       u32 dbhi; > +       u32 sn = cq->arm_sn & 3; >   > -       doorbell[0] = ci; > -       doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | > -                                 ((flags & IB_CQ_SOLICITED_MASK) == > -                                  IB_CQ_SOLICITED ? 1 : 2)); > +       db_rec[0] = cpu_to_be32(cq->cons_index); > +       db_rec[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | > +                               ((flags & IB_CQ_SOLICITED_MASK) == > +                                IB_CQ_SOLICITED ? 1 : 2)); >   > -       mthca_write_db_rec(doorbell, cq->arm_db); > +       mthca_write_db_rec(db_rec, cq->arm_db); >   Patch looks good, but don't you have the same 64-bit alignment problem in mthca_write_db_rec() ? - Jack From moshek at voltaire.com Wed Oct 17 08:26:28 2007 From: moshek at voltaire.com (Moshe Kazir) Date: Wed, 17 Oct 2007 17:26:28 +0200 Subject: [ofa-general] RE: [ewg] OFED 1.3 Alpha release is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA4D2B8A@exil.voltaire.com> To save other's time -> Please add to the OFED_Instalation_Guide.txt That for compile and install of the open-iscsi rpm on PPC64 SLES 10 SP 1 , bison and yylex are required. Also, OFED-1.2 install.sh checked for the availability of the requiered packages before the start of compile and install. Will this ability be added to OFED-1.3 ? Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Monday, October 15, 2007 4:31 PM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] OFED 1.3 Alpha release is available Hi, OFED 1.3 Alpha release is available on http://www.openfabrics.org/builds/ofed-1.3/release/ File: OFED-1.3-alpha2.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ The beta release is expected on 29 October Tziporet & Vlad ======================================================================== Release information: -------------------- OS support: Novell: - SLES10 - SLES10 SP1 Redhat: - Redhat EL4 up4 and up5 - Redhat EL5 kernel.org: - 2.6.23 Note: Fedora C6 and Open SUSE 10.2 and Redhat EL4 up3 are not part of the official list. We keep the backport patches for these OSes and make sure OFED compile and loaded properly but will not do full QA cycle. Systems: * x86_64 * x86 * ia64 * ppc64* *Note: On PPC64 installation fails on the packages: ibutils, mvapich2, MPI tests over Open MPI. Main Changes from OFED 1.2.5 ============================ 1. General changes o Kernel code based on 2.6.23 o Quality of Service support in OpenSM, CMA, IPoIB, SRP o Added Neteffect driver (nes) 2. Package and install o There is a new install script. See OFED_Installation_Guide.txt for more details on the new installation and build procedures. Note: There is an easy way to install in one command line without a conf file, and without the interactive mode. Example: ./install.pl --all --prefix /usr/local o User space packages are now in different source RPMs (as opposed to one source RPM in previous OFED releases). o The option for a build without installing is not supported any more. o Added an option to generate tarball with kernel sources for each kernel. 3. IPoIB o Stateless offloads o IGMP for user-space multicast IB o NAPI is enabled default o High availability is supported via the bonding module only (removed ipoib tool scripts) 4. SDP - these are not yet in the alpha release o Keep-alive o Asynch IO o Send Zero Copy 5. iSER o ??? 6. qlgc_vnic o Update for PathScale HCA 7. RDS o RDMA API (using FMRs) - under work 8. uDAPL - these are not yet in the alpha release o Add DAT 2.0 API run-time library and development support. uDAPL 2.0 will include IB extensions for IB rdma write with immediate data and IB atomic operations. o Both uDAPL 1.2 and 2.0 packages will be provided and will co-exist 9. Libraries a. libibverbs 1.1.1 o Added Extended RC transport type b. librdmacm (uCMA) 1.0.3 10. OSM o More routing performance improvements o Even more speedups o Better packaging/installation o "Native" daemon mode o Performance management o Quality of Service manager: Based on IBTA annex 11. Management o Multiple partitions 12. MPI: a. OSU MVAPICH o Version is 0.9.9 - same as in 1.2.5 - to be replaced later b. Open MPI o Version is 1.2.2-1 - same as in 1.2.5 - to be replaced later c. OSU MVAPICH2 o Version was updated to 1.0-1. Tasks that should be completed for the beta release: ---------------------------------------------------- 1. Integrate all SDP features 2. Complete RDS work 3. Apply patches that fix warning of backport patches 4. Fix compilation problems on PPC 5. Add qperf test from Qlogic 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) 7. Support RHEL 5 up1 8. SPEC files should be part of each user space package _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From mshefty at ichips.intel.com Wed Oct 17 08:57:13 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Oct 2007 08:57:13 -0700 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403028B07EB@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> <4714E970.3060507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> <47161E4E.1000005@opengridcomputing.com> <349DCDA352EACF42A0C49FA6DCEA8403028B07EB@G3W0634.americas.hpqcorp.net> Message-ID: <471630D9.60204@ichips.intel.com> > Thanks, For all the sample code, the call rdma_resolve_addr() specify > a port number, which is the same as port number when calling > rdma_bind_addr() > on the other side, Correct - rdma_resolve_addr() does a couple of things: 1. It calls rdma_bind_addr() for the local id if it has not already been called. 2. It sets the destination address, including port number. 3. For IB, it maps the destination IP address to a DGID. > I hope port number is not necessary for rdma_resolve_addr() call. This is the only call that sets the destination address and port number. Although the IP address mapping doesn't use the port number, it would still need to be set before calling rdma_resolve_route() to support QoS. - Sean From swise at opengridcomputing.com Wed Oct 17 09:09:15 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 17 Oct 2007 11:09:15 -0500 Subject: [ofa-general] librdmacm port selection for rdma_bind_addr() In-Reply-To: <471630D9.60204@ichips.intel.com> References: <349DCDA352EACF42A0C49FA6DCEA84030287C8DD@G3W0634.americas.hpqcorp.net> <4714E970.3060507@ichips.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403028B047A@G3W0634.americas.hpqcorp.net> <47161E4E.1000005@opengridcomputing.com> <349DCDA352EACF42A0C49FA6DCEA8403028B07EB@G3W0634.americas.hpqcorp.net> <471630D9.60204@ichips.intel.com> Message-ID: <471633AB.2090904@opengridcomputing.com> Sean Hefty wrote: >> Thanks, For all the sample code, the call rdma_resolve_addr() specify >> a port number, which is the same as port number when calling >> rdma_bind_addr() >> on the other side, > > Correct - rdma_resolve_addr() does a couple of things: > > 1. It calls rdma_bind_addr() for the local id if it has not already been > called. > 2. It sets the destination address, including port number. > 3. For IB, it maps the destination IP address to a DGID. > >> I hope port number is not necessary for rdma_resolve_addr() call. > > This is the only call that sets the destination address and port number. > Although the IP address mapping doesn't use the port number, it would > still need to be set before calling rdma_resolve_route() to support QoS. > Right. And if you have a service that's using ephemeral ports, then you need some way to advertise the port chosen by the transport to your clients. This would be out of band with respect to the rdma connection(s). Steve. From rick.jones2 at hp.com Wed Oct 17 09:17:30 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 17 Oct 2007 09:17:30 -0700 Subject: [ofa-general] Question on IB RDMA read timing. In-Reply-To: <20071017075631.GA5089@minantech.com> References: <20071017030955.GB15679@vt.edu> <4715BD44.2090200@dev.mellanox.co.il> <20071017075631.GA5089@minantech.com> Message-ID: <4716359A.9030303@hp.com> Gleb Natapov wrote: > On Wed, Oct 17, 2007 at 09:44:04AM +0200, Dotan Barak wrote: > >>Hi. >> >>Bharath Ramesh wrote: >> >>>I wrote a simple test program to actual time it takes for RDMA read over >>>IB. I find a huge difference in the numbers returned by timing. I was >>>wondering if someone could help me in finding what I might be doing >>>wrong in the way I am measuring the time. >>> >>>Steps I do for timing is as follows. >>> >>>1) Create the send WR for RDMA Read. >>>2) call gettimeofday () >>>3) ibv_post_send () the WR >>>4) Loop around ibv_poll_cq () till I get the completion event. >>>5) call gettimeofday (); >>> >>>The difference in time would give me the time it takes to perform RDMA >>>read over IB. I constantly get around 35 microsecs as the timing which >>>seems to be really large considering the latency of IB. I am measuring >>>the time for transferring 4K bytes of data. If anyone wants I can send >>>the code that I have written. I am not subscribed to the list, if you >>>could please cc me in the reply. >>> >> >>I don't familiar with the implementation of gettimeofday, but i believe >>that this function do a context switch >>(and/or spend some time in the function to fill the struct that you supply >>to it) >> > > Here: > struct timeval tv_s, tv_e; > gettimeofday(&tv_s, NULL); > gettimeofday(&tv_e, NULL); > printf("%d\n", tv_e.tv_usec - tv_s.tv_usec); > Compile and run it. The overhead of two calls to gettimeofday is at most > 1 microsecond. Unless there is contention with other gettimeofday() calls on the system - on SMP etc there are locks involved in making sure that each call to gettimeofday() does not go backwards and the like, and on some systems, with enough callers to gettimeofday() one can run into lock contention. So, while 99 times out of ten gettimeofday() may be "cheap" it really isn't a good idea to ass-u-me it will always be cheap. And besides, the most efficient call is the one which is never made, so the suggestion to perform N operations between the calls is probably still a good one. Even for measuring the overhead of gettimeofday() :) Also, while it may not be so much the case these days, certainly in the past there were "gettimeofday()" implementations which may have rather coarse granularity. Now, some CPUs offer interval timer/registers/whatever - for example the ITC on Itanium or CR16 on PA-RISC, I'm sure there are other examples - which can be used for measuring very short things. Under some OSes - HP-UX and Solaris are two with which I am familiar - there is a "gethrtime()" interface which uses those without the user having to deal with inline assembly. That should have lower overhead than gettimeofday() although even then it would probably be best, if one is indeed going for the average, to use those to measure the time to perform N operations. If one does use gethrtime(), it should only be for measuring short things, and those "timestamps" should not be interspersed with those from gettimeofday(). The two are really separate "timespaces" if you will. Gethrtime() does not get tick adjustment like gettimeofday() does/can. rick jones FWIW, netperf uses gettimeofday() to measure the overall runtime of a netperf test, and gethrtime() (when available) to measure the individual times for "transactions" such as the exchange of a request/response, or time spend in send() or recv() or whatnot. > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bbr at lampreynetworks.com Wed Oct 17 10:05:40 2007 From: bbr at lampreynetworks.com (Barry Reinhold) Date: Wed, 17 Oct 2007 13:05:40 -0400 Subject: [ofa-general] Expected behavior Message-ID: <006801c810df$f3a1d580$dae58080$@com> During the OFA plugfest at UNH I came across a problem in which a verbs consumer application made a call to ibv_poll_cq after having called ibv_destroy_qp(). The application segfaults in a call to a module supplied by a verb provider that is invoked as a result of the call to ibv_destroy_qp(). As the writer of the application I am unclear as to ownership of this problem - is this behavior "badness" in the code of the verb provider, or is it an issue in my application. The application is processing an abortive teardown process in which it is attempting to terminate the RDMA stream and recover associated resources. The peer at the other end may be doing the same thing at the "same" time. The application, when aborting does the following: 1. Sets the qp_state to IBV_QPS_ERR 2. Sleeps for a second 3. Calls ibv_destroy_qp 4. Calls ibv_destoy_cq The event processing thread is doing the following: 1. Calls bv_get_cq_event 2. Calls ibv_req_notify_cq 3. while((ibv_poll_cq(cq_event, 1, xx) ==1 ) {}; 4. more stuff Does the application need to ensure that ibv_poll_cq is never called after the associated qp is destroyed? Barry Reinhold (603) 868-8411 bbr at lampreynetworks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Oct 17 10:10:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 17 Oct 2007 10:10:16 -0700 Subject: [ofa-general] Expected behavior In-Reply-To: <006801c810df$f3a1d580$dae58080$@com> (Barry Reinhold's message of "Wed, 17 Oct 2007 13:05:40 -0400") References: <006801c810df$f3a1d580$dae58080$@com> Message-ID: > Does the application need to ensure that ibv_poll_cq is never called after > the associated qp is destroyed? No, definitely not, since a CQ may be attached to arbitrarily many QPs, and clearly you need to be able to poll the CQ even after one of the QPs is destroyed. However, you say your app does: > 1. Sets the qp_state to IBV_QPS_ERR > 2. Sleeps for a second > 3. Calls ibv_destroy_qp > 4. Calls ibv_destoy_cq and if you are destroying a CQ, then you definitely do need to make sure you don't poll the CQ after calling ibv_destroy_cq. - R. From weiny2 at llnl.gov Wed Oct 17 11:38:53 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 17 Oct 2007 11:38:53 -0700 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <200710161635.38818.eddiem@sgi.com> References: <200710161635.38818.eddiem@sgi.com> Message-ID: <20071017113853.1b8e5946.weiny2@llnl.gov> On Tue, 16 Oct 2007 16:35:38 -0700 Edward Mascarenhas wrote: > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > clusters? > > We are seeing 1000s of the following message in the system log > > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is > already overloaded with 6736 messages and queue time of:10006[msec] > > It seems like a huge number of datagrams are being generated resulting > in increased time to bring up the fabric. > > Is there a threshold of cluster size beyond which we are likely to see > these messages. > > How many MADs are generated during bring up? > > What is the largest cluster size for which OpenSM has been tried by > others? > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down routing in ~2min. We don't see messages like you state above. But we have been using the OpenSM from OFED 1.2. Hope this helps, Ira From sean.hefty at intel.com Wed Oct 17 11:39:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 17 Oct 2007 11:39:40 -0700 Subject: [ofa-general] [PATCH] librdmacm: provide wrapper functions to extract src/dst addresses Message-ID: <000001c810ed$14147e00$28c8180a@amr.corp.intel.com> Provide wrapper functions to retrieve the source and destination addresses. This is based on feedback from Doug Ledford. Signed-off-by: Sean Hefty --- If there are no objections, I would like to include this change in the next release of librdmacm, and request that it go into OFED 1.3. Makefile.am | 2 ++ include/rdma/rdma_cma.h | 10 ++++++++++ man/rdma_bind_addr.3 | 3 ++- man/rdma_cm.7 | 1 + man/rdma_get_dst_addr.3 | 16 ++++++++++++++++ man/rdma_get_dst_port.3 | 3 ++- man/rdma_get_src_addr.3 | 17 +++++++++++++++++ man/rdma_get_src_port.3 | 3 ++- man/rdma_resolve_addr.3 | 3 ++- 9 files changed, 54 insertions(+), 4 deletions(-) diff --git a/Makefile.am b/Makefile.am index 1195bd9..c688283 100644 --- a/Makefile.am +++ b/Makefile.am @@ -49,6 +49,8 @@ man_MANS = \ man/rdma_get_devices.3 \ man/rdma_get_src_port.3 \ man/rdma_get_dst_port.3 \ + man/rdma_get_src_addr.3 \ + man/rdma_get_dst_addr.3 \ man/rdma_join_multicast.3 \ man/rdma_leave_multicast.3 \ man/rdma_listen.3 \ diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index b0848d5..8ebcaf6 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -494,6 +494,16 @@ static inline uint16_t rdma_get_dst_port(struct rdma_cm_id *id) ((struct sockaddr_in *) &id->route.addr.dst_addr)->sin_port; } +static inline struct sockaddr *rdma_get_src_addr(struct rdma_cm_id *id) +{ + return &id->route.addr.src_addr; +} + +static inline struct sockaddr *rdma_get_dst_addr(struct rdma_cm_id *id) +{ + return &id->route.addr.dst_addr; +} + /** * rdma_get_devices - Get list of RDMA devices currently available. * @num_devices: If non-NULL, set to the number of devices returned. diff --git a/man/rdma_bind_addr.3 b/man/rdma_bind_addr.3 index bed7f0b..dc7a868 100644 --- a/man/rdma_bind_addr.3 +++ b/man/rdma_bind_addr.3 @@ -25,4 +25,5 @@ address. If used to bind to port 0, the rdma_cm will select an available port and return it to the user. .SH "SEE ALSO" -rdma_create_id(3), rdma_listen(3), rdma_resolve_addr(3), rdma_create_qp(3) +rdma_create_id(3), rdma_listen(3), rdma_resolve_addr(3), rdma_create_qp(3), +rdma_get_src_addr(3), rdma_get_src_port(3) diff --git a/man/rdma_cm.7 b/man/rdma_cm.7 index bfb3493..2e07706 100644 --- a/man/rdma_cm.7 +++ b/man/rdma_cm.7 @@ -110,5 +110,6 @@ rdma_resolve_route(3), rdma_connect(3), rdma_listen(3), rdma_accept(3), rdma_reject(3), rdma_join_multicast(3), rdma_leave_multicast(3), rdma_notify(3), rdma_ack_cm_event(3), rdma_disconnect(3), rdma_destroy_qp(3), rdma_destroy_id(3), rdma_destroy_event_channel(3), rdma_get_devices(3), rdma_free_devices(3), +rdma_get_dst_addr(3), rdma_get_src_addr(3), rdma_get_dst_port(3), rdma_get_src_port(3), rdma_set_option(3) ucmatose(1), udaddy(1), mckey(1), rping(1) diff --git a/man/rdma_get_dst_addr.3 b/man/rdma_get_dst_addr.3 new file mode 100644 index 0000000..054445f --- /dev/null +++ b/man/rdma_get_dst_addr.3 @@ -0,0 +1,16 @@ +.TH "RDMA_GET_DST_ADDR" 3 "2007-05-15" "librdmacm" "Librdmacm Programmer's Manual" librdmacm +.SH NAME +rdma_get_dst_addr \- Returns the remote IP address of a bound rdma_cm_id. +.SH SYNOPSIS +.B "#include " +.P +.B "struct sockaddr *" rdma_get_dst_addr +.BI "(struct rdma_cm_id *" id ");" +.SH ARGUMENTS +.IP "id" 12 +RDMA identifier. +.SH "DESCRIPTION" +Returns the remote IP address associated with an rdma_cm_id. +.SH "SEE ALSO" +rdma_resolve_addr(3), rdma_get_src_port(3), rdma_get_dst_port(3), +rdma_get_src_addr(3) diff --git a/man/rdma_get_dst_port.3 b/man/rdma_get_dst_port.3 index 88e6ec2..658c9f7 100644 --- a/man/rdma_get_dst_port.3 +++ b/man/rdma_get_dst_port.3 @@ -13,4 +13,5 @@ RDMA identifier. Returns the remote port number for an rdma_cm_id that has been bound to a remote address. .SH "SEE ALSO" -rdma_connect(3), rdma_accept(3), rdma_get_cm_event(3), rdma_get_src_port(3) +rdma_connect(3), rdma_accept(3), rdma_get_cm_event(3), rdma_get_src_port(3), +rdma_get_src_addr(3), rdma_get_dst_addr(3) diff --git a/man/rdma_get_src_addr.3 b/man/rdma_get_src_addr.3 new file mode 100644 index 0000000..fa9b256 --- /dev/null +++ b/man/rdma_get_src_addr.3 @@ -0,0 +1,17 @@ +.TH "RDMA_GET_SRC_ADDR" 3 "2007-05-15" "librdmacm" "Librdmacm Programmer's Manual" librdmacm +.SH NAME +rdma_get_src_addr \- Returns the local IP address of a bound rdma_cm_id. +.SH SYNOPSIS +.B "#include " +.P +.B "struct sockaddr *" rdma_get_src_addr +.BI "(struct rdma_cm_id *" id ");" +.SH ARGUMENTS +.IP "id" 12 +RDMA identifier. +.SH "DESCRIPTION" +Returns the local IP address for an rdma_cm_id that has been bound to +a local device. +.SH "SEE ALSO" +rdma_bind_addr(3), rdma_resolve_addr(3), rdma_get_src_port(3), +rdma_get_dst_port(3), rdma_get_dst_addr(3) diff --git a/man/rdma_get_src_port.3 b/man/rdma_get_src_port.3 index 63ee564..88f0920 100644 --- a/man/rdma_get_src_port.3 +++ b/man/rdma_get_src_port.3 @@ -13,4 +13,5 @@ RDMA identifier. Returns the local port number for an rdma_cm_id that has been bound to a local address. .SH "SEE ALSO" -rdma_bind_addr(3), rdma_resolve_addr(3), rdma_get_dst_port(3) +rdma_bind_addr(3), rdma_resolve_addr(3), rdma_get_dst_port(3), +rdma_get_src_addr(3), rdma_get_dst_addr(3) diff --git a/man/rdma_resolve_addr.3 b/man/rdma_resolve_addr.3 index 32cd5cf..a9b7f61 100644 --- a/man/rdma_resolve_addr.3 +++ b/man/rdma_resolve_addr.3 @@ -33,4 +33,5 @@ an RDMA device. This call is typically made from the active side of a connection before calling rdma_resolve_route and rdma_connect. .SH "SEE ALSO" rdma_create_id(3), rdma_resolve_route(3), rdma_connect(3), rdma_create_qp(3), -rdma_get_cm_event(3), rdma_bind_addr(3) +rdma_get_cm_event(3), rdma_bind_addr(3), rdma_get_src_port(3), +rdma_get_dst_port(3), rdma_get_src_addr(3), rdma_get_dst_addr(3) From sashak at voltaire.com Wed Oct 17 12:03:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 21:03:51 +0200 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017113853.1b8e5946.weiny2@llnl.gov> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> Message-ID: <20071017190351.GF6945@sashak.voltaire.com> On 11:38 Wed 17 Oct , Ira Weiny wrote: > On Tue, 16 Oct 2007 16:35:38 -0700 > Edward Mascarenhas wrote: > > > > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > > clusters? > > > > We are seeing 1000s of the following message in the system log > > > > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is > > already overloaded with 6736 messages and queue time of:10006[msec] > > > > It seems like a huge number of datagrams are being generated resulting > > in increased time to bring up the fabric. > > > > Is there a threshold of cluster size beyond which we are likely to see > > these messages. > > > > How many MADs are generated during bring up? > > > > What is the largest cluster size for which OpenSM has been tried by > > others? > > > > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down > routing in ~2min. 2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from bring-up power-on? Sasha > We don't see messages like you state above. But we have been using the OpenSM > from OFED 1.2. > > Hope this helps, > Ira > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From chrise at sgi.com Wed Oct 17 12:04:49 2007 From: chrise at sgi.com (Chris Elmquist) Date: Wed, 17 Oct 2007 14:04:49 -0500 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017113853.1b8e5946.weiny2@llnl.gov> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> Message-ID: <20071017190449.GO20546@sgi.com> On Wednesday (10/17/2007 at 11:38AM -0700), Ira Weiny wrote: > On Tue, 16 Oct 2007 16:35:38 -0700 > Edward Mascarenhas wrote: > > > > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > > clusters? > > [...] > > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down > routing in ~2min. > > We don't see messages like you state above. But we have been using the OpenSM > from OFED 1.2. > > Hope this helps, > Ira Ira, Thank you for the information. Can you describe the configuration of the machine on which you run that OpenSM? How much horsepower and the type of HCA used? I suspect that the machine on which we run OpenSM may be underpowered for what we are asking of it... Chris -- Chris Elmquist mailto:chrise at sgi.com (651)683-3093 Silicon Graphics, Inc. Eagan, MN From sashak at voltaire.com Wed Oct 17 12:23:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 21:23:36 +0200 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017190351.GF6945@sashak.voltaire.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> <20071017190351.GF6945@sashak.voltaire.com> Message-ID: <20071017192336.GG6945@sashak.voltaire.com> On 21:03 Wed 17 Oct , Sasha Khapyorsky wrote: > > > > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down > > routing in ~2min. > > 2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from > bring-up power-on? With simulator (ibsim) and atlas I have 7+ seconds with master OpenSM: ------------------------------------------------- OpenSM 3.1.5 Command Line Arguments: Creating new log file Run Once Log File: ./osm.log ------------------------------------------------- OpenSM 3.1.5 Using default GUID 0x2c9020021a5ed Entering MASTER state SUBNET UP Exiting SM real 0m7.324s user 0m5.860s sys 0m2.980s Sasha From cdmaest at sandia.gov Wed Oct 17 12:40:22 2007 From: cdmaest at sandia.gov (Maestas, Christopher Daniel) Date: Wed, 17 Oct 2007 13:40:22 -0600 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017192336.GG6945@sashak.voltaire.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> <20071017190351.GF6945@sashak.voltaire.com> <20071017192336.GG6945@sashak.voltaire.com> Message-ID: <347180497203A942A6AA82C85846CBC904D2705A@ES23SNLNT.srn.sandia.gov> We had some similar experiences at 4480 nodes using Open SM 3.0.0 svn tag 10188. With a clean mapping I think it took ~3-5 minutes. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky Sent: Wednesday, October 17, 2007 1:24 PM To: Ira Weiny Cc: Edward Mascarenhas; general at lists.openfabrics.org Subject: Re: [ofa-general] Running OpenSM on large clusters On 21:03 Wed 17 Oct , Sasha Khapyorsky wrote: > > > > We have atlas running with 1152 nodes. OpenSM is able to route it > > with up/down routing in ~2min. > > 2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from > bring-up power-on? With simulator (ibsim) and atlas I have 7+ seconds with master OpenSM: ------------------------------------------------- OpenSM 3.1.5 Command Line Arguments: Creating new log file Run Once Log File: ./osm.log ------------------------------------------------- OpenSM 3.1.5 Using default GUID 0x2c9020021a5ed Entering MASTER state SUBNET UP Exiting SM real 0m7.324s user 0m5.860s sys 0m2.980s Sasha _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From lindagfarimet at dagfari.de Wed Oct 17 12:50:20 2007 From: lindagfarimet at dagfari.de (Leyla LIEBENHAUSER) Date: Wed, 17 Oct 2007 20:50:20 +0100 Subject: [ofa-general] For Openib Message-ID: <01c810ff$54a08510$c942f851@lindagfarimet> Hi, still sweating at your work? Let me guess. You’re tied, man, I can tell! But relax, lot’s of people have the similar problems. Do you know that it grew up to a disaster, the number of people JUST LIKE YOU is nightmarishly huge. Astonished? It’s the price for the life you lead. You’re the one who carries it all on your shoulders. You are to take care of everything. Of course you tired. Add here, ecology, food and others. To sum it up? Too tired to go on. You’re now experiencing your sexual engine failure. No wonder 100% you and your girlfriend are about to split now, your girlfriend is on her way to pack up her stuff and push along. But what would you say if I tell you that. I know a remedy. Be your self medicator, feel renewed, feel you DESIRE and YOU ARE ABLE . And WE’LL GIVE YOU A LIFT. And I’ll bet it works. I’m sure you’re informed about Viagra. You think that it’s costly. Then I’ll show you something very attractive! Check out the prices, this deal is decent! Viagra- $1.53 Easy-to – buy ha?! It’s real, there’s just no need to overpay for the license which drugs stores simply do have to pay for the right to sell it out. Be strait up, buy strait ahead. And have a nice one! -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph.campbell at qlogic.com Wed Oct 17 12:56:59 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 12:56:59 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> <001001c81026$ec087cc0$a865a8c0@catcher> <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192651019.30322.446.camel@brick.pathscale.com> On Tue, 2007-10-16 at 12:05 -0700, Hal Rosenstock wrote: > On Tue, 2007-10-16 at 14:01 -0500, Steve Welch wrote: > > > -----Original Message----- > > > From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > > > Steve: > > > > > > This patch looks good on my system, meaning it did not break any of my > > > usual > > > tests (switch related). > > > > > Suri, thanks for testing this. > > > > Hal, I will resubmit the patch to the list to include the detailed > > description as we discussed previously. > > Great; thanks. It'd be nice to hear from the iPathers to really nail > this one. > > -- Hal I have reviewed the patch but I noticed a number of problems with handling initial or final LID routed parts of the directed route SMP. I will be sending several small patches as soon as I finish testing them. The InfiniPath changes need to be part of the loopback of DR SMP responses from userspace so I will post a modified patch for that too. From hrosenstock at xsigo.com Wed Oct 17 13:03:41 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 13:03:41 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <1192651019.30322.446.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> <001001c81026$ec087cc0$a865a8c0@catcher> <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> <1192651019.30322.446.camel@brick.pathscale.com> Message-ID: <1192651421.5921.358.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 12:56 -0700, Ralph Campbell wrote: > On Tue, 2007-10-16 at 12:05 -0700, Hal Rosenstock wrote: > > On Tue, 2007-10-16 at 14:01 -0500, Steve Welch wrote: > > > > -----Original Message----- > > > > From: Suresh Shelvapille [mailto:suri at baymicrosystems.com] > > > > Steve: > > > > > > > > This patch looks good on my system, meaning it did not break any of my > > > > usual > > > > tests (switch related). > > > > > > > Suri, thanks for testing this. > > > > > > Hal, I will resubmit the patch to the list to include the detailed > > > description as we discussed previously. > > > > Great; thanks. It'd be nice to hear from the iPathers to really nail > > this one. > > > > -- Hal > > I have reviewed the patch but I noticed a number of > problems with handling initial or final LID routed parts of the > directed route SMP. So called "combined routes" is a separate issue and not currently supported. IMO this should be a separate patch if you are going to add this in. > I will be sending several small patches > as soon as I finish testing them. Great; thanks. > The InfiniPath changes > need to be part of the loopback of DR SMP responses from userspace > so I will post a modified patch for that too. Any chance while your in this area, you can add support for CapMask changed trap 144 (for isSM) in the iPath SMA ? From ralph.campbell at qlogic.com Wed Oct 17 13:04:50 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 13:04:50 -0700 Subject: [ofa-general] [PATCH] IB/core - delete redundant check for DR SMP Message-ID: <1192651490.30322.450.camel@brick.pathscale.com> The function handle_outgoing_dr_smp() is only called if the MAD to be sent is a directed route SMP. Thus, the check for IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is redundant. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..c483d6e 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -679,8 +679,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (device->node_type == RDMA_NODE_IB_SWITCH && - smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + if (device->node_type == RDMA_NODE_IB_SWITCH) port_num = send_wr->wr.ud.port_num; else port_num = mad_agent_priv->agent.port_num; From ralph.campbell at qlogic.com Wed Oct 17 13:07:17 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 13:07:17 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed Message-ID: <1192651637.30322.453.camel@brick.pathscale.com> The code in handle_outgoing_dr_smp() checks to see if the directed route SMP has an initial LID routed part and correctly does not modify the hop pointer but it then proceeds to process the packet as if there was no initial LID routed part. Instead, if there is an initial LID routed part, the packet should just be sent on to the destination and not processed further since it can't be destined for the local SM/SMA. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..3c01236 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -691,9 +691,10 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, * If we are at the start of the LID routed part, don't update the * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. */ - if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == - IB_LID_PERMISSIVE && - smi_handle_dr_smp_send(smp, device->node_type, port_num) == + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) != + IB_LID_PERMISSIVE) + goto out; + if (smi_handle_dr_smp_send(smp, device->node_type, port_num) == IB_SMI_DISCARD) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); From ralph.campbell at qlogic.com Wed Oct 17 13:16:03 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 13:16:03 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <1192651421.5921.358.camel@hrosenstock-ws.xsigo.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> <001001c81026$ec087cc0$a865a8c0@catcher> <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> <1192651019.30322.446.camel@brick.pathscale.com> <1192651421.5921.358.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192652163.30322.456.camel@brick.pathscale.com> On Wed, 2007-10-17 at 13:03 -0700, Hal Rosenstock wrote: > Any chance while your in this area, you can add support for CapMask > changed trap 144 (for isSM) in the iPath SMA ? Probably. I assume the IsCapabilityMaskNoticeSupported bit would then need to be set in SubnGet(PortInfo). From hrosenstock at xsigo.com Wed Oct 17 13:20:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 13:20:51 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <1192652163.30322.456.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <017601c81025$ddb7be20$1914a8c0@md.baymicrosystems.com> <001001c81026$ec087cc0$a865a8c0@catcher> <1192561522.5921.80.camel@hrosenstock-ws.xsigo.com> <1192651019.30322.446.camel@brick.pathscale.com> <1192651421.5921.358.camel@hrosenstock-ws.xsigo.com> <1192652163.30322.456.camel@brick.pathscale.com> Message-ID: <1192652451.5921.371.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 13:16 -0700, Ralph Campbell wrote: > On Wed, 2007-10-17 at 13:03 -0700, Hal Rosenstock wrote: > > > Any chance while your in this area, you can add support for CapMask > > changed trap 144 (for isSM) in the iPath SMA ? > > Probably. Thanks. > I assume the IsCapabilityMaskNoticeSupported bit > would then need to be set in SubnGet(PortInfo). Indeed. From sashak at voltaire.com Wed Oct 17 13:36:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 17 Oct 2007 22:36:02 +0200 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <347180497203A942A6AA82C85846CBC904D2705A@ES23SNLNT.srn.sandia.gov> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> <20071017190351.GF6945@sashak.voltaire.com> <20071017192336.GG6945@sashak.voltaire.com> <347180497203A942A6AA82C85846CBC904D2705A@ES23SNLNT.srn.sandia.gov> Message-ID: <20071017203602.GM6945@sashak.voltaire.com> On 13:40 Wed 17 Oct , Maestas, Christopher Daniel wrote: > We had some similar experiences at 4480 nodes using Open SM 3.0.0 svn > tag 10188. It is something pre-OFED-1.2. > With a clean mapping I think it took ~3-5 minutes. There were many performance improvements since that (but probably my simulator is too fast anyway :)). BTW could you send me output of ibnetdiscover? I will be able to re-run it with ibsim too. Sasha > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha > Khapyorsky > Sent: Wednesday, October 17, 2007 1:24 PM > To: Ira Weiny > Cc: Edward Mascarenhas; general at lists.openfabrics.org > Subject: Re: [ofa-general] Running OpenSM on large clusters > > On 21:03 Wed 17 Oct , Sasha Khapyorsky wrote: > > > > > > We have atlas running with 1152 nodes. OpenSM is able to route it > > > with up/down routing in ~2min. > > > > 2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from > > bring-up power-on? > > With simulator (ibsim) and atlas I have 7+ seconds with master OpenSM: > > ------------------------------------------------- > OpenSM 3.1.5 > Command Line Arguments: > Creating new log file > Run Once > Log File: ./osm.log > ------------------------------------------------- > OpenSM 3.1.5 > > Using default GUID 0x2c9020021a5ed > Entering MASTER state > > SUBNET UP > > Exiting SM > > > real 0m7.324s > user 0m5.860s > sys 0m2.980s > > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From mshefty at ichips.intel.com Wed Oct 17 13:32:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Oct 2007 13:32:35 -0700 Subject: [ofa-general] [PATCH] IB/core - delete redundant check for DR SMP In-Reply-To: <1192651490.30322.450.camel@brick.pathscale.com> References: <1192651490.30322.450.camel@brick.pathscale.com> Message-ID: <47167163.9030509@ichips.intel.com> Ralph Campbell wrote: > The function handle_outgoing_dr_smp() is only called if the > MAD to be sent is a directed route SMP. Thus, the check for > IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is redundant. > > Signed-off-by: Ralph Campbell Acked-by: Sean Hefty > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..c483d6e 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -679,8 +679,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > struct ib_wc mad_wc; > struct ib_send_wr *send_wr = &mad_send_wr->send_wr; > > - if (device->node_type == RDMA_NODE_IB_SWITCH && > - smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > + if (device->node_type == RDMA_NODE_IB_SWITCH) > port_num = send_wr->wr.ud.port_num; > else > port_num = mad_agent_priv->agent.port_num; > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hrosenstock at xsigo.com Wed Oct 17 13:39:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 13:39:51 -0700 Subject: [ofa-general] [PATCH] IB/core - delete redundant check for DR SMP In-Reply-To: <47167163.9030509@ichips.intel.com> References: <1192651490.30322.450.camel@brick.pathscale.com> <47167163.9030509@ichips.intel.com> Message-ID: <1192653591.5921.380.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 13:32 -0700, Sean Hefty wrote: > Ralph Campbell wrote: > > The function handle_outgoing_dr_smp() is only called if the > > MAD to be sent is a directed route SMP. Thus, the check for > > IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is redundant. > > > > Signed-off-by: Ralph Campbell > > Acked-by: Sean Hefty Acked-by: Hal Rosenstock > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 6f42877..c483d6e 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -679,8 +679,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > struct ib_wc mad_wc; > > struct ib_send_wr *send_wr = &mad_send_wr->send_wr; > > > > - if (device->node_type == RDMA_NODE_IB_SWITCH && > > - smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > > + if (device->node_type == RDMA_NODE_IB_SWITCH) > > port_num = send_wr->wr.ud.port_num; > > else > > port_num = mad_agent_priv->agent.port_num; > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From weiny2 at llnl.gov Wed Oct 17 14:12:13 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 17 Oct 2007 14:12:13 -0700 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017192336.GG6945@sashak.voltaire.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> <20071017190351.GF6945@sashak.voltaire.com> <20071017192336.GG6945@sashak.voltaire.com> Message-ID: <20071017141213.71ddd269.weiny2@llnl.gov> Ok, perhaps my estimate was a bit too fast. However, it should be on the order of minutes. I could get a hard number from a sysadmin but I don't think that is the issue. Ira On Wed, 17 Oct 2007 21:23:36 +0200 Sasha Khapyorsky wrote: > On 21:03 Wed 17 Oct , Sasha Khapyorsky wrote: > > > > > > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down > > > routing in ~2min. > > > > 2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from > > bring-up power-on? > > With simulator (ibsim) and atlas I have 7+ seconds with master OpenSM: > > ------------------------------------------------- > OpenSM 3.1.5 > Command Line Arguments: > Creating new log file > Run Once > Log File: ./osm.log > ------------------------------------------------- > OpenSM 3.1.5 > > Using default GUID 0x2c9020021a5ed > Entering MASTER state > > SUBNET UP > > Exiting SM > > > real 0m7.324s > user 0m5.860s > sys 0m2.980s > > > Sasha From weiny2 at llnl.gov Wed Oct 17 14:19:37 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 17 Oct 2007 14:19:37 -0700 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017190449.GO20546@sgi.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113853.1b8e5946.weiny2@llnl.gov> <20071017190449.GO20546@sgi.com> Message-ID: <20071017141937.0fe16063.weiny2@llnl.gov> On Wed, 17 Oct 2007 14:04:49 -0500 Chris Elmquist wrote: > On Wednesday (10/17/2007 at 11:38AM -0700), Ira Weiny wrote: > > On Tue, 16 Oct 2007 16:35:38 -0700 > > Edward Mascarenhas wrote: > > > > > > > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > > > clusters? > > > > [...] > > > > > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down > > routing in ~2min. > > > > We don't see messages like you state above. But we have been using the OpenSM > > from OFED 1.2. > > > > Hope this helps, > > Ira > > Ira, > > Thank you for the information. Can you describe the configuration of > the machine on which you run that OpenSM? How much horsepower and the > type of HCA used? > > I suspect that the machine on which we run OpenSM may be underpowered for > what we are asking of it... > > Chris > The node is a 4 socket MB with 2.4Gig dual core opterons (8 cores total). OpenSM is the biggest thing running on that node but I don't recall it taking all 8 cores for any length of time... The HCA's are Mellanox on a PCIe bus. ibstat is included below. Ira 14:12:48 > ibstat CA 'mthca0' CA type: MT25208 Number of ports: 2 Firmware version: 5.2.916 Hardware version: 20 Node GUID: 0x0002c9020021a5ec System image GUID: 0x0002c9020021a5ef Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 1388 LMC: 0 SM lid: 1388 Capability mask: 0x02510a6a Port GUID: 0x0002c9020021a5ed Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x0002c9020021a5ee From sashak at voltaire.com Wed Oct 17 15:13:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 00:13:22 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <4714C9A1.5010304@dev.mellanox.co.il> References: <4714C9A1.5010304@dev.mellanox.co.il> Message-ID: <20071017221322.GN6945@sashak.voltaire.com> Hi Yevgeny, On 16:24 Tue 16 Oct , Yevgeny Kliteynik wrote: > Adding ClassPortInfo:CapabilityMask2 field and turning > on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). ^^^^^^^^^^ capability > > Signed-off-by: Yevgeny Kliteynik > --- > infiniband-diags/src/saquery.c | 6 +- > opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- > opensm/include/opensm/osm_base.h | 12 +++ > opensm/opensm/osm_sa_class_port_info.c | 4 +- > opensm/osmtest/osmtest.c | 13 +++- > 5 files changed, 162 insertions(+), 10 deletions(-) > > diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c > index a9a8da4..e17ec5a 100644 > --- a/infiniband-diags/src/saquery.c > +++ b/infiniband-diags/src/saquery.c > @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) > "\t\tBase version.............%d\n" > "\t\tClass version............%d\n" > "\t\tCapability mask..........0x%04X\n" > - "\t\tResponse time value......0x%08X\n" > + "\t\tCapability mask 2........0x%08X\n" > + "\t\tResponse time value......0x%02X\n" > "\t\tRedirect GID.............0x%s\n" > "\t\tRedirect TC/SL/FL........0x%08X\n" > "\t\tRedirect LID.............0x%04X\n" > @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) > class_port_info->base_ver, > class_port_info->class_ver, > cl_ntoh16(class_port_info->cap_mask), > - class_port_info->resp_time_val, > + ib_class_cap_mask2(class_port_info), > + ib_class_resp_time_val(class_port_info), > sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), > cl_ntoh32(class_port_info->redir_tc_sl_fl), > cl_ntoh16(class_port_info->redir_lid), > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index 0969755..3685007 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { > uint8_t base_ver; > uint8_t class_ver; > ib_net16_t cap_mask; > - uint8_t reserved[3]; > - uint8_t resp_time_val; > + ib_net32_t cap_mask2_resp_time; > ib_gid_t redir_gid; > ib_net32_t redir_tc_sl_fl; > ib_net16_t redir_lid; > @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { > * cap_mask > * Supported capabilities of this management class. > * > -* resp_time_value > -* Maximum expected response time. > +* cap_mask2_resp_time > +* Maximum expected response time and additional > +* supported capabilities of this management class. > * > * redr_gid > * GID to use for redirection, or zero > @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { > * > *********/ > > +/****f* IBA Base: Types/ib_class_set_resp_time_val > +* NAME > +* ib_class_set_resp_time_val > +* > +* DESCRIPTION > +* Set maximum expected response time. > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, > + IN const uint8_t val) > +{ > + p_cpi->cap_mask2_resp_time = > + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | Souldn't be ~IB_CLASS_RESP_TIME_MASK? > + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* val > +* [in] Response time value to set. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_resp_time_val > +* NAME > +* ib_class_resp_time_val > +* > +* DESCRIPTION > +* Get response time value. > +* > +* SYNOPSIS > +*/ > +static inline uint8_t OSM_API > +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) > +{ > + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & > + IB_CLASS_RESP_TIME_MASK); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* RETURN VALUES > +* Response time value. > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_set_cap_mask2 > +* NAME > +* ib_class_set_cap_mask2 > +* > +* DESCRIPTION > +* Set ClassPortInfo:CapabilityMask2. > +* > +* SYNOPSIS > +*/ > +static inline void OSM_API > +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, > + IN const uint32_t cap_mask2) > +{ > + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & > + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | > + cl_hton32(cap_mask2 << 5); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* cap_mask2 > +* [in] CapabilityMask2 value to set. > +* > +* RETURN VALUES > +* None > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > +/****f* IBA Base: Types/ib_class_cap_mask2 > +* NAME > +* ib_class_cap_mask2 > +* > +* DESCRIPTION > +* Get ClassPortInfo:CapabilityMask2. > +* > +* SYNOPSIS > +*/ > +static inline uint32_t OSM_API > +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) > +{ > + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); > +} > + > +/* > +* PARAMETERS > +* p_cpi > +* [in] Pointer to the class port info object. > +* > +* RETURN VALUES > +* CapabilityMask2 of the ClassPortInfo. > +* > +* NOTES > +* > +* SEE ALSO > +* ib_class_port_info_t > +*********/ > + > /****s* IBA Base: Types/ib_sm_info_t > * NAME > * ib_sm_info_t > diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h > index e635dcb..26ef067 100644 > --- a/opensm/include/opensm/osm_base.h > +++ b/opensm/include/opensm/osm_base.h > @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { > #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) > /***********/ > > +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED > +* Name > +* OSM_CAP2_IS_QOS_SUPPORTED > +* > +* DESCRIPTION > +* QoS is supported > +* > +* SYNOPSIS > +*/ > +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) This one is IB specific. I guess it should be somewhere in ib_types.h. > +/***********/ > + > /****d* OpenSM: Base/osm_sm_state_t > * NAME > * osm_sm_state_t > diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c > index d5c9f82..96d8898 100644 > --- a/opensm/opensm/osm_sa_class_port_info.c > +++ b/opensm/opensm/osm_sa_class_port_info.c > @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, > } > } > rtv += 8; > - p_resp_cpi->resp_time_val = rtv; > + ib_class_set_resp_time_val(p_resp_cpi, rtv); > p_resp_cpi->redir_gid = zero_gid; > p_resp_cpi->redir_tc_sl_fl = 0; > p_resp_cpi->redir_lid = 0; > @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, > p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | > OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; > #endif > + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); Shouldn't it check subn->opts.qos? Sasha > + > if (p_rcv->p_subn->opt.no_multicast_option != TRUE) > p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; > p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); > diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c > index 73933a3..de54f2d 100644 > --- a/opensm/osmtest/osmtest.c > +++ b/opensm/osmtest/osmtest.c > @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) > (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); > > osm_log(&p_osmt->log, OSM_LOG_INFO, > - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" > - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", > + "osmtest_validate_sa_class_port_info:\n" > + "-----------------------------\n" > + "SA Class Port Info:\n" > + " base_ver:%u\n" > + " class_ver:%u\n" > + " cap_mask:0x%X\n" > + " cap_mask2:0x%X\n" > + " resp_time_val:0x%X\n" > + "-----------------------------\n", > p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), > - p_cpi->resp_time_val); > + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); > > Exit: > #if 0 > -- > 1.5.1.4 > From mshefty at ichips.intel.com Wed Oct 17 15:07:01 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Oct 2007 15:07:01 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <1192651637.30322.453.camel@brick.pathscale.com> References: <1192651637.30322.453.camel@brick.pathscale.com> Message-ID: <47168785.3030700@ichips.intel.com> Ralph Campbell wrote: > The code in handle_outgoing_dr_smp() checks to see if the directed > route SMP has an initial LID routed part and correctly does not > modify the hop pointer but it then proceeds to process the packet > as if there was no initial LID routed part. Instead, if there > is an initial LID routed part, the packet should just be sent on > to the destination and not processed further since it can't be > destined for the local SM/SMA. This makes sense to me at first read, but I need more time studying the spec and existing code before reaching any conclusions. Can't a DR SMP be entirely LID routed, meaning that this SMP could be for the local node? (I know that doesn't seem to make sense, but is it permitted?) > @@ -691,9 +691,10 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > * If we are at the start of the LID routed part, don't update the > * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. > */ > - if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == > - IB_LID_PERMISSIVE && > - smi_handle_dr_smp_send(smp, device->node_type, port_num) == > + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) != > + IB_LID_PERMISSIVE) > + goto out; With this change, I would move the LID check up higher in the function, to avoid setting the port_num as a minor nit optimization. - Sean From hrosenstock at xsigo.com Wed Oct 17 15:11:59 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 15:11:59 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <47168785.3030700@ichips.intel.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> Message-ID: <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: > Ralph Campbell wrote: > > The code in handle_outgoing_dr_smp() checks to see if the directed > > route SMP has an initial LID routed part and correctly does not > > modify the hop pointer but it then proceeds to process the packet > > as if there was no initial LID routed part. Instead, if there > > is an initial LID routed part, the packet should just be sent on > > to the destination and not processed further since it can't be > > destined for the local SM/SMA. > > This makes sense to me at first read, but I need more time studying the > spec and existing code before reaching any conclusions. Can't a DR SMP > be entirely LID routed, meaning that this SMP could be for the local > node? (I know that doesn't seem to make sense, but is it permitted?) Yes (and you can do this with smpquery and sminfo). > > @@ -691,9 +691,10 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > * If we are at the start of the LID routed part, don't update the > > * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. > > */ > > - if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == > > - IB_LID_PERMISSIVE && > > - smi_handle_dr_smp_send(smp, device->node_type, port_num) == > > + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) != > > + IB_LID_PERMISSIVE) > > + goto out; > > With this change, I would move the LID check up higher in the function, > to avoid setting the port_num as a minor nit optimization. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Wed Oct 17 15:11:59 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 15:11:59 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <47168785.3030700@ichips.intel.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> Message-ID: <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: > Ralph Campbell wrote: > > The code in handle_outgoing_dr_smp() checks to see if the directed > > route SMP has an initial LID routed part and correctly does not > > modify the hop pointer but it then proceeds to process the packet > > as if there was no initial LID routed part. Instead, if there > > is an initial LID routed part, the packet should just be sent on > > to the destination and not processed further since it can't be > > destined for the local SM/SMA. > > This makes sense to me at first read, but I need more time studying the > spec and existing code before reaching any conclusions. Can't a DR SMP > be entirely LID routed, meaning that this SMP could be for the local > node? (I know that doesn't seem to make sense, but is it permitted?) Yes (and you can do this with smpquery and sminfo). > > @@ -691,9 +691,10 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > * If we are at the start of the LID routed part, don't update the > > * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. > > */ > > - if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == > > - IB_LID_PERMISSIVE && > > - smi_handle_dr_smp_send(smp, device->node_type, port_num) == > > + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) != > > + IB_LID_PERMISSIVE) > > + goto out; > > With this change, I would move the LID check up higher in the function, > to avoid setting the port_num as a minor nit optimization. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Wed Oct 17 15:34:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 00:34:22 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <20071017221322.GN6945@sashak.voltaire.com> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> Message-ID: <20071017223422.GP6945@sashak.voltaire.com> On 00:13 Thu 18 Oct , Sasha Khapyorsky wrote: > > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > > index 0969755..3685007 100644 > > --- a/opensm/include/iba/ib_types.h > > +++ b/opensm/include/iba/ib_types.h > > @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { > > uint8_t base_ver; > > uint8_t class_ver; > > ib_net16_t cap_mask; > > - uint8_t reserved[3]; > > - uint8_t resp_time_val; > > + ib_net32_t cap_mask2_resp_time; This will break ibutils. We are in OFED already, so I think the patch for ibutils should be committed/pushed at same time. Sasha From mshefty at ichips.intel.com Wed Oct 17 15:32:25 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 17 Oct 2007 15:32:25 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> Message-ID: <47168D79.2070607@ichips.intel.com> Hal Rosenstock wrote: > On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: >> Ralph Campbell wrote: >>> The code in handle_outgoing_dr_smp() checks to see if the directed >>> route SMP has an initial LID routed part and correctly does not >>> modify the hop pointer but it then proceeds to process the packet >>> as if there was no initial LID routed part. Instead, if there >>> is an initial LID routed part, the packet should just be sent on >>> to the destination and not processed further since it can't be >>> destined for the local SM/SMA. >> This makes sense to me at first read, but I need more time studying the >> spec and existing code before reaching any conclusions. Can't a DR SMP >> be entirely LID routed, meaning that this SMP could be for the local >> node? (I know that doesn't seem to make sense, but is it permitted?) > > Yes (and you can do this with smpquery and sminfo). So, I think we want to remove the comment from the changelog that states that the SMP 'can't be destined for the local SM/SMA'. I _think_ the code change itself is okay, as long as we handle this on the receive side, which is needed anyway, and is part of the missing support that you pointed out in a separate thread. Does this seem correct? - Sean From hrosenstock at xsigo.com Wed Oct 17 15:36:45 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 15:36:45 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <47168D79.2070607@ichips.intel.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> <47168D79.2070607@ichips.intel.com> Message-ID: <1192660605.5921.405.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 15:32 -0700, Sean Hefty wrote: > Hal Rosenstock wrote: > > On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: > >> Ralph Campbell wrote: > >>> The code in handle_outgoing_dr_smp() checks to see if the directed > >>> route SMP has an initial LID routed part and correctly does not > >>> modify the hop pointer but it then proceeds to process the packet > >>> as if there was no initial LID routed part. Instead, if there > >>> is an initial LID routed part, the packet should just be sent on > >>> to the destination and not processed further since it can't be > >>> destined for the local SM/SMA. > > >> This makes sense to me at first read, but I need more time studying the > >> spec and existing code before reaching any conclusions. Can't a DR SMP > >> be entirely LID routed, meaning that this SMP could be for the local > >> node? (I know that doesn't seem to make sense, but is it permitted?) > > > > Yes (and you can do this with smpquery and sminfo). > > So, I think we want to remove the comment from the changelog that states > that the SMP 'can't be destined for the local SM/SMA'. I think that language came from the spec. Need to do some homework on this. > I _think_ the code change itself is okay, as long as we handle this on the receive > side, which is needed anyway, and is part of the missing support that > you pointed out in a separate thread. Does this seem correct? I'm not sure until I do my homework. -- Hal > - Sean From hrosenstock at xsigo.com Wed Oct 17 15:36:45 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 17 Oct 2007 15:36:45 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <47168D79.2070607@ichips.intel.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> <1192659119.5921.402.camel@hrosenstock-ws.xsigo.com> <47168D79.2070607@ichips.intel.com> Message-ID: <1192660605.5921.405.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 15:32 -0700, Sean Hefty wrote: > Hal Rosenstock wrote: > > On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: > >> Ralph Campbell wrote: > >>> The code in handle_outgoing_dr_smp() checks to see if the directed > >>> route SMP has an initial LID routed part and correctly does not > >>> modify the hop pointer but it then proceeds to process the packet > >>> as if there was no initial LID routed part. Instead, if there > >>> is an initial LID routed part, the packet should just be sent on > >>> to the destination and not processed further since it can't be > >>> destined for the local SM/SMA. > > >> This makes sense to me at first read, but I need more time studying the > >> spec and existing code before reaching any conclusions. Can't a DR SMP > >> be entirely LID routed, meaning that this SMP could be for the local > >> node? (I know that doesn't seem to make sense, but is it permitted?) > > > > Yes (and you can do this with smpquery and sminfo). > > So, I think we want to remove the comment from the changelog that states > that the SMP 'can't be destined for the local SM/SMA'. I think that language came from the spec. Need to do some homework on this. > I _think_ the code change itself is okay, as long as we handle this on the receive > side, which is needed anyway, and is part of the missing support that > you pointed out in a separate thread. Does this seem correct? I'm not sure until I do my homework. -- Hal > - Sean From ralph.campbell at qlogic.com Wed Oct 17 15:44:18 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 15:44:18 -0700 Subject: [ofa-general] [PATCH] IB/core - Don't modify outgoing DR SMP if first part is LID routed In-Reply-To: <47168785.3030700@ichips.intel.com> References: <1192651637.30322.453.camel@brick.pathscale.com> <47168785.3030700@ichips.intel.com> Message-ID: <1192661058.30322.488.camel@brick.pathscale.com> I need to double check too. I was thinking the packet was originating on the node calling handle_outgoing_dr_smp() and therefore, if there is a leading LID part, just send the packet. But, if it is forwarding the MAD, I think the check is more complex. It may be necessary for handle_outgoing_dr_smp() to classify whether the packet should be sent, modify the directed route hop_ptr, or process locally, etc. similar to smi_check_forward_dr_smp(). On Wed, 2007-10-17 at 15:07 -0700, Sean Hefty wrote: > Ralph Campbell wrote: > > The code in handle_outgoing_dr_smp() checks to see if the directed > > route SMP has an initial LID routed part and correctly does not > > modify the hop pointer but it then proceeds to process the packet > > as if there was no initial LID routed part. Instead, if there > > is an initial LID routed part, the packet should just be sent on > > to the destination and not processed further since it can't be > > destined for the local SM/SMA. > > This makes sense to me at first read, but I need more time studying the > spec and existing code before reaching any conclusions. Can't a DR SMP > be entirely LID routed, meaning that this SMP could be for the local > node? (I know that doesn't seem to make sense, but is it permitted?) > > > @@ -691,9 +691,10 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > > * If we are at the start of the LID routed part, don't update the > > * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. > > */ > > - if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == > > - IB_LID_PERMISSIVE && > > - smi_handle_dr_smp_send(smp, device->node_type, port_num) == > > + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) != > > + IB_LID_PERMISSIVE) > > + goto out; > > With this change, I would move the LID check up higher in the function, > to avoid setting the port_num as a minor nit optimization. > > - Sean From sashak at voltaire.com Wed Oct 17 15:57:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 00:57:09 +0200 Subject: [ofa-general] Re: [PATCH v3] osm: QoS - parsing port names In-Reply-To: <4714C6A6.7050300@dev.mellanox.co.il> References: <4714C6A6.7050300@dev.mellanox.co.il> Message-ID: <20071017225709.GQ6945@sashak.voltaire.com> On 16:11 Tue 16 Oct , Yevgeny Kliteynik wrote: > > Added node-by-name hash to the QoS policy object and > as port names are parsed they use this hash to locate > that actual port that the name refers to. > For now I prefer to keep this hash local, so it's part > of QoS policy object. > When the same parser will be used for partitions too, > this hash will be moved to be part of the subnet object. > > V3 changes (vs. V2): > - node-by-name instead of ca-by-name > - removed any constraints on the format of node name > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_qos_policy.h | 3 +- > opensm/opensm/osm_qos_parser.y | 64 ++++++++++++++++++++++++++------ > opensm/opensm/osm_qos_policy.c | 38 ++++++++++++++++--- > 3 files changed, 86 insertions(+), 19 deletions(-) > > diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h > index 30c2e6d..61fc325 100644 > --- a/opensm/include/opensm/osm_qos_policy.h > +++ b/opensm/include/opensm/osm_qos_policy.h > @@ -49,6 +49,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { > typedef struct _osm_qos_port_group_t { > char *name; /* single string (this port group name) */ > char *use; /* single string (description) */ > - cl_list_t port_name_list; /* list of port names (.../.../...) */ > uint8_t node_types; /* node types bitmask */ > cl_qmap_t port_map; > } osm_qos_port_group_t; > @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { > cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ > osm_qos_level_t *p_default_qos_level; /* default QoS level */ > osm_subn_t *p_subn; /* osm subnet object */ > + st_table * p_node_hash; /* node by name hash */ > } osm_qos_policy_t; > > /***************************************************/ > diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y > index d2917d3..5a6e0c9 100644 > --- a/opensm/opensm/osm_qos_parser.y > +++ b/opensm/opensm/osm_qos_parser.y > @@ -245,7 +245,8 @@ qos_policy_entry: port_groups_section > * use: our SRP storage targets > * port-guid: 0x1000000000000001,0x1000000000000002 > * ... > - * port-name: vs1/HCA-1/P1 > + * port-name: vs1 HCA-1/P1 > + * port-name: node_and_HCA_name/P2 Maybe node_desc is cleaner instead of node_and_HCA_name. > * ... > * pkey: 0x00FF-0x0FFF > * ... > @@ -602,21 +603,60 @@ port_group_use_start: TK_USE { > > port_group_port_name: port_group_port_name_start string_list { > /* 'port-name' in 'port-group' - any num of instances */ > - cl_list_iterator_t list_iterator; > - char * tmp_str; > - > - list_iterator = cl_list_head(&tmp_parser_struct.str_list); > - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) > + cl_list_iterator_t list_iterator; > + osm_node_t * p_node; > + osm_physp_t * p_physp; > + unsigned port_num; > + char * tmp_str; > + char * port_str; > + > + /* parsing port name strings */ > + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); > + list_iterator != cl_list_end(&tmp_parser_struct.str_list); > + list_iterator = cl_list_next(list_iterator)) > { > tmp_str = (char*)cl_list_obj(list_iterator); > + if (tmp_str) > + { > + /* last slash in port name string is a separator > + between node name and port number */ > + port_str = strrchr(tmp_str, '/'); > + if (!port_str || (strlen(port_str) < 3) || If port number is not specified it could be nice wildcarding - all ports for this node. There is no wild card expansion with multiple ports mapping in this patch, so this comment is just idea for future use, no need to change yet. > + (port_str[1] != 'p' && port_str[1] != 'P')) { > + yyerror("illegal port name"); > + free(tmp_str); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > > - /* > - * TODO: parse port name strings > - */ > + if (!(port_num = strtoul(&port_str[2],NULL,0))) { > + yyerror("illegal port number in port name"); > + free(tmp_str); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > > - if (tmp_str) > - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); > - list_iterator = cl_list_next(list_iterator); > + /* separate node name from port number */ > + port_str[0] = '\0'; > + > + if (st_lookup(p_qos_policy->p_node_hash, > + (st_data_t)tmp_str, > + (st_data_t*)&p_node)) > + { > + /* we found the node, now get the right port */ > + p_physp = osm_node_get_physp_ptr(p_node, port_num); > + if (!p_physp) { > + yyerror("port number out of range in port name"); > + free(tmp_str); > + cl_list_remove_all(&tmp_parser_struct.str_list); > + return 1; > + } > + /* we found the port, now add it to guid table */ > + __parser_add_port_to_port_map(&p_current_port_group->port_map, > + p_physp); > + } > + free(tmp_str); > + } > } > cl_list_remove_all(&tmp_parser_struct.str_list); > } > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 51dd7b9..1207295 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -59,6 +59,33 @@ > /*************************************************** > ***************************************************/ > > +static void > +__build_nodebyname_hash(osm_qos_policy_t * p_qos_policy) > +{ > + osm_node_t * p_node; > + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; > + > + p_qos_policy->p_node_hash = st_init_strtable(); > + CL_ASSERT(p_qos_policy->p_node_hash); > + > + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) > + return; > + > + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); > + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); > + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { > + if (!st_lookup(p_qos_policy->p_node_hash, > + (st_data_t)p_node->print_desc, > + (st_data_t*)&p_node)) > + st_insert(p_qos_policy->p_node_hash, > + (st_data_t)p_node->print_desc, > + (st_data_t)p_node); st_lookup() is not needed? st_insert() replace entry if it exists. In case of identical node_desc last will appear. Sasha > + } > +} > + > +/*************************************************** > + ***************************************************/ > + > static boolean_t > __is_num_in_range_arr(uint64_t ** range_arr, > unsigned range_arr_len, uint64_t num) > @@ -127,8 +154,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() > return NULL; > > memset(p, 0, sizeof(osm_qos_port_group_t)); > - > - cl_list_init(&p->port_name_list, 10); > cl_qmap_init(&p->port_map); > > return p; > @@ -150,10 +175,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) > if (p->use) > free(p->use); > > - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); > - cl_list_remove_all(&p->port_name_list); > - cl_list_destroy(&p->port_name_list); > - > p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); > while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) > { > @@ -423,6 +444,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) > cl_list_init(&p_qos_policy->qos_match_rules, 10); > > p_qos_policy->p_subn = p_subn; > + __build_nodebyname_hash(p_qos_policy); > + > return p_qos_policy; > } > > @@ -495,6 +518,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) > cl_list_remove_all(&p_qos_policy->qos_match_rules); > cl_list_destroy(&p_qos_policy->qos_match_rules); > > + if (p_qos_policy->p_node_hash) > + st_free_table(p_qos_policy->p_node_hash); > + > free(p_qos_policy); > > p_qos_policy = NULL; > -- > 1.5.1.4 > From eddiem at sgi.com Wed Oct 17 16:24:40 2007 From: eddiem at sgi.com (Edward Mascarenhas) Date: Wed, 17 Oct 2007 16:24:40 -0700 Subject: [ofa-general] Running OpenSM on large clusters In-Reply-To: <20071017113049.GA6329@sashak.voltaire.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113049.GA6329@sashak.voltaire.com> Message-ID: <200710171624.40828.eddiem@sgi.com> On Wednesday 17 October 2007 04:30:49 am Sasha Khapyorsky wrote: > On 16:35 Tue 16 Oct , Edward Mascarenhas wrote: > > Has anyone seen issues with running OpenSM on large (1500+ nodes) > > clusters? > > > > We are seeing 1000s of the following message in the system log > > > > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is > > already overloaded with 6736 messages and queue time > > of:10006[msec] > > I guess you see this during fabric bringup when SA processor is not > available yet. Which version of OpenSM you are using - we did some > improvements in this area in recent versions (partially in > OFED-1.2)? > Yes during fabric bringup. We are using OFED 1.2. Edward From ralph.campbell at qlogic.com Wed Oct 17 17:33:23 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 17:33:23 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> Message-ID: <1192667603.30322.504.camel@brick.pathscale.com> Steve's patch plus the attached patch for ib_ipath allows loopback to work and doesn't seem to obviously break anything. I was wondering though about adding the code from smi_check_local_returning_smp() to smi_check_local_smp() instead of defining a separate function. That got me thinking about what happens when a return path DR SMP is received and ib_mad_recv_done_handler() calls smi_check_local_smp(). Now I'm trying to convince myself one way or the other whether the same checks inib_mad_recv_done_handler() are needed or not. On Wed, 2007-10-10 at 22:29 -0500, swelch at systemfabricworks.com wrote: > > Sean, Roland, > > This patch [v3] replaces the [v2] patch; it includes those changes but renames > the smi function testing returning SMP requests to the name Hal recommends. > > This patch allows userspace DR SMP responses to be looped back and delivered > to a local mad agent by the management stack. > > Thanks, Steve > > Signed-off-by: Steve Welch > --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..98148d6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > mad_agent_priv->agent.port_num); > if (port_priv) { > - mad_priv->mad.mad.mad_hdr.tid = > - ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..aff96ba 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, int port_num); > > /* > - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > */ > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > struct ib_device *device) > @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > + */ > +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 3d1432d..1978c34 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: smp->status |= IB_SMP_UNSUP_METHOD; @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 port_num, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: pmp->status |= IB_SMP_UNSUP_METHOD; From ralph.campbell at qlogic.com Wed Oct 17 18:06:42 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 17 Oct 2007 18:06:42 -0700 Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler() Message-ID: <1192669602.30322.510.camel@brick.pathscale.com> In ib_mad_recv_done_handler(), the response pointer is checked for NULL after allocating it. It is then checked again in the local process_mad() path but there is no possibility of it changing in between. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..f82900d 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1931,15 +1931,6 @@ local: if (port_priv->device->process_mad) { int ret; - if (!response) { - printk(KERN_ERR PFX "No memory for response MAD\n"); - /* - * Is it better to assume that - * it wouldn't be processed ? - */ - goto out; - } - ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc, &recv->grh, From swelch at systemfabricworks.com Wed Oct 17 18:59:29 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Wed, 17 Oct 2007 20:59:29 -0500 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDR SMP responses from userspace In-Reply-To: <1192667603.30322.504.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> Message-ID: <001001c8112a$8575fd20$a865a8c0@catcher> I believe in the case of ib_mad_recv_done_handler(), the call smi_check_forward_dr_smp() will return 0 indicating it should be handled by the local stack because the hop pointer will equal 0 (in the case where the DR SMP response should be delivered to the stack). The smi_check_local_smp() call would not be reached. The second part of the original fix is not required either in ib_mad_recv_done_handler(); when the device process mad routine does not reply or consume the MAD it uses the original receive mad to deliver to the MAD to the local agent, eliminating the need for the memcpy. Steve > -----Original Message----- > From: Ralph Campbell [mailto:ralph.campbell at qlogic.com] > Sent: Wednesday, October 17, 2007 7:33 PM > To: swelch at systemfabricworks.com > Cc: rdreier at cisco.com; sean.hefty at intel.com; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH V3] infiniband/core: Enable loopback > ofDR SMP responses from userspace > > Steve's patch plus the attached patch for ib_ipath allows loopback > to work and doesn't seem to obviously break anything. > > I was wondering though about adding the code from > smi_check_local_returning_smp() to smi_check_local_smp() > instead of defining a separate function. > That got me thinking about what happens when a return path DR SMP > is received and ib_mad_recv_done_handler() calls smi_check_local_smp(). > Now I'm trying to convince myself one way or the other whether > the same checks inib_mad_recv_done_handler() are needed or not. > > On Wed, 2007-10-10 at 22:29 -0500, swelch at systemfabricworks.com wrote: > > > > Sean, Roland, > > > > This patch [v3] replaces the [v2] patch; it includes those changes but > renames > > the smi function testing returning SMP requests to the name Hal > recommends. > > > > This patch allows userspace DR SMP responses to be looped back and > delivered > > to a local mad agent by the management stack. > > > > Thanks, Steve > > > > Signed-off-by: Steve Welch > > --- > > drivers/infiniband/core/mad.c | 6 +++--- > > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > > 2 files changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/infiniband/core/mad.c > b/drivers/infiniband/core/mad.c > > index 6f42877..98148d6 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > } > > > > /* Check to post send on QP or process locally */ > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > > goto out; > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > mad_agent_priv->agent.port_num); > > if (port_priv) { > > - mad_priv->mad.mad.mad_hdr.tid = > > - ((struct ib_mad *)smp)->mad_hdr.tid; > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > > recv_mad_agent = find_mad_agent(port_priv, > > &mad_priv->mad.mad); > > } > > diff --git a/drivers/infiniband/core/smi.h > b/drivers/infiniband/core/smi.h > > index 1cfc298..aff96ba 100644 > > --- a/drivers/infiniband/core/smi.h > > +++ b/drivers/infiniband/core/smi.h > > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct > ib_smp *smp, > > u8 node_type, int port_num); > > > > /* > > - * Return 1 if the SMP should be handled by the local SMA/SM via > process_mad > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > SMA/SM > > + * via process_mad > > */ > > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > struct ib_device *device) > > @@ -71,4 +72,19 @@ static inline enum smi_action > smi_check_local_smp(struct ib_smp *smp, > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > } > > + > > +/* > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > SMA/SM > > + * via process_mad > > + */ > > +static inline enum smi_action smi_check_local_returning_smp(struct > ib_smp *smp, > > + struct ib_device *device) > > +{ > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > + return ((device->process_mad && > > + ib_get_smp_direction(smp) && > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > +} > > + > > #endif /* __SMI_H_ */ > > > > diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c > b/drivers/infiniband/hw/ipath/ipath_mad.c > index 3d1432d..1978c34 100644 > --- a/drivers/infiniband/hw/ipath/ipath_mad.c > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c > @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int > mad_flags, > * before checking for other consumers. > * Just tell the caller to process it normally. > */ > - ret = IB_MAD_RESULT_FAILURE; > + ret = IB_MAD_RESULT_SUCCESS; > goto bail; > default: > smp->status |= IB_SMP_UNSUP_METHOD; > @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 > port_num, > * before checking for other consumers. > * Just tell the caller to process it normally. > */ > - ret = IB_MAD_RESULT_FAILURE; > + ret = IB_MAD_RESULT_SUCCESS; > goto bail; > default: > pmp->status |= IB_SMP_UNSUP_METHOD; From rdreier at cisco.com Wed Oct 17 21:34:45 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 17 Oct 2007 21:34:45 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <200710171658.03184.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 17 Oct 2007 16:58:02 +0200") References: <20070726014931.GL10235@sgi.com> <200710171658.03184.jackm@dev.mellanox.co.il> Message-ID: > Patch looks good, but don't you have the same 64-bit alignment problem in mthca_write_db_rec() ? Good point. Arthur? From rdreier at cisco.com Wed Oct 17 21:48:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 17 Oct 2007 21:48:00 -0700 Subject: [ofa-general] Re: [PATCH 5/5] IB/ehca: Enable large page MRs by default In-Reply-To: <200710161731.59688.fenkes@de.ibm.com> (Joachim Fenkes's message of "Tue, 16 Oct 2007 17:31:59 +0200") References: <200710161722.29144.fenkes@de.ibm.com> <200710161731.59688.fenkes@de.ibm.com> Message-ID: thanks, applied 1-5 From tziporet at dev.mellanox.co.il Thu Oct 18 02:04:57 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 18 Oct 2007 11:04:57 +0200 Subject: [ewg] RE: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> Message-ID: <471721B9.3090306@mellanox.co.il> Talpey, Thomas wrote: > At 08:46 PM 10/16/2007, Scott Weitzenkamp (sweitzen) wrote: > >> What ever happened to NFS RDMA? >> > > The NFS/RDMA client is queued for 2.6.24-rc1, it has been in the NFS > client maintainer's tree for some time and was pulled by Linus last week. > I haven't announced it yet because it appears the 2.6.24 merge window > is a bit of a mess! But I expect it to contain the client. > > If you want to see it in its current state, go to > git://linux-nfs.org/nfs-2.6 > > I thought OFED1.3 was intended to be 2.6.24-based. In that case > why would it exclude other 2.6.24 content simply because it wasn't > there for an early Alpha? > > > > Being in upstream kernel is very good but for NFS RDMA to be in OFED we need backport patches that supports all OSes (e.g. SLES 10 based on kernel 2.6.16, RHEL 5 based on kernel 2.6.18) and someone has to do this work Tziporet From dotanb at dev.mellanox.co.il Thu Oct 18 02:10:37 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 18 Oct 2007 11:10:37 +0200 Subject: [ofa-general] Question on IB RDMA read timing. In-Reply-To: <4716359A.9030303@hp.com> References: <20071017030955.GB15679@vt.edu> <4715BD44.2090200@dev.mellanox.co.il> <20071017075631.GA5089@minantech.com> <4716359A.9030303@hp.com> Message-ID: <4717230D.8030501@dev.mellanox.co.il> Bharath Ramesh, if you wish you can send the source to me and i will review it. Dotan Rick Jones wrote: > Gleb Natapov wrote: >> On Wed, Oct 17, 2007 at 09:44:04AM +0200, Dotan Barak wrote: >> >>> Hi. >>> >>> Bharath Ramesh wrote: >>> >>>> I wrote a simple test program to actual time it takes for RDMA read >>>> over >>>> IB. I find a huge difference in the numbers returned by timing. I was >>>> wondering if someone could help me in finding what I might be doing >>>> wrong in the way I am measuring the time. >>>> >>>> Steps I do for timing is as follows. >>>> >>>> 1) Create the send WR for RDMA Read. >>>> 2) call gettimeofday () >>>> 3) ibv_post_send () the WR >>>> 4) Loop around ibv_poll_cq () till I get the completion event. >>>> 5) call gettimeofday (); >>>> >>>> The difference in time would give me the time it takes to perform RDMA >>>> read over IB. I constantly get around 35 microsecs as the timing which >>>> seems to be really large considering the latency of IB. I am measuring >>>> the time for transferring 4K bytes of data. If anyone wants I can send >>>> the code that I have written. I am not subscribed to the list, if you >>>> could please cc me in the reply. >>>> >>> >>> I don't familiar with the implementation of gettimeofday, but i >>> believe that this function do a context switch >>> (and/or spend some time in the function to fill the struct that you >>> supply to it) >>> >> >> Here: >> struct timeval tv_s, tv_e; >> gettimeofday(&tv_s, NULL); >> gettimeofday(&tv_e, NULL); >> printf("%d\n", tv_e.tv_usec - tv_s.tv_usec); >> Compile and run it. The overhead of two calls to gettimeofday is at most >> 1 microsecond. > > Unless there is contention with other gettimeofday() calls on the > system - on SMP etc there are locks involved in making sure that each > call to gettimeofday() does not go backwards and the like, and on some > systems, with enough callers to gettimeofday() one can run into lock > contention. So, while 99 times out of ten gettimeofday() may be > "cheap" it really isn't a good idea to ass-u-me it will always be cheap. > > And besides, the most efficient call is the one which is never made, > so the suggestion to perform N operations between the calls is > probably still a good one. Even for measuring the overhead of > gettimeofday() :) > > Also, while it may not be so much the case these days, certainly in > the past there were "gettimeofday()" implementations which may have > rather coarse granularity. > > Now, some CPUs offer interval timer/registers/whatever - for example > the ITC on Itanium or CR16 on PA-RISC, I'm sure there are other > examples - which can be used for measuring very short things. Under > some OSes - HP-UX and Solaris are two with which I am familiar - there > is a "gethrtime()" interface which uses those without the user having > to deal with inline assembly. That should have lower overhead than > gettimeofday() although even then it would probably be best, if one is > indeed going for the average, to use those to measure the time to > perform N operations. > > If one does use gethrtime(), it should only be for measuring short > things, and those "timestamps" should not be interspersed with those > from gettimeofday(). The two are really separate "timespaces" if you > will. Gethrtime() does not get tick adjustment like gettimeofday() > does/can. > > rick jones > > FWIW, netperf uses gettimeofday() to measure the overall runtime of a > netperf test, and gethrtime() (when available) to measure the > individual times for "transactions" such as the exchange of a > request/response, or time spend in send() or recv() or whatnot. > >> -- >> Gleb. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > From vlad at lists.openfabrics.org Thu Oct 18 02:55:56 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 18 Oct 2007 02:55:56 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071018-0200 daily build status Message-ID: <20071018095556.BB3C6E60853@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From suri at baymicrosystems.com Thu Oct 18 06:31:57 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 18 Oct 2007 09:31:57 -0400 Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin ib_mad_recv_done_handler() In-Reply-To: <1192669602.30322.510.camel@brick.pathscale.com> References: <1192669602.30322.510.camel@brick.pathscale.com> Message-ID: <01e701c8118b$44a094c0$1914a8c0@md.baymicrosystems.com> Ralph: Which version are you looking at? We cleaned it up already in 2.6.23(rcx) and I don't see it. Thanks, Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Ralph Campbell > Sent: Wednesday, October 17, 2007 9:07 PM > To: openib > Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin > ib_mad_recv_done_handler() > > In ib_mad_recv_done_handler(), the response pointer is checked for > NULL after allocating it. It is then checked again in the local > process_mad() path but there is no possibility of it changing > in between. > > Signed-off-by: Ralph Campbell > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..f82900d 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -1931,15 +1931,6 @@ local: > if (port_priv->device->process_mad) { > int ret; > > - if (!response) { > - printk(KERN_ERR PFX "No memory for response MAD\n"); > - /* > - * Is it better to assume that > - * it wouldn't be processed ? > - */ > - goto out; > - } > - > ret = port_priv->device->process_mad(port_priv->device, 0, > port_priv->port_num, > wc, &recv->grh, > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Oct 18 07:02:15 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 16:02:15 +0200 Subject: [ofa-general] [PATCH] libibumad: umad_get_issm_path() addition In-Reply-To: <20071015142857.GX12364@sashak.voltaire.com> References: <1192037373.17526.51.camel@hrosenstock-ws.xsigo.com> <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <20071015142857.GX12364@sashak.voltaire.com> Message-ID: <20071018140215.GD6329@sashak.voltaire.com> Recently return value of umad_open_port() was changed from number of umad device to fd (in order to support multiple opens and for other goodies). However OpenSM vendor (osm_vendor_ibumad) used this previous undocumented umad_open_port() return value semantic for resolving number of /dev/../issm* device (which is needed for PortInfo capmask IsSM bit setup). So currently it is broken. This patch introduces new umad_get_issm_path() function for explicit issm device path resolving which should be used by OpenSM vendor. Signed-off-by: Sasha Khapyorsky --- libibumad/include/infiniband/umad.h | 2 ++ libibumad/libibumad.ver | 2 +- libibumad/src/libibumad.map | 1 + libibumad/src/umad.c | 18 ++++++++++++++++++ 4 files changed, 22 insertions(+), 1 deletions(-) diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 779ac73..2ec8b37 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -155,6 +155,8 @@ int umad_release_ca(umad_ca_t *ca); int umad_get_port(char *ca_name, int portnum, umad_port_t *port); int umad_release_port(umad_port_t *port); +int umad_get_issm_path(char *ca_name, int portnum, char path[], int max); + int umad_open_port(char *ca_name, int portnum); int umad_close_port(int portid); diff --git a/libibumad/libibumad.ver b/libibumad/libibumad.ver index f2ffbe2..a46b2c5 100644 --- a/libibumad/libibumad.ver +++ b/libibumad/libibumad.ver @@ -6,4 +6,4 @@ # API_REV - advance on any added API # RUNNING_REV - advance any change to the vendor files # AGE - number of backward versions the API still supports -LIBVERSION=1:0:0 +LIBVERSION=1:1:0 diff --git a/libibumad/src/libibumad.map b/libibumad/src/libibumad.map index 211438e..9444aa9 100644 --- a/libibumad/src/libibumad.map +++ b/libibumad/src/libibumad.map @@ -11,6 +11,7 @@ IBUMAD_1.0 { umad_release_port; umad_close_port; umad_get_mad; + umad_get_issm_path; umad_size; umad_set_grh; umad_set_pkey; diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 5baa5b8..41373e7 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -513,6 +513,24 @@ umad_get_ca_portguids(char *ca_name, uint64_t *portguids, int max) } int +umad_get_issm_path(char *ca_name, int portnum, char path[], int max) +{ + int umad_id; + + TRACE("ca %s port %d", ca_name, portnum); + + if (!(ca_name = resolve_ca_name(ca_name, &portnum))) + return -ENODEV; + + if ((umad_id = dev_to_umad_id(ca_name, portnum)) < 0) + return -EINVAL; + + snprintf(path, max - 1, "%s/issm%u", UMAD_DEV_DIR , umad_id); + + return 0; +} + +int umad_open_port(char *ca_name, int portnum) { char dev_file[UMAD_DEV_FILE_SZ]; -- 1.5.3.4.206.g58ba4 From sashak at voltaire.com Thu Oct 18 07:03:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 16:03:41 +0200 Subject: [ofa-general] [PATCH] opensm/vendor: use umad_get_issm_path() in osm_vendor_set_sm() In-Reply-To: <20071018140215.GD6329@sashak.voltaire.com> References: <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <20071015142857.GX12364@sashak.voltaire.com> <20071018140215.GD6329@sashak.voltaire.com> Message-ID: <20071018140341.GE6329@sashak.voltaire.com> Instead of assuming undocumented feature that umad_open_port() will return umad device number (which is not so anymore) use newly introduced umad_get_issm_path() for real issm device name resolving. Signed-off-by: Sasha Khapyorsky --- opensm/include/vendor/osm_vendor_ibumad.h | 1 + opensm/libvendor/osm_vendor_ibumad.c | 34 +++++++++++++++++----------- 2 files changed, 22 insertions(+), 13 deletions(-) diff --git a/opensm/include/vendor/osm_vendor_ibumad.h b/opensm/include/vendor/osm_vendor_ibumad.h index f86aeef..743b393 100644 --- a/opensm/include/vendor/osm_vendor_ibumad.h +++ b/opensm/include/vendor/osm_vendor_ibumad.h @@ -165,6 +165,7 @@ typedef struct _osm_vendor { int umad_port_id; void *receiver; int issmfd; + char issm_path[256]; } osm_vendor_t; #define OSM_BIND_INVALID_HANDLE 0 diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index ca2e2c9..6d78573 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -828,6 +828,17 @@ osm_vendor_bind(IN osm_vendor_t * const p_vend, goto Exit; } + if (umad_get_issm_path(p_vend->umad_port.ca_name, + p_vend->umad_port.portnum, + p_vend->issm_path, + sizeof(p_vend->issm_path)) < 0) { + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_bind: ERR 5424: " + "Cannot resolve issm path for port %s:%u\n", + p_vend->umad_port.ca_name, p_vend->umad_port.portnum); + goto Exit; + } + if (!(p_bind = malloc(sizeof(*p_bind)))) { osm_log(p_vend->p_log, OSM_LOG_ERROR, "osm_vendor_bind: ERR 5425: " @@ -1164,27 +1175,24 @@ void osm_vendor_set_sm(IN osm_bind_handle_t h_bind, IN boolean_t is_sm_val) { osm_umad_bind_info_t *p_bind = (osm_umad_bind_info_t *) h_bind; osm_vendor_t *p_vend = p_bind->p_vend; - char issmstring[24]; OSM_LOG_ENTER(p_vend->p_log, osm_vendor_set_sm); - sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id); if (TRUE == is_sm_val) { - p_vend->issmfd = open(issmstring, O_NONBLOCK); + p_vend->issmfd = open(p_vend->issm_path, O_NONBLOCK); if (p_vend->issmfd < 0) { osm_log(p_vend->p_log, OSM_LOG_ERROR, "osm_vendor_set_sm: ERR 5431: " - "setting IS_SM capability" - " mask failed; errno %d\n", errno); + "setting IS_SM capmask: cannot open file " + "\'%s\': %s\n", + p_vend->issm_path, strerror(errno)); p_vend->issmfd = -1; } - } else { - if (p_vend->issmfd != -1) { - if (0 != close(p_vend->issmfd)) - osm_log(p_vend->p_log, OSM_LOG_ERROR, - "osm_vendor_set_sm: ERR 5432: " - "clearing IS_SM capability" - " mask failed: errno %d\n", errno); - } + } else if (p_vend->issmfd != -1) { + if (0 != close(p_vend->issmfd)) + osm_log(p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_set_sm: ERR 5432: " + "clearing IS_SM capmask: cannot close: %s\n", + strerror(errno)); p_vend->issmfd = -1; } OSM_LOG_EXIT(p_vend->p_log); -- 1.5.3.4.206.g58ba4 From pw at osc.edu Thu Oct 18 07:14:31 2007 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 18 Oct 2007 10:14:31 -0400 Subject: [ofa-general] [PATCH] librdmacm: provide wrapper functions to extract src/dst addresses In-Reply-To: <000001c810ed$14147e00$28c8180a@amr.corp.intel.com> References: <000001c810ed$14147e00$28c8180a@amr.corp.intel.com> Message-ID: <20071018141431.GA32055@osc.edu> sean.hefty at intel.com wrote on Wed, 17 Oct 2007 11:39 -0700: > Provide wrapper functions to retrieve the source and destination > addresses. This is based on feedback from Doug Ledford. [..] > +static inline struct sockaddr *rdma_get_src_addr(struct rdma_cm_id *id) > +{ > + return &id->route.addr.src_addr; > +} > + > +static inline struct sockaddr *rdma_get_dst_addr(struct rdma_cm_id *id) > +{ > + return &id->route.addr.dst_addr; > +} I like the idea of making these fields easier to use, but find the naming a bit confusing for one particular example. Server process does rdma_bind_addr(), rdma_listen(). Gets RDMA_CM_EVENT_CONNECT_REQUEST. Inspects the rdma_cm_event's event->id using your new function calls above. The question the server asks is, "shall I permit this remote node to connect to me, based on IP address?" If yes, it will call rdma_accept(). In sockets, we use getpeername(), which is natural: the other side is the peer. With this new RDMA interface, one may be tempted to use rdma_get_src_addr(), since the connecting peer was the "source" of the connect request message, but no, rdma_get_dst_addr() is what is required to find out the peer address. In general, these RDMA connections are bidirectional just like sockets, so the concept of src and dst are not clear. You have made the identification of "src" == "me" and "dst" == "peer". I'd suggest using the more natural getsockname() and getpeername() approach from sockets, or similar. How about rdma_get_local_addr(id) { return src_addr; } rdma_get_peer_addr(id) { return dst_addr; } instead? -- Pete From hrosenstock at xsigo.com Thu Oct 18 07:34:09 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 18 Oct 2007 07:34:09 -0700 Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin ib_mad_recv_done_handler() In-Reply-To: <01e701c8118b$44a094c0$1914a8c0@md.baymicrosystems.com> References: <1192669602.30322.510.camel@brick.pathscale.com> <01e701c8118b$44a094c0$1914a8c0@md.baymicrosystems.com> Message-ID: <1192718049.5921.447.camel@hrosenstock-ws.xsigo.com> Suri, On Thu, 2007-10-18 at 09:31 -0400, Suresh Shelvapille wrote: > Ralph: > > Which version are you looking at? We cleaned it up already in 2.6.23(rcx) > and I don't see it. Which patch cleaned this up ? I can't seem to find this right now. -- Hal > > Thanks, > Suri > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > > Of Ralph Campbell > > Sent: Wednesday, October 17, 2007 9:07 PM > > To: openib > > Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin > > ib_mad_recv_done_handler() > > > > In ib_mad_recv_done_handler(), the response pointer is checked for > > NULL after allocating it. It is then checked again in the local > > process_mad() path but there is no possibility of it changing > > in between. > > > > Signed-off-by: Ralph Campbell > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 6f42877..f82900d 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -1931,15 +1931,6 @@ local: > > if (port_priv->device->process_mad) { > > int ret; > > > > - if (!response) { > > - printk(KERN_ERR PFX "No memory for response MAD\n"); > > - /* > > - * Is it better to assume that > > - * it wouldn't be processed ? > > - */ > > - goto out; > > - } > > - > > ret = port_priv->device->process_mad(port_priv->device, 0, > > port_priv->port_num, > > wc, &recv->grh, > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From suri at baymicrosystems.com Thu Oct 18 08:08:55 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 18 Oct 2007 11:08:55 -0400 Subject: [ofa-general] RE: [PATCHv3] mad.c: Fix memory leak in switch handling and improve error handling References: <021601c7d07a$3c6fb5d0$1914a8c0@surioffice> Message-ID: <01f301c81198$d1566360$1914a8c0@md.baymicrosystems.com> Hal: I don't have a copy of the email with a patch but, here is the email chain that I and you exchanged. This patch was submitted as part of the memory leak fix and it should have made it to 2.6.23 rc-2 or 3 I think. Thanks, Suri > > > > > > > > mad.c: Fix memory leak in switch handling and improve error handling > > > > > > > > Signed-off-by: Suresh Shelvapille > > > > Signed-off-by: Hal Rosenstock > > > > > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > > > index bc547f1..6310dc3 100644 > > > > --- a/drivers/infiniband/core/mad.c > > > > +++ b/drivers/infiniband/core/mad.c > > > > @@ -1847,11 +1847,6 @@ static void ib_mad_recv_done_handler(struct > > > > ib_mad_port_private *port_priv, > > > > struct ib_mad_agent_private *mad_agent; > > > > int port_num; > > > > > > > > - response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); > > > > - if (!response) > > > > - printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " > > > > - "for response buffer\n"); > > > > - > > > > mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; > > > > qp_info = mad_list->mad_queue->qp_info; > > > > dequeue_mad(mad_list); > > > > @@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct > > > > ib_mad_port_private *port_priv, > > > > if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) > > > > goto out; > > > > > > > > + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); > > > > + if (!response) { > > > > + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " > > > > + "for response buffer\n"); > > > > + goto out; > > > > + } > > > > + > > > > if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) > > > > port_num = wc->port_num; > > > > else > > > > @@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct > > > > ib_mad_port_private *port_priv, > > > > response->header.recv_wc.recv_buf.mad = > > > > &response->mad.mad; > > > > response->header.recv_wc.recv_buf.grh = &response->grh; > > > > > > > > - if (!agent_send_response(&response->mad.mad, > > > > - &response->grh, wc, > > > > - port_priv->device, > > > > - > > > > smi_get_fwd_port(&recv->mad.smp), > > > > - qp_info->qp->qp_num)) > > > > - response = NULL; > > > > + agent_send_response(&response->mad.mad, > > > > + &response->grh, wc, > > > > + port_priv->device, > > > > + smi_get_fwd_port(&recv->mad.smp), > > > > + qp_info->qp->qp_num); > > > > > > > > goto out; > > > > } > > > > @@ -1930,15 +1931,6 @@ local: > > > > if (port_priv->device->process_mad) { > > > > int ret; > > > > > > > > - if (!response) { > > > > - printk(KERN_ERR PFX "No memory for response MAD\n"); > > > > - /* > > > > - * Is it better to assume that > > > > - * it wouldn't be processed ? > > > > - */ > > > > - goto out; > > > > - } > > > > - > > > > ret = port_priv->device->process_mad(port_priv->device, 0, > > > > port_priv->port_num, > > > > wc, &recv->grh, > > > > > > From hrosenstock at xsigo.com Thu Oct 18 08:17:47 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 18 Oct 2007 08:17:47 -0700 Subject: [ofa-general] RE: [PATCHv3] mad.c: Fix memory leak in switch handling and improve error handling In-Reply-To: <01f301c81198$d1566360$1914a8c0@md.baymicrosystems.com> References: <021601c7d07a$3c6fb5d0$1914a8c0@surioffice> <01f301c81198$d1566360$1914a8c0@md.baymicrosystems.com> Message-ID: <1192720667.5921.455.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-18 at 11:08 -0400, Suresh Shelvapille wrote: > Hal: > > I don't have a copy of the email with a patch but, here is the email chain that I and > you exchanged. This patch was submitted as part of the memory leak fix and it > should have made it to 2.6.23 rc-2 or 3 I think. OK; but the patch that was actually committed on 8/3 (445d68070c9c02acdda38e6d69bd43096f521035) does not include this so I think it is needed (again) as that change did not include the removal of this if clause after the local label. We somehow dropped this. Roland ? -- Hal > Thanks, > Suri > > > > > > > > > > > > mad.c: Fix memory leak in switch handling and improve error handling > > > > > > > > > > Signed-off-by: Suresh Shelvapille > > > > > Signed-off-by: Hal Rosenstock > > > > > > > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > > > > index bc547f1..6310dc3 100644 > > > > > --- a/drivers/infiniband/core/mad.c > > > > > +++ b/drivers/infiniband/core/mad.c > > > > > @@ -1847,11 +1847,6 @@ static void ib_mad_recv_done_handler(struct > > > > > ib_mad_port_private *port_priv, > > > > > struct ib_mad_agent_private *mad_agent; > > > > > int port_num; > > > > > > > > > > - response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); > > > > > - if (!response) > > > > > - printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " > > > > > - "for response buffer\n"); > > > > > - > > > > > mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; > > > > > qp_info = mad_list->mad_queue->qp_info; > > > > > dequeue_mad(mad_list); > > > > > @@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct > > > > > ib_mad_port_private *port_priv, > > > > > if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) > > > > > goto out; > > > > > > > > > > + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); > > > > > + if (!response) { > > > > > + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " > > > > > + "for response buffer\n"); > > > > > + goto out; > > > > > + } > > > > > + > > > > > if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) > > > > > port_num = wc->port_num; > > > > > else > > > > > @@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct > > > > > ib_mad_port_private *port_priv, > > > > > response->header.recv_wc.recv_buf.mad = > > > > > &response->mad.mad; > > > > > response->header.recv_wc.recv_buf.grh = &response->grh; > > > > > > > > > > - if (!agent_send_response(&response->mad.mad, > > > > > - &response->grh, wc, > > > > > - port_priv->device, > > > > > - > > > > > smi_get_fwd_port(&recv->mad.smp), > > > > > - qp_info->qp->qp_num)) > > > > > - response = NULL; > > > > > + agent_send_response(&response->mad.mad, > > > > > + &response->grh, wc, > > > > > + port_priv->device, > > > > > + smi_get_fwd_port(&recv->mad.smp), > > > > > + qp_info->qp->qp_num); > > > > > > > > > > goto out; > > > > > } > > > > > @@ -1930,15 +1931,6 @@ local: > > > > > if (port_priv->device->process_mad) { > > > > > int ret; > > > > > > > > > > - if (!response) { > > > > > - printk(KERN_ERR PFX "No memory for response MAD\n"); > > > > > - /* > > > > > - * Is it better to assume that > > > > > - * it wouldn't be processed ? > > > > > - */ > > > > > - goto out; > > > > > - } > > > > > - > > > > > ret = port_priv->device->process_mad(port_priv->device, 0, > > > > > port_priv->port_num, > > > > > wc, &recv->grh, > > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Thu Oct 18 08:19:33 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 18 Oct 2007 08:19:33 -0700 Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler() In-Reply-To: <1192669602.30322.510.camel@brick.pathscale.com> References: <1192669602.30322.510.camel@brick.pathscale.com> Message-ID: <1192720773.5921.457.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-17 at 18:06 -0700, Ralph Campbell wrote: > In ib_mad_recv_done_handler(), the response pointer is checked for > NULL after allocating it. It is then checked again in the local > process_mad() path but there is no possibility of it changing > in between. > > Signed-off-by: Ralph Campbell Yes, this appears to be no longer needed (and as Suri pointed out was dropped part of a previous patch). Good catch. Acked-by: Hal Rosenstock > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..f82900d 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -1931,15 +1931,6 @@ local: > if (port_priv->device->process_mad) { > int ret; > > - if (!response) { > - printk(KERN_ERR PFX "No memory for response MAD\n"); > - /* > - * Is it better to assume that > - * it wouldn't be processed ? > - */ > - goto out; > - } > - > ret = port_priv->device->process_mad(port_priv->device, 0, > port_priv->port_num, > wc, &recv->grh, > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at dev.mellanox.co.il Thu Oct 18 08:36:43 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 18 Oct 2007 17:36:43 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: userspace qp sq-size sanity check Message-ID: <200710181736.43962.jackm@dev.mellanox.co.il> Add userspace-qp sq size sanity check. The minimum sq stride value below is taken from the MT25408 PRM (section 11.10, Table 306, log_sq_stride definition). Signed-off-by: Jack Morgenstein --- Roland, Without this check, userspace can submit arbitrarily large/small values for the number of WQEs and the stride. This can crash the kernel. Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-10-18 15:10:58.779428000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-10-18 16:34:37.118955000 +0200 @@ -63,6 +63,10 @@ struct mlx4_ib_sqp { u8 header_buf[MLX4_IB_UD_HEADER_SIZE]; }; +enum { + MLX4_IB_MIN_SQ_STRIDE = 6 +}; + static const __be32 mlx4_ib_opcode[] = { [IB_WR_SEND] = __constant_cpu_to_be32(MLX4_OPCODE_SEND), [IB_WR_SEND_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_IMM), @@ -285,9 +289,17 @@ static int set_kernel_sq_size(struct mlx return 0; } -static int set_user_sq_size(struct mlx4_ib_qp *qp, +static int set_user_sq_size(struct mlx4_ib_dev *dev, + struct mlx4_ib_qp *qp, struct mlx4_ib_create_qp *ucmd) { + /* Sanity check SQ size before proceeding */ + if ((1 << ucmd->log_sq_bb_count) > dev->dev->caps.max_wqes || + ucmd->log_sq_stride > + ilog2(roundup_pow_of_two(dev->dev->caps.max_sq_desc_sz)) || + ucmd->log_sq_stride < MLX4_IB_MIN_SQ_STRIDE) + return -EINVAL; + qp->sq.wqe_cnt = 1 << ucmd->log_sq_bb_count; qp->sq.wqe_shift = ucmd->log_sq_stride; @@ -330,7 +342,7 @@ static int create_qp_common(struct mlx4_ qp->sq_no_prefetch = ucmd.sq_no_prefetch; - err = set_user_sq_size(qp, &ucmd); + err = set_user_sq_size(dev, qp, &ucmd); if (err) goto err; From yangdong at ncic.ac.cn Thu Oct 18 08:41:37 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Thu, 18 Oct 2007 23:41:37 +0800 Subject: [ofa-general] problems about ibv_req_notify_cq Message-ID: <47177EB1.8070609@ncic.ac.cn> hello: As usual(i do it in my test proc), after i do rdma_create_id, rdma_resolve_addr, rdma_resolve_route, i can find cm_id->verbs->ops.req_notify_cq and cm_id->verbs->ops.poll_cq is not NULL, so that i can invoke ibv_req_notify_cq, which actually invokes cm_id->verbs->ops.req_notify_cq. When i was to setup for a connection, i found that cm_id->verbs->ops.req_notify_cq and cm_id->verbs->ops.poll_cq is NUL, the ops are similar to my test proc, only difference is, test proc is a main func, but in my instance, i make these openib ops as a my-own lib, i invoke referenced func to use these ops. i cannot use ibv_req_notify_cq, as follow Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 47563940566528 (LWP 11806)] 0x0000000000000000 in ?? () (gdb) backtrace #0 0x0000000000000000 in ?? () #1 0x0000000000416038 in ibv_req_notify_cq (cq=0x795520, solicited_only=0) at verbs.h:857 From robert.j.woodruff at intel.com Thu Oct 18 08:56:02 2007 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 18 Oct 2007 08:56:02 -0700 Subject: [ewg] RE: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: <471721B9.3090306@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> <471721B9.3090306@mellanox.co.il> Message-ID: >Being in upstream kernel is very good but for NFS RDMA to be in OFED we >need backport patches that supports all OSes >(e.g. SLES 10 based on kernel 2.6.16, RHEL 5 based on kernel 2.6.18) and >someone has to do this work >Tziporet Not sure I buy that argument. I think in the past we have had some features in OFED that were only availible on certain kernels/distros. If I recall, for example, I think that for a while iser was not available for all kernels until the backport patches were developed. Are you proposing we remove something that is in an upstream kernel from OFED ? If so, perhaps we should discuss on the next conference call, as I thought that generally all features that are upstream get included into OFED and then perhaps some features that are not yet upstream are added on. That seems to be the process that we have followed in the past anyway. my 2 cents woody From sashak at voltaire.com Thu Oct 18 09:39:29 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 18:39:29 +0200 Subject: [ofa-general] [PATCH] libibumad/man: umad_get_issm_path() man page In-Reply-To: <20071018140215.GD6329@sashak.voltaire.com> References: <000201c80b63$fcf9b430$bacc180a@amr.corp.intel.com> <470F168A.50703@Sun.COM> <1192189817.14052.259.camel@hrosenstock-ws.xsigo.com> <20071014151115.GD6489@sashak.voltaire.com> <4712E990.9020906@Sun.COM> <20071015120848.GP12364@sashak.voltaire.com> <1192454333.4962.174.camel@hrosenstock-ws.xsigo.com> <20071015135432.GU12364@sashak.voltaire.com> <20071015142857.GX12364@sashak.voltaire.com> <20071018140215.GD6329@sashak.voltaire.com> Message-ID: <20071018163929.GF6329@sashak.voltaire.com> Man page for umad_get_issm_path(). Signed-off-by: Sasha Khapyorsky --- libibumad/Makefile.am | 3 +- libibumad/man/umad_get_issm_path.3 | 38 ++++++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 1 deletions(-) create mode 100644 libibumad/man/umad_get_issm_path.3 diff --git a/libibumad/Makefile.am b/libibumad/Makefile.am index dd4a996..7674654 100644 --- a/libibumad/Makefile.am +++ b/libibumad/Makefile.am @@ -13,7 +13,8 @@ man_MANS = man/umad_debug.3 man/umad_get_ca.3 \ man/umad_set_grh_net.3 man/umad_set_grh.3 \ man/umad_set_addr_net.3 man/umad_set_addr.3 man/umad_set_pkey.3 \ man/umad_register.3 man/umad_register_oui.3 man/umad_unregister.3 \ - man/umad_send.3 man/umad_recv.3 man/umad_poll.3 + man/umad_send.3 man/umad_recv.3 man/umad_poll.3 \ + man/umad_get_issm_path.3 lib_LTLIBRARIES = libibumad.la diff --git a/libibumad/man/umad_get_issm_path.3 b/libibumad/man/umad_get_issm_path.3 new file mode 100644 index 0000000..ac538c9 --- /dev/null +++ b/libibumad/man/umad_get_issm_path.3 @@ -0,0 +1,38 @@ +.\" -*- nroff -*- +.\" +.TH UMAD_GET_ISSM_PATH 3 "Oct 18, 2007" "OpenIB" "OpenIB Programmer\'s Manual" +.SH "NAME" +umad_get_issm_path \- get path of issm device +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "int umad_get_issm_path(char " "*ca_name" ", int " "portnum", char *path, int max); +.fi +.SH "DESCRIPTION" +.B umad_get_issm_path() +resolves path to issm device (which used for setting/clearing PortInfo:CapMask IsSM bit) for +.I portnum +of the IB device +.I ca_name +, it stores resolved path in +.I path +array which cannot exceed +.I max +bytes in length (including NULL terminator). +.fi +Opening issm device sets PortInfo:CapMask IsSM bit and closing clears it. +.fi +.SH "RETURN VALUE" +.B umad_open_port() +returns 0 on success and a negative value on error as follows: + -ENODEV IB device can\'t be resolved + -EINVAL port is not valid (bad +.I portnum\fR +or no umad device) +.SH "SEE ALSO" +.BR umad_open_port (3), +.BR umad_get_port (3) +.SH "AUTHOR" +.TP +Sasha Khapyorsky -- 1.5.3.4.206.g58ba4 From rdreier at cisco.com Thu Oct 18 09:27:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 18 Oct 2007 09:27:50 -0700 Subject: [ofa-general] Re: [PATCH] mlx4_ib: userspace qp sq-size sanity check In-Reply-To: <200710181736.43962.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 18 Oct 2007 17:36:43 +0200") References: <200710181736.43962.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From sashak at voltaire.com Thu Oct 18 09:44:32 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 18 Oct 2007 18:44:32 +0200 Subject: [ofa-general] [PATCH] opensm/lash: remove debug printfs - speed-up algorithm In-Reply-To: <200710171624.40828.eddiem@sgi.com> References: <200710161635.38818.eddiem@sgi.com> <20071017113049.GA6329@sashak.voltaire.com> <200710171624.40828.eddiem@sgi.com> Message-ID: <20071018164432.GG6329@sashak.voltaire.com> Remove old debug prints (with printf()). We have generic LFTs dumper already (it runs right after subnet is up), which produce similar output. Onky this reduces LASH running time in 4-5 times and makes it comparable to other rouiting algorithms. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_lash.c | 38 -------------------------------------- 1 files changed, 0 insertions(+), 38 deletions(-) diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 2c62708..1e91192 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -1272,43 +1272,6 @@ static void populate_fwd_tbls(lash_t * p_lash) OSM_LOG_EXIT(p_log); } -static void print_fwd_table(IN const osm_switch_t * p_sw) -{ - uint16_t max_lid_ho, lid_ho; - uint64_t switch_guid = osm_lash_get_switch_guid(p_sw); - - max_lid_ho = p_sw->max_lid_ho; - printf("FWDTBL: 0x%016" PRIx64 " max LID 0x%04X\n", - cl_ntoh64(switch_guid), max_lid_ho); - - // starting at 1, not 0. Assuming no LID with an ID of 0 - for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) { - uint8_t port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); - - if (port_num == OSM_NO_PATH) - printf("0x%04X : UNREACHABLE\n", lid_ho); - else - printf("0x%04X : %d \n", lid_ho, port_num); - } - printf("\n"); -} - -static void print_fwd_tables(lash_t * p_lash) -{ - osm_subn_t *p_subn = &p_lash->p_osm->subn; - osm_switch_t *p_next_sw, *p_sw; - - p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl); - while (p_next_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl)) { - p_sw = p_next_sw; - p_next_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item); - - if (p_sw && p_sw->p_node) { - print_fwd_table(p_sw); - } - } -} - static void osm_lash_process_switch(lash_t * p_lash, osm_switch_t * p_sw) { osm_log_t *p_log = &p_lash->p_osm->log; @@ -1492,7 +1455,6 @@ static int lash_process(void *context) goto Exit; populate_fwd_tbls(p_lash); - print_fwd_tables(p_lash); Exit: free_lash_structures(p_lash); -- 1.5.3.4.206.g58ba4 From akepner at sgi.com Thu Oct 18 09:50:01 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 18 Oct 2007 09:50:01 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: References: <20070726014931.GL10235@sgi.com> <200710171658.03184.jackm@dev.mellanox.co.il> Message-ID: <20071018165001.GZ5601@sgi.com> On Wed, Oct 17, 2007 at 09:34:45PM -0700, Roland Dreier wrote: > > Patch looks good, but don't you have the same 64-bit > > alignment problem in mthca_write_db_rec() ? > > Good point. Arthur? Yeah - I wouldn't have seen this since I wasn't using a memfree card for testing (and it'd depend on where the db_rec variable was placed on the stack frame, too.) Anyway, it sure looks like it could be a problem. -- Arthur From mshefty at ichips.intel.com Thu Oct 18 10:02:21 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Oct 2007 10:02:21 -0700 Subject: [ofa-general] [PATCH] librdmacm: provide wrapper functions to extract src/dst addresses In-Reply-To: <20071018141431.GA32055@osc.edu> References: <000001c810ed$14147e00$28c8180a@amr.corp.intel.com> <20071018141431.GA32055@osc.edu> Message-ID: <4717919D.4090305@ichips.intel.com> > rdma_get_local_addr(id) { return src_addr; } > rdma_get_peer_addr(id) { return dst_addr; } There are already calls for rdma_get_src/dst_port(). I kept the same naming convention, but matching the socket calls makes more sense. - Sean From mshefty at ichips.intel.com Thu Oct 18 10:09:46 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 18 Oct 2007 10:09:46 -0700 Subject: [ofa-general] problems about ibv_req_notify_cq In-Reply-To: <47177EB1.8070609@ncic.ac.cn> References: <47177EB1.8070609@ncic.ac.cn> Message-ID: <4717935A.60105@ichips.intel.com> > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 47563940566528 (LWP 11806)] > 0x0000000000000000 in ?? () > (gdb) backtrace > #0 0x0000000000000000 in ?? () > #1 0x0000000000416038 in ibv_req_notify_cq (cq=0x795520, > solicited_only=0) at verbs.h:857 Can you post the code that has this problem? From ralph.campbell at qlogic.com Thu Oct 18 10:26:35 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 18 Oct 2007 10:26:35 -0700 Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin ib_mad_recv_done_handler() In-Reply-To: <01e701c8118b$44a094c0$1914a8c0@md.baymicrosystems.com> References: <1192669602.30322.510.camel@brick.pathscale.com> <01e701c8118b$44a094c0$1914a8c0@md.baymicrosystems.com> Message-ID: <1192728395.30322.516.camel@brick.pathscale.com> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git master branch as of a few days ago. I guess you figured out what happened to the earlier patch though already. On Thu, 2007-10-18 at 09:31 -0400, Suresh Shelvapille wrote: > Ralph: > > Which version are you looking at? We cleaned it up already in 2.6.23(rcx) > and I don't see it. > > Thanks, > Suri > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > > Of Ralph Campbell > > Sent: Wednesday, October 17, 2007 9:07 PM > > To: openib > > Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer checkin > > ib_mad_recv_done_handler() > > > > In ib_mad_recv_done_handler(), the response pointer is checked for > > NULL after allocating it. It is then checked again in the local > > process_mad() path but there is no possibility of it changing > > in between. > > > > Signed-off-by: Ralph Campbell > > > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > > index 6f42877..f82900d 100644 > > --- a/drivers/infiniband/core/mad.c > > +++ b/drivers/infiniband/core/mad.c > > @@ -1931,15 +1931,6 @@ local: > > if (port_priv->device->process_mad) { > > int ret; > > > > - if (!response) { > > - printk(KERN_ERR PFX "No memory for response MAD\n"); > > - /* > > - * Is it better to assume that > > - * it wouldn't be processed ? > > - */ > > - goto out; > > - } > > - > > ret = port_priv->device->process_mad(port_priv->device, 0, > > port_priv->port_num, > > wc, &recv->grh, > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hartlch14 at gmail.com Thu Oct 18 11:05:54 2007 From: hartlch14 at gmail.com (Chuck Hartley) Date: Thu, 18 Oct 2007 14:05:54 -0400 Subject: [ofa-general] Expected RDMA performance Message-ID: We have recently started using IB and are wondering if we are getting the expected level of performance out of it. Searching the Wiki and various websites didn't reveal any tables showing what "typical" performance is. Our hardware is several SuperMicro X7DBT-INF with onboard DDR HCA's, and a 24 port MT47396 switch. We are running Fedora Core 6 (kernel 2.6.20-1.2948.fc6), 16GB memory, dual quad core 2.33GHz Xeons and the latest BIOS/firmware versions for the components. Here is the output we get for RDMA write BW (read test is similar): RDMA_Write BW Test Inline data is used up to 400 bytes message Number of qp's running 1 Connection type : RC Each Qp will post up to 100 messages each time local address: LID 0x02, QPN 0x2d0409, PSN 0x7721a RKey 0x4a043100 VAddr 0x002aaaabafd000 remote address: LID 0x04, QPN 0x50405, PSN 0xd8c024, RKey 0xe002600 VAddr 0x002aaaabaff000 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] 2 5000 2.67 2.67 4 5000 5.36 5.35 8 5000 10.75 10.74 16 5000 20.82 20.78 32 5000 41.60 41.57 64 5000 82.57 82.45 128 5000 156.77 156.73 256 5000 230.09 229.86 512 5000 677.01 676.31 1024 5000 1101.94 1099.43 2048 5000 1238.14 1237.90 4096 5000 1288.37 1288.03 8192 5000 1320.57 1320.40 16384 5000 1330.15 1330.11 32768 5000 1343.19 1343.19 65536 5000 1347.03 1347.02 131072 5000 1348.83 1348.82 262144 5000 1341.16 1341.16 524288 5000 1340.69 1340.69 1048576 5000 1341.46 1340.97 2097152 5000 1342.01 1342.01 4194304 5000 1342.10 1342.09 8388608 5000 1342.12 1342.12 ------------------------------------------------------------------ Is this typical RDMA performance? What is the maximum theoretical BW for DDR IB - 1525MB/sec? Or am I doing some math incorrectly? No kernel parameters have been modified, though I did not think much could be done to affect RDMA performance. We are having some issues with an SSD unit that I want to understand, but need to make sure the basic IB installation is working correctly first. If the numbers above are good, then I'll move on to the SSD question. If the numbers are low, then I need some pointers to what I should look at / change. Thanks, Chuck -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Thu Oct 18 11:59:30 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 18 Oct 2007 11:59:30 -0700 Subject: [ewg] RE: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: <4715F785.6080105@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> <4715F785.6080105@mellanox.co.il> Message-ID: > >> 4. SDP - these are not yet in the alpha release > >> o Keep-alive > >> o Asynch IO > >> o Send Zero Copy > >> > > > > If it didn't make it into alpha, perhaps it should not go > into 1.3, so > > we can hold the release date better? > > > Since the code is running and tested and Jim just has not succeed to > arrange it all in the git on time I think it should be in > I cc Jim so he can answer in more details on the status. I don't object to it going in, but new features often take some time to stabilize (has the new SDP been tested on ppc64, for example?), so any new major features going in beyond this point have a high probability of delaying the release. Scott From ralph.campbell at qlogic.com Thu Oct 18 13:08:45 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 18 Oct 2007 13:08:45 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDR SMP responses from userspace In-Reply-To: <001001c8112a$8575fd20$a865a8c0@catcher> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> Message-ID: <1192738125.30322.544.camel@brick.pathscale.com> If ib_mad_recv_done_handler() on a switch device gets a DR SMP with: D == 1, smp->hop_ptr == 1, smp->dr_slid != IB_LID_PERMISSIVE, then smi_handle_dr_smp_recv() returns IB_SMI_HANDLE, smi_check_forward_dr_smp() returns IB_SMI_SEND, smi_handle_dr_smp_send() does smp->hop_ptr-- and returns IB_SMI_HANDLE, smi_check_local_smp() returns IB_SMI_DISCARD. Instead, it should send a copy of the received packet with LRH:DLID set to the smp->dr_slid. So I think smi_check_local_returning_smp() needs to be called before smi_check_local_smp() and do the appropriate code for forwarding the packet. Alternatively, we could move the code from smi_check_local_returning_smp() into smi_check_local_smp() and let the device's process_mad() function do the forwarding. On Wed, 2007-10-17 at 20:59 -0500, Steve Welch wrote: > I believe in the case of ib_mad_recv_done_handler(), the call > smi_check_forward_dr_smp() will return 0 indicating it should be > handled by the local stack because the hop pointer will equal > 0 (in the case where the DR SMP response should be delivered to > the stack). The smi_check_local_smp() call would not be reached. > > The second part of the original fix is not required either > in ib_mad_recv_done_handler(); when the device process mad > routine does not reply or consume the MAD it uses the > original receive mad to deliver to the MAD to the local agent, > eliminating the need for the memcpy. > > Steve > > -----Original Message----- > > From: Ralph Campbell [mailto:ralph.campbell at qlogic.com] > > Sent: Wednesday, October 17, 2007 7:33 PM > > To: swelch at systemfabricworks.com > > Cc: rdreier at cisco.com; sean.hefty at intel.com; general at lists.openfabrics.org > > Subject: Re: [ofa-general] [PATCH V3] infiniband/core: Enable loopback > > ofDR SMP responses from userspace > > > > Steve's patch plus the attached patch for ib_ipath allows loopback > > to work and doesn't seem to obviously break anything. > > > > I was wondering though about adding the code from > > smi_check_local_returning_smp() to smi_check_local_smp() > > instead of defining a separate function. > > That got me thinking about what happens when a return path DR SMP > > is received and ib_mad_recv_done_handler() calls smi_check_local_smp(). > > Now I'm trying to convince myself one way or the other whether > > the same checks inib_mad_recv_done_handler() are needed or not. > > > > On Wed, 2007-10-10 at 22:29 -0500, swelch at systemfabricworks.com wrote: > > > > > > Sean, Roland, > > > > > > This patch [v3] replaces the [v2] patch; it includes those changes but > > renames > > > the smi function testing returning SMP requests to the name Hal > > recommends. > > > > > > This patch allows userspace DR SMP responses to be looped back and > > delivered > > > to a local mad agent by the management stack. > > > > > > Thanks, Steve > > > > > > Signed-off-by: Steve Welch > > > --- > > > drivers/infiniband/core/mad.c | 6 +++--- > > > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > > > 2 files changed, 20 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/infiniband/core/mad.c > > b/drivers/infiniband/core/mad.c > > > index 6f42877..98148d6 100644 > > > --- a/drivers/infiniband/core/mad.c > > > +++ b/drivers/infiniband/core/mad.c > > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > } > > > > > > /* Check to post send on QP or process locally */ > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > > > goto out; > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct > > ib_mad_agent_private *mad_agent_priv, > > > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > > mad_agent_priv->agent.port_num); > > > if (port_priv) { > > > - mad_priv->mad.mad.mad_hdr.tid = > > > - ((struct ib_mad *)smp)->mad_hdr.tid; > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > ib_mad)); > > > recv_mad_agent = find_mad_agent(port_priv, > > > &mad_priv->mad.mad); > > > } > > > diff --git a/drivers/infiniband/core/smi.h > > b/drivers/infiniband/core/smi.h > > > index 1cfc298..aff96ba 100644 > > > --- a/drivers/infiniband/core/smi.h > > > +++ b/drivers/infiniband/core/smi.h > > > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct > > ib_smp *smp, > > > u8 node_type, int port_num); > > > > > > /* > > > - * Return 1 if the SMP should be handled by the local SMA/SM via > > process_mad > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > SMA/SM > > > + * via process_mad > > > */ > > > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > > > struct ib_device *device) > > > @@ -71,4 +72,19 @@ static inline enum smi_action > > smi_check_local_smp(struct ib_smp *smp, > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > } > > > + > > > +/* > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > SMA/SM > > > + * via process_mad > > > + */ > > > +static inline enum smi_action smi_check_local_returning_smp(struct > > ib_smp *smp, > > > + struct ib_device *device) > > > +{ > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > + return ((device->process_mad && > > > + ib_get_smp_direction(smp) && > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > +} > > > + > > > #endif /* __SMI_H_ */ > > > > > > > > diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c > > b/drivers/infiniband/hw/ipath/ipath_mad.c > > index 3d1432d..1978c34 100644 > > --- a/drivers/infiniband/hw/ipath/ipath_mad.c > > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c > > @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int > > mad_flags, > > * before checking for other consumers. > > * Just tell the caller to process it normally. > > */ > > - ret = IB_MAD_RESULT_FAILURE; > > + ret = IB_MAD_RESULT_SUCCESS; > > goto bail; > > default: > > smp->status |= IB_SMP_UNSUP_METHOD; > > @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 > > port_num, > > * before checking for other consumers. > > * Just tell the caller to process it normally. > > */ > > - ret = IB_MAD_RESULT_FAILURE; > > + ret = IB_MAD_RESULT_SUCCESS; > > goto bail; > > default: > > pmp->status |= IB_SMP_UNSUP_METHOD; > > From pradeeps at linux.vnet.ibm.com Thu Oct 18 15:00:58 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 18 Oct 2007 15:00:58 -0700 Subject: [ofa-general] ConnectX problems on Sles10sp1 Message-ID: <4717D79A.3010207@linux.vnet.ibm.com> This originally started as a problem that ibv_devinfo was showing: "No IB devices found". We are using OFED 1.2.5. Started to dig this a little and see no entries under /sys/class/infiniband as can be seen below. lsmod | grep ib mlx4_ib 74560 0 ib_addr 28704 1 rdma_cm ib_ipoib 124200 0 ib_cm 65904 2 rdma_cm,ib_ipoib ib_sa 77880 3 rdma_cm,ib_ipoib,ib_cm ipv6 466288 21 ib_ipoib ib_uverbs 72440 1 rdma_ucm ib_umad 40928 0 ib_mad 72592 4 mlx4_ib,ib_cm,ib_sa,ib_umad ib_core 106688 10 mlx4_ib,rdma_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,ib_mad mlx4_core 117904 1 mlx4_ib libata 196100 1 ipr scsi_mod 228216 4 sg,ipr,libata,sd_mod modinfo mlx4_ib filename: /lib/modules/2.6.16.46-0.12-ppc64/updates/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko version: 0.01 license: Dual BSD/GPL description: Mellanox ConnectX HCA InfiniBand driver author: Roland Dreier srcversion: E9808B3F9850220A7A35677 depends: mlx4_core,ib_core,ib_mad,ib_core vermagic: 2.6.16.46-0.12-ppc64 SMP gcc-4.1 modinfo mlx4_core filename: /lib/modules/2.6.16.46-0.12-ppc64/updates/kernel/drivers/net/mlx4/mlx4_core.ko version: 0.01 license: Dual BSD/GPL description: Mellanox ConnectX HCA low-level driver author: Roland Dreier srcversion: 2FD23F27A2C14EE6DA1D7D7 alias: pci:v000015B3d0000673Csv*sd*bc*sc*i* alias: pci:v000015B3d00006732sv*sd*bc*sc*i* alias: pci:v000015B3d00006354sv*sd*bc*sc*i* alias: pci:v000015B3d0000634Asv*sd*bc*sc*i* alias: pci:v000015B3d00006340sv*sd*bc*sc*i* depends: vermagic: 2.6.16.46-0.12-ppc64 SMP gcc-4.1 parm: debug_level:Enable debug tracing if > 0 (int) parm: msi_x:attempt to use MSI-X if nonzero (int) parm: ierr_reset_disable:disable reset on Internal Error event if nonzero (int) ls -l /sys/class/infiniband total 0 My suspicion is that Installed FW version is 2.0.150, that is lower than the OFED 1.2.5 Wiki suggestion for ConnectX IB (fw-25408 Rev 2.2.000) maybe causing problems. However when I load the mlx4_ib module no errors are seen. Additionally, even though the uverbs module is loaded, I do not see the uverbs devices (udev is running). Only the rdma_cm device as shown below. ls -l /dev/infiniband/ total 0 crw-rw-rw- 1 root root 10, 62 Oct 18 17:13 rdma_cm Any suggestions? Pradeep From rdreier at cisco.com Thu Oct 18 19:09:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 18 Oct 2007 19:09:07 -0700 Subject: [ofa-general] ConnectX problems on Sles10sp1 In-Reply-To: <4717D79A.3010207@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 18 Oct 2007 15:00:58 -0700") References: <4717D79A.3010207@linux.vnet.ibm.com> Message-ID: > My suspicion is that Installed FW version is 2.0.150, that is lower than the OFED 1.2.5 Wiki > suggestion for ConnectX IB (fw-25408 Rev 2.2.000) maybe causing problems. However when I load the > mlx4_ib module no errors are seen. You really need to update your firmware. Are you sure there are no kernel messages? Maybe not from mlx4_ib, but from mlx4_core? - R. From pradeeps at linux.vnet.ibm.com Thu Oct 18 19:30:07 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 18 Oct 2007 19:30:07 -0700 Subject: [ofa-general] ConnectX problems on Sles10sp1 In-Reply-To: References: <4717D79A.3010207@linux.vnet.ibm.com> Message-ID: <471816AF.6080204@linux.vnet.ibm.com> Roland Dreier wrote: > > My suspicion is that Installed FW version is 2.0.150, that is lower than the OFED 1.2.5 Wiki > > suggestion for ConnectX IB (fw-25408 Rev 2.2.000) maybe causing problems. However when I load the > > mlx4_ib module no errors are seen. > > You really need to update your firmware. > > Are you sure there are no kernel messages? Maybe not from mlx4_ib, > but from mlx4_core? > There are no kernel messages at all. The other vexing issue really is that I need the device to be available (if recollect I need the PSID for example to download the firmware for the correct adapter) to update the firmware, at least if I were to use mstflint. Any ways around that? Pradeep From rdreier at cisco.com Thu Oct 18 20:30:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 18 Oct 2007 20:30:53 -0700 Subject: [ofa-general] ConnectX problems on Sles10sp1 In-Reply-To: <471816AF.6080204@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 18 Oct 2007 19:30:07 -0700") References: <4717D79A.3010207@linux.vnet.ibm.com> <471816AF.6080204@linux.vnet.ibm.com> Message-ID: > There are no kernel messages at all. not even stuff like the following? mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007) mlx4_core: Initializing 0000:0d:00.0 if not then you have some problem with your kernel/driver build. You could also set debug_level=1 in your mlx4_core module options and see if that produces any information. But in the end I'm pretty sure you're just going to get a message from the driver confirming that the installed firmware has an unsupported command interface version. > The other vexing issue really is that I need the device to be > available (if recollect I need the PSID for example to download the > firmware for the correct adapter) to update the firmware, at least > if I were to use mstflint. Any ways around that? I'm sure you can use mstflint without the driver loaded by specifying the PCI bus/device number, although I don't know the details (it should be described in the README). Also tvflash will work in that situation too (latest git tree should be able to burn Hermon firmware). - R. From lujtheblairfirmcyg at theblairfirm.com Fri Oct 19 01:01:21 2007 From: lujtheblairfirmcyg at theblairfirm.com (Delmar Early) Date: Fri, 19 Oct 2007 15:01:21 +0700 Subject: [ofa-general] Can you imagine that you are healthy? Message-ID: <835985326.01023433842814@theblairfirm.com> LegalRX drug-store offers all cures that you feel need in in order to recover your health at little price. We operate all over the planet with buyers from America, Europe and Asia. This time you got no need to search for drugstore at your area. We can transfer high quality medsworld-wide. Come to our site & acquire meds that you immediately need straight to your residence. http://desertbehind.cn/ We're confirmed by VISA & VeriSign hence we provide certain & dependable acquisition. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Oct 19 02:55:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 19 Oct 2007 02:55:45 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071019-0200 daily build status Message-ID: <20071019095545.9A29FE603CD@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From swelch at systemfabricworks.com Fri Oct 19 08:05:15 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Fri, 19 Oct 2007 10:05:15 -0500 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <1192738125.30322.544.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> Message-ID: <000601c81261$74c822e0$bc0da8c0@catcher> Ralph, This looks like it could be a problem, but it is not directly related to the original patch; i.e. loop back of a DR SMP response being sent from userspace. I think you could probably combine the routines as you suggest or call them both as is done to solve the original problem. If you want to combine this change with the original patch to make a more uniform change across both problems that would be fine with me. Hal, Sean, do you have an opinion? Steve > -----Original Message----- > From: Ralph Campbell [mailto:ralph.campbell at qlogic.com] > Sent: Thursday, October 18, 2007 3:09 PM > To: Steve Welch > Cc: rdreier at cisco.com; sean.hefty at intel.com; general at lists.openfabrics.org > Subject: RE: [ofa-general] [PATCH V3] infiniband/core: Enable loopback > ofDRSMP responses from userspace > > If ib_mad_recv_done_handler() on a switch device gets a DR SMP with: > D == 1, smp->hop_ptr == 1, smp->dr_slid != IB_LID_PERMISSIVE, > then smi_handle_dr_smp_recv() returns IB_SMI_HANDLE, > smi_check_forward_dr_smp() returns IB_SMI_SEND, > smi_handle_dr_smp_send() does smp->hop_ptr-- and returns IB_SMI_HANDLE, > smi_check_local_smp() returns IB_SMI_DISCARD. > > Instead, it should send a copy of the received packet with LRH:DLID > set to the smp->dr_slid. So I think smi_check_local_returning_smp() > needs to be called before smi_check_local_smp() > and do the appropriate code for forwarding the packet. > Alternatively, we could move the code from > smi_check_local_returning_smp() into smi_check_local_smp() and > let the device's process_mad() function do the forwarding. > > On Wed, 2007-10-17 at 20:59 -0500, Steve Welch wrote: > > I believe in the case of ib_mad_recv_done_handler(), the call > > smi_check_forward_dr_smp() will return 0 indicating it should be > > handled by the local stack because the hop pointer will equal > > 0 (in the case where the DR SMP response should be delivered to > > the stack). The smi_check_local_smp() call would not be reached. > > > > The second part of the original fix is not required either > > in ib_mad_recv_done_handler(); when the device process mad > > routine does not reply or consume the MAD it uses the > > original receive mad to deliver to the MAD to the local agent, > > eliminating the need for the memcpy. > > > > Steve > > > -----Original Message----- > > > From: Ralph Campbell [mailto:ralph.campbell at qlogic.com] > > > Sent: Wednesday, October 17, 2007 7:33 PM > > > To: swelch at systemfabricworks.com > > > Cc: rdreier at cisco.com; sean.hefty at intel.com; > general at lists.openfabrics.org > > > Subject: Re: [ofa-general] [PATCH V3] infiniband/core: Enable loopback > > > ofDR SMP responses from userspace > > > > > > Steve's patch plus the attached patch for ib_ipath allows loopback > > > to work and doesn't seem to obviously break anything. > > > > > > I was wondering though about adding the code from > > > smi_check_local_returning_smp() to smi_check_local_smp() > > > instead of defining a separate function. > > > That got me thinking about what happens when a return path DR SMP > > > is received and ib_mad_recv_done_handler() calls > smi_check_local_smp(). > > > Now I'm trying to convince myself one way or the other whether > > > the same checks inib_mad_recv_done_handler() are needed or not. > > > > > > On Wed, 2007-10-10 at 22:29 -0500, swelch at systemfabricworks.com wrote: > > > > > > > > Sean, Roland, > > > > > > > > This patch [v3] replaces the [v2] patch; it includes those changes > but > > > renames > > > > the smi function testing returning SMP requests to the name Hal > > > recommends. > > > > > > > > This patch allows userspace DR SMP responses to be looped back and > > > delivered > > > > to a local mad agent by the management stack. > > > > > > > > Thanks, Steve > > > > > > > > Signed-off-by: Steve Welch > > > > --- > > > > drivers/infiniband/core/mad.c | 6 +++--- > > > > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > > > > 2 files changed, 20 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/drivers/infiniband/core/mad.c > > > b/drivers/infiniband/core/mad.c > > > > index 6f42877..98148d6 100644 > > > > --- a/drivers/infiniband/core/mad.c > > > > +++ b/drivers/infiniband/core/mad.c > > > > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct > > > ib_mad_agent_private *mad_agent_priv, > > > > } > > > > > > > > /* Check to post send on QP or process locally */ > > > > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > > > > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > > > > + smi_check_local_returning_smp(smp, device) == > IB_SMI_DISCARD) > > > > goto out; > > > > > > > > local = kmalloc(sizeof *local, GFP_ATOMIC); > > > > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct > > > ib_mad_agent_private *mad_agent_priv, > > > > port_priv = ib_get_mad_port(mad_agent_priv- > >agent.device, > > > > mad_agent_priv->agent.port_num); > > > > if (port_priv) { > > > > - mad_priv->mad.mad.mad_hdr.tid = > > > > - ((struct ib_mad *)smp)->mad_hdr.tid; > > > > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct > > ib_mad)); > > > > recv_mad_agent = find_mad_agent(port_priv, > > > > &mad_priv->mad.mad); > > > > } > > > > diff --git a/drivers/infiniband/core/smi.h > > > b/drivers/infiniband/core/smi.h > > > > index 1cfc298..aff96ba 100644 > > > > --- a/drivers/infiniband/core/smi.h > > > > +++ b/drivers/infiniband/core/smi.h > > > > @@ -59,7 +59,8 @@ extern enum smi_action > smi_handle_dr_smp_send(struct > > > ib_smp *smp, > > > > u8 node_type, int port_num); > > > > > > > > /* > > > > - * Return 1 if the SMP should be handled by the local SMA/SM via > > > process_mad > > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > > SMA/SM > > > > + * via process_mad > > > > */ > > > > static inline enum smi_action smi_check_local_smp(struct ib_smp > *smp, > > > > struct ib_device *device) > > > > @@ -71,4 +72,19 @@ static inline enum smi_action > > > smi_check_local_smp(struct ib_smp *smp, > > > > (smp->hop_ptr == smp->hop_cnt + 1)) ? > > > > IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > } > > > > + > > > > +/* > > > > + * Return IB_SMI_HANDLE if the SMP should be handled by the local > > > SMA/SM > > > > + * via process_mad > > > > + */ > > > > +static inline enum smi_action smi_check_local_returning_smp(struct > > > ib_smp *smp, > > > > + struct ib_device *device) > > > > +{ > > > > + /* C14-13:3 -- We're at the end of the DR segment of path */ > > > > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > > > > + return ((device->process_mad && > > > > + ib_get_smp_direction(smp) && > > > > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > > > > +} > > > > + > > > > #endif /* __SMI_H_ */ > > > > > > > > > > > > diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c > > > b/drivers/infiniband/hw/ipath/ipath_mad.c > > > index 3d1432d..1978c34 100644 > > > --- a/drivers/infiniband/hw/ipath/ipath_mad.c > > > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c > > > @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, > int > > > mad_flags, > > > * before checking for other consumers. > > > * Just tell the caller to process it normally. > > > */ > > > - ret = IB_MAD_RESULT_FAILURE; > > > + ret = IB_MAD_RESULT_SUCCESS; > > > goto bail; > > > default: > > > smp->status |= IB_SMP_UNSUP_METHOD; > > > @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, > u8 > > > port_num, > > > * before checking for other consumers. > > > * Just tell the caller to process it normally. > > > */ > > > - ret = IB_MAD_RESULT_FAILURE; > > > + ret = IB_MAD_RESULT_SUCCESS; > > > goto bail; > > > default: > > > pmp->status |= IB_SMP_UNSUP_METHOD; > > > > From dwight at currentmail.com Fri Oct 19 06:20:58 2007 From: dwight at currentmail.com (aristotle abelard) Date: Fri, 19 Oct 2007 13:20:58 +0000 Subject: [ofa-general] Obtain a University Degree based on your professional experience. Message-ID: <000801c81261$0723e9c3$fed9d3af@slmqunis> University Degree OBTAIN A PROSPEROUS FUTURE, MONEY-EARNING POWER, AND THE PRESTIGE THAT COMES WITH HAVING THE CAREER POSITION YOU'VE ALWAYS DREAMED OF. DIPLOMA FROM PRESTIGIOUS NON-ACCREDITED UNVERSITIES BASED ON YOUR PRESENT KNOWLEDGE AND PROFESSIONAL EXPERIENCE. If you qualify, no required tests, classes, books or examinations. Confidentiality Assured 1z-213-d291z-2175 24 hours a day, 7 days a week including Sundays and Holidays limited range. With more highly perfected appliances, as a vacuum, or have not given much attention to radio, I will briefly outline the resented any suggestion of insult aimed at his crippled friend. However, to be 'all work and no play,' he took great pleasure in his work. In the "'Where's Mr. Storey?' demanded the lad. The clerk snickered as he on a scrap of paper and handed it down to him, saying: -------------- next part -------------- An HTML attachment was scrubbed... URL: From cap at nsc.liu.se Fri Oct 19 08:20:54 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Fri, 19 Oct 2007 17:20:54 +0200 Subject: [ofa-general] Expected RDMA performance In-Reply-To: References: Message-ID: <200710191720.58526.cap@nsc.liu.se> On Thursday 18 October 2007, Chuck Hartley wrote: ... > 8388608 5000 1342.12 1342.12 > ------------------------------------------------------------------ > > Is this typical RDMA performance? It's close to what I've seen on similar hw. ~1400 is what you can push through the 8x pci-e of the intel 5000 chipset (confirmed by trying 4x pci-e which has shown ~700). > What is the maximum theoretical BW for > DDR IB - 1525MB/sec? No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective which is 2000 MB/s (10-base) and 1907 MiB/s (2-base). On our system (with a different HCA) we see quite a difference with snoop-filter off (bios option). With snoop off (our) application performance goes up (not very suprising) but IB performance goes down (latency 0.4us worse and bw ~1400->1200). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From cap at nsc.liu.se Fri Oct 19 08:31:07 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Fri, 19 Oct 2007 17:31:07 +0200 Subject: [ofa-general] ConnectX problems on Sles10sp1 In-Reply-To: References: <4717D79A.3010207@linux.vnet.ibm.com> <471816AF.6080204@linux.vnet.ibm.com> Message-ID: <200710191731.16893.cap@nsc.liu.se> On Friday 19 October 2007, Roland Dreier wrote: ... > But in the end I'm pretty sure you're just going to get a message from > the driver confirming that the installed firmware has an unsupported > command interface version. Yes, we got the following with 2.0.150 (no special debug etc.): mlx4_core 0000:17:00.0: Installed FW has unsupported command interface revision 1. kernel: mlx4_core 0000:17:00.0: (Installed FW version is 2.0.150) > > The other vexing issue really is that I need the device to be > > available (if recollect I need the PSID for example to download the > > firmware for the correct adapter) to update the firmware, at least > > if I were to use mstflint. Any ways around that? > > I'm sure you can use mstflint without the driver loaded by specifying > the PCI bus/device number, Yes, we just had to do something similar, "mstflint -d xx:yy.z q" will give you the PSID etc. We also had to fiddle some pci registers to get through to the "fw handicaped" HCA but I have no idea if that will be needed for all HCAs with such old fw. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From krause at cup.hp.com Fri Oct 19 09:09:57 2007 From: krause at cup.hp.com (Michael Krause) Date: Fri, 19 Oct 2007 09:09:57 -0700 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <200710191720.58526.cap@nsc.liu.se> References: <200710191720.58526.cap@nsc.liu.se> Message-ID: <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> At 08:20 AM 10/19/2007, Peter Kjellstrom wrote: >On Thursday 18 October 2007, Chuck Hartley wrote: >... > > 8388608 5000 1342.12 1342.12 > > ------------------------------------------------------------------ > > > > Is this typical RDMA performance? > >It's close to what I've seen on similar hw. ~1400 is what you can push >through >the 8x pci-e of the intel 5000 chipset (confirmed by trying 4x pci-e which >has shown ~700). > > > What is the maximum theoretical BW for > > DDR IB - 1525MB/sec? > >No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective which is >2000 MB/s (10-base) and 1907 MiB/s (2-base). There is also IB protocol overhead combined with driver / device control traffic overhead (consumes device as well as PCI resources / bandwidth), end-to-end control traffic which is also a function of how the application is constructed. In general, hitting about 80-85% of the theoretical maximum is possible. >On our system (with a different HCA) we see quite a difference with >snoop-filter off (bios option). With snoop off (our) application performance >goes up (not very suprising) but IB performance goes down (latency 0.4us >worse and bw ~1400->1200). Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Oct 19 09:19:00 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 19 Oct 2007 09:19:00 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <000601c81261$74c822e0$bc0da8c0@catcher> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> Message-ID: <4718D8F4.3070305@ichips.intel.com> > This looks like it could be a problem, but it is not directly > related to the original patch; i.e. loop back of a DR SMP > response being sent from userspace. > > I think you could probably combine the routines as you suggest > or call them both as is done to solve the original problem. > If you want to combine this change with the original patch > to make a more uniform change across both problems that > would be fine with me. Hal, Sean, do you have an opinion? I think Hal pointed this out, but the mad layer doesn't currently support mixing DR and LID routed SMPs. I would like patches to add/fix that to be separate. - Sean From ralph.campbell at qlogic.com Fri Oct 19 09:26:38 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 19 Oct 2007 09:26:38 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <4718D8F4.3070305@ichips.intel.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> <4718D8F4.3070305@ichips.intel.com> Message-ID: <1192811198.6112.1.camel@brick.pathscale.com> On Fri, 2007-10-19 at 09:19 -0700, Sean Hefty wrote: > > This looks like it could be a problem, but it is not directly > > related to the original patch; i.e. loop back of a DR SMP > > response being sent from userspace. > > > > I think you could probably combine the routines as you suggest > > or call them both as is done to solve the original problem. > > If you want to combine this change with the original patch > > to make a more uniform change across both problems that > > would be fine with me. Hal, Sean, do you have an opinion? > > I think Hal pointed this out, but the mad layer doesn't currently > support mixing DR and LID routed SMPs. I would like patches to add/fix > that to be separate. > > - Sean I agree. I just wanted to review the original patch enough to be sure I understood the limitations. Now that my memory of DR SMPs is refreshed, if one of us resubmits the V3 patch plus the ipath change plus the extended description, I will ACK it. From hrosenstock at xsigo.com Fri Oct 19 09:56:15 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 19 Oct 2007 09:56:15 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <4718D8F4.3070305@ichips.intel.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> <4718D8F4.3070305@ichips.intel.com> Message-ID: <1192812975.23494.176.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-19 at 09:19 -0700, Sean Hefty wrote: > > This looks like it could be a problem, but it is not directly > > related to the original patch; i.e. loop back of a DR SMP > > response being sent from userspace. > > > > I think you could probably combine the routines as you suggest > > or call them both as is done to solve the original problem. > > If you want to combine this change with the original patch > > to make a more uniform change across both problems that > > would be fine with me. Hal, Sean, do you have an opinion? > > I think Hal pointed this out, but the mad layer doesn't currently > support mixing DR and LID routed SMPs. To be precise, pure DR and LR SMPs are supported. Additionally, there is one combined route (LR part followed by a DR part) which can be initiated but that's as far as it goes. That latter mode is used by some ibportstate forms. > I would like patches to add/fix that to be separate. Me too. -- Hal > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Fri Oct 19 09:58:29 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 19 Oct 2007 09:58:29 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <1192811198.6112.1.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> <4718D8F4.3070305@ichips.intel.com> <1192811198.6112.1.camel@brick.pathscale.com> Message-ID: <1192813109.23494.179.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-19 at 09:26 -0700, Ralph Campbell wrote: > On Fri, 2007-10-19 at 09:19 -0700, Sean Hefty wrote: > > > This looks like it could be a problem, but it is not directly > > > related to the original patch; i.e. loop back of a DR SMP > > > response being sent from userspace. > > > > > > I think you could probably combine the routines as you suggest > > > or call them both as is done to solve the original problem. > > > If you want to combine this change with the original patch > > > to make a more uniform change across both problems that > > > would be fine with me. Hal, Sean, do you have an opinion? > > > > I think Hal pointed this out, but the mad layer doesn't currently > > support mixing DR and LID routed SMPs. I would like patches to add/fix > > that to be separate. > > > > - Sean > > I agree. I just wanted to review the original patch enough to be > sure I understood the limitations. Now that my memory of DR SMPs > is refreshed, if one of us resubmits the V3 patch plus the ipath > change plus the extended description, I will ACK it. Yes, Steve should resubmit the V3 patch updated with the extended description. I think your patch to eliminate the unneeded if clause can be a separate patch (and I acked that already). Is that what you mean by the ipath change ? Is there something else right now ? -- Hal > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ralph.campbell at qlogic.com Fri Oct 19 10:08:40 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 19 Oct 2007 10:08:40 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <1192813109.23494.179.camel@hrosenstock-ws.xsigo.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> <4718D8F4.3070305@ichips.intel.com> <1192811198.6112.1.camel@brick.pathscale.com> <1192813109.23494.179.camel@hrosenstock-ws.xsigo.com> Message-ID: <1192813720.6112.17.camel@brick.pathscale.com> On Fri, 2007-10-19 at 09:58 -0700, Hal Rosenstock wrote: > Yes, Steve should resubmit the V3 patch updated with the extended > description. I think your patch to eliminate the unneeded if clause can > be a separate patch (and I acked that already). > > Is that what you mean by the ipath change ? Is there something else > right now ? > > -- Hal I was refering to the change to ipath_mad.c to change the two instances of IB_MAD_RESULT_FAILURE to IB_MAD_RESULT_SUCCESS. From hrosenstock at xsigo.com Fri Oct 19 10:12:49 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 19 Oct 2007 10:12:49 -0700 Subject: [ofa-general] [PATCH V3] infiniband/core: Enable loopback ofDRSMP responses from userspace In-Reply-To: <1192813720.6112.17.camel@brick.pathscale.com> References: <470D9895.mail1ZJ11FSUP@systemfabricworks.com> <1192667603.30322.504.camel@brick.pathscale.com> <001001c8112a$8575fd20$a865a8c0@catcher> <1192738125.30322.544.camel@brick.pathscale.com> <000601c81261$74c822e0$bc0da8c0@catcher> <4718D8F4.3070305@ichips.intel.com> <1192811198.6112.1.camel@brick.pathscale.com> <1192813109.23494.179.camel@hrosenstock-ws.xsigo.com> <1192813720.6112.17.camel@brick.pathscale.com> Message-ID: <1192813969.23494.191.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-19 at 10:08 -0700, Ralph Campbell wrote: > On Fri, 2007-10-19 at 09:58 -0700, Hal Rosenstock wrote: > > > Yes, Steve should resubmit the V3 patch updated with the extended > > description. I think your patch to eliminate the unneeded if clause can > > be a separate patch (and I acked that already). > > > > Is that what you mean by the ipath change ? Is there something else > > right now ? > > > > -- Hal > > I was refering to the change to ipath_mad.c to change the two > instances of IB_MAD_RESULT_FAILURE to IB_MAD_RESULT_SUCCESS. Ah. OK; that's a third one. I'll look at that one and comment on it. From swelch at systemfabricworks.com Fri Oct 19 10:41:28 2007 From: swelch at systemfabricworks.com (swelch at systemfabricworks.com) Date: Fri, 19 Oct 2007 12:41:28 -0500 Subject: [ofa-general] [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace Message-ID: <4718EC48.mail8WG11T65C@systemfabricworks.com> This patch [v4] replaces the [v3] patch; it's identicial other than the patch description has been updated to put back in the detailed patch description absent from the [v3] patch. The local loopback of an outgoing DR SMP response is limited to those that originate at the driver specific SMA implementation during the driver specific process_mad() function. This patch enables a returning DR SMP originating in userspace (or elsewhere) to be delivered to the local managment stack. In this specific case the driver process_mad() function does not consume or process the MAD, so a reponse mad has not be created and the original MAD must manually be copied to the MAD buffer that is to be handed off to the local agent. For consistent bahavior on top of iPath hardware, a subsequent patch to be submitted by Ralph Campbell to update process_mad() return values is required. Thanks, Steve Signed-off-by: Steve Welch --- drivers/infiniband/core/mad.c | 6 +++--- drivers/infiniband/core/smi.h | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..98148d6 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, } /* Check to post send on QP or process locally */ - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, port_priv = ib_get_mad_port(mad_agent_priv->agent.device, mad_agent_priv->agent.port_num); if (port_priv) { - mad_priv->mad.mad.mad_hdr.tid = - ((struct ib_mad *)smp)->mad_hdr.tid; + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); recv_mad_agent = find_mad_agent(port_priv, &mad_priv->mad.mad); } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 1cfc298..aff96ba 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); /* - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad */ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, struct ib_device *device) @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, (smp->hop_ptr == smp->hop_cnt + 1)) ? IB_SMI_HANDLE : IB_SMI_DISCARD); } + +/* + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad + */ +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, + struct ib_device *device) +{ + /* C14-13:3 -- We're at the end of the DR segment of path */ + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ + return ((device->process_mad && + ib_get_smp_direction(smp) && + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); +} + #endif /* __SMI_H_ */ From weiny2 at llnl.gov Fri Oct 19 11:11:36 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 19 Oct 2007 11:11:36 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: Formalize BuildRequires for rpmbuild Message-ID: <20071019111136.48518c07.weiny2@llnl.gov> >From 33d2c9cca44ce13aa8f35b2228369a33f7a45a70 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Wed, 17 Oct 2007 15:23:55 -0700 Subject: [PATCH] Formalize BuildRequires for rpmbuild the mock build tool in particular requires specific build requires Signed-off-by: Ira K. Weiny --- infiniband-diags/infiniband-diags.spec.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in index 30c14a9..f4a08ab 100644 --- a/infiniband-diags/infiniband-diags.spec.in +++ b/infiniband-diags/infiniband-diags.spec.in @@ -11,7 +11,7 @@ Group: System Environment/Libraries BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) Source: http://www.openfabrics.org/downloads/management/@TARBALL@ Url: http://openfabrics.org/ -BuildRequires: libibmad-devel, opensm-devel +BuildRequires: libibmad-devel, opensm-devel, libibcommon-devel, libibumad-devel Provides: perl(IBswcountlimits) %description -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Formalize-BuildRequires-for-rpmbuild.patch Type: application/octet-stream Size: 1063 bytes Desc: not available URL: From ralph.campbell at qlogic.com Fri Oct 19 11:51:08 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 19 Oct 2007 11:51:08 -0700 Subject: [ofa-general] Re: [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <4718EC48.mail8WG11T65C@systemfabricworks.com> References: <4718EC48.mail8WG11T65C@systemfabricworks.com> Message-ID: <1192819868.6112.33.camel@brick.pathscale.com> On Fri, 2007-10-19 at 12:41 -0500, swelch at systemfabricworks.com wrote: > > This patch [v4] replaces the [v3] patch; it's identicial other than > the patch description has been updated to put back in the detailed > patch description absent from the [v3] patch. > > The local loopback of an outgoing DR SMP response is limited to those > that originate at the driver specific SMA implementation during the > driver specific process_mad() function. This patch enables a > returning DR SMP originating in userspace (or elsewhere) to be > delivered to the local managment stack. In this specific case > the driver process_mad() function does not consume or process > the MAD, so a reponse mad has not be created and the original > MAD must manually be copied to the MAD buffer that is to be handed > off to the local agent. > > For consistent bahavior on top of iPath hardware, a subsequent patch > to be submitted by Ralph Campbell to update process_mad() return values > is required. > > Thanks, Steve Acked-by: Ralph Campbell > Signed-off-by: Steve Welch > --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..98148d6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > mad_agent_priv->agent.port_num); > if (port_priv) { > - mad_priv->mad.mad.mad_hdr.tid = > - ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..aff96ba 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, int port_num); > > /* > - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > */ > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > struct ib_device *device) > @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > + */ > +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ From tesyin220 at yahoo.co.uk Fri Oct 19 11:57:58 2007 From: tesyin220 at yahoo.co.uk (TESY DIANE) Date: Fri, 19 Oct 2007 19:57:58 +0100 (BST) Subject: [ofa-general] I wish to have a good friendships with you. Message-ID: <657179.64289.qm@web27510.mail.ukl.yahoo.com> Hi dear, how are you today i hope that every things is ok with you as it is my great pleassure to contact you in having communication with you, please i wish you will have the desire with me so that we can get to know each other better and see what happened in future. i will be very happy if you can write me through my email for easiest communication and to know all about each other, and also give you my pictures and details about me, here is my email (tesydiane1946 at yahoo.co.uk) i will be waiting to hear from you as i wish you all the best for your day. your new friend. Tesy. --------------------------------- Your Yahoo! Mail address is precious. Protect it with our ingenious new AddressGuard tool. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.frank at oracle.com Fri Oct 19 12:03:14 2007 From: richard.frank at oracle.com (Richard Frank) Date: Fri, 19 Oct 2007 15:03:14 -0400 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <200710191720.58526.cap@nsc.liu.se> References: <200710191720.58526.cap@nsc.liu.se> Message-ID: <4718FF72.801@oracle.com> Does it follow then that it's possible to get 1400 mbytes / sec out + 1400 mbytes / sec in for total of 2800 mbytes rdma write + rdma read ? Peter Kjellstrom wrote: > On Thursday 18 October 2007, Chuck Hartley wrote: > ... > >> 8388608 5000 1342.12 1342.12 >> ------------------------------------------------------------------ >> >> Is this typical RDMA performance? >> > > It's close to what I've seen on similar hw. ~1400 is what you can push through > the 8x pci-e of the intel 5000 chipset (confirmed by trying 4x pci-e which > has shown ~700). > > >> What is the maximum theoretical BW for >> DDR IB - 1525MB/sec? >> > > No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective which is > 2000 MB/s (10-base) and 1907 MiB/s (2-base). > > On our system (with a different HCA) we see quite a difference with > snoop-filter off (bios option). With snoop off (our) application performance > goes up (not very suprising) but IB performance goes down (latency 0.4us > worse and bw ~1400->1200). > > /Peter > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ggrundstrom at neteffect.com Fri Oct 19 12:57:10 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 14:57:10 -0500 Subject: [ofa-general] [PATCH 0/14 v2] nes: NetEffect 10Gb RNIC Driver Message-ID: <200710191957.l9JJvAgC021662@neteffect.com> This is the second posting for the series of patches containing the source code for the NetEffect 10Gb RNIC adapter. The driver is split into two components - a kernel driver module and a userspace library. The code can also be found in the following git trees. git.openfabrics.org/~glenn/libnes.git git.openfabrics.org/~glenn/linux-2.6.git Thanks, Glenn. From ggrundstrom at neteffect.com Fri Oct 19 13:01:30 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:01:30 -0500 Subject: [ofa-general] [PATCH 1/14 v2] nes: module and device initialization Message-ID: <200710192001.l9JK1U8O021689@neteffect.com> Kernel module and device initialization routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes.c 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,811 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "nes.h" + +#include +#include +#include +#include + +#ifdef SPIN_BUG_ON +#undef SPIN_BUG_ON +#define SPIN_BUG_ON (...) +#endif + +MODULE_AUTHOR("NetEffect"); +MODULE_DESCRIPTION("NetEffect RNIC Low-level iWARP Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +int max_mtu = 9000; +int nics_per_function = 1; + +#ifdef NES_INT_MODERATE +int interrupt_mod_interval = 128; +#else +int interrupt_mod_interval = 0; +#endif + +/* Interoperability */ +int mpa_version = 1; +module_param(mpa_version, int, 0); +MODULE_PARM_DESC(mpa_version, "MPA version to be used int MPA Req/Resp (0 or 1)"); + +/* Interoperability */ +int disable_mpa_crc = 0; +module_param(disable_mpa_crc, int, 0); +MODULE_PARM_DESC(disable_mpa_crc, "Disable checking of MPA CRC"); + +unsigned int send_first = 0; +module_param(send_first, int, 0); +MODULE_PARM_DESC(send_first, "Send RDMA Message First on Active Connection"); + + +unsigned int nes_drv_opt = 0; +module_param(nes_drv_opt, int, 0); +MODULE_PARM_DESC(nes_drv_opt, "Driver option parameters"); + +unsigned int nes_debug_level = 0xffffffff; +module_param(nes_debug_level, uint, 0644); +MODULE_PARM_DESC(nes_debug_level, "Enable debug output level"); + +LIST_HEAD(nes_adapter_list); +LIST_HEAD(nes_dev_list); + +atomic_t qps_destroyed; +atomic_t cqp_reqs_allocated; +atomic_t cqp_reqs_freed; +atomic_t cqp_reqs_dynallocated; +atomic_t cqp_reqs_dynfreed; +atomic_t cqp_reqs_queued; +atomic_t cqp_reqs_redriven; + +static void nes_print_macaddr(struct net_device *netdev); +static irqreturn_t nes_interrupt(int, void *); +static int __devinit nes_probe(struct pci_dev *, const struct pci_device_id *); +static void __devexit nes_remove(struct pci_dev *); +static int __init nes_init_module(void); +static void __exit nes_exit_module(void); + +static struct pci_device_id nes_pci_table[] = { + {PCI_VENDOR_ID_NETEFFECT, PCI_DEVICE_ID_NETEFFECT_NE020, PCI_ANY_ID, PCI_ANY_ID}, + {0} +}; + +MODULE_DEVICE_TABLE(pci, nes_pci_table); + +static int nes_inetaddr_event(struct notifier_block *, unsigned long, void *); +static int nes_net_event(struct notifier_block *, unsigned long, void *); +static int notifiers_registered = 0; + + +static struct notifier_block nes_inetaddr_notifier = { + .notifier_call = nes_inetaddr_event +}; + +static struct notifier_block nes_net_notifier = { + .notifier_call = nes_net_event +}; + + + + +/** + * nes_inetaddr_event + */ +static int nes_inetaddr_event(struct notifier_block *notifier, + unsigned long event, void *ptr) +{ + struct in_ifaddr *ifa = ptr; + struct net_device *event_netdev = ifa->ifa_dev->dev; + struct nes_device *nesdev; + struct net_device *netdev; + struct nes_vnic *nesvnic; + unsigned int addr; + unsigned int mask; + + addr = ntohl(ifa->ifa_address); + mask = ntohl(ifa->ifa_mask); + nes_debug(NES_DBG_NETDEV, "nes_inetaddr_event: ip address %08X, netmask %08X.\n", + addr, mask); + list_for_each_entry(nesdev, &nes_dev_list, list) { + nes_debug(NES_DBG_NETDEV, "Nesdev list entry = 0x%p. (%s)\n", + nesdev, nesdev->netdev[0]->name); + netdev = nesdev->netdev[0]; + nesvnic = netdev_priv(netdev); + if (netdev == event_netdev) { + if (0 == nesvnic->rdma_enabled) { + nes_debug(NES_DBG_NETDEV, "Returning without processing event for %s since" + " RDMA is not enabled.\n", + netdev->name); + return NOTIFY_OK; + } + /* we have ifa->ifa_address/mask here if we need it */ + switch (event) { + case NETDEV_DOWN: + nes_debug(NES_DBG_NETDEV, "event:DOWN\n"); + nes_write_indexed(nesdev, + NES_IDX_DST_IP_ADDR+(0x10*PCI_FUNC(nesdev->pcidev->devfn)), 0); + + nesvnic->local_ipaddr = 0; + return NOTIFY_OK; + break; + case NETDEV_UP: + nes_debug(NES_DBG_NETDEV, "event:UP\n"); + + if (nesvnic->local_ipaddr != 0) { + nes_debug(NES_DBG_NETDEV, "Interface already has local_ipaddr\n"); + return NOTIFY_OK; + } + /* Add the address to the IP table */ + nesvnic->local_ipaddr = ifa->ifa_address; + + nes_write_indexed(nesdev, + NES_IDX_DST_IP_ADDR+(0x10*PCI_FUNC(nesdev->pcidev->devfn)), + ntohl(ifa->ifa_address)); + return NOTIFY_OK; + break; + default: + break; + } + } + } + + return NOTIFY_DONE; +} + + +/** + * nes_net_event + */ +static int nes_net_event(struct notifier_block *notifier, + unsigned long event, void *ptr) +{ + struct neighbour *neigh = ptr; + struct nes_device *nesdev; + struct net_device *netdev; + struct nes_vnic *nesvnic; + + switch (event) { + case NETEVENT_NEIGH_UPDATE: + list_for_each_entry(nesdev, &nes_dev_list, list) { + /* nes_debug(NES_DBG_NETDEV, "Nesdev list entry = 0x%p.\n", nesdev); */ + netdev = nesdev->netdev[0]; + nesvnic = netdev_priv(netdev); + if (netdev == neigh->dev) { + if (0 == nesvnic->rdma_enabled) { + nes_debug(NES_DBG_NETDEV, "Skipping device %s since no RDMA\n", + netdev->name); + } else { + if (neigh->nud_state & NUD_VALID) { + nes_manage_arp_cache(neigh->dev, neigh->ha, + ntohl(*(u32 *)neigh->primary_key), NES_ARP_ADD); + } else { + nes_manage_arp_cache(neigh->dev, neigh->ha, + ntohl(*(u32 *)neigh->primary_key), NES_ARP_DELETE); + } + } + return NOTIFY_OK; + } + } + break; + default: + nes_debug(NES_DBG_NETDEV, "NETEVENT_ %lu undefined\n", event); + break; + } + + return NOTIFY_DONE; +} + + +/** + * nes_add_ref + */ +void nes_add_ref(struct ib_qp *ibqp) +{ + struct nes_qp *nesqp; + + nesqp = to_nesqp(ibqp); + nes_debug(NES_DBG_QP, "Bumping refcount for QP%u. Pre-inc value = %u\n", + ibqp->qp_num, atomic_read(&nesqp->refcount)); + atomic_inc(&nesqp->refcount); +} + + +/** + * nes_rem_ref + */ +void nes_rem_ref(struct ib_qp *ibqp) +{ + u64 u64temp; + struct nes_qp *nesqp; + struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + + nesqp = to_nesqp(ibqp); + + if (atomic_read(&nesqp->refcount) == 0) { + printk(KERN_INFO PFX "%s: Reference count already 0 for QP%d, last aeq = 0x%04X.\n", + __FUNCTION__, ibqp->qp_num, nesqp->last_aeq ); + BUG(); + } + + if (atomic_dec_and_test(&nesqp->refcount)) { + atomic_inc(&qps_destroyed); + + /* Free the control structures */ + pci_free_consistent(nesdev->pcidev, nesqp->qp_mem_size, nesqp->hwqp.sq_vbase, + nesqp->hwqp.sq_pbase); + + nesadapter->qp_table[nesqp->hwqp.qp_id-NES_FIRST_QPN] = NULL; + nes_free_resource(nesadapter, nesadapter->allocated_qps, nesqp->hwqp.qp_id); + + /* Destroy the QP */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_QP, "Failed to get a cqp_request.\n"); + return; + } + cqp_request->waiting = 0; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = + cpu_to_le32(NES_CQP_DESTROY_QP | NES_CQP_QP_TYPE_IWARP); + + if (nesqp->hte_added) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_QP_DEL_HTE); + nesqp->hte_added = 0; + } + + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesqp->hwqp.qp_id); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesqp->nesqp_context_pbase; + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_HIGH_IDX] = + cpu_to_le32((u32)(u64temp >> 32)); + + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + kfree(nesqp->allocated_buffer); + } +} + + +/** + * nes_get_qp + */ +struct ib_qp *nes_get_qp(struct ib_device *device, int qpn) { + struct nes_vnic *nesvnic = to_nesvnic(device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + + if ((qpn=(NES_FIRST_QPN+nesadapter->max_qp))) + return NULL; + + return &nesadapter->qp_table[qpn-NES_FIRST_QPN]->ibqp; +} + + +/** + * nes_print_macaddr + */ +static void nes_print_macaddr(struct net_device *netdev) +{ + nes_debug(NES_DBG_INIT, "%s: MAC %02X:%02X:%02X:%02X:%02X:%02X, IRQ %u\n", + netdev->name, + netdev->dev_addr[0], netdev->dev_addr[1], netdev->dev_addr[2], + netdev->dev_addr[3], netdev->dev_addr[4], netdev->dev_addr[5], + netdev->irq); +} + + +/** + * nes_interrupt - handle interrupts + */ +static irqreturn_t nes_interrupt(int irq, void *dev_id) +{ + struct nes_device *nesdev = (struct nes_device *)dev_id; + int handled = 0; + u32 int_mask; + u32 int_req; + u32 int_stat; + u32 intf_int_stat; + u32 timer_stat; + + if (nesdev->msi_enabled) { + /* No need to read the interrupt pending register if msi is enabled */ + handled = 1; + } else { + if (unlikely(nesdev->nesadapter->hw_rev == NE020_REV)) { + /* Master interrupt enable provides synchronization for kicking off bottom half + when interrupt sharing is going on */ + int_mask = nes_read32(nesdev->regs + NES_INT_MASK); + if (int_mask & 0x80000000) { + /* Check interrupt status to see if this might be ours */ + int_stat = nes_read32(nesdev->regs + NES_INT_STAT); + int_req = nesdev->int_req; + if (int_stat&int_req) { + /* if interesting CEQ or AEQ is pending, claim the interrupt */ + if ((int_stat&int_req) & (~(NES_INT_TIMER|NES_INT_INTF))) { + handled = 1; + } else { + if (((int_stat & int_req) & NES_INT_TIMER) == NES_INT_TIMER) { + /* Timer might be running but might be for another function */ + timer_stat = nes_read32(nesdev->regs + NES_TIMER_STAT); + if ((timer_stat & nesdev->timer_int_req) != 0) { + handled = 1; + } + } + if ((((int_stat & int_req) & NES_INT_INTF) == NES_INT_INTF) && + (0 == handled)) { + intf_int_stat = nes_read32(nesdev->regs+NES_INTF_INT_STAT); + if ((intf_int_stat & nesdev->intf_int_req) != 0) { + handled = 1; + } + } + } + if (handled) { + nes_write32(nesdev->regs+NES_INT_MASK, int_mask & (~0x80000000)); + int_mask = nes_read32(nesdev->regs+NES_INT_MASK); + /* Save off the status to save an additional read */ + nesdev->int_stat = int_stat; + nesdev->napi_isr_ran = 1; + } + } + } + } else { + handled = nes_read32(nesdev->regs+NES_INT_PENDING); + } + } + + if (handled) { +#ifdef NES_NAPI + if (0 == nes_napi_isr(nesdev)) { +#endif + tasklet_schedule(&nesdev->dpc_tasklet); +#ifdef NES_NAPI + } +#endif + return IRQ_HANDLED; + } else { + return IRQ_NONE; + } +} + + +/** + * nes_probe - Device initialization + */ +static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_id *ent) +{ + struct net_device *netdev = NULL; + struct nes_device *nesdev = NULL; + int ret = 0; + struct nes_vnic *nesvnic = NULL; + void __iomem *mmio_regs = NULL; + u8 hw_rev; + + assert(pcidev != NULL); + assert(ent != NULL); + + printk(KERN_INFO PFX "NetEffect RNIC driver v%s loading. (%s)\n", + DRV_VERSION, pci_name(pcidev)); + + ret = pci_enable_device(pcidev); + if (ret) { + printk(KERN_ERR PFX "Unable to enable PCI device. (%s)\n", pci_name(pcidev)); + goto bail0; + } + + nes_debug(NES_DBG_INIT, "BAR0 (@0x%08lX) size = 0x%lX bytes\n", + (long unsigned int)pci_resource_start(pcidev, BAR_0), + (long unsigned int)pci_resource_len(pcidev, BAR_0)); + nes_debug(NES_DBG_INIT, "BAR1 (@0x%08lX) size = 0x%lX bytes\n", + (long unsigned int)pci_resource_start(pcidev, BAR_1), + (long unsigned int)pci_resource_len(pcidev, BAR_1)); + + /* Make sure PCI base addr are MMIO */ + if (!(pci_resource_flags(pcidev, BAR_0) & IORESOURCE_MEM) || + !(pci_resource_flags(pcidev, BAR_1) & IORESOURCE_MEM)) { + printk(KERN_ERR PFX "PCI regions not an MMIO resource\n"); + ret = -ENODEV; + goto bail1; + } + + /* Reserve PCI I/O and memory resources */ + ret = pci_request_regions(pcidev, DRV_NAME); + if (ret) { + printk(KERN_ERR PFX "Unable to request regions. (%s)\n", pci_name(pcidev)); + goto bail1; + } + + if ((sizeof(dma_addr_t) > 4)) { + ret = pci_set_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "64b DMA mask configuration failed\n"); + goto bail2; + } + ret = pci_set_consistent_dma_mask(pcidev, DMA_64BIT_MASK); + if (ret) { + printk(KERN_ERR PFX "64b DMA consistent mask configuration failed\n"); + goto bail2; + } + } else { + ret = pci_set_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret < 0) { + printk(KERN_ERR PFX "32b DMA mask configuration failed\n"); + goto bail2; + } + ret = pci_set_consistent_dma_mask(pcidev, DMA_32BIT_MASK); + if (ret) { + printk(KERN_ERR PFX "32b DMA consistent mask configuration failed\n"); + goto bail2; + } + } + + pci_set_master(pcidev); + + /* Allocate hardware structure */ + nesdev = kmalloc(sizeof(struct nes_device), GFP_KERNEL); + if (!nesdev) { + printk(KERN_ERR PFX "%s: Unable to alloc hardware struct\n", pci_name(pcidev)); + ret = -ENOMEM; + goto bail2; + } + + memset(nesdev, 0, sizeof(struct nes_device)); + nes_debug(NES_DBG_INIT, "Allocated nes device at %p\n", nesdev); + nesdev->pcidev = pcidev; + pci_set_drvdata(pcidev, nesdev); + + pci_read_config_byte(pcidev, 0x0008, &hw_rev); + nes_debug(NES_DBG_INIT, "hw_rev=%u\n", hw_rev); + + spin_lock_init(&nesdev->indexed_regs_lock); + + /* Remap the PCI registers in adapter BAR0 to kernel VA space */ + mmio_regs = ioremap_nocache(pci_resource_start(pcidev, BAR_0), sizeof(mmio_regs)); + if (mmio_regs == NULL) { + printk(KERN_ERR PFX "Unable to remap BAR0\n"); + ret = -EIO; + goto bail3; + } + nesdev->regs = mmio_regs; + nesdev->index_reg = 0x50 + (PCI_FUNC(pcidev->devfn)*8) + mmio_regs; + + /* Ensure interrupts are disabled */ + nes_write32(nesdev->regs+NES_INT_MASK, 0x7fffffff); + +#ifdef CONFIG_PCI_MSI + if (nes_drv_opt & NES_DRV_OPT_ENABLE_MSI) { + if (!pci_enable_msi(nesdev->pcidev)) { + nesdev->msi_enabled = 1; + nes_debug(NES_DBG_INIT, "MSI is enabled for device %s\n", + pci_name(pcidev)); + } else { + nes_debug(NES_DBG_INIT, "MSI is disabled by linux for device %s\n", + pci_name(pcidev)); + } + } else { + nes_debug(NES_DBG_INIT, "MSI not requested due to driver options for device %s\n", + pci_name(pcidev)); + } +#else + nes_debug(NES_DBG_INIT, "MSI not supported by this kernel for device %s\n", + pci_name(pcidev)); +#endif + + nesdev->et_rx_coalesce_usecs_irq = interrupt_mod_interval; + nesdev->csr_start = pci_resource_start(nesdev->pcidev, BAR_0); + nesdev->doorbell_start = pci_resource_start(nesdev->pcidev, BAR_1); + + /* Init the adapter */ + nesdev->nesadapter = nes_init_adapter(nesdev, hw_rev); + if (!nesdev->nesadapter) { + printk(KERN_ERR PFX "Unable to initialize adapter.\n" ); + ret = -ENOMEM; + goto bail5; + } + + nesdev->mac_index = PCI_FUNC(nesdev->pcidev->devfn)%nesdev->nesadapter->port_count; + tasklet_init(&nesdev->dpc_tasklet, nes_dpc, (unsigned long)nesdev); + + /* bring up the Control QP */ + if (nes_init_cqp(nesdev)) { + ret = -ENODEV; + goto bail6; + } + + /* Arm the CCQ */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT | + PCI_FUNC(nesdev->pcidev->devfn)); + nes_read32(nesdev->regs+NES_CQE_ALLOC); + + /* Enable the interrupts */ + nesdev->int_req = (0x101 << PCI_FUNC(nesdev->pcidev->devfn)) | + (1 << (PCI_FUNC(nesdev->pcidev->devfn)+16)); + if (PCI_FUNC(nesdev->pcidev->devfn) < 4) { + nesdev->int_req |= (1 << (PCI_FUNC(nesdev->pcidev->devfn)+24)); + } + + /* TODO: This really should be the first driver to load, not function 0 */ + if (0 == PCI_FUNC(nesdev->pcidev->devfn)) { + /* pick up PCI and critical errors if the first driver to load */ + nesdev->intf_int_req = NES_INTF_INT_PCIERR | NES_INTF_INT_CRITERR; + nesdev->int_req |= NES_INT_INTF; + } else { + nesdev->intf_int_req = 0; + } + nesdev->intf_int_req |= (1 << (PCI_FUNC(nesdev->pcidev->devfn)+16)); + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS0, 0); + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS1, 0); + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS2, 0x00001265); + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS4, 0x18021804); + + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS3, 0x17801790); + + /* deal with both periodic and one_shot */ + nesdev->timer_int_req = 0x101 << PCI_FUNC(nesdev->pcidev->devfn); + nesdev->nesadapter->timer_int_req |= nesdev->timer_int_req; + nes_debug(NES_DBG_INIT, "setting int_req for function %u, nesdev = 0x%04X, adapter = 0x%04X\n", + PCI_FUNC(nesdev->pcidev->devfn), + nesdev->timer_int_req, nesdev->nesadapter->timer_int_req); + + nes_write32(nesdev->regs+NES_INTF_INT_MASK, ~(nesdev->intf_int_req)); + + list_add_tail(&nesdev->list, &nes_dev_list); + + /* Request an interrupt line for the driver */ +#ifdef IRQF_SHARED + ret = request_irq(pcidev->irq, nes_interrupt, IRQF_SHARED, DRV_NAME, nesdev); +#else + ret = request_irq(pcidev->irq, nes_interrupt, SA_SHIRQ, DRV_NAME, nesdev); +#endif + if (ret) { + printk(KERN_ERR PFX "%s: requested IRQ %u is busy\n", + pci_name(pcidev), pcidev->irq); + goto bail65; + } + + nes_write32(nesdev->regs+NES_INT_MASK, ~nesdev->int_req); + + if (!notifiers_registered) { + register_inetaddr_notifier(&nes_inetaddr_notifier); + register_netevent_notifier(&nes_net_notifier); + notifiers_registered = 1; + } + + /* Initialize network devices */ + if ((netdev = nes_netdev_init(nesdev, mmio_regs)) == NULL) { + goto bail7; + } + + /* Register network device */ + ret = register_netdev(netdev); + if (ret) { + printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", ret); + nes_netdev_destroy(netdev); + goto bail7; + } + + nes_print_macaddr(netdev); + /* create a CM core for this netdev */ + nesvnic = netdev_priv(netdev); + + nesdev->netdev_count++; + nesdev->nesadapter->netdev_count++; + + + printk(KERN_ERR PFX "%s: NetEffect RNIC driver successfully loaded.\n", + pci_name(pcidev)); + return 0; + + bail7: + printk(KERN_ERR PFX "bail7\n"); + while (nesdev->netdev_count > 0) { + nesdev->netdev_count--; + nesdev->nesadapter->netdev_count--; + + unregister_netdev(nesdev->netdev[nesdev->netdev_count]); + nes_netdev_destroy(nesdev->netdev[nesdev->netdev_count]); + } + + nes_debug(NES_DBG_INIT, "netdev_count=%d, nesadapter->netdev_count=%d\n", + nesdev->netdev_count, nesdev->nesadapter->netdev_count); + + if (notifiers_registered) { + unregister_netevent_notifier(&nes_net_notifier); + unregister_inetaddr_notifier(&nes_inetaddr_notifier); + notifiers_registered = 0; + } + + list_del(&nesdev->list); + nes_destroy_cqp(nesdev); + + bail65: + printk(KERN_ERR PFX "bail65\n"); + free_irq(pcidev->irq, nesdev); +#ifdef CONFIG_PCI_MSI + if (nesdev->msi_enabled) { + pci_disable_msi(pcidev); + } +#endif + bail6: + printk(KERN_ERR PFX "bail6\n"); + tasklet_kill(&nesdev->dpc_tasklet); + /* Deallocate the Adapter Structure */ + nes_destroy_adapter(nesdev->nesadapter); + + bail5: + printk(KERN_ERR PFX "bail5\n"); + iounmap(nesdev->regs); + + bail3: + printk(KERN_ERR PFX "bail3\n"); + kfree(nesdev); + + bail2: + pci_release_regions(pcidev); + + bail1: + pci_disable_device(pcidev); + + bail0: + return ret; +} + + +/** + * nes_remove - unload from kernel + */ +static void __devexit nes_remove(struct pci_dev *pcidev) +{ + struct nes_device *nesdev = pci_get_drvdata(pcidev); + struct net_device *netdev; + int netdev_index=0; + + nes_debug(NES_DBG_SHUTDOWN, "called.\n"); + + if (nesdev->netdev_count) { + netdev = nesdev->netdev[netdev_index]; + if (netdev) { + netif_stop_queue(netdev); + unregister_netdev(netdev); + nes_netdev_destroy(netdev); + + nesdev->netdev[netdev_index] = NULL; + nesdev->netdev_count--; + nesdev->nesadapter->netdev_count--; + } + } + if (notifiers_registered) { + unregister_netevent_notifier(&nes_net_notifier); + unregister_inetaddr_notifier(&nes_inetaddr_notifier); + notifiers_registered = 0; + } + + list_del(&nesdev->list); + nes_destroy_cqp(nesdev); + tasklet_kill(&nesdev->dpc_tasklet); + + /* Deallocate the Adapter Structure */ + nes_destroy_adapter(nesdev->nesadapter); + + free_irq(pcidev->irq, nesdev); + +#ifdef CONFIG_PCI_MSI + if (nesdev->msi_enabled) { + pci_disable_msi(pcidev); + } +#endif + + iounmap(nesdev->regs); + kfree(nesdev); + + /* nes_debug(NES_DBG_SHUTDOWN, "calling pci_release_regions.\n"); */ + pci_release_regions(pcidev); + pci_disable_device(pcidev); + pci_set_drvdata(pcidev, NULL); +} + + +static struct pci_driver nes_pci_driver = { + .name = DRV_NAME, + .id_table = nes_pci_table, + .probe = nes_probe, + .remove = __devexit_p(nes_remove), +}; + + +/** + * nes_init_module - module initialization entry point + */ +static int __init nes_init_module(void) +{ + int retval; + retval = nes_cm_start(); + if (retval) { + printk(KERN_ERR PFX "Unable to start NetEffect iWARP CM.\n"); + return retval; + } +#ifdef OFED_1_2 + return(pci_module_init(&nes_pci_driver)); +#else + return(pci_register_driver(&nes_pci_driver)); +#endif +} + + +/** + * nes_exit_module - module unload entry point + */ +static void __exit nes_exit_module(void) +{ + nes_cm_stop(); + pci_unregister_driver(&nes_pci_driver); +} + + +module_init(nes_init_module); +module_exit(nes_exit_module); + From ggrundstrom at neteffect.com Fri Oct 19 13:04:08 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:04:08 -0500 Subject: [ofa-general] [PATCH 2/14 v2] nes: device structures and defines Message-ID: <200710192004.l9JK48dm021704@neteffect.com> Main include file for device structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes.h 2007-10-19 09:59:12.000000000 -0500 @@ -0,0 +1,613 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __NES_H +#define __NES_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#define TBIRD +#define NES_TWO_PORT +#define NES_ENABLE_CQE_READ +#define NES_SEND_FIRST_WRITE + +#define QUEUE_DISCONNECTS + +#define DRV_BUILD "1" + +#define DRV_NAME "iw_nes" +#define DRV_VERSION "0.5 Build " DRV_BUILD +#define PFX DRV_NAME ": " + +/* + * NetEffect PCI vendor id and NE010 PCI device id. + */ +#ifndef PCI_VENDOR_ID_NETEFFECT /* not in pci.ids yet */ +#define PCI_VENDOR_ID_NETEFFECT 0x1678 +#define PCI_DEVICE_ID_NETEFFECT_NE020 0x0100 +#endif + +#define NE020_REV 4 +#define NE020_REV1 5 + +#define BAR_0 0 +#define BAR_1 2 + +#define RX_BUF_SIZE (1536 + 8) + +#define NES_REG0_SIZE (4 * 1024) +#define NES_TX_TIMEOUT (6*HZ) +#define NES_FIRST_QPN 64 +#define NES_SW_CONTEXT_ALIGN 1024 + +#define NES_NIC_MAX_NICS 16 +#define NES_MAX_ARP_TABLE_SIZE 4096 + +#define MAX_DPC_ITERATIONS 128 + +#define NES_DRV_OPT_ENABLE_MPA_VER_0 0x00000001 +#define NES_DRV_OPT_DISABLE_MPA_CRC 0x00000002 +#define NES_DRV_OPT_DISABLE_FIRST_WRITE 0x00000004 +#define NES_DRV_OPT_DISABLE_INTF 0x00000008 +#define NES_DRV_OPT_ENABLE_MSI 0x00000010 +#define NES_DRV_OPT_DUAL_LOGICAL_PORT 0x00000020 +#define NES_DRV_OPT_SUPRESS_OPTION_BC 0x00000040 +#define NES_DRV_OPT_NO_INLINE_DATA 0x00000080 + +#define NES_AEQ_EVENT_TIMEOUT 2500 +#define NES_DISCONNECT_EVENT_TIMEOUT 2000 + +/* debug levels */ +#define NES_DBG_HW 0x00000001 +#define NES_DBG_INIT 0x00000002 +#define NES_DBG_ISR 0x00000004 +#define NES_DBG_PHY 0x00000008 +#define NES_DBG_NETDEV 0x00000010 +#define NES_DBG_CM 0x00000020 +#define NES_DBG_CM1 0x00000040 +#define NES_DBG_NIC_RX 0x00000080 +#define NES_DBG_NIC_TX 0x00000100 +#define NES_DBG_CQP 0x00000200 +#define NES_DBG_MMAP 0x00000400 +#define NES_DBG_MR 0x00000800 +#define NES_DBG_PD 0x00001000 +#define NES_DBG_CQ 0x00002000 +#define NES_DBG_QP 0x00004000 +#define NES_DBG_MOD_QP 0x00008000 +#define NES_DBG_AEQ 0x00010000 +#define NES_DBG_IW_RX 0x00020000 +#define NES_DBG_IW_TX 0x00040000 +#define NES_DBG_SHUTDOWN 0x00080000 +#define NES_DBG_RSVD1 0x10000000 +#define NES_DBG_RSVD2 0x20000000 +#define NES_DBG_RSVD3 0x40000000 +#define NES_DBG_RSVD4 0x80000000 +#define NES_DBG_ALL 0xffffffff + +#ifdef CONFIG_INFINIBAND_NES_DEBUG +#define assert(expr) \ +if(!(expr)) { \ + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \ + #expr, __FILE__, __FUNCTION__, __LINE__); \ +} + +#define nes_debug(level, fmt, args...) \ + if (level & nes_debug_level) \ + printk(KERN_ERR PFX "%s[%u]: " fmt, __FUNCTION__, __LINE__, ##args) + +#ifndef dprintk +#define dprintk(fmt, args...) do { printk(KERN_ERR PFX fmt, ##args); } while (0) +#endif +#define NES_EVENT_TIMEOUT 1200000 +/* #define NES_EVENT_TIMEOUT 1200 */ +#else +#define assert(expr) do {} while (0) +#define nes_debug(level, fmt, args...) +#define dprintk(fmt, args...) do {} while (0) + +#define NES_EVENT_TIMEOUT 100000 +#endif + +#include "nes_hw.h" +#include "nes_verbs.h" +#include "nes_context.h" +#include "nes_user.h" +#include "nes_cm.h" + +extern int max_mtu; +extern int nics_per_function; +#define max_frame_len (max_mtu+ETH_HLEN) +extern int interrupt_mod_interval; +extern int nes_if_count; +extern int mpa_version; +extern int disable_mpa_crc; +extern unsigned int send_first; +extern unsigned int nes_drv_opt; +extern unsigned int nes_debug_level; + +extern struct list_head nes_adapter_list; +extern struct list_head nes_dev_list; + +extern struct nes_cm_core *g_cm_core; + +extern atomic_t cm_connects; +extern atomic_t cm_accepts; +extern atomic_t cm_disconnects; +extern atomic_t cm_closes; +extern atomic_t cm_connecteds; +extern atomic_t cm_connect_reqs; +extern atomic_t cm_rejects; +extern atomic_t mod_qp_timouts; +extern atomic_t qps_created; +extern atomic_t qps_destroyed; +extern atomic_t sw_qps_destroyed; +extern u32 mh_detected; +extern u32 mh_pauses_sent; +extern u32 cm_packets_sent; +extern u32 cm_packets_bounced; +extern u32 cm_packets_created; +extern u32 cm_packets_received; +extern u32 cm_packets_dropped; +extern u32 cm_packets_retrans; +extern u32 cm_listens_created; +extern u32 cm_listens_destroyed; +extern u32 cm_backlog_drops; +extern atomic_t cm_nodes_created; +extern atomic_t cm_nodes_destroyed; +extern atomic_t cm_accel_dropped_pkts; +extern atomic_t cm_resets_recvd; + +extern u32 crit_err_count; +extern u32 mh_detected; +extern u32 mh_pauses_sent; + +extern atomic_t cqp_reqs_allocated; +extern atomic_t cqp_reqs_freed; +extern atomic_t cqp_reqs_dynallocated; +extern atomic_t cqp_reqs_dynfreed; +extern atomic_t cqp_reqs_queued; +extern atomic_t cqp_reqs_redriven; + + +struct nes_device { + struct nes_adapter *nesadapter; + void __iomem *regs; + void __iomem *index_reg; + struct pci_dev *pcidev; + struct net_device *netdev[NES_NIC_MAX_NICS]; + u64 link_status_interrupts; + struct tasklet_struct dpc_tasklet; + spinlock_t indexed_regs_lock; + unsigned long doorbell_start; + unsigned long csr_start; + unsigned long mac_tx_errors; + unsigned long mac_pause_frames_sent; + unsigned long mac_pause_frames_received; + unsigned long mac_rx_errors; + unsigned long mac_rx_crc_errors; + unsigned long mac_rx_symbol_err_frames; + unsigned long mac_rx_jabber_frames; + unsigned long mac_rx_oversized_frames; + unsigned long mac_rx_short_frames; + unsigned int mac_index; + unsigned int nes_stack_start; + + /* Control Structures */ + void *cqp_vbase; + dma_addr_t cqp_pbase; + u32 cqp_mem_size; + u8 ceq_index; + u8 nic_ceq_index; + struct nes_hw_cqp cqp; + struct nes_hw_cq ccq; + struct list_head cqp_avail_reqs; + struct list_head cqp_pending_reqs; + struct nes_cqp_request *nes_cqp_requests; + + u32 int_req; + u32 int_stat; + u32 timer_int_req; + u32 timer_only_int_count; + u32 intf_int_req; + u32 et_rx_coalesce_usecs_irq; + u32 last_mac_tx_pauses; + u32 last_used_chunks_tx; + struct list_head list; + + u16 base_doorbell_index; + u8 msi_enabled; + u8 netdev_count; + u8 napi_isr_ran; + u8 disable_rx_flow_control; + u8 disable_tx_flow_control; +}; + + +static inline int nes_skb_is_gso(const struct sk_buff *skb) +{ + return skb_shinfo(skb)->gso_size; +} + +#define nes_skb_linearize(_skb) skb_linearize(_skb) + + +/* Read from memory-mapped device */ +static inline u32 nes_read_indexed(struct nes_device *nesdev, u32 reg_index) +{ + unsigned long flags; + void __iomem *addr = nesdev->index_reg; + u32 value; + + spin_lock_irqsave(&nesdev->indexed_regs_lock, flags); + + writel(reg_index, addr); + value = readl((void __iomem *)addr + 4); + + spin_unlock_irqrestore(&nesdev->indexed_regs_lock, flags); + return value; +} + +static inline u32 nes_read32(const void __iomem* addr) +{ + return readl(addr); +} + +static inline u16 nes_read16(const void __iomem* addr) +{ + return readw(addr); +} + +static inline u8 nes_read8(const void __iomem* addr) +{ + return readb(addr); +} + +/* Write to memory-mapped device */ +static inline void nes_write_indexed(struct nes_device *nesdev, u32 reg_index, u32 val) +{ + unsigned long flags; + void __iomem *addr = nesdev->index_reg; + + spin_lock_irqsave(&nesdev->indexed_regs_lock, flags); + + writel(reg_index, addr); + writel(val, (void __iomem *)addr + 4); + + spin_unlock_irqrestore(&nesdev->indexed_regs_lock, flags); +} + +static inline void nes_write32(void __iomem *addr, u32 val) +{ + writel(val, addr); +} + +static inline void nes_write16(void __iomem *addr, u16 val) +{ + writew(val, addr); +} + +static inline void nes_write8(void __iomem *addr, u8 val) +{ + writeb(val, addr); +} + + + +static inline int nes_alloc_resource(struct nes_adapter *nesadapter, + unsigned long *resource_array, u32 max_resources, + u32 *req_resource_num, u32 *next) +{ + unsigned long flags; + u32 resource_num; + + spin_lock_irqsave(&nesadapter->resource_lock, flags); + + resource_num = find_next_zero_bit(resource_array, max_resources, *next); + if (resource_num >= max_resources) { + resource_num = find_first_zero_bit(resource_array, max_resources); + if (resource_num >= max_resources) { + printk(KERN_ERR PFX "%s: No available resourcess.\n", __FUNCTION__); + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); + return -EMFILE; + } + } + nes_debug(NES_DBG_HW, "find_next_zero_bit returned = %u (max = %u).\n", + resource_num, max_resources); + set_bit(resource_num, resource_array); + *next = resource_num+1; + if (*next == max_resources) { + *next = 0; + } + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); + *req_resource_num = resource_num; + + return 0; +} + +static inline int nes_is_resource_allocated(struct nes_adapter *nesadapter, + unsigned long *resource_array, u32 resource_num) +{ + unsigned long flags; + int bit_is_set; + + spin_lock_irqsave(&nesadapter->resource_lock, flags); + + bit_is_set = test_bit(resource_num, resource_array); + nes_debug(NES_DBG_HW, "resource_num %u is%s allocated.\n", + resource_num, (bit_is_set ? "": " not")); + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); + + return bit_is_set; +} + +static inline void nes_free_resource(struct nes_adapter *nesadapter, + unsigned long *resource_array, u32 resource_num) +{ + unsigned long flags; + + spin_lock_irqsave(&nesadapter->resource_lock, flags); + clear_bit(resource_num, resource_array); + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); +} + +static inline struct nes_vnic *to_nesvnic(struct ib_device *ibdev) { + return(container_of(ibdev, struct nes_ib_device, ibdev)->nesvnic); +} + +static inline struct nes_pd *to_nespd(struct ib_pd *ibpd) { + return(container_of(ibpd, struct nes_pd, ibpd)); +} + +static inline struct nes_ucontext *to_nesucontext(struct ib_ucontext *ibucontext) { + return(container_of(ibucontext, struct nes_ucontext, ibucontext)); +} + +static inline struct nes_mr *to_nesmr(struct ib_mr *ibmr) { + return(container_of(ibmr, struct nes_mr, ibmr)); +} + +static inline struct nes_mr *to_nesmr_from_ibfmr(struct ib_fmr *ibfmr) { + return(container_of(ibfmr, struct nes_mr, ibfmr)); +} + +static inline struct nes_mr *to_nesmw(struct ib_mw *ibmw) { + return(container_of(ibmw, struct nes_mr, ibmw)); +} + +static inline struct nes_fmr *to_nesfmr(struct nes_mr *nesmr) { + return(container_of(nesmr, struct nes_fmr, nesmr)); +} + +static inline struct nes_cq *to_nescq(struct ib_cq *ibcq) { + return(container_of(ibcq, struct nes_cq, ibcq)); +} + +static inline struct nes_qp *to_nesqp(struct ib_qp *ibqp) { + return(container_of(ibqp, struct nes_qp, ibqp)); +} + + +#define NES_CQP_REQUEST_NOT_HOLDING_LOCK 0 +#define NES_CQP_REQUEST_HOLDING_LOCK 1 +#define NES_CQP_REQUEST_NO_DOORBELL_RING 0 +#define NES_CQP_REQUEST_RING_DOORBELL 1 + +static inline struct nes_cqp_request + *nes_get_cqp_request(struct nes_device *nesdev, int holding_lock) { + unsigned long flags; + struct nes_cqp_request *cqp_request = NULL; + + if (!holding_lock) { + spin_lock_irqsave(&nesdev->cqp.lock, flags); + } + if (!list_empty(&nesdev->cqp_avail_reqs)) { + cqp_request = list_entry(nesdev->cqp_avail_reqs.next, + struct nes_cqp_request, list); + atomic_inc(&cqp_reqs_allocated); + list_del_init(&cqp_request->list); + } else if (!holding_lock) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + cqp_request = kzalloc(sizeof(struct nes_cqp_request), + GFP_KERNEL); + if (cqp_request) { + cqp_request->dynamic = 1; + INIT_LIST_HEAD(&cqp_request->list); + atomic_inc(&cqp_reqs_dynallocated); + } + spin_lock_irqsave(&nesdev->cqp.lock, flags); + } + if (!holding_lock) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + + if (cqp_request) { + init_waitqueue_head(&cqp_request->waitq); + cqp_request->waiting = 0; + cqp_request->request_done = 0; + init_waitqueue_head(&cqp_request->waitq); + nes_debug(NES_DBG_CQP, "Got cqp request %p from the available list \n", + cqp_request); + } else + printk(KERN_ERR PFX "%s: Could not allocated a CQP request.\n", + __FUNCTION__); + + return cqp_request; +} + +static inline void nes_post_cqp_request(struct nes_device *nesdev, + struct nes_cqp_request *cqp_request, int holding_lock, int ring_doorbell) +{ + /* caller must be holding CQP lock */ + struct nes_hw_cqp_wqe *cqp_wqe; + unsigned long flags; + u32 cqp_head; + + if (!holding_lock) { + spin_lock_irqsave(&nesdev->cqp.lock, flags); + } + + if (((((nesdev->cqp.sq_tail+(nesdev->cqp.sq_size*2))-nesdev->cqp.sq_head) & + (nesdev->cqp.sq_size - 1)) != 1) + && (list_empty(&nesdev->cqp_pending_reqs))) { + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + memcpy(cqp_wqe, &cqp_request->cqp_wqe, sizeof(*cqp_wqe)); + barrier(); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)((u64)(cqp_request))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)(((u64)(cqp_request))>>32)); + nes_debug(NES_DBG_CQP, "CQP request (opcode 0x%02X), line 1 = 0x%08X put on CQPs SQ," + " request = %p, cqp_head = %u, cqp_tail = %u, cqp_size = %u," + " waiting = %d, refcount = %d.\n", + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX])&0x3f, + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX]), cqp_request, + nesdev->cqp.sq_head, nesdev->cqp.sq_tail, nesdev->cqp.sq_size, + cqp_request->waiting, atomic_read(&cqp_request->refcount)); + barrier(); + if (ring_doorbell) { + /* Ring doorbell (1 WQEs) */ + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x01800000 | nesdev->cqp.qp_id); + } + + barrier(); + } else { + atomic_inc(&cqp_reqs_queued); + nes_debug(NES_DBG_CQP, "CQP request %p (opcode 0x%02X), line 1 = 0x%08X" + " put on the pending queue.\n", + cqp_request, + cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_OPCODE_IDX]&0x3f, + cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_ID_IDX]); + list_add_tail(&cqp_request->list, &nesdev->cqp_pending_reqs); + } + + if (!holding_lock) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + + return; +} + + +/* Utils */ +#define CRC32C_POLY 0x1EDC6F41 +#define ORDER 32 +#define REFIN 1 +#define REFOUT 1 +#define NES_HASH_CRC_INITAL_VALUE 0xFFFFFFFF +#define NES_HASH_CRC_FINAL_XOR 0xFFFFFFFF + +/* nes.c */ +void nes_add_ref(struct ib_qp *); +void nes_rem_ref(struct ib_qp *); +struct ib_qp *nes_get_qp(struct ib_device *, int); + +/* nes_hw.c */ +struct nes_adapter *nes_init_adapter(struct nes_device *, u8); +unsigned int nes_reset_adapter_ne020(struct nes_device *, u8 *); +int nes_init_serdes(struct nes_device *, u8, u8, u8); +void nes_init_csr_ne020(struct nes_device *, u8, u8); +void nes_destroy_adapter(struct nes_adapter *); +int nes_init_cqp(struct nes_device *); +int nes_init_phy(struct nes_device *); +int nes_init_nic_qp(struct nes_device *, struct net_device *); +void nes_destroy_nic_qp(struct nes_vnic *); +int nes_napi_isr(struct nes_device *); +void nes_dpc(unsigned long); +void nes_process_ceq(struct nes_device *, struct nes_hw_ceq *); +void nes_process_aeq(struct nes_device *, struct nes_hw_aeq *); +void nes_process_mac_intr(struct nes_device *, u32); +void nes_nic_napi_ce_handler(struct nes_device *, struct nes_hw_nic_cq *); +void nes_nic_ce_handler(struct nes_device *, struct nes_hw_nic_cq *); +void nes_cqp_ce_handler(struct nes_device *, struct nes_hw_cq *); +void nes_process_iwarp_aeqe(struct nes_device *, struct nes_hw_aeqe *); +void nes_iwarp_ce_handler(struct nes_device *, struct nes_hw_cq *); +int nes_destroy_cqp(struct nes_device *); +int nes_nic_cm_xmit(struct sk_buff *, struct net_device *); + +/* nes_nic.c */ +void nes_netdev_exit(struct nes_vnic *); +struct net_device *nes_netdev_init(struct nes_device *, void __iomem *); +void nes_netdev_destroy(struct net_device *); +int nes_nic_cm_xmit(struct sk_buff *, struct net_device *); + +/* nes_cm.c */ +void *nes_cm_create(struct net_device *); +int nes_cm_recv(struct sk_buff *, struct net_device *); +void nes_update_arp(unsigned char *, u32, u32, u16, u16); +void nes_manage_arp_cache(struct net_device *, unsigned char *, u32, u32); +void nes_sock_release(struct nes_qp *, unsigned long *); +struct nes_cm_core *nes_cm_alloc_core(void); +void nes_disconnect_worker(void *); +void flush_wqes(struct nes_device *nesdev, struct nes_qp *, u32, u32); +int nes_manage_apbvt(struct nes_vnic *, u32, u32, u32); + +int nes_cm_disconn(struct nes_qp *); +void nes_cm_disconn_worker(void *); + +/* nes_verbs.c */ +int nes_hw_modify_qp(struct nes_device *, struct nes_qp *, u32, u32); +int nes_modify_qp(struct ib_qp *, struct ib_qp_attr *, int, struct ib_udata *); +struct nes_ib_device *nes_init_ofa_device(struct net_device *); +void nes_destroy_ofa_device(struct nes_ib_device *); +int nes_register_ofa_device(struct nes_ib_device *); +void nes_unregister_ofa_device(struct nes_ib_device *); + +/* nes_util.c */ +int nes_read_eeprom_values(struct nes_device *, struct nes_adapter *); +void nes_write_1G_phy_reg(struct nes_device *, u8, u8, u16); +void nes_read_1G_phy_reg(struct nes_device *, u8, u8, u16 *); +void nes_write_10G_phy_reg(struct nes_device *, u16, u8, u16); +void nes_read_10G_phy_reg(struct nes_device *, u16, u8); +int nes_arp_table(struct nes_device *, u32, u8 *, u32); +void nes_mh_fix(unsigned long); +void nes_dump_mem(unsigned int, void *, int); +u32 nes_crc32(u32, u32, u32, u32, u8 *, u32, u32, u32); + +#endif /* __NES_H */ From ggrundstrom at neteffect.com Fri Oct 19 13:06:56 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:06:56 -0500 Subject: [ofa-general] [PATCH 3/14 v2] nes: connection manager routines Message-ID: <200710192006.l9JK6ur0021720@neteffect.com> NetEffect connection manager routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_cm.c 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,3055 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + + +#define TCPOPT_TIMESTAMP 8 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "nes.h" + +u32 cm_packets_sent; +u32 cm_packets_bounced; +u32 cm_packets_dropped; +u32 cm_packets_retrans; +u32 cm_packets_created; +u32 cm_packets_received; +u32 cm_listens_created; +u32 cm_listens_destroyed; +u32 cm_backlog_drops; +atomic_t cm_nodes_created; +atomic_t cm_nodes_destroyed; +atomic_t cm_accel_dropped_pkts; +atomic_t cm_resets_recvd; + +static inline int mini_cm_accelerated(struct nes_cm_core *, struct nes_cm_node *); +static struct nes_cm_listener *mini_cm_listen(struct nes_cm_core *, + struct nes_vnic *, struct nes_cm_info *); +static int add_ref_cm_node(struct nes_cm_node *); +static int rem_ref_cm_node(struct nes_cm_core *, struct nes_cm_node *); +static int mini_cm_del_listen(struct nes_cm_core *, struct nes_cm_listener *); + + +/* External CM API Interface */ +/* instance of function pointers for client API */ +/* set address of this instance to cm_core->cm_ops at cm_core alloc */ +static struct nes_cm_ops nes_cm_api = { + mini_cm_accelerated, + mini_cm_listen, + mini_cm_del_listen, + mini_cm_connect, + mini_cm_close, + mini_cm_accept, + mini_cm_reject, + mini_cm_recv_pkt, + mini_cm_dealloc_core, + mini_cm_get, + mini_cm_set +}; + +struct nes_cm_core *g_cm_core; + +atomic_t cm_connects; +atomic_t cm_accepts; +atomic_t cm_disconnects; +atomic_t cm_closes; +atomic_t cm_connecteds; +atomic_t cm_connect_reqs; +atomic_t cm_rejects; + + +/** + * create_event + */ +static struct nes_cm_event *create_event(struct nes_cm_node *cm_node, + enum nes_cm_event_type type) +{ + struct nes_cm_event *event; + + if (!cm_node->cm_id) + return NULL; + + /* allocate an empty event */ + event = (struct nes_cm_event *)kzalloc(sizeof(*event), GFP_ATOMIC); + + if (!event) + return NULL; + + event->type = type; + event->cm_node = cm_node; + event->cm_info.rem_addr = cm_node->rem_addr; + event->cm_info.loc_addr = cm_node->loc_addr; + event->cm_info.rem_port = cm_node->rem_port; + event->cm_info.loc_port = cm_node->loc_port; + event->cm_info.cm_id = cm_node->cm_id; + + nes_debug(NES_DBG_CM, "Created event=%p, type=%u, dst_addr=%08x[%x]," + " src_addr=%08x[%x]\n", + event, type, + event->cm_info.loc_addr, event->cm_info.loc_port, + event->cm_info.rem_addr, event->cm_info.rem_port); + + nes_cm_post_event(event); + return event; +} + + +/** + * send_mpa_request + */ +int send_mpa_request(struct nes_cm_node *cm_node) +{ + struct sk_buff *skb; + int ret; + + skb = get_free_pkt(cm_node); + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + /* send an MPA Request frame */ + form_cm_frame(skb, cm_node, NULL, 0, &cm_node->mpa_frame, + cm_node->mpa_frame_size, SET_ACK); + + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 1, 0); + if (ret < 0) { + return ret; + } + + return 0; +} + + +/** + * recv_mpa - process a received TCP pkt, we are expecting an + * IETF MPA frame + */ +static int parse_mpa(struct nes_cm_node *cm_node, u8 *buffer, u32 len) +{ + struct ietf_mpa_frame *mpa_frame; + + /* assume req frame is in tcp data payload */ + if (len < sizeof(struct ietf_mpa_frame)) { + nes_debug(NES_DBG_CM, "The received ietf buffer was too small (%x)\n", len); + return -1; + } + + mpa_frame = (struct ietf_mpa_frame *)buffer; + cm_node->mpa_frame_size = (u32)ntohs(mpa_frame->priv_data_len); + + if (cm_node->mpa_frame_size + sizeof(struct ietf_mpa_frame) != len) { + nes_debug(NES_DBG_CM, "The received ietf buffer was not right" + " complete (%x + %x != %x)\n", + cm_node->mpa_frame_size, (u32)sizeof(struct ietf_mpa_frame), len); + return -1; + } + + /* copy entire MPA frame to our cm_node's frame */ + memcpy(cm_node->mpa_frame_buf, buffer + sizeof(struct ietf_mpa_frame), + cm_node->mpa_frame_size); + + return 0; +} + + +/** + * handle_exception_pkt - process an exception packet. + * We have been in a TSA state, and we have now received SW + * TCP/IP traffic should be a FIN request or IP pkt with options + */ +static int handle_exception_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb) +{ + int ret = 0; +#ifdef OFED_1_2 + struct tcphdr *tcph = skb->h.th; +#else + struct tcphdr *tcph = tcp_hdr(skb); +#endif + + /* first check to see if this a FIN pkt */ + if (tcph->fin) { + /* we need to ACK the FIN request */ + send_ack(cm_node); + + /* check which side we are (client/server) and set next state accordingly */ + if (cm_node->tcp_cntxt.client) + cm_node->state = NES_CM_STATE_CLOSING; + else { + /* we are the server side */ + cm_node->state = NES_CM_STATE_CLOSE_WAIT; + /* since this is a self contained CM we don't wait for */ + /* an APP to close us, just send final FIN immediately */ + ret = send_fin(cm_node, NULL); + cm_node->state = NES_CM_STATE_LAST_ACK; + } + } else { + ret = -EINVAL; + } + + return ret; +} + + +/** + * form_cm_frame - get a free packet and build empty frame Use + * node info to build. + */ +struct sk_buff *form_cm_frame(struct sk_buff *skb, struct nes_cm_node *cm_node, + void *options, u32 optionsize, void *data, u32 datasize, u8 flags) +{ + struct tcphdr *tcph; + struct iphdr *iph; + struct ethhdr *ethh; + u8 *buf; + u16 packetsize = sizeof(*iph); + + packetsize += sizeof(*tcph); + packetsize += optionsize + datasize; + + memset(skb->data, 0x00, ETH_HLEN + sizeof(*iph) + sizeof(*tcph)); + + skb->len = 0; + buf = skb_put(skb, packetsize + ETH_HLEN); + + ethh = (struct ethhdr *) buf; + buf += ETH_HLEN; + +#ifdef OFED_1_2 + iph = skb->nh.iph = (struct iphdr *)buf; + buf += sizeof(*iph); + tcph = skb->h.th = (struct tcphdr *)buf; + skb->mac.raw = skb->data; +#else + iph = (struct iphdr *)buf; + buf += sizeof(*iph); + tcph = (struct tcphdr *)buf; + skb_reset_mac_header(skb); + skb_set_network_header(skb, ETH_HLEN); + skb_set_transport_header(skb, ETH_HLEN+sizeof(*iph)); +#endif + buf += sizeof(*tcph); + + skb->ip_summed = CHECKSUM_PARTIAL; + skb->protocol = ntohs(0x800); + skb->data_len = 0; + skb->mac_len = ETH_HLEN; + + memcpy(ethh->h_dest, cm_node->rem_mac, ETH_ALEN); + memcpy(ethh->h_source, cm_node->loc_mac, ETH_ALEN); + ethh->h_proto = htons(0x0800); + + iph->version = IPVERSION; + iph->ihl = 5; /* 5 * 4Byte words, IP headr len */ + iph->tos = 0; + iph->tot_len = htons(packetsize); + iph->id = htons(++cm_node->tcp_cntxt.loc_id); + + iph->frag_off = ntohs(0x4000); + iph->ttl = 0x40; + iph->protocol= 0x06; /* IPPROTO_TCP */ + + iph->saddr = htonl(cm_node->loc_addr); + iph->daddr = htonl(cm_node->rem_addr); + + tcph->source = htons(cm_node->loc_port); + tcph->dest = htons(cm_node->rem_port); + tcph->seq = htonl(cm_node->tcp_cntxt.loc_seq_num); + + if (flags & SET_ACK) { + cm_node->tcp_cntxt.loc_ack_num = cm_node->tcp_cntxt.rcv_nxt; + tcph->ack_seq = htonl(cm_node->tcp_cntxt.loc_ack_num); + tcph->ack = 1; + } else + tcph->ack_seq = 0; + + if (flags & SET_SYN) { + cm_node->tcp_cntxt.loc_seq_num ++; + tcph->syn = 1; + } else + cm_node->tcp_cntxt.loc_seq_num += datasize; /* data (no headers) */ + + if (flags & SET_FIN) + tcph->fin = 1; + + if (flags & SET_RST) + tcph->rst = 1; + + tcph->doff = (u16)((sizeof(*tcph) + optionsize + 3)>> 2); + tcph->window = htons(cm_node->tcp_cntxt.rcv_wnd); + tcph->urg_ptr = 0; + if (optionsize) + memcpy(buf, options, optionsize); + buf += optionsize; + if (datasize) + memcpy(buf, data, datasize); + + skb_shinfo(skb)->nr_frags = 0; + cm_packets_created++; + + return skb; +} + + +/** + * print_core - dump a cm core + */ +static void print_core(struct nes_cm_core *core) +{ + nes_debug(NES_DBG_CM, "---------------------------------------------\n"); + nes_debug(NES_DBG_CM, "CM Core -- (core = %p )\n", core); + if (!core) + return; + nes_debug(NES_DBG_CM, "---------------------------------------------\n"); + nes_debug(NES_DBG_CM, "Session ID : %u \n", atomic_read(&core->session_id)); + + nes_debug(NES_DBG_CM, "State : %u \n", core->state); + + nes_debug(NES_DBG_CM, "Tx Free cnt : %u \n", skb_queue_len(&core->tx_free_list)); + nes_debug(NES_DBG_CM, "Listen Nodes : %u \n", atomic_read(&core->listen_node_cnt)); + nes_debug(NES_DBG_CM, "Active Nodes : %u \n", atomic_read(&core->node_cnt)); + + nes_debug(NES_DBG_CM, "core : %p \n", core); + + nes_debug(NES_DBG_CM, "-------------- end core ---------------\n"); +} + + +/** + * schedule_nes_timer + * note - cm_node needs to be protected before calling this. Encase in: + * rem_ref_cm_node(cm_core, cm_node);add_ref_cm_node(cm_node); + */ +int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb, + enum nes_timer_type type, int send_retrans, + int close_when_complete) +{ + unsigned long flags; + struct nes_cm_core *cm_core; + struct nes_timer_entry *new_send; + int ret = 0; + u32 was_timer_set; + + new_send = kzalloc(sizeof(*new_send), GFP_ATOMIC); + if (!new_send) + return -1; + if (!cm_node) + return -EINVAL; + + /* new_send->timetosend = currenttime */ + new_send->retrycount = NES_DEFAULT_RETRYS; + new_send->retranscount = NES_DEFAULT_RETRANS; + new_send->skb = skb; + new_send->timetosend = jiffies; + new_send->type = type; + new_send->netdev = cm_node->netdev; + new_send->send_retrans = send_retrans; + new_send->close_when_complete = close_when_complete; + + if (type == NES_TIMER_TYPE_CLOSE) { + new_send->timetosend += (HZ/2); /* TODO: decide on the correct value here */ + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + list_add_tail(&new_send->list, &cm_node->recv_list); + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + } + + if (type == NES_TIMER_TYPE_SEND) { +#ifdef OFED_1_2 + new_send->seq_num = htonl(skb->h.th->seq); +#else + new_send->seq_num = htonl(tcp_hdr(skb)->seq); +#endif + atomic_inc(&new_send->skb->users); + + ret = nes_nic_cm_xmit(new_send->skb, cm_node->netdev); + if (ret != NETDEV_TX_OK) { + nes_debug(NES_DBG_CM, "Error sending packet %p (jiffies = %lu)\n", + new_send, jiffies); + atomic_dec(&new_send->skb->users); + new_send->timetosend = jiffies; + } else { + cm_packets_sent++; + if (!send_retrans) { + if (close_when_complete) + rem_ref_cm_node(cm_node->cm_core, cm_node); + dev_kfree_skb_any(new_send->skb); + kfree(new_send); + return ret; + } + new_send->timetosend = jiffies + NES_RETRY_TIMEOUT; + } + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + list_add_tail(&new_send->list, &cm_node->retrans_list); + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + } + if (type == NES_TIMER_TYPE_RECV) { +#ifdef OFED_1_2 + new_send->seq_num = htonl(skb->h.th->seq); +#else + new_send->seq_num = htonl(tcp_hdr(skb)->seq); +#endif + new_send->timetosend = jiffies; + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + list_add_tail(&new_send->list, &cm_node->recv_list); + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + } + cm_core = cm_node->cm_core; + + was_timer_set = timer_pending(&cm_core->tcp_timer); + + if (!was_timer_set) { + cm_core->tcp_timer.expires = new_send->timetosend; + add_timer(&cm_core->tcp_timer); + } + + return ret; +} + + +/** + * nes_cm_timer_tick + */ +void nes_cm_timer_tick(unsigned long pass) +{ + unsigned long flags, qplockflags; + unsigned long nexttimeout = jiffies + NES_LONG_TIME; + struct iw_cm_id *cm_id; + struct nes_cm_node *cm_node; + struct nes_timer_entry *send_entry, *recv_entry; + struct list_head *list_core, *list_core_temp; + struct list_head *list_node, *list_node_temp; + struct nes_cm_core *cm_core = g_cm_core; + struct nes_qp *nesqp; + struct sk_buff *skb; + u32 settimer = 0; + int ret = NETDEV_TX_OK; + int node_done; + + spin_lock_irqsave(&cm_core->ht_lock, flags); + + list_for_each_safe(list_node, list_core_temp, &cm_core->connected_nodes) { + cm_node = container_of(list_node, struct nes_cm_node, list); + add_ref_cm_node(cm_node); + spin_unlock_irqrestore(&cm_core->ht_lock, flags); + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + list_for_each_safe(list_core, list_node_temp, &cm_node->recv_list) { + recv_entry = container_of(list_core, struct nes_timer_entry, list); + if ((time_after(recv_entry->timetosend, jiffies)) && + (recv_entry->type == NES_TIMER_TYPE_CLOSE)) { + if (nexttimeout > recv_entry->timetosend || !settimer) { + nexttimeout = recv_entry->timetosend; + settimer = 1; + } + continue; + } + list_del(&recv_entry->list); + cm_id = cm_node->cm_id; + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + if (recv_entry->type == NES_TIMER_TYPE_CLOSE) { + nesqp = (struct nes_qp *)recv_entry->skb; + spin_lock_irqsave(&nesqp->lock, qplockflags); + if (nesqp->cm_id) { + nes_debug(NES_DBG_CM, "QP%u: cm_id = %p, refcount = %d: " + "****** HIT A NES_TIMER_TYPE_CLOSE" + " with something to do!!! ******\n", + nesqp->hwqp.qp_id, cm_id, + atomic_read(&nesqp->refcount)); + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + nesqp->last_aeq = NES_AEQE_AEID_RESET_SENT; + nesqp->ibqp_state = IB_QPS_ERR; + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_cm_disconn(nesqp); + } else { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_debug(NES_DBG_CM, "QP%u: cm_id = %p, refcount = %d:" + " ****** HIT A NES_TIMER_TYPE_CLOSE" + " with nothing to do!!! ******\n", + nesqp->hwqp.qp_id, cm_id, + atomic_read(&nesqp->refcount)); + nes_rem_ref(&nesqp->ibqp); + } + if (cm_id){ + cm_id->rem_ref(cm_id); + } + } + kfree(recv_entry); + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + } + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + node_done = 0; + list_for_each_safe(list_core, list_node_temp, &cm_node->retrans_list) { + if (node_done) { + break; + } + send_entry = container_of(list_core, struct nes_timer_entry, list); + if (time_after(send_entry->timetosend, jiffies)) { + if (cm_node->state != NES_CM_STATE_TSA) { + if ((nexttimeout > send_entry->timetosend) || !settimer) { + nexttimeout = send_entry->timetosend; + settimer = 1; + } + node_done = 1; + continue; + } else { + list_del(&send_entry->list); + skb = send_entry->skb; + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + dev_kfree_skb_any(skb); + kfree(send_entry); + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + } + if (send_entry->type == NES_TIMER_NODE_CLEANUP) { + list_del(&send_entry->list); + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + kfree(send_entry); + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + if ((send_entry->seq_num < cm_node->tcp_cntxt.rem_ack_num ) || + (cm_node->state == NES_CM_STATE_TSA) || + (cm_node->state == NES_CM_STATE_CLOSED)) { + skb = send_entry->skb; + list_del(&send_entry->list); + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + kfree(send_entry); + dev_kfree_skb_any(skb); + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + + if (!send_entry->retranscount || !send_entry->retrycount) { + cm_packets_dropped++; + skb = send_entry->skb; + list_del(&send_entry->list); + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + dev_kfree_skb_any(skb); + kfree(send_entry); + if (cm_node->state == NES_CM_STATE_SYN_RCVD) { + /* this node never even generated an indication up to the cm */ + rem_ref_cm_node(cm_core, cm_node); + } else { + cm_node->state = NES_CM_STATE_CLOSED; + create_event(cm_node, NES_CM_EVENT_ABORTED); + } + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + /* this seems like the correct place, but leave send entry unprotected */ + // spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + atomic_inc(&send_entry->skb->users); + cm_packets_retrans++; + nes_debug(NES_DBG_CM, "Retransmitting send_entry %p for node %p," + " jiffies = %lu, time to send = %lu, retranscount = %u, " + "send_entry->seq_num = 0x%08X, cm_node->tcp_cntxt.rem_ack_num = 0x%08X\n", + send_entry, cm_node, jiffies, send_entry->timetosend, send_entry->retranscount, + send_entry->seq_num, cm_node->tcp_cntxt.rem_ack_num); + + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + ret = nes_nic_cm_xmit(send_entry->skb, cm_node->netdev); + if (ret != NETDEV_TX_OK) { + cm_packets_bounced++; + atomic_dec(&send_entry->skb->users); + send_entry->retrycount--; + nexttimeout = jiffies + NES_SHORT_TIME; + settimer = 1; + node_done = 1; + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } else { + cm_packets_sent++; + } + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + list_del(&send_entry->list); + nes_debug(NES_DBG_CM, "Packet Sent: retrans count = %u, retry count = %u.\n", + send_entry->retranscount, send_entry->retrycount); + if (send_entry->send_retrans) { + send_entry->retranscount--; + send_entry->timetosend = jiffies + NES_RETRY_TIMEOUT; + if (nexttimeout > send_entry->timetosend || !settimer) { + nexttimeout = send_entry->timetosend; + settimer = 1; + } + list_add(&send_entry->list, &cm_node->retrans_list); + continue; + } else { + int close_when_complete; + skb = send_entry->skb; + close_when_complete = send_entry->close_when_complete; + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + if (close_when_complete) { + BUG_ON(atomic_read(&cm_node->ref_count) == 1); + rem_ref_cm_node(cm_core, cm_node); + } + dev_kfree_skb_any(skb); + kfree(send_entry); + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + } + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + + rem_ref_cm_node(cm_core, cm_node); + + spin_lock_irqsave(&cm_core->ht_lock, flags); + if (ret != NETDEV_TX_OK) + break; + } + spin_unlock_irqrestore(&cm_core->ht_lock, flags); + + if (settimer) { + if (!timer_pending(&cm_core->tcp_timer)) { + cm_core->tcp_timer.expires = nexttimeout; + add_timer(&cm_core->tcp_timer); + } + } +} + + +/** + * send_syn + */ +int send_syn(struct nes_cm_node *cm_node, u32 sendack) +{ + int ret; + int flags = SET_SYN; + struct sk_buff *skb; + char optionsbuffer[sizeof(struct option_mss) + + sizeof(struct option_windowscale) + + sizeof(struct option_base) + 1]; + + int optionssize = 0; + /* Sending MSS option */ + union all_known_options *options; + + if (!cm_node) + return -EINVAL; + + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_mss.optionnum = OPTION_NUMBER_MSS; + options->as_mss.length = sizeof(struct option_mss); + options->as_mss.mss = htons(cm_node->tcp_cntxt.mss); + optionssize += sizeof(struct option_mss); + + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_windowscale.optionnum = OPTION_NUMBER_WINDOW_SCALE; + options->as_windowscale.length = sizeof(struct option_windowscale); + options->as_windowscale.shiftcount = NES_CM_DEFAULT_RCV_WND_SCALE; + optionssize += sizeof(struct option_windowscale); + + if (sendack && !(NES_DRV_OPT_SUPRESS_OPTION_BC & nes_drv_opt) + ) { + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_base.optionnum = OPTION_NUMBER_WRITE0; + options->as_base.length = sizeof(struct option_base); + optionssize += sizeof(struct option_base); + /* we need the size to be a multiple of 4 */ + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_end = 1; + optionssize += 1; + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_end = 1; + optionssize += 1; + } + + options = (union all_known_options *)&optionsbuffer[optionssize]; + options->as_end = OPTION_NUMBER_END; + optionssize += 1; + + skb = get_free_pkt(cm_node); + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + if (sendack) + flags |= SET_ACK; + + form_cm_frame(skb, cm_node, optionsbuffer, optionssize, NULL, 0, flags); + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 1, 0); + + return ret; +} + + +/** + * send_reset + */ +int send_reset(struct nes_cm_node *cm_node) +{ + int ret; + struct sk_buff *skb = get_free_pkt(cm_node); + + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + add_ref_cm_node(cm_node); + form_cm_frame(skb, cm_node, NULL, 0, NULL, 0, SET_RST | SET_ACK); + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 0, 1); + + return ret; +} + + +/** + * send_ack + */ +int send_ack(struct nes_cm_node *cm_node) +{ + int ret; + struct sk_buff *skb = get_free_pkt(cm_node); + + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + form_cm_frame(skb, cm_node, NULL, 0, NULL, 0, SET_ACK); + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 0, 0); + + return ret; +} + + +/** + * send_fin + */ +int send_fin(struct nes_cm_node *cm_node, struct sk_buff *skb) +{ + int ret; + + /* if we didn't get a frame get one */ + if (!skb) + skb = get_free_pkt(cm_node); + + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + form_cm_frame(skb, cm_node, NULL, 0, NULL, 0, SET_ACK | SET_FIN); + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 1, 0); + + return ret; +} + + +/** + * get_free_pkt + */ +struct sk_buff *get_free_pkt(struct nes_cm_node *cm_node) +{ + struct sk_buff *skb, *new_skb; + + /* check to see if we need to repopulate the free tx pkt queue */ + if (skb_queue_len(&cm_node->cm_core->tx_free_list) < NES_CM_FREE_PKT_LO_WATERMARK) { + while (skb_queue_len(&cm_node->cm_core->tx_free_list) < + cm_node->cm_core->free_tx_pkt_max) { + /* replace the frame we took, we won't get it back */ + new_skb = dev_alloc_skb(cm_node->cm_core->mtu); + BUG_ON(!new_skb); + /* add a replacement frame to the free tx list head */ + skb_queue_head(&cm_node->cm_core->tx_free_list, new_skb); + } + } + + skb = skb_dequeue(&cm_node->cm_core->tx_free_list); + + return skb; +} + + +/** + * make_hashkey - generate hash key from node tuple + */ +static inline int make_hashkey(u16 loc_port, nes_addr_t loc_addr, u16 rem_port, + nes_addr_t rem_addr) +{ + u32 hashkey = 0; + + hashkey = loc_addr + rem_addr + loc_port + rem_port; + hashkey = (hashkey % NES_CM_HASHTABLE_SIZE); + + return hashkey; +} + + +/** + * find_node - find a cm node that matches the reference cm node + */ +static struct nes_cm_node *find_node(struct nes_cm_core *cm_core, + u16 rem_port, nes_addr_t rem_addr, u16 loc_port, nes_addr_t loc_addr) +{ + unsigned long flags; + u32 hashkey; + struct list_head *list_pos; + struct list_head *hte; + struct nes_cm_node *cm_node; + + /* make a hash index key for this packet */ + hashkey = make_hashkey(loc_port, loc_addr, rem_port, rem_addr); + + /* get a handle on the hte */ + hte = &cm_core->connected_nodes; + + nes_debug(NES_DBG_CM, "Searching for an owner node:%x:%x from core %p->%p\n", + loc_addr, loc_port, cm_core, hte); + + /* walk list and find cm_node associated with this session ID */ + spin_lock_irqsave(&cm_core->ht_lock, flags); + list_for_each(list_pos, hte) { + cm_node = container_of(list_pos, struct nes_cm_node, list); + /* compare quad, return node handle if a match */ + nes_debug(NES_DBG_CM, "finding node %x:%x =? %x:%x ^ %x:%x =? %x:%x\n", + cm_node->loc_addr, cm_node->loc_port, + loc_addr, loc_port, + cm_node->rem_addr, cm_node->rem_port, + rem_addr, rem_port); + if ((cm_node->loc_addr == loc_addr) && (cm_node->loc_port == loc_port) && + (cm_node->rem_addr == rem_addr) && (cm_node->rem_port == rem_port)) { + add_ref_cm_node(cm_node); + spin_unlock_irqrestore(&cm_core->ht_lock, flags); + return cm_node; + } + } + spin_unlock_irqrestore(&cm_core->ht_lock, flags); + + /* no owner node */ + return NULL; +} + + +/** + * find_listener - find a cm node listening on this addr-port pair + */ +static struct nes_cm_listener * find_listener(struct nes_cm_core *cm_core, + nes_addr_t dst_addr, u16 dst_port, enum nes_cm_listener_state listener_state) +{ + unsigned long flags; + struct list_head *listen_list; + struct nes_cm_listener *listen_node; + + /* walk list and find cm_node associated with this session ID */ + spin_lock_irqsave(&cm_core->listen_list_lock, flags); + list_for_each(listen_list, &cm_core->listen_list.list) { + listen_node = container_of(listen_list, struct nes_cm_listener, list); + /* compare node pair, return node handle if a match */ + if (((listen_node->loc_addr == dst_addr) || + listen_node->loc_addr == 0x00000000) && + (listen_node->loc_port == dst_port) && + (listener_state & listen_node->listener_state)) { + atomic_inc(&listen_node->ref_count); + spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); + return listen_node; + } + } + spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); + + nes_debug(NES_DBG_CM, "Unable to find listener- %x:%x\n", + dst_addr, dst_port); + + /* no listener */ + return NULL; +} + + +/** + * add_hte_node - add a cm node to the hash table + */ +static int add_hte_node(struct nes_cm_core *cm_core, struct nes_cm_node *cm_node) +{ + unsigned long flags; + u32 hashkey; + struct list_head *hte; + + if (!cm_node || !cm_core) + return -EINVAL; + + nes_debug(NES_DBG_CM, "Adding Node to Active Connection HT\n"); + + /* first, make an index into our hash table */ + hashkey = make_hashkey(cm_node->loc_port, cm_node->loc_addr, + cm_node->rem_port, cm_node->rem_addr); + cm_node->hashkey = hashkey; + + spin_lock_irqsave(&cm_core->ht_lock, flags); + + /* get a handle on the hash table element (list head for this slot) */ + hte = &cm_core->connected_nodes; + list_add_tail(&cm_node->list, hte); + atomic_inc(&cm_core->ht_node_cnt); + + spin_unlock_irqrestore(&cm_core->ht_lock, flags); + + return 0; +} + + +/** + * mini_cm_dec_refcnt_listen + */ +static int mini_cm_dec_refcnt_listen(struct nes_cm_core *cm_core, + struct nes_cm_listener *listener) +{ + int ret = 1; + unsigned long flags; + + spin_lock_irqsave(&cm_core->listen_list_lock, flags); + if (!atomic_dec_return(&listener->ref_count)) { + list_del(&listener->list); + + /* decrement our listen node count */ + atomic_dec(&cm_core->listen_node_cnt); + + spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); + + if (listener->nesvnic) { + nes_manage_apbvt(listener->nesvnic, listener->loc_port, + PCI_FUNC(listener->nesvnic->nesdev->pcidev->devfn), NES_MANAGE_APBVT_DEL); + } + + nes_debug(NES_DBG_CM, "destroying listener (%p)\n", listener); + + kfree(listener); + ret = 0; + cm_listens_destroyed++; + } else { + spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); + } + if (listener) { + if (atomic_read(&listener->pend_accepts_cnt) > 0) + nes_debug(NES_DBG_CM, "destroying listener (%p)" + " with non-zero pending accepts=%u\n", + listener, atomic_read(&listener->pend_accepts_cnt)); + } + + return ret; +} + + +/** + * mini_cm_del_listen + */ +static int mini_cm_del_listen(struct nes_cm_core *cm_core, + struct nes_cm_listener *listener) +{ + listener->listener_state = NES_CM_LISTENER_PASSIVE_STATE; + listener->cm_id = NULL; /* going to be destroyed pretty soon */ + return mini_cm_dec_refcnt_listen(cm_core, listener); +} + + +/** + * mini_cm_accelerated + */ +static inline int mini_cm_accelerated(struct nes_cm_core *cm_core, + struct nes_cm_node *cm_node) +{ + u32 was_timer_set; + cm_node->accelerated = 1; + + if (cm_node->accept_pend) { + BUG_ON(!cm_node->listener); + atomic_dec( &cm_node->listener->pend_accepts_cnt ); + BUG_ON(atomic_read(&cm_node->listener->pend_accepts_cnt) < 0 ); + } + + was_timer_set = timer_pending(&cm_core->tcp_timer); + if (!was_timer_set) { + cm_core->tcp_timer.expires = jiffies + NES_SHORT_TIME; + add_timer(&cm_core->tcp_timer); + } + + return 0; +} + + +/** + * nes_addr_send_arp + */ +static void nes_addr_send_arp(u32 dst_ip) +{ + struct rtable *rt; + struct flowi fl; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = htonl(dst_ip); + if (ip_route_output_key(&rt, &fl)) { + printk("%s: ip_route_output_key failed for 0x%08X\n", + __FUNCTION__, dst_ip); + return; + } + + neigh_event_send(rt->u.dst.neighbour, NULL); + ip_rt_put(rt); +} + + +/** + * make_cm_node - create a new instance of a cm node + */ +static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core, + struct nes_vnic *nesvnic, struct nes_cm_info *cm_info, + struct nes_cm_listener *listener) +{ + struct nes_cm_node *cm_node; + struct timespec ts; + int arpindex = 0; + struct nes_device *nesdev; + struct nes_adapter *nesadapter; + + /* create an hte and cm_node for this instance */ + cm_node = (struct nes_cm_node *)kzalloc(sizeof(*cm_node), GFP_ATOMIC); + if (!cm_node) + return NULL; + + memset(cm_node, 0, sizeof(struct nes_cm_node)); + /* set our node specific transport info */ + cm_node->loc_addr = cm_info->loc_addr; + cm_node->rem_addr = cm_info->rem_addr; + cm_node->loc_port = cm_info->loc_port; + cm_node->rem_port = cm_info->rem_port; + cm_node->send_write0 = send_first; + nes_debug(NES_DBG_CM, "Make node addresses : loc = %x:%x, rem = %x:%x\n", + cm_node->loc_addr, cm_node->loc_port, cm_node->rem_addr, cm_node->rem_port); + cm_node->listener = listener; + cm_node->netdev = nesvnic->netdev; + cm_node->cm_id = cm_info->cm_id; + memcpy(cm_node->loc_mac, nesvnic->netdev->dev_addr, ETH_ALEN); + + INIT_LIST_HEAD(&cm_node->retrans_list); + spin_lock_init(&cm_node->retrans_list_lock); + INIT_LIST_HEAD(&cm_node->recv_list); + spin_lock_init(&cm_node->recv_list_lock); + + cm_node->loopbackpartner = NULL; + atomic_set(&cm_node->ref_count, 1); + /* associate our parent CM core */ + cm_node->cm_core = cm_core; + cm_node->tcp_cntxt.loc_id = NES_CM_DEF_LOCAL_ID; + cm_node->tcp_cntxt.rcv_wscale = NES_CM_DEFAULT_RCV_WND_SCALE; + cm_node->tcp_cntxt.rcv_wnd = NES_CM_DEFAULT_RCV_WND_SCALED >> + NES_CM_DEFAULT_RCV_WND_SCALE; + ts = current_kernel_time(); + cm_node->tcp_cntxt.loc_seq_num = htonl(ts.tv_nsec); + cm_node->tcp_cntxt.mss = nesvnic->max_frame_size - sizeof(struct iphdr) - + sizeof(struct tcphdr) - ETH_HLEN; + cm_node->tcp_cntxt.rcv_nxt = 0; + /* get a unique session ID , add thread_id to an upcounter to handle race */ + atomic_inc(&cm_core->node_cnt); + atomic_inc(&cm_core->session_id); + cm_node->session_id = (u32)(atomic_read(&cm_core->session_id) + current->tgid); + cm_node->conn_type = cm_info->conn_type; + cm_node->apbvt_set = 0; + cm_node->accept_pend = 0; + + cm_node->nesvnic = nesvnic; + /* get some device handles, for arp lookup */ + nesdev = nesvnic->nesdev; + nesadapter = nesdev->nesadapter; + + cm_node->loopbackpartner = NULL; + /* get the mac addr for the remote node */ + arpindex = nes_arp_table(nesdev, cm_node->rem_addr, NULL, NES_ARP_RESOLVE); + if (arpindex < 0) { + kfree(cm_node); + nes_addr_send_arp(cm_info->rem_addr); + return NULL; + } + + /* copy the mac addr to node context */ + memcpy(cm_node->rem_mac, nesadapter->arp_table[arpindex].mac_addr, ETH_ALEN); + nes_debug(NES_DBG_CM, "Remote mac addr from arp table:%02x," + " %02x, %02x, %02x, %02x, %02x\n", + cm_node->rem_mac[0], cm_node->rem_mac[1], + cm_node->rem_mac[2], cm_node->rem_mac[3], + cm_node->rem_mac[4], cm_node->rem_mac[5]); + + add_hte_node(cm_core, cm_node); + atomic_inc(&cm_nodes_created); + + return cm_node; +} + + +/** + * add_ref_cm_node - destroy an instance of a cm node + */ +static int add_ref_cm_node(struct nes_cm_node *cm_node) +{ + atomic_inc(&cm_node->ref_count); + return 0; +} + + +/** + * rem_ref_cm_node - destroy an instance of a cm node + */ +static int rem_ref_cm_node(struct nes_cm_core *cm_core, + struct nes_cm_node *cm_node) +{ + unsigned long flags, qplockflags; + struct nes_timer_entry *send_entry; + struct nes_timer_entry *recv_entry; + struct iw_cm_id *cm_id; + struct list_head *list_core, *list_node_temp; + struct nes_qp *nesqp; + + if (!cm_node) + return -EINVAL; + + spin_lock_irqsave(&cm_node->cm_core->ht_lock, flags); + if (atomic_dec_return(&cm_node->ref_count)) { + spin_unlock_irqrestore(&cm_node->cm_core->ht_lock, flags); + return 0; + } + list_del(&cm_node->list); + atomic_dec(&cm_core->ht_node_cnt); + spin_unlock_irqrestore(&cm_node->cm_core->ht_lock, flags); + + /* if the node is destroyed before connection was accelerated */ + if (!cm_node->accelerated && cm_node->accept_pend) { + BUG_ON(!cm_node->listener); + atomic_dec(&cm_node->listener->pend_accepts_cnt); + BUG_ON(atomic_read(&cm_node->listener->pend_accepts_cnt) < 0); + } + + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + list_for_each_safe(list_core, list_node_temp, &cm_node->retrans_list) { + send_entry = container_of(list_core, struct nes_timer_entry, list); + list_del(&send_entry->list); + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + dev_kfree_skb_any(send_entry->skb); + kfree(send_entry); + spin_lock_irqsave(&cm_node->retrans_list_lock, flags); + continue; + } + spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); + + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + list_for_each_safe(list_core, list_node_temp, &cm_node->recv_list) { + recv_entry = container_of(list_core, struct nes_timer_entry, list); + list_del(&recv_entry->list); + cm_id = cm_node->cm_id; + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + if (recv_entry->type == NES_TIMER_TYPE_CLOSE) { + nesqp = (struct nes_qp *)recv_entry->skb; + spin_lock_irqsave(&nesqp->lock, qplockflags); + if (nesqp->cm_id) { + nes_debug(NES_DBG_CM, "QP%u: cm_id = %p: ****** HIT A NES_TIMER_TYPE_CLOSE" + " with something to do!!! ******\n", + nesqp->hwqp.qp_id, cm_id); + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + nesqp->last_aeq = NES_AEQE_AEID_RESET_SENT; + nesqp->ibqp_state = IB_QPS_ERR; + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_cm_disconn(nesqp); + } else { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_debug(NES_DBG_CM, "QP%u: cm_id = %p: ****** HIT A NES_TIMER_TYPE_CLOSE" + " with nothing to do!!! ******\n", + nesqp->hwqp.qp_id, cm_id); + nes_rem_ref(&nesqp->ibqp); + } + cm_id->rem_ref(cm_id); + } else if (recv_entry->type == NES_TIMER_TYPE_RECV) { + dev_kfree_skb_any(recv_entry->skb); + } + kfree(recv_entry); + spin_lock_irqsave(&cm_node->recv_list_lock, flags); + } + spin_unlock_irqrestore(&cm_node->recv_list_lock, flags); + + if (cm_node->listener) { + mini_cm_dec_refcnt_listen(cm_core, cm_node->listener); + } else { + if (cm_node->apbvt_set && cm_node->nesvnic) { + nes_manage_apbvt(cm_node->nesvnic, cm_node->loc_port, + PCI_FUNC(cm_node->nesvnic->nesdev->pcidev->devfn), + NES_MANAGE_APBVT_DEL); + } + } + + kfree(cm_node); + atomic_dec(&cm_core->node_cnt); + atomic_inc(&cm_nodes_destroyed); + + return 0; +} + + +/** + * process_options + */ +static void process_options(struct nes_cm_node *cm_node, u8 *optionsloc, u32 optionsize) +{ + u32 tmp; + u32 offset = 0; + union all_known_options *all_options; + + while (offset < optionsize) { + all_options = (union all_known_options *)(optionsloc + offset); + switch (all_options->as_base.optionnum) { + case OPTION_NUMBER_END: + offset = optionsize; + break; + case OPTION_NUMBER_NONE: + offset += 1; + continue; + case OPTION_NUMBER_MSS: + tmp = htons(all_options->as_mss.mss); + if (tmp < cm_node->tcp_cntxt.mss) + cm_node->tcp_cntxt.mss = tmp; + break; + case OPTION_NUMBER_WINDOW_SCALE: + cm_node->tcp_cntxt.snd_wscale = all_options->as_windowscale.shiftcount; + break; + case OPTION_NUMBER_WRITE0: + cm_node->send_write0 = 1; + break; + default: + nes_debug(NES_DBG_CM, "TCP Option not understood: %x\n", + all_options->as_base.optionnum); + break; + } + offset += all_options->as_base.length; + } +} + + +/** + * process_packet + */ +int process_packet(struct nes_cm_node *cm_node, struct sk_buff *skb, + struct nes_cm_core *cm_core) +{ + int optionsize; + int datasize; + int ret = 0; +#ifdef OFED_1_2 + struct tcphdr *tcph = skb->h.th; +#else + struct tcphdr *tcph = tcp_hdr(skb); +#endif + u32 inc_sequence; + + if ((!tcph) || (NES_CM_STATE_TSA == cm_node->state)) { + BUG_ON(!tcph); + atomic_inc(&cm_accel_dropped_pkts); + return -1; + } + + if (tcph->rst) { + atomic_inc(&cm_resets_recvd); + nes_debug(NES_DBG_CM, "Received Reset, cm_node = %p, state = %u. refcnt=%d\n", + cm_node, cm_node->state, atomic_read(&cm_node->ref_count)); + switch (cm_node->state) { + case NES_CM_STATE_LISTENING: + rem_ref_cm_node(cm_core, cm_node); + case NES_CM_STATE_TSA: + case NES_CM_STATE_CLOSED: + break; + case NES_CM_STATE_SYN_RCVD: + nes_debug(NES_DBG_CM, "Received a reset for local 0x%08X:%04X," + " remote 0x%08X:%04X, node state = %u\n", + cm_node->loc_addr, cm_node->loc_port, + cm_node->rem_addr, cm_node->rem_port, + cm_node->state); + rem_ref_cm_node(cm_core, cm_node); + break; + case NES_CM_STATE_ONE_SIDE_ESTABLISHED: + case NES_CM_STATE_ESTABLISHED: + case NES_CM_STATE_MPAREQ_SENT: + default: + nes_debug(NES_DBG_CM, "Received a reset for local 0x%08X:%04X," + " remote 0x%08X:%04X, node state = %u refcnt=%d\n", + cm_node->loc_addr, cm_node->loc_port, + cm_node->rem_addr, cm_node->rem_port, + cm_node->state, atomic_read(&cm_node->ref_count)); + // create event + cm_node->state = NES_CM_STATE_CLOSED; + + create_event(cm_node, NES_CM_EVENT_ABORTED); + break; + + } + return -1; + } + + optionsize = (tcph->doff << 2) - sizeof(struct tcphdr); + +#ifdef OFED_1_2 + skb_pull(skb, skb->nh.iph->ihl << 2); +#else + skb_pull(skb, ip_hdr(skb)->ihl << 2); +#endif + skb_pull(skb, tcph->doff << 2); + + datasize = skb->len; + inc_sequence = ntohl(tcph->seq); + nes_debug(NES_DBG_CM, "datasize = %u, sequence = 0x%08X, ack_seq = 0x%08X," + " rcv_nxt = 0x%08X Flags: %s %s.\n", + datasize, inc_sequence,ntohl(tcph->ack_seq), + cm_node->tcp_cntxt.rcv_nxt, (tcph->syn ? "SYN":""), + (tcph->ack ? "ACK":"")); + + if (!tcph->syn && (inc_sequence != cm_node->tcp_cntxt.rcv_nxt)) { + nes_debug(NES_DBG_CM, "dropping packet, datasize = %u, sequence = 0x%08X," + " ack_seq = 0x%08X, rcv_nxt = 0x%08X Flags: %s.\n", + datasize, inc_sequence,ntohl(tcph->ack_seq), + cm_node->tcp_cntxt.rcv_nxt, (tcph->ack ? "ACK":"")); + if (cm_node->state == NES_CM_STATE_LISTENING) { + rem_ref_cm_node(cm_core, cm_node); + } + return -1; + } + + cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize; + + cm_node->tcp_cntxt.snd_wnd = htons(tcph->window) << + cm_node->tcp_cntxt.snd_wscale; + + if (cm_node->tcp_cntxt.snd_wnd > cm_node->tcp_cntxt.max_snd_wnd) { + cm_node->tcp_cntxt.max_snd_wnd = cm_node->tcp_cntxt.snd_wnd; + } + + if (optionsize) { + u8 *optionsloc = (u8 *)&tcph[1]; + process_options(cm_node, optionsloc, optionsize); + } + + if (tcph->ack) { + cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq); + switch (cm_node->state) { + case NES_CM_STATE_SYN_RCVD: + case NES_CM_STATE_SYN_SENT: + /* read and stash current sequence number */ + if (cm_node->tcp_cntxt.rem_ack_num > cm_node->tcp_cntxt.loc_seq_num) { + nes_debug(NES_DBG_CM, "ERROR - cm_node->tcp_cntxt.rem_ack_num >" + " cm_node->tcp_cntxt.loc_seq_num\n"); + send_reset(cm_node); + return 0; + } + if (cm_node->state == NES_CM_STATE_SYN_SENT) + cm_node->state = NES_CM_STATE_ONE_SIDE_ESTABLISHED; + else { + cm_node->state = NES_CM_STATE_ESTABLISHED; + } + break; + case NES_CM_STATE_LAST_ACK: + cm_node->state = NES_CM_STATE_CLOSED; + break; + case NES_CM_STATE_FIN_WAIT1: + cm_node->state = NES_CM_STATE_FIN_WAIT2; + break; + case NES_CM_STATE_CLOSING: + cm_node->state = NES_CM_STATE_TIME_WAIT; + /* need to schedule this to happen in 2MSL timeouts */ + cm_node->state = NES_CM_STATE_CLOSED; + break; + case NES_CM_STATE_ONE_SIDE_ESTABLISHED: + case NES_CM_STATE_ESTABLISHED: + case NES_CM_STATE_MPAREQ_SENT: + case NES_CM_STATE_CLOSE_WAIT: + case NES_CM_STATE_TIME_WAIT: + case NES_CM_STATE_CLOSED: + break; + case NES_CM_STATE_LISTENING: + if (!(tcph->syn)) { + nes_debug(NES_DBG_CM, "Received an ack without a SYN on a listening port\n"); + send_reset(cm_node); + /* send_reset bumps refcount, this should have been a new node */ + rem_ref_cm_node(cm_core, cm_node); + return -1; + } else { + nes_debug(NES_DBG_CM, "Received an ack on a listening port (syn-ack maybe?)\n"); + } + break; + case NES_CM_STATE_TSA: + nes_debug(NES_DBG_CM, "Received a packet with the ack bit set while in TSA state\n"); + break; + case NES_CM_STATE_UNKNOWN: + case NES_CM_STATE_INITED: + case NES_CM_STATE_ACCEPTING: + case NES_CM_STATE_FIN_WAIT2: + default: + nes_debug(NES_DBG_CM, "Received ack from unknown state: %x\n", + cm_node->state); + send_reset(cm_node); + break; + } + } + + if (tcph->syn) { + if (cm_node->state == NES_CM_STATE_LISTENING) { + /* do not exceed backlog */ + atomic_inc(&cm_node->listener->pend_accepts_cnt); + if (atomic_read(&cm_node->listener->pend_accepts_cnt) > + cm_node->listener->backlog) { + nes_debug(NES_DBG_CM, "drop syn due to backlog pressure \n"); + cm_backlog_drops++; + atomic_dec(&cm_node->listener->pend_accepts_cnt); + rem_ref_cm_node(cm_core, cm_node); + return 0; + } + cm_node->accept_pend = 1; + + } + if (datasize == 0) + cm_node->tcp_cntxt.rcv_nxt ++; + + if (cm_node->state == NES_CM_STATE_LISTENING) { + cm_node->state = NES_CM_STATE_SYN_RCVD; + send_syn(cm_node, 1); + } + if (cm_node->state == NES_CM_STATE_ONE_SIDE_ESTABLISHED) { + cm_node->state = NES_CM_STATE_ESTABLISHED; + /* send final handshake ACK */ + ret = send_ack(cm_node); + if (ret < 0) + return ret; + + cm_node->state = NES_CM_STATE_MPAREQ_SENT; + ret = send_mpa_request(cm_node); + if (ret < 0) + return ret; + } + } + + if (tcph->fin) { + cm_node->tcp_cntxt.rcv_nxt++; + switch (cm_node->state) { + case NES_CM_STATE_SYN_RCVD: + case NES_CM_STATE_SYN_SENT: + case NES_CM_STATE_ONE_SIDE_ESTABLISHED: + case NES_CM_STATE_ESTABLISHED: + case NES_CM_STATE_ACCEPTING: + case NES_CM_STATE_MPAREQ_SENT: + cm_node->state = NES_CM_STATE_CLOSE_WAIT; + cm_node->state = NES_CM_STATE_LAST_ACK; + ret = send_fin(cm_node, NULL); + break; + case NES_CM_STATE_FIN_WAIT1: + cm_node->state = NES_CM_STATE_CLOSING; + ret = send_ack(cm_node); + break; + case NES_CM_STATE_FIN_WAIT2: + cm_node->state = NES_CM_STATE_TIME_WAIT; + cm_node->tcp_cntxt.loc_seq_num ++; + ret = send_ack(cm_node); + /* need to schedule this to happen in 2MSL timeouts */ + cm_node->state = NES_CM_STATE_CLOSED; + break; + case NES_CM_STATE_CLOSE_WAIT: + case NES_CM_STATE_LAST_ACK: + case NES_CM_STATE_CLOSING: + case NES_CM_STATE_TSA: + default: + nes_debug(NES_DBG_CM, "Received a fin while in %x state\n", + cm_node->state); + ret = -EINVAL; + break; + } + } + + if (datasize) { + u8 *dataloc = skb->data; + /* figure out what state we are in and handle transition to next state */ + switch (cm_node->state) { + case NES_CM_STATE_LISTENING: + case NES_CM_STATE_SYN_RCVD: + case NES_CM_STATE_SYN_SENT: + case NES_CM_STATE_FIN_WAIT1: + case NES_CM_STATE_FIN_WAIT2: + case NES_CM_STATE_CLOSE_WAIT: + case NES_CM_STATE_LAST_ACK: + case NES_CM_STATE_CLOSING: + break; + case NES_CM_STATE_MPAREQ_SENT: + /* recv the mpa res frame, ret=frame len (incl priv data) */ + ret = parse_mpa(cm_node, dataloc, datasize); + if (ret < 0) + break; + /* set the req frame payload len in skb */ + /* we are done handling this state, set node to a TSA state */ + cm_node->state = NES_CM_STATE_TSA; + send_ack(cm_node); + create_event(cm_node, NES_CM_EVENT_CONNECTED); + break; + + case NES_CM_STATE_ESTABLISHED: + /* we are expecting an MPA req frame */ + ret = parse_mpa(cm_node, dataloc, datasize); + if (ret < 0) { + break; + } + cm_node->state = NES_CM_STATE_TSA; + send_ack(cm_node); + /* we got a valid MPA request, create an event */ + create_event(cm_node, NES_CM_EVENT_MPA_REQ); + break; + case NES_CM_STATE_TSA: + handle_exception_pkt(cm_node, skb); + break; + case NES_CM_STATE_UNKNOWN: + case NES_CM_STATE_INITED: + default: + ret = -1; + } + } + + return ret; +} + + +/** + * mini_cm_listen - create a listen node with params + */ +static struct nes_cm_listener *mini_cm_listen(struct nes_cm_core *cm_core, + struct nes_vnic *nesvnic, struct nes_cm_info *cm_info) +{ + struct nes_cm_listener *listener; + unsigned long flags; + + /* cannot have multiple matching listeners */ + listener = find_listener( cm_core, htonl(cm_info->loc_addr), + htons(cm_info->loc_port), NES_CM_LISTENER_EITHER_STATE); + if (listener && listener->listener_state == NES_CM_LISTENER_ACTIVE_STATE) { + /* find automatically incs ref count ??? */ + atomic_dec(&listener->ref_count); + nes_debug(NES_DBG_CM, "Not creating listener since it already exists\n"); + return NULL; + } + + if (!listener) { + /* create a CM listen node (1/2 node to compare incoming traffic to) */ + listener = (struct nes_cm_listener *)kzalloc(sizeof(*listener), GFP_ATOMIC); + if (!listener) { + nes_debug(NES_DBG_CM, "Not creating listener memory allocation failed\n"); + return NULL; + } + + memset(listener, 0, sizeof(struct nes_cm_listener)); + listener->loc_addr = htonl(cm_info->loc_addr); + listener->loc_port = htons(cm_info->loc_port); + listener->reused_node = 0; + + atomic_set(&listener->ref_count, 1); + } + /* pasive case */ + /* find already inc'ed the ref count */ + else { + listener->reused_node = 1; + } + + listener->cm_id = cm_info->cm_id; + atomic_set(&listener->pend_accepts_cnt, 0); + listener->cm_core = cm_core; + listener->nesvnic = nesvnic; + atomic_inc(&cm_core->node_cnt); + atomic_inc(&cm_core->session_id); + + listener->session_id = (u32)(atomic_read(&cm_core->session_id) + current->tgid); + listener->conn_type = cm_info->conn_type; + listener->backlog = cm_info->backlog; + listener->listener_state = NES_CM_LISTENER_ACTIVE_STATE; + + if (!listener->reused_node) { + spin_lock_irqsave(&cm_core->listen_list_lock, flags); + list_add(&listener->list, &cm_core->listen_list.list); + spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); + atomic_inc(&cm_core->listen_node_cnt); + } + + nes_debug(NES_DBG_CM, "Api - listen(): addr=0x%08X, port=0x%04x," + " listener = %p, backlog = %d, cm_id = %p.\n", + ntohl(cm_info->loc_addr), ntohs(cm_info->loc_port), + listener, listener->backlog, listener->cm_id); + + return listener; +} + + +/** + * mini_cm_connect - make a connection node with params + */ +struct nes_cm_node * mini_cm_connect(struct nes_cm_core *cm_core, + struct nes_vnic *nesvnic, struct ietf_mpa_frame *mpa_frame, + struct nes_cm_info *cm_info) +{ + int ret = 0; + struct nes_cm_node *cm_node; + struct nes_cm_listener *loopbackremotelistener; + struct nes_cm_node *loopbackremotenode; + + u16 mpa_frame_size = sizeof(struct ietf_mpa_frame) + + ntohs(mpa_frame->priv_data_len); + + cm_info->loc_addr = htonl(cm_info->loc_addr); + cm_info->rem_addr = htonl(cm_info->rem_addr); + cm_info->loc_port = htons(cm_info->loc_port); + cm_info->rem_port = htons(cm_info->rem_port); + + /* create a CM connection node */ + cm_node = make_cm_node(cm_core, nesvnic, cm_info, NULL); + if (!cm_node) + return NULL; + + // set our node side to client (active) side + cm_node->tcp_cntxt.client = 1; + + if (cm_info->loc_addr == cm_info->rem_addr) { + loopbackremotelistener = find_listener(cm_core, cm_node->rem_addr, + cm_node->rem_port, NES_CM_LISTENER_ACTIVE_STATE); + if (NULL == loopbackremotelistener) { + create_event(cm_node, NES_CM_EVENT_ABORTED); + } else { + u16 temp; + temp = cm_info->loc_port; + cm_info->loc_port = cm_info->rem_port; + cm_info->rem_port = temp; + loopbackremotenode = make_cm_node(cm_core, nesvnic, cm_info, + loopbackremotelistener); + loopbackremotenode->loopbackpartner = cm_node; + cm_node->loopbackpartner = loopbackremotenode; + memcpy(loopbackremotenode->mpa_frame_buf, &mpa_frame->priv_data, + mpa_frame_size); + loopbackremotenode->mpa_frame_size = mpa_frame_size - + sizeof(struct ietf_mpa_frame); + + create_event(loopbackremotenode, NES_CM_EVENT_MPA_REQ); + // we are done handling this state, set node to a TSA state + cm_node->state = NES_CM_STATE_TSA; + } + return cm_node; + } + + /* set our node side to client (active) side */ + cm_node->tcp_cntxt.client = 1; + /* init our MPA frame ptr */ + memcpy(&cm_node->mpa_frame, mpa_frame, mpa_frame_size); + cm_node->mpa_frame_size = mpa_frame_size; + + /* send a syn and goto syn sent state */ + cm_node->state = NES_CM_STATE_SYN_SENT; + ret = send_syn(cm_node, 0); + + nes_debug(NES_DBG_CM, "Api - connect(): dest addr=0x%08X, port=0x%04x," + " cm_node=%p, cm_id = %p.\n", + cm_node->rem_addr, cm_node->rem_port, cm_node, cm_node->cm_id); + + return cm_node; +} + + +/** + * mini_cm_accept - accept a connection + * This function is never called + */ +int mini_cm_accept(struct nes_cm_core *cm_core, struct ietf_mpa_frame *mpa_frame, + struct nes_cm_node *cm_node) +{ + return 0; +} + + +/** + * mini_cm_reject - reject and teardown a connection + */ +int mini_cm_reject(struct nes_cm_core *cm_core, + struct ietf_mpa_frame *mpa_frame, + struct nes_cm_node *cm_node) +{ + int ret = 0; + struct sk_buff *skb; + u16 mpa_frame_size = sizeof(struct ietf_mpa_frame) + + ntohs(mpa_frame->priv_data_len); + + skb = get_free_pkt(cm_node); + if (!skb) { + nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n"); + return -1; + } + + /* send an MPA Request frame */ + form_cm_frame(skb, cm_node, NULL, 0, mpa_frame, mpa_frame_size, SET_ACK | SET_FIN); + ret = schedule_nes_timer(cm_node, skb, NES_TIMER_TYPE_SEND, 1, 0); + + cm_node->state = NES_CM_STATE_CLOSED; + ret = send_fin(cm_node, NULL); + + if (ret < 0) { + printk(KERN_INFO PFX "failed to send MPA Reply (reject)\n"); + return ret; + } + + return ret; +} + + +/** + * mini_cm_close + */ +int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_node) +{ + int ret = 0; + + if (!cm_core || !cm_node) + return -EINVAL; + + switch (cm_node->state) { + /* if passed in node is null, create a reference key node for node search */ + /* check if we found an owner node for this pkt */ + case NES_CM_STATE_SYN_RCVD: + case NES_CM_STATE_SYN_SENT: + case NES_CM_STATE_ONE_SIDE_ESTABLISHED: + case NES_CM_STATE_ESTABLISHED: + case NES_CM_STATE_ACCEPTING: + case NES_CM_STATE_MPAREQ_SENT: + cm_node->state = NES_CM_STATE_FIN_WAIT1; + send_fin(cm_node, NULL); + break; + case NES_CM_STATE_CLOSE_WAIT: + cm_node->state = NES_CM_STATE_LAST_ACK; + send_fin(cm_node, NULL); + break; + case NES_CM_STATE_FIN_WAIT1: + case NES_CM_STATE_FIN_WAIT2: + case NES_CM_STATE_LAST_ACK: + case NES_CM_STATE_TIME_WAIT: + case NES_CM_STATE_CLOSING: + ret = -1; + break; + case NES_CM_STATE_LISTENING: + case NES_CM_STATE_UNKNOWN: + case NES_CM_STATE_INITED: + case NES_CM_STATE_CLOSED: + case NES_CM_STATE_TSA: + ret = rem_ref_cm_node(cm_core, cm_node); + break; + } + cm_node->cm_id = NULL; + return ret; +} + + +/** + * recv_pkt - recv an ETHERNET packet, and process it through CM + * node state machine + */ +int mini_cm_recv_pkt(struct nes_cm_core *cm_core, struct nes_vnic *nesvnic, + struct sk_buff *skb) +{ + struct nes_cm_node *cm_node = NULL; + struct nes_cm_listener *listener = NULL; + struct iphdr *iph; + struct tcphdr *tcph; + struct nes_cm_info nfo; + int ret = 0; + + if (!skb || skb->len < sizeof(struct iphdr) + sizeof(struct tcphdr)) { + ret = -EINVAL; + goto out; + } + + iph = (struct iphdr *)skb->data; + tcph = (struct tcphdr *)(skb->data + sizeof(struct iphdr)); +#ifdef OFED_1_2 + skb->nh.iph = iph; + skb->h.th = tcph; +#else + skb_reset_network_header(skb); + skb_set_transport_header(skb, sizeof(*tcph)); +#endif + skb->len = htons(iph->tot_len); + + nfo.loc_addr = ntohl(iph->daddr); + nfo.loc_port = ntohs(tcph->dest); + nfo.rem_addr = ntohl(iph->saddr); + nfo.rem_port = ntohs(tcph->source); + + /* note: this call is going to increment cm_node ref count */ + cm_node = find_node(cm_core, + nfo.rem_port, nfo.rem_addr, + nfo.loc_port, nfo.loc_addr); + + if (!cm_node) { + listener = find_listener(cm_core, nfo.loc_addr, nfo.loc_port, + NES_CM_LISTENER_ACTIVE_STATE); + if (listener) { + nfo.cm_id = listener->cm_id; + nfo.conn_type = listener->conn_type; + } else { + nfo.cm_id = NULL; + nfo.conn_type = 0; + } + + cm_node = make_cm_node(cm_core, nesvnic, &nfo, listener); + if (!cm_node) { + nes_debug(NES_DBG_CM, "Unable to allocate node\n"); + ret = -1; + goto out; + } + if (!listener) { + nes_debug(NES_DBG_CM, "Packet found for unknown port %x refcnt=%d\n", + nfo.loc_port, atomic_read(&cm_node->ref_count)); + if (!tcph->rst) { + nes_debug(NES_DBG_CM, "Packet found for unknown port=%d" + " rem_port=%d refcnt=%d\n", + nfo.loc_port, nfo.rem_port, atomic_read(&cm_node->ref_count)); + + cm_node->tcp_cntxt.rcv_nxt = ntohl(tcph->seq); + cm_node->tcp_cntxt.loc_seq_num = ntohl(tcph->ack_seq); + send_reset(cm_node); + } + rem_ref_cm_node(cm_core, cm_node); + ret = -1; + goto out; + } + add_ref_cm_node(cm_node); + cm_node->state = NES_CM_STATE_LISTENING; + } + + nes_debug(NES_DBG_CM, "Processing Packet for node %p, data = (%p):\n", + cm_node, skb->data); + process_packet(cm_node, skb, cm_core); + + rem_ref_cm_node(cm_core, cm_node); + out: + if (skb) + dev_kfree_skb_any(skb); + return ret; +} + + +/** + * nes_cm_alloc_core - allocate a top level instance of a cm core + */ +struct nes_cm_core *nes_cm_alloc_core(void) +{ + int i; + + struct nes_cm_core *cm_core; + struct sk_buff *skb = NULL; + + /* setup the CM core */ + /* alloc top level core control structure */ + cm_core = kzalloc(sizeof(*cm_core), GFP_KERNEL); + if (!cm_core) + return NULL; + + INIT_LIST_HEAD(&cm_core->connected_nodes); + init_timer(&cm_core->tcp_timer); + cm_core->tcp_timer.function = nes_cm_timer_tick; + + cm_core->mtu = NES_CM_DEFAULT_MTU; + cm_core->state = NES_CM_STATE_INITED; + cm_core->free_tx_pkt_max = NES_CM_DEFAULT_FREE_PKTS; + + atomic_set(&cm_core->session_id, 0); + atomic_set(&cm_core->events_posted, 0); + + /* init the packet lists */ + skb_queue_head_init(&cm_core->tx_free_list); + + for (i=0; i < NES_CM_DEFAULT_FRAME_CNT; i++) { + skb = dev_alloc_skb(cm_core->mtu); + if (!skb) { + kfree(cm_core); + return NULL; + } + /* add 'raw' skb to free frame list */ + skb_queue_head(&cm_core->tx_free_list, skb); + } + + cm_core->api = &nes_cm_api; + + spin_lock_init(&cm_core->ht_lock); + spin_lock_init(&cm_core->listen_list_lock); + + INIT_LIST_HEAD(&cm_core->listen_list.list); + + nes_debug(NES_DBG_CM, "Init CM Core completed -- cm_core=%p\n", cm_core); + + nes_debug(NES_DBG_CM, "Enable QUEUE EVENTS\n"); + cm_core->event_wq = create_singlethread_workqueue("nesewq"); + cm_core->post_event = nes_cm_post_event; + nes_debug(NES_DBG_CM, "Enable QUEUE DISCONNECTS\n"); + cm_core->disconn_wq = create_singlethread_workqueue("nesdwq"); + + print_core(cm_core); + return cm_core; +} + + +/** + * mini_cm_dealloc_core - deallocate a top level instance of a cm core + */ +int mini_cm_dealloc_core(struct nes_cm_core *cm_core) +{ + nes_debug(NES_DBG_CM, "De-Alloc CM Core (%p)\n", cm_core); + + if (!cm_core) + return -EINVAL; + + barrier(); + + if (timer_pending(&cm_core->tcp_timer)) { + del_timer(&cm_core->tcp_timer); + } + + destroy_workqueue(cm_core->event_wq); + destroy_workqueue(cm_core->disconn_wq); + nes_debug(NES_DBG_CM, "\n"); + kfree(cm_core); + + return 0; +} + + +/** + * mini_cm_get + */ +int mini_cm_get(struct nes_cm_core *cm_core) +{ + return cm_core->state; +} + + +/** + * mini_cm_set + */ +int mini_cm_set(struct nes_cm_core *cm_core, u32 type, u32 value) +{ + int ret = 0; + + switch (type) { + case NES_CM_SET_PKT_SIZE: + cm_core->mtu = value; + break; + case NES_CM_SET_FREE_PKT_Q_SIZE: + cm_core->free_tx_pkt_max = value; + break; + default: + /* unknown set option */ + ret = -EINVAL; + } + + return ret; +} + + +/** + * nes_cm_init_tsa_conn setup HW; MPA frames must be + * successfully exchanged when this is called + */ +static int nes_cm_init_tsa_conn(struct nes_qp *nesqp, struct nes_cm_node *cm_node) +{ + int ret = 0; + + if (!nesqp) + return -EINVAL; + + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_IPV4 | + NES_QPCONTEXT_MISC_NO_NAGLE | NES_QPCONTEXT_MISC_DO_NOT_FRAG | + NES_QPCONTEXT_MISC_DROS); + + if (cm_node->tcp_cntxt.snd_wscale || cm_node->tcp_cntxt.rcv_wscale) + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_WSCALE); + + nesqp->nesqp_context->misc2 |= cpu_to_le32(64 << NES_QPCONTEXT_MISC2_TTL_SHIFT); + + nesqp->nesqp_context->mss |= cpu_to_le32(((u32)cm_node->tcp_cntxt.mss) << 16); + + nesqp->nesqp_context->tcp_state_flow_label |= cpu_to_le32( + (u32)NES_QPCONTEXT_TCPSTATE_EST << NES_QPCONTEXT_TCPFLOW_TCP_STATE_SHIFT); + + nesqp->nesqp_context->pd_index_wscale |= cpu_to_le32( + (cm_node->tcp_cntxt.snd_wscale << NES_QPCONTEXT_PDWSCALE_SND_WSCALE_SHIFT) & + NES_QPCONTEXT_PDWSCALE_SND_WSCALE_MASK); + + nesqp->nesqp_context->pd_index_wscale |= cpu_to_le32( + (cm_node->tcp_cntxt.rcv_wscale << NES_QPCONTEXT_PDWSCALE_RCV_WSCALE_SHIFT) & + NES_QPCONTEXT_PDWSCALE_RCV_WSCALE_MASK); + + nesqp->nesqp_context->keepalive = cpu_to_le32(0x80); + nesqp->nesqp_context->ts_recent = 0; + nesqp->nesqp_context->ts_age = 0; + nesqp->nesqp_context->snd_nxt = cpu_to_le32(cm_node->tcp_cntxt.loc_seq_num); + nesqp->nesqp_context->snd_wnd = cpu_to_le32(cm_node->tcp_cntxt.snd_wnd); + nesqp->nesqp_context->rcv_nxt = cpu_to_le32(cm_node->tcp_cntxt.rcv_nxt); + nesqp->nesqp_context->rcv_wnd = cpu_to_le32(cm_node->tcp_cntxt.rcv_wnd << + cm_node->tcp_cntxt.rcv_wscale); + nesqp->nesqp_context->snd_max = cpu_to_le32(cm_node->tcp_cntxt.loc_seq_num); + nesqp->nesqp_context->snd_una = cpu_to_le32(cm_node->tcp_cntxt.loc_seq_num); + nesqp->nesqp_context->srtt = 0; + nesqp->nesqp_context->rttvar = cpu_to_le32(0x6); + nesqp->nesqp_context->ssthresh = cpu_to_le32(0x3FFFC000); + nesqp->nesqp_context->cwnd = cpu_to_le32(2*cm_node->tcp_cntxt.mss); + nesqp->nesqp_context->snd_wl1 = cpu_to_le32(cm_node->tcp_cntxt.rcv_nxt); + nesqp->nesqp_context->snd_wl2 = cpu_to_le32(cm_node->tcp_cntxt.loc_seq_num); + nesqp->nesqp_context->max_snd_wnd = cpu_to_le32(cm_node->tcp_cntxt.max_snd_wnd); + + nes_debug(NES_DBG_CM, "QP%u: rcv_nxt = 0x%08X, snd_nxt = 0x%08X," + " Setting MSS to %u, PDWscale = 0x%08X, rcv_wnd = %u, context misc = 0x%08X.\n", + nesqp->hwqp.qp_id, le32_to_cpu(nesqp->nesqp_context->rcv_nxt), + le32_to_cpu(nesqp->nesqp_context->snd_nxt), + cm_node->tcp_cntxt.mss, le32_to_cpu(nesqp->nesqp_context->pd_index_wscale), + le32_to_cpu(nesqp->nesqp_context->rcv_wnd), + le32_to_cpu(nesqp->nesqp_context->misc)); + nes_debug(NES_DBG_CM, " snd_wnd = 0x%08X.\n", le32_to_cpu(nesqp->nesqp_context->snd_wnd)); + nes_debug(NES_DBG_CM, " snd_cwnd = 0x%08X.\n", le32_to_cpu(nesqp->nesqp_context->cwnd)); + nes_debug(NES_DBG_CM, " max_swnd = 0x%08X.\n", le32_to_cpu(nesqp->nesqp_context->max_snd_wnd)); + + nes_debug(NES_DBG_CM, "Change cm_node state to TSA\n"); + cm_node->state = NES_CM_STATE_TSA; + + return ret; +} + + +/** + * nes_cm_disconn + */ +int nes_cm_disconn(struct nes_qp *nesqp) +{ + unsigned long flags; + + spin_lock_irqsave(&nesqp->lock, flags); + if (0==nesqp->disconn_pending) { + nesqp->disconn_pending++; + spin_unlock_irqrestore(&nesqp->lock, flags); + /* nes_add_ref(&nesqp->ibqp); */ + /* init our disconnect work element, to */ + /* NES_INIT_WORK(&nesqp->disconn_work, nes_disconnect_worker, (void *)nesqp); */ + INIT_WORK(&nesqp->disconn_work, nes_disconnect_worker); + + queue_work(g_cm_core->disconn_wq, &nesqp->disconn_work); + } else { + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_rem_ref(&nesqp->ibqp); + } + + return 0; +} + + +/** + * nes_disconnect_worker + */ +void nes_disconnect_worker(void *parm) +{ + struct work_struct *work = parm; + struct nes_qp *nesqp = container_of(work, struct nes_qp, disconn_work); + + nes_debug(NES_DBG_CM, "processing AEQE id 0x%04X for QP%u.\n", + nesqp->last_aeq, nesqp->hwqp.qp_id); + nes_cm_disconn_true(nesqp); +} + + +/** + * nes_cm_disconn_true + */ +int nes_cm_disconn_true(struct nes_qp *nesqp) +{ + unsigned long flags; + int ret = 0; + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + struct nes_vnic *nesvnic; + u16 last_ae; + u8 original_hw_tcp_state; + u8 original_ibqp_state; + u8 issued_disconnect_reset = 0; + + if (!nesqp) { + nes_debug(NES_DBG_CM, "disconnect_worker nesqp is NULL\n"); + return -1; + } + + spin_lock_irqsave(&nesqp->lock, flags); + cm_id = nesqp->cm_id; + /* make sure we havent already closed this connection */ + if (!cm_id) { + nes_debug(NES_DBG_CM, "QP%u disconnect_worker cmid is NULL\n", + nesqp->hwqp.qp_id); + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_rem_ref(&nesqp->ibqp); + return -1; + } + + nesvnic = to_nesvnic(nesqp->ibqp.device); + nes_debug(NES_DBG_CM, "Disconnecting QP%u\n", nesqp->hwqp.qp_id); + + original_hw_tcp_state = nesqp->hw_tcp_state; + original_ibqp_state = nesqp->ibqp_state; + last_ae = nesqp->last_aeq; + + + nes_debug(NES_DBG_CM, "set ibqp_state=%u\n", nesqp->ibqp_state); + + if ((nesqp->cm_id) && (cm_id->event_handler)) { + if ((original_hw_tcp_state == NES_AEQE_TCP_STATE_CLOSE_WAIT) || + ((original_ibqp_state == IB_QPS_RTS) && + (last_ae == NES_AEQE_AEID_LLP_CONNECTION_RESET))) { + atomic_inc(&cm_disconnects); + cm_event.event = IW_CM_EVENT_DISCONNECT; + if (last_ae == NES_AEQE_AEID_LLP_CONNECTION_RESET) { + issued_disconnect_reset = 1; + cm_event.status = IW_CM_EVENT_STATUS_RESET; + nes_debug(NES_DBG_CM, "Generating a CM Disconnect Event (status reset) for " + " QP%u, cm_id = %p. \n", + nesqp->hwqp.qp_id, cm_id); + } else { + cm_event.status = IW_CM_EVENT_STATUS_OK; + } + + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + + nes_debug(NES_DBG_CM, "Generating a CM Disconnect Event for " + " QP%u, SQ Head = %u, SQ Tail = %u. cm_id = %p, refcount = %u.\n", + nesqp->hwqp.qp_id, + nesqp->hwqp.sq_head, nesqp->hwqp.sq_tail, cm_id, + atomic_read(&nesqp->refcount)); + + spin_unlock_irqrestore(&nesqp->lock, flags); + ret = cm_id->event_handler(cm_id, &cm_event); + if (ret) + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + spin_lock_irqsave(&nesqp->lock, flags); + } + + nesqp->disconn_pending = 0; + /* There might have been another AE while the lock was released */ + original_hw_tcp_state = nesqp->hw_tcp_state; + original_ibqp_state = nesqp->ibqp_state; + last_ae = nesqp->last_aeq; + + if ((0 == issued_disconnect_reset) && (nesqp->cm_id) && + ((original_hw_tcp_state == NES_AEQE_TCP_STATE_CLOSED) || + (original_hw_tcp_state == NES_AEQE_TCP_STATE_TIME_WAIT) || + (last_ae == NES_AEQE_AEID_RDMAP_ROE_BAD_LLP_CLOSE) || + (last_ae == NES_AEQE_AEID_LLP_CONNECTION_RESET))) { + atomic_inc(&cm_closes); + nesqp->cm_id = NULL; + nesqp->in_disconnect = 0; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_disconnect(nesqp, 1); + + cm_id->provider_data = nesqp; + /* Send up the close complete event */ + cm_event.event = IW_CM_EVENT_CLOSE; + cm_event.status = IW_CM_EVENT_STATUS_OK; + cm_event.provider_data = cm_id->provider_data; + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + + ret = cm_id->event_handler(cm_id, &cm_event); + if (ret) { + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + } + + cm_id->rem_ref(cm_id); + + spin_lock_irqsave(&nesqp->lock, flags); + if (0 == nesqp->flush_issued) { + nesqp->flush_issued = 1; + spin_unlock_irqrestore(&nesqp->lock, flags); + flush_wqes(nesvnic->nesdev, nesqp, NES_CQP_FLUSH_RQ, 1); + } else { + spin_unlock_irqrestore(&nesqp->lock, flags); + } + + /* This reference is from either ModifyQP or the AE processing, + there is still a race here with modifyqp */ + nes_rem_ref(&nesqp->ibqp); + + } else { + cm_id = nesqp->cm_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + /* check to see if the inbound reset beat the outbound reset */ + if ((!cm_id) && (last_ae==NES_AEQE_AEID_RESET_SENT)) { + nes_debug(NES_DBG_CM, "QP%u: Decing refcount due to inbound reset" + " beating the outbound reset.\n", + nesqp->hwqp.qp_id); + nes_rem_ref(&nesqp->ibqp); + } + } + } else { + nesqp->disconn_pending = 0; + spin_unlock_irqrestore(&nesqp->lock, flags); + } + nes_rem_ref(&nesqp->ibqp); + + return 0; +} + + +/** + * nes_disconnect + */ +int nes_disconnect(struct nes_qp *nesqp, int abrupt) +{ + int ret = 0; + struct nes_vnic *nesvnic; + struct nes_device *nesdev; + + nesvnic = to_nesvnic(nesqp->ibqp.device); + if (!nesvnic) + return -EINVAL; + + nesdev = nesvnic->nesdev; + + nes_debug(NES_DBG_CM, "netdev refcnt = %u.\n", + atomic_read(&nesvnic->netdev->refcnt)); + + if (nesqp->active_conn) { + + /* indicate this connection is NOT active */ + nesqp->active_conn = 0; + } else { + /* Need to free the Last Streaming Mode Message */ + if (nesqp->ietf_frame) { + pci_free_consistent(nesdev->pcidev, + nesqp->private_data_len+sizeof(struct ietf_mpa_frame), + nesqp->ietf_frame, nesqp->ietf_frame_pbase); + } + } + + /* close the CM node down if it is still active */ + if (nesqp->cm_node) { + nes_debug(NES_DBG_CM, "Call close API\n"); + + g_cm_core->api->close(g_cm_core, nesqp->cm_node); + nesqp->cm_node = NULL; + } + + return ret; +} + + +/** + * nes_accept + */ +int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + u64 u64temp; + struct ib_qp *ibqp; + struct nes_qp *nesqp; + struct nes_vnic *nesvnic; + struct nes_device *nesdev; + struct nes_cm_node *cm_node; + struct nes_adapter *adapter; + struct ib_qp_attr attr; + struct iw_cm_event cm_event; + struct nes_hw_qp_wqe *wqe; + struct nes_v4_quad nes_quad; + int ret; + + ibqp = nes_get_qp(cm_id->device, conn_param->qpn); + if (!ibqp) + return -EINVAL; + + /* get all our handles */ + nesqp = to_nesqp(ibqp); + nesvnic = to_nesvnic(nesqp->ibqp.device); + nesdev = nesvnic->nesdev; + adapter = nesdev->nesadapter; + + /* since this is from a listen, we were able to put node handle into cm_id */ + cm_node = (struct nes_cm_node *)cm_id->provider_data; + + /* associate the node with the QP */ + nesqp->cm_node = (void *)cm_node; + + nes_debug(NES_DBG_CM, "QP%u, cm_node=%p, jiffies = %lu\n", + nesqp->hwqp.qp_id, cm_node, jiffies); + atomic_inc(&cm_accepts); + + nes_debug(NES_DBG_CM, "netdev refcnt = %u.\n", + atomic_read(&nesvnic->netdev->refcnt)); + + /* allocate the ietf frame and space for private data */ + nesqp->ietf_frame = pci_alloc_consistent(nesdev->pcidev, + sizeof(struct ietf_mpa_frame) + conn_param->private_data_len, + &nesqp->ietf_frame_pbase); + + if (!nesqp->ietf_frame) { + nes_debug(NES_DBG_CM, "Unable to allocate memory for private data\n"); + return -ENOMEM; + } + + + /* setup the MPA frame */ + nesqp->private_data_len = conn_param->private_data_len; + memcpy(nesqp->ietf_frame->key, IEFT_MPA_KEY_REP, IETF_MPA_KEY_SIZE); + + memcpy(nesqp->ietf_frame->priv_data, conn_param->private_data, + conn_param->private_data_len); + + nesqp->ietf_frame->priv_data_len = cpu_to_be16(conn_param->private_data_len); + nesqp->ietf_frame->rev = mpa_version; + nesqp->ietf_frame->flags = IETF_MPA_FLAGS_CRC; + + /* setup our first outgoing iWarp send WQE (the IETF frame response) */ + wqe = &nesqp->hwqp.sq_vbase[0]; + + u64temp = (u64)nesqp; + u64temp |= NES_SW_CONTEXT_ALIGN>>1; + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)(u64temp)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(u64temp>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = + cpu_to_le32(NES_IWARP_SQ_WQE_STREAMING | NES_IWARP_SQ_WQE_WRPDU); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = + cpu_to_le32(conn_param->private_data_len + sizeof(struct ietf_mpa_frame)); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX] = + cpu_to_le32((u32)nesqp->ietf_frame_pbase); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX] = + cpu_to_le32((u32)((u64)nesqp->ietf_frame_pbase >> 32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX] = + cpu_to_le32(conn_param->private_data_len + sizeof(struct ietf_mpa_frame)); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX] = 0; + + nesqp->nesqp_context->ird_ord_sizes |= cpu_to_le32( + NES_QPCONTEXT_ORDIRD_LSMM_PRESENT | NES_QPCONTEXT_ORDIRD_WRPDU); + nesqp->skip_lsmm = 1; + + + /* Cache the cm_id in the qp */ + nesqp->cm_id = cm_id; + cm_node->cm_id = cm_id; + + /* nesqp->cm_node = (void *)cm_id->provider_data; */ + cm_id->provider_data = nesqp; + nesqp->active_conn = 0; + + nes_cm_init_tsa_conn(nesqp, cm_node); + + nesqp->nesqp_context->tcpPorts[0] = cpu_to_le16(ntohs(cm_id->local_addr.sin_port)); + nesqp->nesqp_context->tcpPorts[1] = cpu_to_le16(ntohs(cm_id->remote_addr.sin_port)); + nesqp->nesqp_context->ip0 = cpu_to_le32(ntohl(cm_id->remote_addr.sin_addr.s_addr)); + + nesqp->nesqp_context->misc2 |= cpu_to_le32( + (u32)PCI_FUNC(nesdev->pcidev->devfn) << NES_QPCONTEXT_MISC2_SRC_IP_SHIFT); + + nesqp->nesqp_context->arp_index_vlan |= cpu_to_le32( + nes_arp_table(nesdev, le32_to_cpu(nesqp->nesqp_context->ip0), NULL, + NES_ARP_RESOLVE) << 16); + + nesqp->nesqp_context->ts_val_delta = cpu_to_le32( + jiffies - nes_read_indexed(nesdev, NES_IDX_TCP_NOW)); + + nesqp->nesqp_context->ird_index = cpu_to_le32(nesqp->hwqp.qp_id); + + nesqp->nesqp_context->ird_ord_sizes |= cpu_to_le32( + ((u32)1 << NES_QPCONTEXT_ORDIRD_IWARP_MODE_SHIFT)); + nesqp->nesqp_context->ird_ord_sizes |= cpu_to_le32((u32)conn_param->ord); + + memset(&nes_quad, 0, sizeof(nes_quad)); + + nes_quad.DstIpAdrIndex = (u32)PCI_FUNC(nesdev->pcidev->devfn) << 27; + nes_quad.SrcIpadr = cm_id->remote_addr.sin_addr.s_addr; + nes_quad.TcpPorts[0] = cm_id->remote_addr.sin_port; + nes_quad.TcpPorts[1] = cm_id->local_addr.sin_port; + + /* Produce hash key */ + nesqp->hte_index = nes_crc32(1, NES_HASH_CRC_INITAL_VALUE, + NES_HASH_CRC_FINAL_XOR, sizeof(nes_quad), + (u8 *)&nes_quad, ORDER, REFIN, REFOUT); + + nes_debug(NES_DBG_CM, "HTE Index = 0x%08X, CRC = 0x%08X\n", + nesqp->hte_index, nesqp->hte_index & adapter->hte_index_mask); + + nesqp->hte_index &= adapter->hte_index_mask; + nesqp->nesqp_context->hte_index = cpu_to_le32(nesqp->hte_index); + + cm_node->cm_core->api->accelerated(cm_node->cm_core, cm_node); + + nes_debug(NES_DBG_CM, "QP%u, Destination IP = 0x%08X:0x%04X, local = 0x%08X:0x%04X," + " rcv_nxt=0x%08X, snd_nxt=0x%08X, mpa + private data length=%lu.\n", + nesqp->hwqp.qp_id, + ntohl(cm_id->remote_addr.sin_addr.s_addr), + ntohs(cm_id->remote_addr.sin_port), + ntohl(cm_id->local_addr.sin_addr.s_addr), + ntohs(cm_id->local_addr.sin_port), + le32_to_cpu(nesqp->nesqp_context->rcv_nxt), + le32_to_cpu(nesqp->nesqp_context->snd_nxt), + conn_param->private_data_len+sizeof(struct ietf_mpa_frame)); + + attr.qp_state = IB_QPS_RTS; + nes_modify_qp(&nesqp->ibqp, &attr, IB_QP_STATE, NULL ); + + /* notify OF layer that accept event was successfull */ + cm_id->add_ref(cm_id); + + cm_event.event = IW_CM_EVENT_ESTABLISHED; + cm_event.status = IW_CM_EVENT_STATUS_ACCEPTED; + cm_event.provider_data = (void *)nesqp; + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + ret = cm_id->event_handler(cm_id, &cm_event); + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + if (ret) + printk("%s[%u] OFA CM event_handler returned, ret=%d\n", + __FUNCTION__, __LINE__, ret); + + return 0; +} + + +/** + * nes_reject + */ +int nes_reject(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + struct nes_cm_node *cm_node; + struct nes_cm_core *cm_core; + + atomic_inc(&cm_rejects); + cm_node = (struct nes_cm_node *) cm_id->provider_data; + cm_core = cm_node->cm_core; + cm_node->mpa_frame_size = sizeof(struct ietf_mpa_frame) + pdata_len; + + strcpy(&cm_node->mpa_frame.key[0], IEFT_MPA_KEY_REP); + memcpy(&cm_node->mpa_frame.priv_data, pdata, pdata_len); + + cm_node->mpa_frame.priv_data_len = cpu_to_be16(pdata_len); + cm_node->mpa_frame.rev = mpa_version; + cm_node->mpa_frame.flags = IETF_MPA_FLAGS_CRC | IETF_MPA_FLAGS_REJECT; + + cm_core->api->reject(cm_core, &cm_node->mpa_frame, cm_node); + + return 0; +} + + +/** + * nes_connect + * setup and launch cm connect node + */ +int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + struct ib_qp *ibqp; + struct nes_qp *nesqp; + struct nes_vnic *nesvnic; + struct nes_device *nesdev; + struct nes_cm_node *cm_node; + struct nes_cm_info cm_info; + + ibqp = nes_get_qp(cm_id->device, conn_param->qpn); + if (!ibqp) + return -EINVAL; + nesqp = to_nesqp(ibqp); + if (!nesqp) + return -EINVAL; + nesvnic = to_nesvnic(nesqp->ibqp.device); + if (!nesvnic) + return -EINVAL; + nesdev = nesvnic->nesdev; + if (!nesdev) + return -EINVAL; + + atomic_inc(&cm_connects); + + nesqp->ietf_frame = kzalloc(sizeof(struct ietf_mpa_frame) + + conn_param->private_data_len, GFP_KERNEL); + if (!nesqp->ietf_frame) + return -ENOMEM; + + /* set qp as having an active connection */ + nesqp->active_conn = 1; + + nes_debug(NES_DBG_CM, "QP%u, Destination IP = 0x%08X:0x%04X, local = 0x%08X:0x%04X.\n", + nesqp->hwqp.qp_id, + ntohl(cm_id->remote_addr.sin_addr.s_addr), + ntohs(cm_id->remote_addr.sin_port), + ntohl(cm_id->local_addr.sin_addr.s_addr), + ntohs(cm_id->local_addr.sin_port)); + + /* cache the cm_id in the qp */ + nesqp->cm_id = cm_id; + + cm_id->provider_data = nesqp; + + /* copy the private data */ + if (conn_param->private_data_len) { + memcpy(nesqp->ietf_frame->priv_data, conn_param->private_data, + conn_param->private_data_len); + } + + nesqp->private_data_len = conn_param->private_data_len; + nesqp->nesqp_context->ird_ord_sizes |= cpu_to_le32((u32)conn_param->ord); + nes_debug(NES_DBG_CM, "requested ord = 0x%08X.\n", (u32)conn_param->ord); + nes_debug(NES_DBG_CM, "mpa private data len =%u\n", conn_param->private_data_len); + + strcpy(&nesqp->ietf_frame->key[0], IEFT_MPA_KEY_REQ); + nesqp->ietf_frame->flags = IETF_MPA_FLAGS_CRC; + nesqp->ietf_frame->rev = IETF_MPA_VERSION; + nesqp->ietf_frame->priv_data_len = htons(conn_param->private_data_len); + + nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), + PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_ADD); + + /* set up the connection params for the node */ + cm_info.loc_addr = (cm_id->local_addr.sin_addr.s_addr); + cm_info.loc_port = (cm_id->local_addr.sin_port); + cm_info.rem_addr = (cm_id->remote_addr.sin_addr.s_addr); + cm_info.rem_port = (cm_id->remote_addr.sin_port); + cm_info.cm_id = cm_id; + cm_info.conn_type = NES_CM_IWARP_CONN_TYPE; + + cm_id->add_ref(cm_id); + nes_add_ref(&nesqp->ibqp); + + /* create a connect CM node connection */ + cm_node = g_cm_core->api->connect(g_cm_core, nesvnic, nesqp->ietf_frame, &cm_info); + if (!cm_node) { + nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), + PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_DEL); + nes_rem_ref(&nesqp->ibqp); + kfree(nesqp->ietf_frame); + nesqp->ietf_frame = NULL; + cm_id->rem_ref(cm_id); + return -ENOMEM; + } + + cm_node->apbvt_set = 1; + nesqp->cm_node = cm_node; + + return 0; +} + + +/** + * nes_create_listen + */ +int nes_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct nes_vnic *nesvnic; + struct nes_cm_listener *cm_node; + struct nes_cm_info cm_info; + struct nes_adapter *adapter; + int err; + + + nes_debug(NES_DBG_CM, "cm_id = %p, local port = 0x%04X.\n", + cm_id, ntohs(cm_id->local_addr.sin_port)); + + nesvnic = to_nesvnic(cm_id->device); + if (!nesvnic) + return -EINVAL; + adapter = nesvnic->nesdev->nesadapter; + + /* setup listen params in our api call struct */ + cm_info.loc_addr = cm_id->local_addr.sin_addr.s_addr; + cm_info.loc_port = cm_id->local_addr.sin_port; + cm_info.backlog = backlog; + cm_info.cm_id = cm_id; + + cm_info.conn_type = NES_CM_IWARP_CONN_TYPE; + + + cm_node = g_cm_core->api->listen(g_cm_core, nesvnic, &cm_info); + if (!cm_node) { + printk("%s[%u] Error returned from listen API call\n", + __FUNCTION__, __LINE__); + return -ENOMEM; + } + + cm_id->provider_data = cm_node; + + if (!cm_node->reused_node) { + err = nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), + PCI_FUNC(nesvnic->nesdev->pcidev->devfn), NES_MANAGE_APBVT_ADD); + if (err) { + printk("nes_manage_apbvt call returned %d.\n", err); + g_cm_core->api->stop_listener(g_cm_core, (void *)cm_node); + return err; + } + cm_listens_created++; + } + + cm_id->add_ref(cm_id); + cm_id->provider_data = (void *)cm_node; + + + return 0; +} + + +/** + * nes_destroy_listen + */ +int nes_destroy_listen(struct iw_cm_id *cm_id) +{ + if (cm_id->provider_data) + g_cm_core->api->stop_listener(g_cm_core, cm_id->provider_data); + else + nes_debug(NES_DBG_CM, "cm_id->provider_data was NULL\n"); + + cm_id->rem_ref(cm_id); + + return 0; +} + + +/** + * nes_cm_recv + */ +int nes_cm_recv(struct sk_buff *skb, struct net_device *netdevice) +{ + cm_packets_received++; + if ((g_cm_core)&&(g_cm_core->api)) { + g_cm_core->api->recv_pkt(g_cm_core, netdev_priv(netdevice), skb); + } else { + nes_debug(NES_DBG_CM, "Unable to process packet for CM," + " cm is not setup properly.\n"); + } + + return 0; +} + + +/** + * nes_cm_start + * Start and init a cm core module + */ +int nes_cm_start(void) +{ + nes_debug(NES_DBG_CM, "\n"); + /* create the primary CM core, pass this handle to subsequent core inits */ + g_cm_core = nes_cm_alloc_core(); + if (g_cm_core) { + return 0; + } else { + return -ENOMEM; + } +} + + +/** + * nes_cm_stop + * stop and dealloc all cm core instances + */ +int nes_cm_stop(void) +{ + g_cm_core->api->destroy_cm_core(g_cm_core); + return 0; +} + + +/** + * cm_event_connected + * handle a connected event, setup QPs and HW + */ +void cm_event_connected(struct nes_cm_event *event) +{ + u64 u64temp; + struct nes_qp *nesqp; + struct nes_vnic *nesvnic; + struct nes_device *nesdev; + struct nes_cm_node *cm_node; + struct nes_adapter *nesadapter; + struct ib_qp_attr attr; + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + struct nes_hw_qp_wqe *wqe; + struct nes_v4_quad nes_quad; + int ret; + + /* get all our handles */ + cm_node = event->cm_node; + cm_id = cm_node->cm_id; + nes_debug(NES_DBG_CM, "cm_event_connected - %p - cm_id = %p\n", cm_node, cm_id); + nesqp = (struct nes_qp *)cm_id->provider_data; + nesvnic = to_nesvnic(nesqp->ibqp.device); + nesdev = nesvnic->nesdev; + nesadapter = nesdev->nesadapter; + + if (nesqp->destroyed) { + return; + } + atomic_inc(&cm_connecteds); + nes_debug(NES_DBG_CM, "QP%u attempting to connect to 0x%08X:0x%04X on" + " local port 0x%04X. jiffies = %lu.\n", + nesqp->hwqp.qp_id, + ntohl(cm_id->remote_addr.sin_addr.s_addr), + ntohs(cm_id->remote_addr.sin_port), + ntohs(cm_id->local_addr.sin_port), + jiffies); + + nes_cm_init_tsa_conn(nesqp, cm_node); + + /* set the QP tsa context */ + nesqp->nesqp_context->tcpPorts[0] = cpu_to_le16(ntohs(cm_id->local_addr.sin_port)); + nesqp->nesqp_context->tcpPorts[1] = cpu_to_le16(ntohs(cm_id->remote_addr.sin_port)); + nesqp->nesqp_context->ip0 = cpu_to_le32(ntohl(cm_id->remote_addr.sin_addr.s_addr)); + + nesqp->nesqp_context->misc2 |= cpu_to_le32( + (u32)PCI_FUNC(nesdev->pcidev->devfn) << NES_QPCONTEXT_MISC2_SRC_IP_SHIFT); + nesqp->nesqp_context->arp_index_vlan |= cpu_to_le32( + nes_arp_table(nesdev, le32_to_cpu(nesqp->nesqp_context->ip0), + NULL, NES_ARP_RESOLVE) << 16); + nesqp->nesqp_context->ts_val_delta = cpu_to_le32( + jiffies - nes_read_indexed(nesdev, NES_IDX_TCP_NOW)); + nesqp->nesqp_context->ird_index = cpu_to_le32(nesqp->hwqp.qp_id); + nesqp->nesqp_context->ird_ord_sizes |= + cpu_to_le32((u32)1 << NES_QPCONTEXT_ORDIRD_IWARP_MODE_SHIFT); + + /* Adjust tail for not having a LSMM */ + nesqp->hwqp.sq_tail = 1; + +#if defined(NES_SEND_FIRST_WRITE) + if (cm_node->send_write0) { + nes_debug(NES_DBG_CM, "Sending first write.\n"); + wqe = &nesqp->hwqp.sq_vbase[0]; + u64temp = (u64)nesqp; + u64temp |= NES_SW_CONTEXT_ALIGN>>1; + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)(u64temp)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(u64temp>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(NES_IWARP_SQ_OP_RDMAW); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = 0; + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX] = 0; + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX] = 0; + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX] = 0; + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX] = 0; + + /* use the reserved spot on the WQ for the extra first WQE */ + nesqp->nesqp_context->ird_ord_sizes &= cpu_to_le32(~(NES_QPCONTEXT_ORDIRD_LSMM_PRESENT | + NES_QPCONTEXT_ORDIRD_WRPDU | NES_QPCONTEXT_ORDIRD_ALSMM)); + nesqp->skip_lsmm = 1; + nesqp->hwqp.sq_tail = 0; + nes_write32(nesdev->regs + NES_WQE_ALLOC, + (1 << 24) | 0x00800000 | nesqp->hwqp.qp_id); + } +#endif + + memset(&nes_quad, 0, sizeof(nes_quad)); + + nes_quad.DstIpAdrIndex = (u32)PCI_FUNC(nesdev->pcidev->devfn) << 27; + nes_quad.SrcIpadr = cm_id->remote_addr.sin_addr.s_addr; + nes_quad.TcpPorts[0] = cm_id->remote_addr.sin_port; + nes_quad.TcpPorts[1] = cm_id->local_addr.sin_port; + + nesqp->hte_index = nes_crc32( 1, NES_HASH_CRC_INITAL_VALUE, + NES_HASH_CRC_FINAL_XOR, sizeof(nes_quad), (u8 *)&nes_quad, + ORDER, REFIN, REFOUT); + + nes_debug(NES_DBG_CM, "HTE Index = 0x%08X, After CRC = 0x%08X, TcpPorts = 0x%08X\n", + nesqp->hte_index, nesqp->hte_index & nesadapter->hte_index_mask, + le32_to_cpu(nes_quad.TcpPorts)); + + nesqp->hte_index &= nesadapter->hte_index_mask; + nesqp->nesqp_context->hte_index = cpu_to_le32(nesqp->hte_index); + + nesqp->ietf_frame = &cm_node->mpa_frame; + nesqp->private_data_len = (u8) cm_node->mpa_frame_size; + cm_node->cm_core->api->accelerated(cm_node->cm_core, cm_node); + + /* modify QP state to rts */ + attr.qp_state = IB_QPS_RTS; + nes_modify_qp(&nesqp->ibqp, &attr, IB_QP_STATE, NULL); + + /* notify OF layer we successfully created the requested connection */ + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.status = IW_CM_EVENT_STATUS_ACCEPTED; + cm_event.provider_data = cm_id->provider_data; + cm_event.local_addr.sin_family = AF_INET; + cm_event.local_addr.sin_port = cm_id->local_addr.sin_port; + cm_event.remote_addr = cm_id->remote_addr; + + cm_event.private_data = (void *)event->cm_node->mpa_frame_buf; + cm_event.private_data_len = (u8) event->cm_node->mpa_frame_size; + + cm_event.local_addr.sin_addr.s_addr = event->cm_info.rem_addr; + ret = cm_id->event_handler(cm_id, &cm_event); + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + + if (ret) + printk("%s[%u] OFA CM event_handler returned, ret=%d\n", + __FUNCTION__, __LINE__, ret); + nes_debug(NES_DBG_CM, "Exiting connect thread for QP%u. jiffies = %lu\n", + nesqp->hwqp.qp_id, jiffies ); + + nes_rem_ref(&nesqp->ibqp); + + return; +} + + +/** + * cm_event_connect_error + */ +void cm_event_connect_error(struct nes_cm_event *event) +{ + struct nes_qp *nesqp; + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + /* struct nes_cm_info cm_info; */ + int ret; + + if (!event->cm_node) + return; + + cm_id = event->cm_node->cm_id; + if (!cm_id) { + return; + } + + nes_debug(NES_DBG_CM, "cm_node=%p, cm_id=%p\n", event->cm_node, cm_id); + nesqp = cm_id->provider_data; + + if (!nesqp) { + return; + } + + /* notify OF layer about this connection error event */ + /* cm_id->rem_ref(cm_id); */ + nesqp->cm_id = NULL; + cm_id->provider_data = NULL; + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.status = IW_CM_EVENT_STATUS_REJECTED; + cm_event.provider_data = cm_id->provider_data; + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + + nes_debug(NES_DBG_CM, "call CM_EVENT REJECTED, local_addr=%08x, remove_addr=%08x\n", + cm_event.local_addr.sin_addr.s_addr, cm_event.remote_addr.sin_addr.s_addr); + + ret = cm_id->event_handler(cm_id, &cm_event); + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + if (ret) + printk("%s[%u] OFA CM event_handler returned, ret=%d\n", + __FUNCTION__, __LINE__, ret); + nes_rem_ref(&nesqp->ibqp); + cm_id->rem_ref(cm_id); + + return; +} + + +/** + * cm_event_reset + */ +void cm_event_reset(struct nes_cm_event *event) +{ + struct nes_qp *nesqp; + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + /* struct nes_cm_info cm_info; */ + int ret; + + if (!event->cm_node) + return; + + cm_id = event->cm_node->cm_id; + + nes_debug(NES_DBG_CM, "%p - cm_id = %p\n", event->cm_node, cm_id); + nesqp = cm_id->provider_data; + + nesqp->cm_id = NULL; + /* cm_id->provider_data = NULL; */ + cm_event.event = IW_CM_EVENT_DISCONNECT; + cm_event.status = IW_CM_EVENT_STATUS_RESET; + cm_event.provider_data = cm_id->provider_data; + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + + ret = cm_id->event_handler(cm_id, &cm_event); + nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); + + /* notify OF layer about this connection error event */ + cm_id->rem_ref(cm_id); + + return; +} + + +/** + * cm_event_mpa_req + */ +void cm_event_mpa_req(struct nes_cm_event *event) +{ + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + int ret; + struct nes_cm_node *cm_node; + + cm_node = event->cm_node; + if (!cm_node) + return; + cm_id = cm_node->cm_id; + + atomic_inc(&cm_connect_reqs); + nes_debug(NES_DBG_CM, "cm_node = %p - cm_id = %p, jiffies = %lu\n", + cm_node, cm_id, jiffies); + + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.status = IW_CM_EVENT_STATUS_OK; + cm_event.provider_data = (void *)cm_node; + + cm_event.local_addr.sin_family = AF_INET; + cm_event.local_addr.sin_port = htons(event->cm_info.loc_port); + cm_event.local_addr.sin_addr.s_addr = htonl(event->cm_info.loc_addr); + + cm_event.remote_addr.sin_family = AF_INET; + cm_event.remote_addr.sin_port = htons(event->cm_info.rem_port); + cm_event.remote_addr.sin_addr.s_addr = htonl(event->cm_info.rem_addr); + + cm_event.private_data = cm_node->mpa_frame_buf; + cm_event.private_data_len = (u8) cm_node->mpa_frame_size; + + ret = cm_id->event_handler(cm_id, &cm_event); + if (ret) + printk("%s[%u] OFA CM event_handler returned, ret=%d\n", + __FUNCTION__, __LINE__, ret); + + return; +} + + +static void nes_cm_event_handler(void *parm); + +/** + * nes_cm_post_event + * post an event to the cm event handler + */ +int nes_cm_post_event(struct nes_cm_event *event) +{ + atomic_inc(&event->cm_node->cm_core->events_posted); + add_ref_cm_node(event->cm_node); + event->cm_info.cm_id->add_ref(event->cm_info.cm_id); + /* NES_INIT_WORK(&event->event_work, nes_cm_event_handler, (void *)event); */ + INIT_WORK(&event->event_work, nes_cm_event_handler); + nes_debug(NES_DBG_CM, "queue_work, event=%p\n", event); + + queue_work(event->cm_node->cm_core->event_wq, &event->event_work); + + nes_debug(NES_DBG_CM, "Exit\n"); + return 0; +} + + +/** + * nes_cm_event_handler + * worker function to handle cm events + * will free instance of nes_cm_event + */ +static void nes_cm_event_handler(void *parm) +{ + struct work_struct *work = parm; + struct nes_cm_event *event = container_of(work, struct nes_cm_event, event_work); + struct nes_cm_core *cm_core; + + if ((!event) || (!event->cm_node) || (!event->cm_node->cm_core)) { + return; + } + cm_core = event->cm_node->cm_core; + nes_debug(NES_DBG_CM, "event=%p, event->type=%u, events posted=%u\n", + event, event->type, atomic_read(&cm_core->events_posted)); + + switch (event->type) { + case NES_CM_EVENT_MPA_REQ: + cm_event_mpa_req(event); + nes_debug(NES_DBG_CM, "CM Event: MPA REQUEST\n"); + break; + case NES_CM_EVENT_RESET: + nes_debug(NES_DBG_CM, "CM Event: RESET\n"); + cm_event_reset(event); + break; + case NES_CM_EVENT_CONNECTED: + if ((!event->cm_node->cm_id) || (event->cm_node->state != NES_CM_STATE_TSA)) { + break; + } + cm_event_connected(event); + nes_debug(NES_DBG_CM, "CM Event: CONNECTED\n"); + break; + case NES_CM_EVENT_ABORTED: + if ((!event->cm_node->cm_id) || (event->cm_node->state == NES_CM_STATE_TSA)) { + break; + } + cm_event_connect_error(event); + nes_debug(NES_DBG_CM, "CM Event: ABORTED\n"); + break; + case NES_CM_EVENT_DROPPED_PKT: + nes_debug(NES_DBG_CM, "CM Event: DROPPED PKT\n"); + break; + default: + nes_debug(NES_DBG_CM, "CM Event: UNKNOWN EVENT TYPE\n"); + break; + } + + atomic_dec(&cm_core->events_posted); + event->cm_info.cm_id->rem_ref(event->cm_info.cm_id); + rem_ref_cm_node(cm_core, event->cm_node); + kfree(event); + + return; +} From ggrundstrom at neteffect.com Fri Oct 19 13:10:05 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:10:05 -0500 Subject: [ofa-general] [PATCH 4/14 v2] nes: connection manager structures and defines Message-ID: <200710192010.l9JKA5ca021740@neteffect.com> NetEffect connection manager includes, structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_cm.h 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,433 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef NES_CM_H +#define NES_CM_H + +#define QUEUE_EVENTS + +#define NES_MANAGE_APBVT_DEL 0 +#define NES_MANAGE_APBVT_ADD 1 + +/* IETF MPA -- defines, enums, structs */ +#define IEFT_MPA_KEY_REQ "MPA ID Req Frame" +#define IEFT_MPA_KEY_REP "MPA ID Rep Frame" +#define IETF_MPA_KEY_SIZE 16 +#define IETF_MPA_VERSION 1 + +enum ietf_mpa_flags { + IETF_MPA_FLAGS_MARKERS = 0x80, /* receive Markers */ + IETF_MPA_FLAGS_CRC = 0x40, /* receive Markers */ + IETF_MPA_FLAGS_REJECT = 0x20, /* Reject */ +}; + +struct ietf_mpa_frame { + u8 key[IETF_MPA_KEY_SIZE]; + u8 flags; + u8 rev; + u16 priv_data_len; + u8 priv_data[0]; +}; + +#define ietf_mpa_req_resp_frame ietf_mpa_frame + +struct nes_v4_quad { + u32 rsvd0; + u32 DstIpAdrIndex; /* Only most significant 5 bits are valid */ + u32 SrcIpadr; + u16 TcpPorts[2]; /* src is low, dest is high */ +}; + +struct nes_cm_node; +enum nes_timer_type { + NES_TIMER_TYPE_SEND, + NES_TIMER_TYPE_RECV, + NES_TIMER_NODE_CLEANUP, + NES_TIMER_TYPE_CLOSE, +}; + +#define MAX_NES_IFS 4 + +#define SET_ACK 1 +#define SET_SYN 2 +#define SET_FIN 4 +#define SET_RST 8 + +struct option_base { + u8 optionnum; + u8 length; +}; + +enum option_numbers { + OPTION_NUMBER_END, + OPTION_NUMBER_NONE, + OPTION_NUMBER_MSS, + OPTION_NUMBER_WINDOW_SCALE, + OPTION_NUMBER_SACK_PERM, + OPTION_NUMBER_SACK, + OPTION_NUMBER_WRITE0 = 0xbc +}; + +struct option_mss { + u8 optionnum; + u8 length; + u16 mss; +}; + +struct option_windowscale { + u8 optionnum; + u8 length; + u8 shiftcount; +}; + +union all_known_options { + char as_end; + struct option_base as_base; + struct option_mss as_mss; + struct option_windowscale as_windowscale; +}; + +struct nes_timer_entry { + struct list_head list; + unsigned long timetosend; /* jiffies */ + struct sk_buff *skb; + u32 type; + u32 retrycount; + u32 retranscount; + u32 context; + u32 seq_num; + u32 send_retrans; + int close_when_complete; + struct net_device *netdev; +}; + +#define NES_DEFAULT_RETRYS 64 +#define NES_DEFAULT_RETRANS 8 +#ifdef CONFIG_INFINIBAND_NES_DEBUG +#define NES_RETRY_TIMEOUT (1000*HZ/1000) +#else +#define NES_RETRY_TIMEOUT (1000*HZ/10000) +#endif +#define NES_SHORT_TIME (10) +#define NES_LONG_TIME (2000*HZ/1000) + +#define NES_CM_HASHTABLE_SIZE 1024 +#define NES_CM_TCP_TIMER_INTERVAL 3000 +#define NES_CM_DEFAULT_MTU 1540 +#define NES_CM_DEFAULT_FRAME_CNT 10 +#define NES_CM_THREAD_STACK_SIZE 256 +#define NES_CM_DEFAULT_RCV_WND 64240 // before we know that window scaling is allowed +#define NES_CM_DEFAULT_RCV_WND_SCALED 256960 // after we know that window scaling is allowed +#define NES_CM_DEFAULT_RCV_WND_SCALE 2 +#define NES_CM_DEFAULT_FREE_PKTS 0x000A +#define NES_CM_FREE_PKT_LO_WATERMARK 2 + +#define NES_CM_DEF_SEQ 0x159bf75f +#define NES_CM_DEF_LOCAL_ID 0x3b47 + +#define NES_CM_DEF_SEQ2 0x18ed5740 +#define NES_CM_DEF_LOCAL_ID2 0xb807 + +typedef u32 nes_addr_t; + +#define nes_cm_tsa_context nes_qp_context + +struct nes_qp; + +/* cm node transition states */ +enum nes_cm_node_state { + NES_CM_STATE_UNKNOWN, + NES_CM_STATE_INITED, + NES_CM_STATE_LISTENING, + NES_CM_STATE_SYN_RCVD, + NES_CM_STATE_SYN_SENT, + NES_CM_STATE_ONE_SIDE_ESTABLISHED, + NES_CM_STATE_ESTABLISHED, + NES_CM_STATE_ACCEPTING, + NES_CM_STATE_MPAREQ_SENT, + NES_CM_STATE_TSA, + NES_CM_STATE_FIN_WAIT1, + NES_CM_STATE_FIN_WAIT2, + NES_CM_STATE_CLOSE_WAIT, + NES_CM_STATE_TIME_WAIT, + NES_CM_STATE_LAST_ACK, + NES_CM_STATE_CLOSING, + NES_CM_STATE_CLOSED +}; + +/* type of nes connection */ +enum nes_cm_conn_type { + NES_CM_IWARP_CONN_TYPE, +}; + +/* CM context params */ +struct nes_cm_tcp_context { + u8 client; + + u32 loc_seq_num; + u32 loc_ack_num; + u32 rem_ack_num; + u32 rcv_nxt; + + u32 loc_id; + u32 rem_id; + + u32 snd_wnd; + u32 max_snd_wnd; + + u32 rcv_wnd; + u32 mss; + u8 snd_wscale; + u8 rcv_wscale; + + struct nes_cm_tsa_context tsa_cntxt; + struct timeval sent_ts; +}; + + +enum nes_cm_listener_state { + NES_CM_LISTENER_PASSIVE_STATE=1, + NES_CM_LISTENER_ACTIVE_STATE=2, + NES_CM_LISTENER_EITHER_STATE=3 +}; + +struct nes_cm_listener { + struct list_head list; + u64 session_id; + struct nes_cm_core *cm_core; + u8 loc_mac[ETH_ALEN]; + nes_addr_t loc_addr; + u16 loc_port; + struct iw_cm_id *cm_id; + enum nes_cm_conn_type conn_type; + atomic_t ref_count; + struct nes_vnic *nesvnic; + atomic_t pend_accepts_cnt; + int backlog; + enum nes_cm_listener_state listener_state; + u32 reused_node; +}; + +/* per connection node and node state information */ +struct nes_cm_node { + u64 session_id; + u32 hashkey; + + nes_addr_t loc_addr, rem_addr; + u16 loc_port, rem_port; + + + u8 loc_mac[ETH_ALEN]; + u8 rem_mac[ETH_ALEN]; + + enum nes_cm_node_state state; + struct nes_cm_tcp_context tcp_cntxt; + struct nes_cm_core *cm_core; + struct sk_buff_head resend_list; + atomic_t ref_count; + struct net_device *netdev; + + struct nes_cm_node *loopbackpartner ; + struct list_head retrans_list; + spinlock_t retrans_list_lock; + struct list_head recv_list; + spinlock_t recv_list_lock; + + int send_write0; + union { + struct ietf_mpa_frame mpa_frame; + u8 mpa_frame_buf[NES_CM_DEFAULT_MTU]; + }; + u16 mpa_frame_size; + struct iw_cm_id *cm_id; + struct list_head list; + int accelerated; + struct nes_cm_listener *listener; + enum nes_cm_conn_type conn_type; + struct nes_vnic *nesvnic; + int apbvt_set; + int accept_pend; +}; + +/* structure for client or CM to fill when making CM api calls. */ +/* - only need to set relevant data, based on op. */ +struct nes_cm_info { + union { + struct iw_cm_id *cm_id; + struct net_device *netdev; + }; + + u16 loc_port; + u16 rem_port; + nes_addr_t loc_addr; + nes_addr_t rem_addr; + + enum nes_cm_conn_type conn_type; + int backlog; +}; + +/* CM event codes */ +enum nes_cm_event_type { + NES_CM_EVENT_UNKNOWN, + NES_CM_EVENT_ESTABLISHED, + NES_CM_EVENT_MPA_REQ, + NES_CM_EVENT_MPA_CONNECT, + NES_CM_EVENT_MPA_ACCEPT, + NES_CM_EVENT_MPA_ESTABLISHED, + NES_CM_EVENT_CONNECTED, + NES_CM_EVENT_CLOSED, + NES_CM_EVENT_RESET, + NES_CM_EVENT_DROPPED_PKT, + NES_CM_EVENT_CLOSE_IMMED, + NES_CM_EVENT_CLOSE_HARD, + NES_CM_EVENT_CLOSE_CLEAN, + NES_CM_EVENT_ABORTED, + NES_CM_EVENT_SEND_FIRST +}; + +/* event to post to CM event handler */ +struct nes_cm_event { + enum nes_cm_event_type type; + + struct nes_cm_info cm_info; + struct work_struct event_work; + struct nes_cm_node *cm_node; +}; + +struct nes_cm_core { + enum nes_cm_node_state state; + atomic_t session_id; + + atomic_t listen_node_cnt; + struct nes_cm_node listen_list; + spinlock_t listen_list_lock; + + u32 mtu; + u32 free_tx_pkt_max; + u32 rx_pkt_posted; + struct sk_buff_head tx_free_list; + atomic_t ht_node_cnt; + struct list_head connected_nodes; + /* struct list_head hashtable[NES_CM_HASHTABLE_SIZE]; */ + spinlock_t ht_lock; + + struct timer_list tcp_timer; + + struct nes_cm_ops *api; + + int (*post_event)(struct nes_cm_event *event); + atomic_t events_posted; + struct workqueue_struct *event_wq; + struct workqueue_struct *disconn_wq; + + atomic_t node_cnt; + u64 aborted_connects; + u32 options; + + struct nes_cm_node *current_listen_node; +}; + + +#define NES_CM_SET_PKT_SIZE (1 << 1) +#define NES_CM_SET_FREE_PKT_Q_SIZE (1 << 2) + +/* CM ops/API for client interface */ +struct nes_cm_ops { + int (*accelerated)(struct nes_cm_core *, struct nes_cm_node *); + struct nes_cm_listener * (*listen)(struct nes_cm_core *, struct nes_vnic *, + struct nes_cm_info *); + int (*stop_listener)(struct nes_cm_core *, struct nes_cm_listener *); + struct nes_cm_node * (*connect)(struct nes_cm_core *, + struct nes_vnic *, struct ietf_mpa_frame *, + struct nes_cm_info *); + int (*close)(struct nes_cm_core *, struct nes_cm_node *); + int (*accept)(struct nes_cm_core *, struct ietf_mpa_frame *, + struct nes_cm_node *); + int (*reject)(struct nes_cm_core *, struct ietf_mpa_frame *, + struct nes_cm_node *); + int (*recv_pkt)(struct nes_cm_core *, struct nes_vnic *, + struct sk_buff *); + int (*destroy_cm_core)(struct nes_cm_core *); + int (*get)(struct nes_cm_core *); + int (*set)(struct nes_cm_core *, u32, u32); +}; + + +int send_mpa_request(struct nes_cm_node *); +struct sk_buff *form_cm_frame(struct sk_buff *, struct nes_cm_node *, + void *, u32, void *, u32, u8); +int schedule_nes_timer(struct nes_cm_node *, struct sk_buff *, + enum nes_timer_type, int, int); +void nes_cm_timer_tick(unsigned long); +int send_syn(struct nes_cm_node *, u32); +int send_reset(struct nes_cm_node *); +int send_ack(struct nes_cm_node *); +int send_fin(struct nes_cm_node *, struct sk_buff *); +struct sk_buff *get_free_pkt(struct nes_cm_node *); +int process_packet(struct nes_cm_node *, struct sk_buff *, struct nes_cm_core *); + +struct nes_cm_node * mini_cm_connect(struct nes_cm_core *, + struct nes_vnic *, struct ietf_mpa_frame *, struct nes_cm_info *); +int mini_cm_accept(struct nes_cm_core *, struct ietf_mpa_frame *, struct nes_cm_node *); +int mini_cm_reject(struct nes_cm_core *, struct ietf_mpa_frame *, struct nes_cm_node *); +int mini_cm_close(struct nes_cm_core *, struct nes_cm_node *); +int mini_cm_recv_pkt(struct nes_cm_core *, struct nes_vnic *, struct sk_buff *); +struct nes_cm_core *mini_cm_alloc_core(struct nes_cm_info *); +int mini_cm_dealloc_core(struct nes_cm_core *); +int mini_cm_get(struct nes_cm_core *); +int mini_cm_set(struct nes_cm_core *, u32, u32); + +int nes_cm_disconn(struct nes_qp *); +void nes_disconnect_worker(void *); +int nes_cm_disconn_true(struct nes_qp *); +int nes_disconnect(struct nes_qp *, int); + +int nes_accept(struct iw_cm_id *, struct iw_cm_conn_param *); +int nes_reject(struct iw_cm_id *, const void *, u8); +int nes_connect(struct iw_cm_id *, struct iw_cm_conn_param *); +int nes_create_listen(struct iw_cm_id *, int); +int nes_destroy_listen(struct iw_cm_id *); + +int nes_cm_recv(struct sk_buff *, struct net_device *); +int nes_cm_start(void); +int nes_cm_stop(void); + +/* CM event handler functions */ +void cm_event_connected(struct nes_cm_event *); +void cm_event_connect_error(struct nes_cm_event *); +void cm_event_reset(struct nes_cm_event *); +void cm_event_mpa_req(struct nes_cm_event *); +int nes_cm_post_event(struct nes_cm_event *); + +#endif /* NES_CM_H */ + From ggrundstrom at neteffect.com Fri Oct 19 13:12:14 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:12:14 -0500 Subject: [ofa-general] [PATCH 5/14 v2] nes: context structures and defines Message-ID: <200710192012.l9JKCEat021753@neteffect.com> QP context structures and defines Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_context.h 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,193 @@ +/* + * Copyright (c) 2006 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef NES_CONTEXT_H +#define NES_CONTEXT_H + +struct nes_qp_context { + u32 misc; + u32 cqs; + u32 sq_addr_low; + u32 sq_addr_high; + u32 rq_addr_low; + u32 rq_addr_high; + u32 misc2; + u16 tcpPorts[2]; + u32 ip0; + u32 ip1; + u32 ip2; + u32 ip3; + u32 mss; + u32 arp_index_vlan; + u32 tcp_state_flow_label; + u32 pd_index_wscale; + u32 keepalive; + u32 ts_recent; + u32 ts_age; + u32 snd_nxt; + u32 snd_wnd; + u32 rcv_nxt; + u32 rcv_wnd; + u32 snd_max; + u32 snd_una; + u32 srtt; + u32 rttvar; + u32 ssthresh; + u32 cwnd; + u32 snd_wl1; + u32 snd_wl2; + u32 max_snd_wnd; + u32 ts_val_delta; + u32 retransmit; + u32 probe_cnt; + u32 hte_index; + u32 q2_addr_low; + u32 q2_addr_high; + u32 ird_index; + u32 Rsvd3; + u32 ird_ord_sizes; + u32 mrkr_offset; + u32 aeq_token_low; + u32 aeq_token_high; +}; + +/* QP Context Misc Field */ + +#define NES_QPCONTEXT_MISC_IWARP_VER_MASK 0x00000003 +#define NES_QPCONTEXT_MISC_IWARP_VER_SHIFT 0 +#define NES_QPCONTEXT_MISC_EFB_SIZE_MASK 0x000000C0 +#define NES_QPCONTEXT_MISC_EFB_SIZE_SHIFT 6 +#define NES_QPCONTEXT_MISC_RQ_SIZE_MASK 0x00000300 +#define NES_QPCONTEXT_MISC_RQ_SIZE_SHIFT 8 +#define NES_QPCONTEXT_MISC_SQ_SIZE_MASK 0x00000c00 +#define NES_QPCONTEXT_MISC_SQ_SIZE_SHIFT 10 +#define NES_QPCONTEXT_MISC_PCI_FCN_MASK 0x00007000 +#define NES_QPCONTEXT_MISC_PCI_FCN_SHIFT 12 +#define NES_QPCONTEXT_MISC_DUP_ACKS_MASK 0x00070000 +#define NES_QPCONTEXT_MISC_DUP_ACKS_SHIFT 16 + +enum nes_qp_context_misc_bits { + NES_QPCONTEXT_MISC_RX_WQE_SIZE = 0x00000004, + NES_QPCONTEXT_MISC_IPV4 = 0x00000008, + NES_QPCONTEXT_MISC_DO_NOT_FRAG = 0x00000010, + NES_QPCONTEXT_MISC_INSERT_VLAN = 0x00000020, + NES_QPCONTEXT_MISC_DROS = 0x00008000, + NES_QPCONTEXT_MISC_WSCALE = 0x00080000, + NES_QPCONTEXT_MISC_KEEPALIVE = 0x00100000, + NES_QPCONTEXT_MISC_TIMESTAMP = 0x00200000, + NES_QPCONTEXT_MISC_SACK = 0x00400000, + NES_QPCONTEXT_MISC_RDMA_WRITE_EN = 0x00800000, + NES_QPCONTEXT_MISC_RDMA_READ_EN = 0x01000000, + NES_QPCONTEXT_MISC_WBIND_EN = 0x10000000, + NES_QPCONTEXT_MISC_FAST_REGISTER_EN = 0x20000000, + NES_QPCONTEXT_MISC_PRIV_EN = 0x40000000, + NES_QPCONTEXT_MISC_NO_NAGLE = 0x80000000 +}; + +enum nes_qp_acc_wq_sizes { + HCONTEXT_TSA_WQ_SIZE_4 = 0, + HCONTEXT_TSA_WQ_SIZE_32 = 1, + HCONTEXT_TSA_WQ_SIZE_128 = 2, + HCONTEXT_TSA_WQ_SIZE_512 = 3 +}; + +/* QP Context Misc2 Fields */ +#define NES_QPCONTEXT_MISC2_TTL_MASK 0x000000ff +#define NES_QPCONTEXT_MISC2_TTL_SHIFT 0 +#define NES_QPCONTEXT_MISC2_HOP_LIMIT_MASK 0x000000ff +#define NES_QPCONTEXT_MISC2_HOP_LIMIT_SHIFT 0 +#define NES_QPCONTEXT_MISC2_LIMIT_MASK 0x00000300 +#define NES_QPCONTEXT_MISC2_LIMIT_SHIFT 8 +#define NES_QPCONTEXT_MISC2_NIC_INDEX_MASK 0x0000fc00 +#define NES_QPCONTEXT_MISC2_NIC_INDEX_SHIFT 10 +#define NES_QPCONTEXT_MISC2_SRC_IP_MASK 0x001f0000 +#define NES_QPCONTEXT_MISC2_SRC_IP_SHIFT 16 +#define NES_QPCONTEXT_MISC2_TOS_MASK 0xff000000 +#define NES_QPCONTEXT_MISC2_TOS_SHIFT 24 +#define NES_QPCONTEXT_MISC2_TRAFFIC_CLASS_MASK 0xff000000 +#define NES_QPCONTEXT_MISC2_TRAFFIC_CLASS_SHIFT 24 + +/* QP Context Tcp State/Flow Label Fields */ +#define NES_QPCONTEXT_TCPFLOW_FLOW_LABEL_MASK 0x000fffff +#define NES_QPCONTEXT_TCPFLOW_FLOW_LABEL_SHIFT 0 +#define NES_QPCONTEXT_TCPFLOW_TCP_STATE_MASK 0xf0000000 +#define NES_QPCONTEXT_TCPFLOW_TCP_STATE_SHIFT 28 + +enum nes_qp_tcp_state { + NES_QPCONTEXT_TCPSTATE_CLOSED = 1, + NES_QPCONTEXT_TCPSTATE_EST = 5, + NES_QPCONTEXT_TCPSTATE_TIME_WAIT = 11, +}; + +/* QP Context PD Index/wscale Fields */ +#define NES_QPCONTEXT_PDWSCALE_RCV_WSCALE_MASK 0x0000000f +#define NES_QPCONTEXT_PDWSCALE_RCV_WSCALE_SHIFT 0 +#define NES_QPCONTEXT_PDWSCALE_SND_WSCALE_MASK 0x00000f00 +#define NES_QPCONTEXT_PDWSCALE_SND_WSCALE_SHIFT 8 +#define NES_QPCONTEXT_PDWSCALE_PDINDEX_MASK 0xffff0000 +#define NES_QPCONTEXT_PDWSCALE_PDINDEX_SHIFT 16 + +/* QP Context Keepalive Fields */ +#define NES_QPCONTEXT_KEEPALIVE_DELTA_MASK 0x0000ffff +#define NES_QPCONTEXT_KEEPALIVE_DELTA_SHIFT 0 +#define NES_QPCONTEXT_KEEPALIVE_PROBE_CNT_MASK 0x00ff0000 +#define NES_QPCONTEXT_KEEPALIVE_PROBE_CNT_SHIFT 16 +#define NES_QPCONTEXT_KEEPALIVE_INTV_MASK 0xff000000 +#define NES_QPCONTEXT_KEEPALIVE_INTV_SHIFT 24 + +/* QP Context ORD/IRD Fields */ +#define NES_QPCONTEXT_ORDIRD_ORDSIZE_MASK 0x0000007f +#define NES_QPCONTEXT_ORDIRD_ORDSIZE_SHIFT 0 +#define NES_QPCONTEXT_ORDIRD_IRDSIZE_MASK 0x00030000 +#define NES_QPCONTEXT_ORDIRD_IRDSIZE_SHIFT 16 +#define NES_QPCONTEXT_ORDIRD_IWARP_MODE_MASK 0x30000000 +#define NES_QPCONTEXT_ORDIRD_IWARP_MODE_SHIFT 28 + +enum nes_ord_ird_bits { + NES_QPCONTEXT_ORDIRD_WRPDU = 0x02000000, + NES_QPCONTEXT_ORDIRD_LSMM_PRESENT = 0x04000000, + NES_QPCONTEXT_ORDIRD_ALSMM = 0x08000000, + NES_QPCONTEXT_ORDIRD_AAH = 0x40000000, + NES_QPCONTEXT_ORDIRD_RNMC = 0x80000000 +}; + +enum nes_iwarp_qp_state { + NES_QPCONTEXT_IWARP_STATE_NONEXIST = 0, + NES_QPCONTEXT_IWARP_STATE_IDLE = 1, + NES_QPCONTEXT_IWARP_STATE_RTS = 2, + NES_QPCONTEXT_IWARP_STATE_CLOSING = 3, + NES_QPCONTEXT_IWARP_STATE_TERMINATE = 5, + NES_QPCONTEXT_IWARP_STATE_ERROR = 6 +}; + + +#endif /* NES_CONTEXT_H */ From ggrundstrom at neteffect.com Fri Oct 19 13:14:04 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:14:04 -0500 Subject: [ofa-general] [PATCH 6/14 v2] nes: hardware init Message-ID: <200710192014.l9JKE4gP021766@neteffect.com> Hardware initialization and interrupt processing. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_hw.c 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,2758 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. +* + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "nes.h" + +u32 crit_err_count = 0; + +#include "nes_cm.h" + + +#ifdef CONFIG_INFINIBAND_NES_DEBUG +static unsigned char *nes_iwarp_state_str[] = { + "Non-Existant", + "Idle", + "RTS", + "Closing", + "RSVD1", + "Terminate", + "Error", + "RSVD2", +}; + +static unsigned char *nes_tcp_state_str[] = { + "Non-Existant", + "Closed", + "Listen", + "SYN Sent", + "SYN Rcvd", + "Established", + "Close Wait", + "FIN Wait 1", + "Closing", + "Last Ack", + "FIN Wait 2", + "Time Wait", + "RSVD1", + "RSVD2", + "RSVD3", + "RSVD4", +}; +#endif + + +/** + * nes_init_adapter - initialize adapter + */ +struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { + struct nes_adapter *nesadapter = NULL; + unsigned long num_pds; + u32 u32temp; + u32 port_count; + u16 max_rq_wrs; + u16 max_sq_wrs; + u32 max_mr; + u32 max_256pbl; + u32 max_4kpbl; + u32 max_qp; + u32 max_irrq; + u32 max_cq; + u32 hte_index_mask; + u32 adapter_size; + u32 arp_table_size; + u8 OneG_Mode; + + /* search the list of existing adapters */ + list_for_each_entry(nesadapter, &nes_adapter_list, list) { + nes_debug(NES_DBG_INIT, "Searching Adapter list for PCI devfn = 0x%X," + " adapter PCI slot/bus = %u/%u, pci devices PCI slot/bus = %u/%u, .\n", + nesdev->pcidev->devfn, + PCI_SLOT(nesadapter->devfn), + nesadapter->bus_number, + PCI_SLOT(nesdev->pcidev->devfn), + nesdev->pcidev->bus->number ); + if ((PCI_SLOT(nesadapter->devfn) == PCI_SLOT(nesdev->pcidev->devfn)) && + (nesadapter->bus_number == nesdev->pcidev->bus->number)) { + nesadapter->ref_count++; + return nesadapter; + } + } + + /* no adapter found */ + num_pds = pci_resource_len(nesdev->pcidev, BAR_1) / 4096; + if ((hw_rev != NE020_REV) && (hw_rev != NE020_REV1)) { + nes_debug(NES_DBG_INIT, "NE020 driver detected unknown hardware revision 0x%x\n", + hw_rev); + return NULL; + } + + nes_debug(NES_DBG_INIT, "Determine Soft Reset, QP_control=0x%x, CPU0=0x%x, CPU1=0x%x, CPU2=0x%x\n", + nes_read_indexed(nesdev, NES_IDX_QP_CONTROL + PCI_FUNC(nesdev->pcidev->devfn) * 8), + nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS), + nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS + 4), + nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS + 8)); + + nes_debug(NES_DBG_INIT, "Reset and init NE020\n"); + if ((port_count = nes_reset_adapter_ne020(nesdev, &OneG_Mode)) == 0) { + return NULL; + } + if (nes_init_serdes(nesdev, hw_rev, port_count, OneG_Mode)) { + return NULL; + } + nes_init_csr_ne020(nesdev, hw_rev, port_count); + + /* Setup and enable the periodic timer */ + nesdev->et_rx_coalesce_usecs_irq = interrupt_mod_interval; + if (nesdev->et_rx_coalesce_usecs_irq) { + nes_write32(nesdev->regs+NES_PERIODIC_CONTROL, 0x80000000 | + ((u32)(nesdev->et_rx_coalesce_usecs_irq * 8))); + } else { + nes_write32(nesdev->regs+NES_PERIODIC_CONTROL, 0x00000000); + } + + max_qp = nes_read_indexed(nesdev, NES_IDX_QP_CTX_SIZE); + nes_debug(NES_DBG_INIT, "%s: QP_CTX_SIZE=%u\n", __FUNCTION__, max_qp); + + u32temp = nes_read_indexed(nesdev, NES_IDX_QUAD_HASH_TABLE_SIZE); + if (max_qp > ((u32)1 << (u32temp & 0x001f))) { + nes_debug(NES_DBG_INIT, "Reducing Max QPs to %u due to hash table size = 0x%08X\n", + max_qp, u32temp); + max_qp = (u32)1 << (u32temp & 0x001f); + } + + hte_index_mask = ((u32)1 << ((u32temp & 0x001f)+1))-1; + nes_debug(NES_DBG_INIT, "Max QP = %u, hte_index_mask = 0x%08X.\n", + max_qp, hte_index_mask); + + u32temp = nes_read_indexed(nesdev, NES_IDX_IRRQ_COUNT); + + max_irrq = 1 << (u32temp & 0x001f); + + if (max_qp > max_irrq) { + max_qp = max_irrq; + nes_debug(NES_DBG_INIT, "Reducing Max QPs to %u due to Available Q1s.\n", + max_qp); + } + + /* there should be no reason to allocate more pds than qps */ + if (num_pds > max_qp) + num_pds = max_qp; + + u32temp = nes_read_indexed(nesdev, NES_IDX_MRT_SIZE); + max_mr = (u32)8192 << (u32temp & 0x7); + + u32temp = nes_read_indexed(nesdev, NES_IDX_PBL_REGION_SIZE); + max_256pbl = (u32)1 << (u32temp & 0x0000001f); + max_4kpbl = (u32)1 << ((u32temp >> 16) & 0x0000001f); + max_cq = nes_read_indexed(nesdev, NES_IDX_CQ_CTX_SIZE); + + u32temp = nes_read_indexed(nesdev, NES_IDX_ARP_CACHE_SIZE); + arp_table_size = 1 << u32temp; + + adapter_size = (sizeof(struct nes_adapter) + + (sizeof(unsigned long)-1)) & (~(sizeof(unsigned long)-1)); + adapter_size += sizeof(unsigned long) * BITS_TO_LONGS(max_qp); + adapter_size += sizeof(unsigned long) * BITS_TO_LONGS(max_mr); + adapter_size += sizeof(unsigned long) * BITS_TO_LONGS(max_cq); + adapter_size += sizeof(unsigned long) * BITS_TO_LONGS(num_pds); + adapter_size += sizeof(unsigned long) * BITS_TO_LONGS(arp_table_size); + adapter_size += sizeof(struct nes_qp **) * max_qp; + + /* allocate a new adapter struct */ + nesadapter = kmalloc(adapter_size, GFP_KERNEL); + if (nesadapter == NULL) { + return NULL; + } + memset(nesadapter, 0, adapter_size); + nes_debug(NES_DBG_INIT, "Allocating new nesadapter @ %p, size = %u (actual size = %u).\n", + nesadapter, (u32)sizeof(struct nes_adapter), adapter_size); + + /* populate the new nesadapter */ + nesadapter->devfn = nesdev->pcidev->devfn; + nesadapter->bus_number = nesdev->pcidev->bus->number; + nesadapter->ref_count = 1; + nesadapter->timer_int_req = 0xffff0000; + nesadapter->OneG_Mode = OneG_Mode; + + /* nesadapter->tick_delta = clk_divisor; */ + nesadapter->hw_rev = hw_rev; + nesadapter->port_count = port_count; + + nesadapter->max_qp = max_qp; + nesadapter->hte_index_mask = hte_index_mask; + nesadapter->max_irrq = max_irrq; + nesadapter->max_mr = max_mr; + nesadapter->max_256pbl = max_256pbl - 1; + nesadapter->max_4kpbl = max_4kpbl - 1; + nesadapter->max_cq = max_cq; + nesadapter->free_256pbl = max_256pbl - 1; + nesadapter->free_4kpbl = max_4kpbl - 1; + nesadapter->max_pd = num_pds; + nesadapter->arp_table_size = arp_table_size; + nesadapter->base_pd = 1; + + nesadapter->device_cap_flags = + IB_DEVICE_ZERO_STAG | IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW; + + nesadapter->allocated_qps = (unsigned long *)&(((unsigned char *)nesadapter) + [(sizeof(struct nes_adapter)+(sizeof(unsigned long)-1))&(~(sizeof(unsigned long)-1))]); + nesadapter->allocated_cqs = &nesadapter->allocated_qps[BITS_TO_LONGS(max_qp)]; + nesadapter->allocated_mrs = &nesadapter->allocated_cqs[BITS_TO_LONGS(max_cq)]; + nesadapter->allocated_pds = &nesadapter->allocated_mrs[BITS_TO_LONGS(max_mr)]; + nesadapter->allocated_arps = &nesadapter->allocated_pds[BITS_TO_LONGS(num_pds)]; + nesadapter->qp_table = (struct nes_qp **)(&nesadapter->allocated_arps[BITS_TO_LONGS(arp_table_size)]); + + + /* mark the usual suspect QPs and CQs as in use */ + for (u32temp = 0; u32temp < NES_FIRST_QPN; u32temp++) { + set_bit(u32temp, nesadapter->allocated_qps); + set_bit(u32temp, nesadapter->allocated_cqs); + } + + u32temp = nes_read_indexed(nesdev, NES_IDX_QP_MAX_CFG_SIZES); + + max_rq_wrs = ((u32temp >> 8) & 3); + switch (max_rq_wrs) { + case 0: + max_rq_wrs = 4; + break; + case 1: + max_rq_wrs = 16; + break; + case 2: + max_rq_wrs = 32; + break; + case 3: + max_rq_wrs = 512; + break; + } + + max_sq_wrs = (u32temp & 3); + switch (max_sq_wrs) { + case 0: + max_sq_wrs = 4; + break; + case 1: + max_sq_wrs = 16; + break; + case 2: + max_sq_wrs = 32; + break; + case 3: + max_sq_wrs = 512; + break; + } + nesadapter->max_qp_wr = min(max_rq_wrs, max_sq_wrs); + + nesadapter->max_irrq_wr = (u32temp >> 16) & 3; + + nesadapter->max_sge = 4; + nesadapter->max_cqe = 32767; + + if (nes_read_eeprom_values(nesdev, nesadapter)) { + printk(KERN_ERR PFX "Unable to read EEPROM data.\n"); + kfree(nesadapter); + return NULL; + } + + u32temp = nes_read_indexed(nesdev, NES_IDX_TCP_TIMER_CONFIG); + nes_write_indexed(nesdev, NES_IDX_TCP_TIMER_CONFIG, + (u32temp & 0xff000000) | (nesadapter->tcp_timer_core_clk_divisor & 0x00ffffff)); + + /* setup port configuration */ + if (nesadapter->port_count == 1) { + u32temp = 0x00000000; + if (nes_drv_opt & NES_DRV_OPT_DUAL_LOGICAL_PORT) { + nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000002); + } else { + nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000003); + } + } else { + if (nesadapter->port_count == 2) { + u32temp = 0x00000044; + } else { + u32temp = 0x000000e4; + } + nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000003); + } + + nes_write_indexed(nesdev, NES_IDX_NIC_LOGPORT_TO_PHYPORT, u32temp); + nes_debug(NES_DBG_INIT, "Probe time, LOG2PHY=%u\n", + nes_read_indexed(nesdev, NES_IDX_NIC_LOGPORT_TO_PHYPORT)); + + spin_lock_init(&nesadapter->resource_lock); + spin_lock_init(&nesadapter->phy_lock); + + INIT_LIST_HEAD(&nesadapter->nesvnic_list[0]); + INIT_LIST_HEAD(&nesadapter->nesvnic_list[1]); + INIT_LIST_HEAD(&nesadapter->nesvnic_list[2]); + INIT_LIST_HEAD(&nesadapter->nesvnic_list[3]); + + if (nesadapter->hw_rev == NE020_REV) { + init_timer(&nesadapter->mh_timer); + nesadapter->mh_timer.function = nes_mh_fix; + nesadapter->mh_timer.expires = jiffies + (HZ/5); /* 1 second */ + nesadapter->mh_timer.data = (unsigned long)nesdev; + add_timer(&nesadapter->mh_timer); + } else { + nes_write32(nesdev->regs+NES_INTF_INT_STAT, 0x0f000000); + } + + list_add_tail(&nesadapter->list, &nes_adapter_list); + + return nesadapter; +} + + +/** + * nes_reset_adapter_ne020 + */ +unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_Mode) +{ + u32 port_count; + u32 u32temp; + u32 i; + + u32temp = nes_read32(nesdev->regs+NES_SOFTWARE_RESET); + port_count = ((u32temp & 0x00000300) >> 8) + 1; + /* TODO: assuming that both SERDES are set the same for now */ + *OneG_Mode = (u32temp & 0x00003c00) ? 0 : 1; + nes_debug(NES_DBG_INIT, "Initial Software Reset = 0x%08X, port_count=%u\n", + u32temp, port_count); + if (*OneG_Mode) { + nes_debug(NES_DBG_INIT, "Running in 1G mode.\n"); + } + u32temp &= 0xff00ffc0; + switch (port_count) { + case 1: + u32temp |= 0x00ee0000; + break; + case 2: + u32temp |= 0x00cc0000; + break; + case 4: + u32temp |= 0x00000000; + break; + default: + return 0; + break; + } + + /* check and do full reset if needed */ + if (nes_read_indexed(nesdev, NES_IDX_QP_CONTROL+(PCI_FUNC(nesdev->pcidev->devfn)*8))) { + nes_debug(NES_DBG_INIT, "Issuing Full Soft reset = 0x%08X\n", u32temp | 0xd); + nes_write32(nesdev->regs+NES_SOFTWARE_RESET, u32temp | 0xd); + + i = 0; + while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000) { + mdelay(1); + } + if (i >= 10000) { + nes_debug(NES_DBG_INIT, "Did not see full soft reset done.\n"); + return 0; + } + } + + /* port reset */ + switch (port_count) { + case 1: + u32temp |= 0x00ee0010; + break; + case 2: + u32temp |= 0x00cc0030; + break; + case 4: + u32temp |= 0x00000030; + break; + } + + nes_debug(NES_DBG_INIT, "Issuing Port Soft reset = 0x%08X\n", u32temp | 0xd); + nes_write32(nesdev->regs+NES_SOFTWARE_RESET, u32temp | 0xd); + + i = 0; + while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000) { + mdelay(1); + } + if (i >= 10000) { + nes_debug(NES_DBG_INIT, "Did not see port soft reset done.\n"); + return 0; + } + + /* serdes 0 */ + i = 0; + while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0) + & 0x0000000f)) != 0x0000000f) && i++ < 5000) { + mdelay(1); + } + if (i >= 5000) { + nes_debug(NES_DBG_INIT, "Serdes 0 not ready, status=%x\n", u32temp); + return 0; + } + + /* serdes 1 */ + if (port_count > 1) { + i = 0; + while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1) + & 0x0000000f)) != 0x0000000f) && i++ < 5000) { + mdelay(1); + } + if (i >= 5000) { + nes_debug(NES_DBG_INIT, "Serdes 1 not ready, status=%x\n", u32temp); + return 0; + } + } + + i = 0; + while ((nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS) != 0x80) && i++ < 10000) { + mdelay(1); + } + if (i >= 10000) { + printk(KERN_ERR PFX "Internal CPU not ready, status = %02X\n", + nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS)); + return 0; + } + + return port_count; +} + + +/** + * nes_init_serdes + */ +int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, u8 OneG_Mode) +{ + int i; + u32 u32temp; + + if (hw_rev != NE020_REV) { + /* init serdes 0 */ + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000FF); + if (!OneG_Mode) { + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE0, 0x11110000); + } + if (port_count > 1) { + /* init serdes 1 */ + + // nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, 0x0000F008); + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); + if (!OneG_Mode) { + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1, 0x11110000); + } + } + } else { + /* init serdes 0 */ + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, 0x00000008); + i = 0; + while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0) + & 0x0000000f)) != 0x0000000f) && i++ < 5000) { + mdelay(1); + } + if (i >= 5000) { + nes_debug(NES_DBG_PHY, "Init: serdes 0 not ready, status=%x\n", u32temp); + return 1; + } + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x000bdef7); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_DRIVE0, 0x9ce73000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_MODE0, 0x0ff00000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_SIGDET0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_BYPASS0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_LOOPBACK_CONTROL0, 0x00000000); + if (OneG_Mode) { + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_CONTROL0, 0xf0182222); + } else { + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_CONTROL0, 0xf0042222); + } + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000ff); + + if (port_count > 1) { + /* init serdes 1 */ + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, 0x00000048); + i = 0; + while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1) & 0x0000000f)) != 0x0000000f) && + (i++ < 5000)) { + mdelay(1); + } + if (i >= 5000) { + printk("%s: Init: serdes 1 not ready, status=%x\n", __FUNCTION__, u32temp); + /* return 1; */ + } + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x000bdef7); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_DRIVE1, 0x9ce73000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_MODE1, 0x0ff00000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_SIGDET1, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_BYPASS1, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_LOOPBACK_CONTROL1, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_CONTROL1, 0xf0002222); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000ff); + } + } + return 0; +} + + +/** + * nes_init_csr_ne020 + * Initialize registers for ne020 hardware + */ +void nes_init_csr_ne020(struct nes_device *nesdev, u8 hw_rev, u8 port_count) +{ + u32 u32temp; + + nes_write_indexed(nesdev, 0x000001E4, 0x00000007); + /* nes_write_indexed(nesdev, 0x000001E8, 0x000208C4); */ + nes_write_indexed(nesdev, 0x000001E8, 0x00020844); + nes_write_indexed(nesdev, 0x000001D8, 0x00048002); + /* nes_write_indexed(nesdev, 0x000001D8, 0x0004B002); */ + nes_write_indexed(nesdev, 0x000001FC, 0x00050005); + nes_write_indexed(nesdev, 0x00000600, 0x55555555); + nes_write_indexed(nesdev, 0x00000604, 0x55555555); + + /* TODO: move these MAC register settings to NIC bringup */ + nes_write_indexed(nesdev, 0x00002000, 0x00000001); + nes_write_indexed(nesdev, 0x00002004, 0x00000001); + nes_write_indexed(nesdev, 0x00002008, 0x0000FFFF); + nes_write_indexed(nesdev, 0x0000200C, 0x00000001); + nes_write_indexed(nesdev, 0x00002010, 0x000003c1); + nes_write_indexed(nesdev, 0x0000201C, 0x75345678); + if (port_count > 1) { + nes_write_indexed(nesdev, 0x00002200, 0x00000001); + nes_write_indexed(nesdev, 0x00002204, 0x00000001); + nes_write_indexed(nesdev, 0x00002208, 0x0000FFFF); + nes_write_indexed(nesdev, 0x0000220C, 0x00000001); + nes_write_indexed(nesdev, 0x00002210, 0x000003c1); + nes_write_indexed(nesdev, 0x0000221C, 0x75345678); + } + if (port_count > 2) { + nes_write_indexed(nesdev, 0x00002400, 0x00000001); + nes_write_indexed(nesdev, 0x00002404, 0x00000001); + nes_write_indexed(nesdev, 0x00002408, 0x0000FFFF); + nes_write_indexed(nesdev, 0x0000240C, 0x00000001); + nes_write_indexed(nesdev, 0x00002410, 0x000003c1); + nes_write_indexed(nesdev, 0x0000241C, 0x75345678); + + nes_write_indexed(nesdev, 0x00002600, 0x00000001); + nes_write_indexed(nesdev, 0x00002604, 0x00000001); + nes_write_indexed(nesdev, 0x00002608, 0x0000FFFF); + nes_write_indexed(nesdev, 0x0000260C, 0x00000001); + nes_write_indexed(nesdev, 0x00002610, 0x000003c1); + nes_write_indexed(nesdev, 0x0000261C, 0x75345678); + } + + nes_write_indexed(nesdev, 0x00005000, 0x00018000); + /* nes_write_indexed(nesdev, 0x00005000, 0x00010000); */ + nes_write_indexed(nesdev, 0x00005004, 0x00020001); + nes_write_indexed(nesdev, 0x00005008, 0x1F1F1F1F); + nes_write_indexed(nesdev, 0x00005010, 0x1F1F1F1F); + nes_write_indexed(nesdev, 0x00005018, 0x1F1F1F1F); + nes_write_indexed(nesdev, 0x00005020, 0x1F1F1F1F); + nes_write_indexed(nesdev, 0x00006090, 0xFFFFFFFF); + + /* TODO: move this to code, get from EEPROM */ + nes_write_indexed(nesdev, 0x00000900, 0x20000001); + nes_write_indexed(nesdev, 0x000060C0, 0x0000028e); + nes_write_indexed(nesdev, 0x000060C8, 0x00000020); + + nes_write_indexed(nesdev, 0x000001EC, 0x5b2625a0); + /* nes_write_indexed(nesdev, 0x000001EC, 0x5f2625a0); */ + + if (hw_rev != NE020_REV) { + u32temp = nes_read_indexed(nesdev, 0x000008e8); + u32temp |= 0x80000000; + nes_write_indexed(nesdev, 0x000008e8, u32temp); + } +} + + +/** + * nes_destroy_adapter - destroy the adapter structure + */ +void nes_destroy_adapter(struct nes_adapter *nesadapter) +{ + struct nes_adapter *tmp_adapter; + + list_for_each_entry(tmp_adapter, &nes_adapter_list, list) { + nes_debug(NES_DBG_SHUTDOWN, "Nes Adapter list entry = 0x%p.\n", + tmp_adapter); + } + + nesadapter->ref_count--; + if (!nesadapter->ref_count) { + if (nesadapter->hw_rev == NE020_REV) { + del_timer(&nesadapter->mh_timer); + } + + list_del(&nesadapter->list); + kfree(nesadapter); + } +} + + +/** + * nes_init_cqp + */ +int nes_init_cqp(struct nes_device *nesdev) +{ + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_hw_cqp_qp_context *cqp_qp_context; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_hw_ceq *ceq; + struct nes_hw_ceq *nic_ceq; + struct nes_hw_aeq *aeq; + void *vmem; + dma_addr_t pmem; + u32 count=0; + u32 cqp_head; + u64 u64temp; + u32 u32temp; + +#define NES_NIC_CEQ_SIZE 8 +/* NICs will be on a separate CQ */ +#define NES_CCEQ_SIZE ((nesadapter->max_cq / nesadapter->port_count) - 32) + + /* allocate CQP memory */ + /* Need to add max_cq to the aeq size once cq overflow checking is added back */ + /* SQ is 512 byte aligned, others are 256 byte aligned */ + nesdev->cqp_mem_size = 512 + + (sizeof(struct nes_hw_cqp_wqe) * NES_CQP_SQ_SIZE) + + (sizeof(struct nes_hw_cqe) * NES_CCQ_SIZE) + + max(((u32)sizeof(struct nes_hw_ceqe) * NES_CCEQ_SIZE), (u32)256) + + max(((u32)sizeof(struct nes_hw_ceqe) * NES_NIC_CEQ_SIZE), (u32)256) + + (sizeof(struct nes_hw_aeqe) * nesadapter->max_qp) + + sizeof(struct nes_hw_cqp_qp_context); + + nesdev->cqp_vbase = pci_alloc_consistent(nesdev->pcidev, nesdev->cqp_mem_size, + &nesdev->cqp_pbase); + if (!nesdev->cqp_vbase) { + nes_debug(NES_DBG_INIT, "Unable to allocate memory for host descriptor rings\n"); + return -ENOMEM; + } + memset(nesdev->cqp_vbase, 0, nesdev->cqp_mem_size); + + /* Allocate a twice the number of CQP requests as the SQ size */ + nesdev->nes_cqp_requests = kmalloc(sizeof(struct nes_cqp_request) * + 2 * NES_CQP_SQ_SIZE, GFP_KERNEL); + if (NULL == nesdev->nes_cqp_requests) { + nes_debug(NES_DBG_INIT, "Unable to allocate memory CQP request entries.\n"); + pci_free_consistent(nesdev->pcidev, nesdev->cqp_mem_size, nesdev->cqp.sq_vbase, + nesdev->cqp.sq_pbase); + return -ENOMEM; + } + memset(nesdev->nes_cqp_requests, 0, sizeof(struct nes_cqp_request) * + 2 * NES_CQP_SQ_SIZE); + nes_debug(NES_DBG_INIT, "Allocated CQP structures at %p (phys = %016lX), size = %u.\n", + nesdev->cqp_vbase, (unsigned long)nesdev->cqp_pbase, nesdev->cqp_mem_size); + + spin_lock_init(&nesdev->cqp.lock); + init_waitqueue_head(&nesdev->cqp.waitq); + + /* Setup Various Structures */ + vmem = (void *)(((unsigned long long)nesdev->cqp_vbase + (512 - 1)) & + ~(unsigned long long)(512 - 1)); + pmem = (dma_addr_t)(((unsigned long long)nesdev->cqp_pbase + (512 - 1)) & + ~(unsigned long long)(512 - 1)); + + nesdev->cqp.sq_vbase = vmem; + nesdev->cqp.sq_pbase = pmem; + nesdev->cqp.sq_size = NES_CQP_SQ_SIZE; + nesdev->cqp.sq_head = 0; + nesdev->cqp.sq_tail = 0; + nesdev->cqp.qp_id = PCI_FUNC(nesdev->pcidev->devfn); + + vmem += (sizeof(struct nes_hw_cqp_wqe) * nesdev->cqp.sq_size); + pmem += (sizeof(struct nes_hw_cqp_wqe) * nesdev->cqp.sq_size); + + nesdev->ccq.cq_vbase = vmem; + nesdev->ccq.cq_pbase = pmem; + nesdev->ccq.cq_size = NES_CCQ_SIZE; + nesdev->ccq.cq_head = 0; + nesdev->ccq.ce_handler = nes_cqp_ce_handler; + nesdev->ccq.cq_number = PCI_FUNC(nesdev->pcidev->devfn); + + vmem += (sizeof(struct nes_hw_cqe) * nesdev->ccq.cq_size); + pmem += (sizeof(struct nes_hw_cqe) * nesdev->ccq.cq_size); + + nesdev->ceq_index = PCI_FUNC(nesdev->pcidev->devfn); + ceq = &nesadapter->ceq[nesdev->ceq_index]; + ceq->ceq_vbase = vmem; + ceq->ceq_pbase = pmem; + ceq->ceq_size = NES_CCEQ_SIZE; + ceq->ceq_head = 0; + + vmem += max(((u32)sizeof(struct nes_hw_ceqe) * ceq->ceq_size), (u32)256); + pmem += max(((u32)sizeof(struct nes_hw_ceqe) * ceq->ceq_size), (u32)256); + + nesdev->nic_ceq_index = PCI_FUNC(nesdev->pcidev->devfn) + 8; + nic_ceq = &nesadapter->ceq[nesdev->nic_ceq_index]; + nic_ceq->ceq_vbase = vmem; + nic_ceq->ceq_pbase = pmem; + nic_ceq->ceq_size = NES_NIC_CEQ_SIZE; + nic_ceq->ceq_head = 0; + + vmem += max(((u32)sizeof(struct nes_hw_ceqe) * nic_ceq->ceq_size), (u32)256); + pmem += max(((u32)sizeof(struct nes_hw_ceqe) * nic_ceq->ceq_size), (u32)256); + + aeq = &nesadapter->aeq[PCI_FUNC(nesdev->pcidev->devfn)]; + aeq->aeq_vbase = vmem; + aeq->aeq_pbase = pmem; + aeq->aeq_size = nesadapter->max_qp; + aeq->aeq_head = 0; + + /* Setup QP Context */ + vmem += (sizeof(struct nes_hw_aeqe) * aeq->aeq_size); + pmem += (sizeof(struct nes_hw_aeqe) * aeq->aeq_size); + + cqp_qp_context = vmem; + cqp_qp_context->context_words[0] = cpu_to_le32((PCI_FUNC(nesdev->pcidev->devfn) << 12) + (2 << 10)); + cqp_qp_context->context_words[1] = 0; + cqp_qp_context->context_words[2] = cpu_to_le32((u32)nesdev->cqp.sq_pbase); + cqp_qp_context->context_words[3] = cpu_to_le32(((u64)nesdev->cqp.sq_pbase) >> 32); + + + /* Write the address to Create CQP */ + if ((sizeof(dma_addr_t) > 4)) { + nes_write_indexed(nesdev, + NES_IDX_CREATE_CQP_HIGH + (PCI_FUNC(nesdev->pcidev->devfn) * 8), + ((u64)pmem) >> 32); + } else { + nes_write_indexed(nesdev, + NES_IDX_CREATE_CQP_HIGH + (PCI_FUNC(nesdev->pcidev->devfn) * 8), 0); + } + nes_write_indexed(nesdev, + NES_IDX_CREATE_CQP_LOW + (PCI_FUNC(nesdev->pcidev->devfn) * 8), + (u32)pmem); + + nes_debug(NES_DBG_INIT, "Address of CQP SQ = %p.\n", nesdev->cqp.sq_vbase); + + INIT_LIST_HEAD(&nesdev->cqp_avail_reqs); + INIT_LIST_HEAD(&nesdev->cqp_pending_reqs); + + for (count=0; count<2*NES_CQP_SQ_SIZE; count++) { + init_waitqueue_head(&nesdev->nes_cqp_requests[count].waitq); + list_add_tail(&nesdev->nes_cqp_requests[count].list, &nesdev->cqp_avail_reqs); + } + + /* Write Create CCQ WQE */ + cqp_head = nesdev->cqp.sq_head++; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_CREATE_CQ | NES_CQP_CQ_CEQ_VALID | + NES_CQP_CQ_CHK_OVERFLOW | ((u32)nesdev->ccq.cq_size << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesdev->ccq.cq_number | + ((u32)nesdev->ceq_index<<16)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesdev->ccq.cq_pbase; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = 0; + u64temp = (u64)&nesdev->ccq; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_LOW_IDX] = cpu_to_le32((u32)(u64temp>>1)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = cpu_to_le32(((u32)((u64temp)>>33))&0x7FFFFFFF); + nes_debug(NES_DBG_INIT, "CQ%u context = 0x%08X:0x%08X.\n", + nesdev->ccq.cq_number, + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX]), + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_LOW_IDX])); + + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_DOORBELL_INDEX_HIGH_IDX] = 0; + + /* Write Create CEQ WQE */ + cqp_head = nesdev->cqp.sq_head++; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_CREATE_CEQ + + ((u32)nesdev->ceq_index << 8)); + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_ELEMENT_COUNT_IDX] = cpu_to_le32(ceq->ceq_size); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)ceq->ceq_pbase; + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + /* Write Create AEQ WQE */ + cqp_head = nesdev->cqp.sq_head++; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_CREATE_AEQ + + ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 8)); + cqp_wqe->wqe_words[NES_CQP_AEQ_WQE_ELEMENT_COUNT_IDX] = cpu_to_le32(aeq->aeq_size); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)aeq->aeq_pbase; + cqp_wqe->wqe_words[NES_CQP_AEQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_AEQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + /* Write Create CEQ WQE */ + cqp_head = nesdev->cqp.sq_head++; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_CREATE_CEQ + + ((u32)nesdev->nic_ceq_index << 8)); + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_ELEMENT_COUNT_IDX] = cpu_to_le32(nic_ceq->ceq_size); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nic_ceq->ceq_pbase; + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_CEQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + /* Poll until CCQP done */ + count = 0; + do { + if (count++ > 1000) { + printk(KERN_ERR PFX "Error creating CQP\n"); + pci_free_consistent(nesdev->pcidev, nesdev->cqp_mem_size, + nesdev->cqp_vbase, nesdev->cqp_pbase); + return -1; + } + udelay(10); + } while (!(nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL + (PCI_FUNC(nesdev->pcidev->devfn) * 8)) & (1 << 8))); + + nes_debug(NES_DBG_INIT, "CQP Status = 0x%08X\n", nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL+(PCI_FUNC(nesdev->pcidev->devfn)*8))); + + u32temp = 0x04800000; + nes_write32(nesdev->regs+NES_WQE_ALLOC, u32temp | nesdev->cqp.qp_id); + + /* wait for the CCQ, CEQ, and AEQ to get created */ + count = 0; + do { + if (count++ > 1000) { + printk(KERN_ERR PFX "Error creating CCQ, CEQ, and AEQ\n"); + pci_free_consistent(nesdev->pcidev, nesdev->cqp_mem_size, + nesdev->cqp_vbase, nesdev->cqp_pbase); + return -1; + } + udelay(10); + } while (((nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL + (PCI_FUNC(nesdev->pcidev->devfn)*8)) & (15<<8)) != (15<<8))); + + /* dump the QP status value */ + nes_debug(NES_DBG_INIT, "QP Status = 0x%08X\n", nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL+(PCI_FUNC(nesdev->pcidev->devfn)*8))); + + nesdev->cqp.sq_tail++; + + return 0; +} + + +/** + * nes_destroy_cqp + */ +int nes_destroy_cqp(struct nes_device *nesdev) +{ + struct nes_hw_cqp_wqe *cqp_wqe; + u32 count=0; + u32 cqp_head; + unsigned long flags; + + nes_debug(NES_DBG_SHUTDOWN, "Waiting for CQP work to complete.\n"); + do { + if (count++ > 1000) break; + udelay(10); + } while (!(nesdev->cqp.sq_head == nesdev->cqp.sq_tail)); + + /* Reset CCQ */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_RESET | + nesdev->ccq.cq_number); + + /* Disable device interrupts */ + nes_write32(nesdev->regs+NES_INT_MASK, 0x7fffffff); + /* Destroy the AEQ */ + spin_lock_irqsave(&nesdev->cqp.lock, flags); + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_AEQ | + ((u32)PCI_FUNC(nesdev->pcidev->devfn)<<8)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = 0; + /* Destroy the NIC CEQ */ + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_CEQ | + ((u32)nesdev->nic_ceq_index<<8)); + /* Destroy the CEQ */ + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_CEQ | + (nesdev->ceq_index<<8)); + /* Destroy the CCQ */ + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_CQ); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32( nesdev->ccq.cq_number || + ((u32)nesdev->ceq_index<<16)); + /* Destroy CQP */ + cqp_head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_QP | + NES_CQP_QP_TYPE_CQP); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesdev->cqp.qp_id); + + barrier(); + /* Ring doorbell (4 WQEs) */ + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x05800000 | nesdev->cqp.qp_id); + + /* Wait for the destroy to complete */ + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + /* wait for the CCQ, CEQ, and AEQ to get destroyed */ + count = 0; + do { + if (count++ > 1000) { + printk(KERN_ERR PFX "Function%d: Error destroying CCQ, CEQ, and AEQ\n", + PCI_FUNC(nesdev->pcidev->devfn)); + break; + } + udelay(10); + } while (((nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL + (PCI_FUNC(nesdev->pcidev->devfn)*8)) & (15<<8)) != 0)); + + /* dump the QP status value */ + nes_debug(NES_DBG_SHUTDOWN, "Function%d: QP Status = 0x%08X\n", + PCI_FUNC(nesdev->pcidev->devfn), + nes_read_indexed(nesdev, + NES_IDX_QP_CONTROL+(PCI_FUNC(nesdev->pcidev->devfn)*8))); + + kfree(nesdev->nes_cqp_requests); + + /* Free the control structures */ + pci_free_consistent(nesdev->pcidev, nesdev->cqp_mem_size, nesdev->cqp.sq_vbase, + nesdev->cqp.sq_pbase); + + return 0; +} + + +/** + * nes_init_phy + */ +int nes_init_phy(struct nes_device *nesdev) +{ + struct nes_adapter *nesadapter = nesdev->nesadapter; + u32 counter = 0; + u32 mac_index = nesdev->mac_index; + u16 phy_data; + + if (nesadapter->OneG_Mode) { + nes_debug(NES_DBG_PHY, "1G PHY, mac_index = %d.\n", mac_index); + nes_read_1G_phy_reg(nesdev, 1, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 1 phy address %u = 0x%X.\n", + nesadapter->phy_index[mac_index], phy_data); + + nes_write_1G_phy_reg(nesdev, 23, nesadapter->phy_index[mac_index], 0xb000); + + /* Reset the PHY */ + nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], 0x8000); + udelay(100); + counter = 0; + do { + nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0 = 0x%X.\n", phy_data); + if (counter++ > 100) break; + } while (phy_data & 0x8000); + + /* Setting no phy loopback */ + phy_data &= 0xbfff; + phy_data |= 0x1140; + nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], phy_data); + nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0 = 0x%X.\n", phy_data); + + nes_read_1G_phy_reg(nesdev, 0x17, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x17 = 0x%X.\n", phy_data); + + nes_read_1G_phy_reg(nesdev, 0x1e, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x1e = 0x%X.\n", phy_data); + + /* Setting the interrupt mask */ + nes_read_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x19 = 0x%X.\n", phy_data); + nes_write_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], 0xffee); + + nes_read_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x19 = 0x%X.\n", phy_data); + + /* turning on flow control */ + nes_read_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x4 = 0x%X.\n", phy_data); + nes_write_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], + (phy_data & ~(0x03E0)) | 0xc00); + /* nes_write_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], + phy_data | 0xc00); */ + nes_read_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x4 = 0x%X.\n", phy_data); + + nes_read_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x9 = 0x%X.\n", phy_data); + /* Clear Half duplex */ + nes_write_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], + phy_data & ~(0x0100)); + nes_read_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy data from register 0x9 = 0x%X.\n", phy_data); + + nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); + nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], phy_data | 0x0300); + } + return 0; +} + + +/** + * nes_replenish_nic_rq + */ +static void nes_replenish_nic_rq(struct nes_vnic *nesvnic) +{ + unsigned long flags; + dma_addr_t bus_address; + struct sk_buff *skb; + struct nes_hw_nic_rq_wqe *nic_rqe; + struct nes_hw_nic *nesnic; + struct nes_device *nesdev; + u32 rx_wqes_posted = 0; + + nesnic = &nesvnic->nic; + nesdev = nesvnic->nesdev; + spin_lock_irqsave(&nesnic->rq_lock, flags); + do { + skb = dev_alloc_skb(nesvnic->max_frame_size); + if (skb) { + skb->dev = nesvnic->netdev; + + bus_address = pci_map_single(nesdev->pcidev, + skb->data, nesvnic->max_frame_size, PCI_DMA_FROMDEVICE); + + nic_rqe = &nesnic->rq_vbase[nesvnic->nic.rq_head]; + nic_rqe->wqe_words[NES_NIC_RQ_WQE_LENGTH_1_0_IDX] = + cpu_to_le32(nesvnic->max_frame_size); + nic_rqe->wqe_words[NES_NIC_RQ_WQE_LENGTH_3_2_IDX] = 0; + nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_LOW_IDX] = + cpu_to_le32((u32)bus_address); + nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_HIGH_IDX] = + cpu_to_le32((u32)((u64)bus_address >> 32)); + nesnic->rx_skb[nesnic->rq_head] = skb; + nesnic->rq_head++; + nesnic->rq_head &= nesnic->rq_size - 1; + atomic_dec(&nesvnic->rx_skbs_needed); + barrier(); + if (++rx_wqes_posted==255) { + nes_write32(nesdev->regs+NES_WQE_ALLOC, (rx_wqes_posted << 24) | nesnic->qp_id); + rx_wqes_posted = 0; + } + } else { + printk("%s[%u] alloc_skb failed! %u wqes still needed.\n", + __FUNCTION__, __LINE__, + atomic_read(&nesvnic->rx_skbs_needed)); + if (((nesnic->rq_size-1) == atomic_read(&nesvnic->rx_skbs_needed)) && + (0 == atomic_read(&nesvnic->rx_skb_timer_running))) { + printk("%s[%u] Starting Timer.\n", __FUNCTION__, __LINE__); + atomic_set(&nesvnic->rx_skb_timer_running, 1); + nesvnic->rq_wqes_timer.expires = jiffies + (HZ/2); /* 1/2 second */ + add_timer(&nesvnic->rq_wqes_timer); + } + break; + } + } while (atomic_read(&nesvnic->rx_skbs_needed)); + barrier(); + if (rx_wqes_posted) { + nes_write32(nesdev->regs+NES_WQE_ALLOC, (rx_wqes_posted << 24) | nesnic->qp_id); + } + spin_unlock_irqrestore(&nesnic->rq_lock, flags); +} + + +/** + * nes_rq_wqes_timeout + */ +static void nes_rq_wqes_timeout(unsigned long parm) +{ + struct nes_vnic *nesvnic = (struct nes_vnic *)parm; + printk("%s: Timer fired.\n", __FUNCTION__); + atomic_set(&nesvnic->rx_skb_timer_running, 0); + if (atomic_read(&nesvnic->rx_skbs_needed)) + nes_replenish_nic_rq(nesvnic); +} + + +/** + * nes_init_nic_qp + */ +int nes_init_nic_qp(struct nes_device *nesdev, struct net_device *netdev) +{ + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_hw_nic_sq_wqe *nic_sqe; + struct nes_hw_nic_qp_context *nic_context; + struct sk_buff *skb; + struct nes_hw_nic_rq_wqe *nic_rqe; + struct nes_vnic *nesvnic = netdev_priv(netdev); + unsigned long flags; + void *vmem; + dma_addr_t pmem; + u64 u64temp; + int ret; + u32 cqp_head; + u32 counter; + u32 wqe_count; + + /* Allocate fragment, SQ, RQ, and CQ; Reuse CEQ based on the PCI function */ + nesvnic->nic_mem_size = 256 + + (NES_NIC_WQ_SIZE * sizeof(struct nes_first_frag)) + + (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe)) + + (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe)) + + (NES_NIC_WQ_SIZE * 2 * sizeof(struct nes_hw_nic_cqe)) + + sizeof(struct nes_hw_nic_qp_context); + + nesvnic->nic_vbase = pci_alloc_consistent(nesdev->pcidev, nesvnic->nic_mem_size, + &nesvnic->nic_pbase); + if (!nesvnic->nic_vbase) { + nes_debug(NES_DBG_INIT, "Unable to allocate memory for NIC host descriptor rings\n"); + return -ENOMEM; + } + memset(nesvnic->nic_vbase, 0, nesvnic->nic_mem_size); + nes_debug(NES_DBG_INIT, "Allocated NIC QP structures at %p (phys = %016lX), size = %u.\n", + nesvnic->nic_vbase, (unsigned long)nesvnic->nic_pbase, nesvnic->nic_mem_size); + + vmem = (void *)(((unsigned long long)nesvnic->nic_vbase + (256 - 1)) & + ~(unsigned long long)(256 - 1)); + pmem = (dma_addr_t)(((unsigned long long)nesvnic->nic_pbase + (256 - 1)) & + ~(unsigned long long)(256 - 1)); + + /* Setup the first Fragment buffers */ + nesvnic->nic.first_frag_vbase = vmem; + + for (counter = 0; counter < NES_NIC_WQ_SIZE; counter++) { + nesvnic->nic.frag_paddr[counter] = pmem; + pmem += sizeof(struct nes_first_frag); + } + + /* setup the SQ */ + vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_first_frag)); + + nesvnic->nic.sq_vbase = (void *)vmem; + nesvnic->nic.sq_pbase = pmem; + nesvnic->nic.sq_head = 0; + nesvnic->nic.sq_tail = 0; + nesvnic->nic.sq_size = NES_NIC_WQ_SIZE; + for (counter = 0; counter < NES_NIC_WQ_SIZE; counter++) { + nic_sqe = &nesvnic->nic.sq_vbase[counter]; + nic_sqe->wqe_words[NES_NIC_SQ_WQE_MISC_IDX] = + cpu_to_le32(NES_NIC_SQ_WQE_DISABLE_CHKSUM | + NES_NIC_SQ_WQE_COMPLETION); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX] = + cpu_to_le32((u32)NES_FIRST_FRAG_SIZE << 16); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX] = + cpu_to_le32((u32)nesvnic->nic.frag_paddr[counter]); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_HIGH_IDX] = + cpu_to_le32((u32)((u64)nesvnic->nic.frag_paddr[counter] >> 32)); + } + + spin_lock_init(&nesvnic->nic.sq_lock); + spin_lock_init(&nesvnic->nic.rq_lock); + + /* setup the RQ */ + vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe)); + pmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe)); + + + nesvnic->nic.rq_vbase = vmem; + nesvnic->nic.rq_pbase = pmem; + nesvnic->nic.rq_head = 0; + nesvnic->nic.rq_tail = 0; + nesvnic->nic.rq_size = NES_NIC_WQ_SIZE; + + /* setup the CQ */ + vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe)); + pmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe)); + + nesvnic->nic_cq.cq_vbase = vmem; + nesvnic->nic_cq.cq_pbase = pmem; + nesvnic->nic_cq.cq_head = 0; + nesvnic->nic_cq.cq_size = NES_NIC_WQ_SIZE * 2; +#ifdef NES_NAPI + nesvnic->nic_cq.ce_handler = nes_nic_napi_ce_handler; +#else + nesvnic->nic_cq.ce_handler = nes_nic_ce_handler; +#endif + + /* Send CreateCQ request to CQP */ + spin_lock_irqsave(&nesdev->cqp.lock, flags); + cqp_head = nesdev->cqp.sq_head; + nes_debug(NES_DBG_INIT, "Before filling out cqp_wqe, cqp=%p, sq_head=%u," + " sq_tail=%u, cqp_head=%u\n", + &nesdev->cqp, nesdev->cqp.sq_head, nesdev->cqp.sq_tail, cqp_head); + + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_CREATE_CQ | NES_CQP_CQ_CEQ_VALID | + ((u32)nesvnic->nic_cq.cq_size << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32( + nesvnic->nic_cq.cq_number | ((u32)nesdev->nic_ceq_index << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesvnic->nic_cq.cq_pbase; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = 0; + u64temp = (u64)&nesvnic->nic_cq; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_LOW_IDX] = cpu_to_le32((u32)(u64temp>>1)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = + cpu_to_le32(((u32)((u64temp)>>33))&0x7FFFFFFF); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_DOORBELL_INDEX_HIGH_IDX] = 0; + if (++cqp_head >= nesdev->cqp.sq_size) + cqp_head = 0; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + + /* Send CreateQP request to CQP */ + nic_context = (void *)(&nesvnic->nic_cq.cq_vbase[nesvnic->nic_cq.cq_size]); + nic_context->context_words[NES_NIC_CTX_MISC_IDX] = + cpu_to_le32((u32)NES_NIC_CTX_SIZE | + ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 12)); + nes_debug(NES_DBG_INIT, "RX_WINDOW_BUFFER_PAGE_TABLE_SIZE = 0x%08X, RX_WINDOW_BUFFER_SIZE = 0x%08X\n", + nes_read_indexed(nesdev, NES_IDX_RX_WINDOW_BUFFER_PAGE_TABLE_SIZE), + nes_read_indexed(nesdev, NES_IDX_RX_WINDOW_BUFFER_SIZE)); + if (0!= nes_read_indexed(nesdev, NES_IDX_RX_WINDOW_BUFFER_SIZE)) { + nic_context->context_words[NES_NIC_CTX_MISC_IDX] |= cpu_to_le32(NES_NIC_BACK_STORE); + } + + u64temp = (u64)nesvnic->nic.sq_pbase; + nic_context->context_words[NES_NIC_CTX_SQ_LOW_IDX] = cpu_to_le32((u32)u64temp); + nic_context->context_words[NES_NIC_CTX_SQ_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + u64temp = (u64)nesvnic->nic.rq_pbase; + nic_context->context_words[NES_NIC_CTX_RQ_LOW_IDX] = cpu_to_le32((u32)u64temp); + nic_context->context_words[NES_NIC_CTX_RQ_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_CREATE_QP | + NES_CQP_QP_TYPE_NIC); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesvnic->nic.qp_id); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesvnic->nic_cq.cq_pbase + + (nesvnic->nic_cq.cq_size * sizeof(struct nes_hw_nic_cqe)); + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + if (++cqp_head >= nesdev->cqp.sq_size) + cqp_head = 0; + nesdev->cqp.sq_head = cqp_head; + + barrier(); + + /* Ring doorbell (2 WQEs) */ + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x02800000 | nesdev->cqp.qp_id); + + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + nes_debug(NES_DBG_INIT, "Waiting for create NIC QP%u to complete.\n", + nesvnic->nic.qp_id); + + ret = wait_event_timeout(nesdev->cqp.waitq, (nesdev->cqp.sq_tail == cqp_head), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_INIT, "Create NIC QP%u completed, wait_event_timeout ret = %u.\n", + nesvnic->nic.qp_id, ret); + if (!ret) { + nes_debug(NES_DBG_INIT, "NIC QP%u create timeout expired\n", nesvnic->nic.qp_id); + pci_free_consistent(nesdev->pcidev, nesvnic->nic_mem_size, nesvnic->nic_vbase, + nesvnic->nic_pbase); + return -EIO; + } + + /* Populate the RQ */ + for (counter = 0; counter < (NES_NIC_WQ_SIZE - 1); counter++) { + skb = dev_alloc_skb(nesvnic->max_frame_size); + if (!skb) { + nes_debug(NES_DBG_INIT, "%s: out of memory for receive skb\n", netdev->name); + + nes_destroy_nic_qp(nesvnic); + return -ENOMEM; + } + + skb->dev = netdev; + + pmem = pci_map_single(nesdev->pcidev, skb->data, + nesvnic->max_frame_size, PCI_DMA_FROMDEVICE); + + nic_rqe = &nesvnic->nic.rq_vbase[counter]; + nic_rqe->wqe_words[NES_NIC_RQ_WQE_LENGTH_1_0_IDX] = cpu_to_le32(nesvnic->max_frame_size); + nic_rqe->wqe_words[NES_NIC_RQ_WQE_LENGTH_3_2_IDX] = 0; + nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_LOW_IDX] = cpu_to_le32((u32)pmem); + nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_HIGH_IDX] = cpu_to_le32((u32)((u64)pmem >> 32)); + nesvnic->nic.rx_skb[counter] = skb; + } + + wqe_count = NES_NIC_WQ_SIZE - 1; + nesvnic->nic.rq_head = wqe_count; + barrier(); + do { + counter = min(wqe_count, ((u32)255)); + wqe_count -= counter; + nes_write32(nesdev->regs+NES_WQE_ALLOC, (counter << 24) | nesvnic->nic.qp_id); + } while (wqe_count); + init_timer(&nesvnic->rq_wqes_timer); + nesvnic->rq_wqes_timer.function = nes_rq_wqes_timeout; + nesvnic->rq_wqes_timer.data = (unsigned long)nesvnic; +#ifdef NES_INT_MODERATE + nes_debug(NES_DBG_INIT, "Default Interrupt Moderation Enabled\n"); +#endif +#ifdef NES_NAPI + nes_debug(NES_DBG_INIT, "NAPI support Enabled\n"); +#endif + + return 0; +} + + +/** + * nes_destroy_nic_qp + */ +void nes_destroy_nic_qp(struct nes_vnic *nesvnic) +{ + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_hw_nic_rq_wqe *nic_rqe; + u64 wqe_frag; + u32 cqp_head; + unsigned long flags; + int ret; + + /* Free remaining NIC receive buffers */ + while (nesvnic->nic.rq_head != nesvnic->nic.rq_tail) { + nic_rqe = &nesvnic->nic.rq_vbase[nesvnic->nic.rq_tail]; + wqe_frag = (u64)le32_to_cpu(nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_LOW_IDX]); + wqe_frag |= ((u64)le32_to_cpu(nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_HIGH_IDX])) << 32; + pci_unmap_single(nesdev->pcidev, (dma_addr_t)wqe_frag, + nesvnic->max_frame_size, PCI_DMA_FROMDEVICE); + dev_kfree_skb(nesvnic->nic.rx_skb[nesvnic->nic.rq_tail++]); + nesvnic->nic.rq_tail &= (nesvnic->nic.rq_size - 1); + } + + /* Destroy NIC QP */ + spin_lock_irqsave(&nesdev->cqp.lock, flags); + cqp_head = nesdev->cqp.sq_head; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_QP | NES_CQP_QP_TYPE_NIC); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesvnic->nic_cq.cq_number); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + if (++cqp_head >= nesdev->cqp.sq_size) + cqp_head = 0; + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; + + /* Destroy NIC CQ */ + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DESTROY_CQ | + ((u32)nesvnic->nic_cq.cq_size << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesvnic->nic_cq.cq_number | + ((u32)nesdev->nic_ceq_index << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + if (++cqp_head >= nesdev->cqp.sq_size) + cqp_head = 0; + + nesdev->cqp.sq_head = cqp_head; + barrier(); + + /* Ring doorbell (2 WQEs) */ + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x02800000 | nesdev->cqp.qp_id); + + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + nes_debug(NES_DBG_SHUTDOWN, "Waiting for CQP, cqp_head=%u, cqp.sq_head=%u," + " cqp.sq_tail=%u, cqp.sq_size=%u\n", + cqp_head, nesdev->cqp.sq_head, + nesdev->cqp.sq_tail, nesdev->cqp.sq_size); + + ret = wait_event_timeout(nesdev->cqp.waitq, (nesdev->cqp.sq_tail == cqp_head), + NES_EVENT_TIMEOUT); + + nes_debug(NES_DBG_SHUTDOWN, "Destroy NIC QP returned, wait_event_timeout ret = %u, cqp_head=%u," + " cqp.sq_head=%u, cqp.sq_tail=%u\n", + ret, cqp_head, nesdev->cqp.sq_head, nesdev->cqp.sq_tail); + if (!ret) { + nes_debug(NES_DBG_SHUTDOWN, "NIC QP%u destroy timeout expired\n", + nesvnic->nic.qp_id); + } + + pci_free_consistent(nesdev->pcidev, nesvnic->nic_mem_size, nesvnic->nic_vbase, + nesvnic->nic_pbase); +} + + +#ifdef NES_NAPI +/** + * nes_napi_isr + */ +int nes_napi_isr(struct nes_device *nesdev) +{ + u32 int_stat; + + if (nesdev->napi_isr_ran) { + /* interrupt status has already been read in ISR */ + int_stat = nesdev->int_stat; + } else { + int_stat = nes_read32(nesdev->regs + NES_INT_STAT); + nesdev->int_stat = int_stat; + nesdev->napi_isr_ran = 1; + } + + int_stat &= nesdev->int_req; + /* nes_debug(NES_DBG_ISR, "Interrupt Status (postfilter) = 0x%08X\n", int_stat ); */ + /* iff NIC, process here, else wait for DPC */ + if ((int_stat) && ((int_stat & 0x0000ff00) == int_stat)) { + nesdev->napi_isr_ran = 0; + nes_write32(nesdev->regs+NES_INT_STAT, + (int_stat & + ~(NES_INT_INTF|NES_INT_TIMER|NES_INT_MAC0|NES_INT_MAC1|NES_INT_MAC2|NES_INT_MAC3))); + + /* Process the CEQs */ + nes_process_ceq(nesdev, &nesdev->nesadapter->ceq[nesdev->nic_ceq_index]); + + if (nesdev->et_rx_coalesce_usecs_irq) { + if ((nesdev->int_req & NES_INT_TIMER) == 0) { + /* Enable Periodic timer interrupts */ + nesdev->int_req |= NES_INT_TIMER; + /* ack any pending periodic timer interrupts so we don't get an immediate interrupt */ + /* TODO: need to also ack other unused periodic timer values, get from nesadapter */ + nes_write32(nesdev->regs+NES_TIMER_STAT, + nesdev->timer_int_req | ~(nesdev->nesadapter->timer_int_req)); + nes_write32(nesdev->regs+NES_INTF_INT_MASK, + ~(nesdev->intf_int_req | NES_INTF_PERIODIC_TIMER)); + } + /* Enable interrupts, except CEQs */ + nes_write32(nesdev->regs+NES_INT_MASK, 0x0000ffff | (~nesdev->int_req)); + } else { + /* Enable interrupts, make sure timer is off */ + nesdev->int_req &= ~NES_INT_TIMER; + nes_write32(nesdev->regs+NES_INTF_INT_MASK, ~(nesdev->intf_int_req)); + nes_write32(nesdev->regs+NES_INT_MASK, ~nesdev->int_req); + } + + return 1; + } else { + return 0; + } +} +#endif + + +/** + * nes_dpc + */ +void nes_dpc(unsigned long param) +{ + struct nes_device *nesdev = (struct nes_device *)param; + struct nes_adapter *nesadapter = nesdev->nesadapter; + u32 counter; + u32 loop_counter = 0; + u32 int_status_bit; + u32 int_stat; + u32 timer_stat; + u32 temp_int_stat; + u32 intf_int_stat; + u32 debug_error; + u32 processed_intf_int = 0; + u16 processed_timer_int = 0; + u16 completion_ints = 0; + u16 timer_ints = 0; + + /* nes_debug(NES_DBG_ISR, "\n"); */ + + do { + timer_stat = 0; + if (nesdev->napi_isr_ran) { + nesdev->napi_isr_ran = 0; + int_stat = nesdev->int_stat; + } else + int_stat = nes_read32(nesdev->regs+NES_INT_STAT); + if (0 != processed_intf_int) { + int_stat &= nesdev->int_req & ~NES_INT_INTF; + } else { + int_stat &= nesdev->int_req; + } + if (0 == processed_timer_int) { + processed_timer_int = 1; + if (int_stat & NES_INT_TIMER) { + timer_stat = nes_read32(nesdev->regs + NES_TIMER_STAT); + if ((timer_stat & nesdev->timer_int_req) == 0) { + int_stat &= ~NES_INT_TIMER; + } + } + } else { + int_stat &= ~NES_INT_TIMER; + } + + if (int_stat) { + if (int_stat & ~(NES_INT_INTF|NES_INT_TIMER|NES_INT_MAC0| + NES_INT_MAC1|NES_INT_MAC2|NES_INT_MAC3)) { + /* Ack the interrupts */ + nes_write32(nesdev->regs+NES_INT_STAT, + (int_stat & ~(NES_INT_INTF|NES_INT_TIMER|NES_INT_MAC0| + NES_INT_MAC1|NES_INT_MAC2|NES_INT_MAC3))); + } + + temp_int_stat = int_stat; + for (counter = 0, int_status_bit = 1; counter < 16; counter++) { + if (int_stat & int_status_bit) { + nes_process_ceq(nesdev, &nesadapter->ceq[counter]); + temp_int_stat &= ~int_status_bit; + completion_ints = 1; + } + if (!(temp_int_stat & 0x0000ffff)) + break; + int_status_bit <<= 1; + } + + /* Process the AEQ for this pci function */ + int_status_bit = 1 << (16 + PCI_FUNC(nesdev->pcidev->devfn)); + if (int_stat & int_status_bit) { + nes_process_aeq(nesdev, &nesadapter->aeq[PCI_FUNC(nesdev->pcidev->devfn)]); + } + + /* Process the MAC interrupt for this pci function */ + int_status_bit = 1 << (24 + nesdev->mac_index); + if (int_stat & int_status_bit) { + nes_process_mac_intr(nesdev, nesdev->mac_index); + } + + if (int_stat & NES_INT_TIMER) { + if (timer_stat & nesdev->timer_int_req) { + nes_write32(nesdev->regs + NES_TIMER_STAT, + (timer_stat & nesdev->timer_int_req) | + ~(nesdev->nesadapter->timer_int_req)); + timer_ints = 1; + } + } + + if (int_stat & NES_INT_INTF) { + processed_intf_int = 1; + intf_int_stat = nes_read32(nesdev->regs+NES_INTF_INT_STAT); + intf_int_stat &= nesdev->intf_int_req; + if (NES_INTF_INT_CRITERR & intf_int_stat) { + debug_error = nes_read_indexed(nesdev, NES_IDX_DEBUG_ERROR_CONTROL_STATUS); + printk(KERN_ERR PFX "Critical Error reported by device!!! 0x%02X\n", + (u16)debug_error); + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_CONTROL_STATUS, + 0x01010000 | (debug_error & 0x0000ffff)); + /* BUG(); */ + if (crit_err_count++ > 10) + nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS1, 1 << 0x17); + } + if (NES_INTF_INT_PCIERR & intf_int_stat) { + printk(KERN_ERR PFX "PCI Error reported by device!!!\n"); + BUG(); + } + if (NES_INTF_INT_AEQ_OFLOW & intf_int_stat) { + printk(KERN_ERR PFX "AEQ Overflow reported by device!!!\n"); + BUG(); + } + nes_write32(nesdev->regs+NES_INTF_INT_STAT, intf_int_stat); + } + + if (int_stat & NES_INT_TSW) { + } + } + /* Don't use the interface interrupt bit stay in loop */ + int_stat &= ~NES_INT_INTF|NES_INT_TIMER|NES_INT_MAC0| + NES_INT_MAC1|NES_INT_MAC2|NES_INT_MAC3; + } while ((int_stat != 0) && (loop_counter++ < MAX_DPC_ITERATIONS)); + + if (1 == timer_ints) { + if (nesdev->et_rx_coalesce_usecs_irq) { + if (0 == completion_ints) { + nesdev->timer_only_int_count++; + if (nesdev->timer_only_int_count>=NES_TIMER_INT_LIMIT) { + nesdev->timer_only_int_count = 0; + nesdev->int_req &= ~NES_INT_TIMER; + nes_write32(nesdev->regs + NES_INTF_INT_MASK, ~(nesdev->intf_int_req)); + nes_write32(nesdev->regs+NES_INT_MASK, ~nesdev->int_req); + } else { + nes_write32(nesdev->regs+NES_INT_MASK, 0x0000ffff|(~nesdev->int_req)); + } + } else { + nesdev->timer_only_int_count = 0; + nes_write32(nesdev->regs+NES_INT_MASK, 0x0000ffff|(~nesdev->int_req)); + } + } else { + nesdev->timer_only_int_count = 0; + nesdev->int_req &= ~NES_INT_TIMER; + nes_write32(nesdev->regs+NES_INTF_INT_MASK, ~(nesdev->intf_int_req)); + nes_write32(nesdev->regs+NES_TIMER_STAT, + nesdev->timer_int_req | ~(nesdev->nesadapter->timer_int_req)); + nes_write32(nesdev->regs+NES_INT_MASK, ~nesdev->int_req); + } + } else { + if ((1 == completion_ints) && (nesdev->et_rx_coalesce_usecs_irq)) { + /* nes_debug(NES_DBG_ISR, "Enabling periodic timer interrupt.\n" ); */ + nesdev->timer_only_int_count = 0; + nesdev->int_req |= NES_INT_TIMER; + nes_write32(nesdev->regs+NES_TIMER_STAT, + nesdev->timer_int_req | ~(nesdev->nesadapter->timer_int_req)); + nes_write32(nesdev->regs+NES_INTF_INT_MASK, + ~(nesdev->intf_int_req | NES_INTF_PERIODIC_TIMER)); + nes_write32(nesdev->regs+NES_INT_MASK, 0x0000ffff | (~nesdev->int_req)); + } else { + nes_write32(nesdev->regs+NES_INT_MASK, ~nesdev->int_req); + } + } +} + + +/** + * nes_process_ceq + */ +void nes_process_ceq(struct nes_device *nesdev, struct nes_hw_ceq *ceq) +{ + u64 u64temp; + struct nes_hw_cq *cq; + u32 head; + u32 ceq_size; + + /* nes_debug(NES_DBG_CQ, "\n"); */ + head = ceq->ceq_head; + ceq_size = ceq->ceq_size; + + do { + if (le32_to_cpu(ceq->ceq_vbase[head].ceqe_words[NES_CEQE_CQ_CTX_HIGH_IDX]) & + NES_CEQE_VALID) { + u64temp = (((u64)(le32_to_cpu(ceq->ceq_vbase[head].ceqe_words[NES_CEQE_CQ_CTX_HIGH_IDX])))<<32) | + ((u64)(le32_to_cpu(ceq->ceq_vbase[head].ceqe_words[NES_CEQE_CQ_CTX_LOW_IDX]))); + u64temp <<= 1; + cq = *((struct nes_hw_cq **)&u64temp); + /* nes_debug(NES_DBG_CQ, "pCQ = %p\n", cq); */ + barrier(); + ceq->ceq_vbase[head].ceqe_words[NES_CEQE_CQ_CTX_HIGH_IDX] = 0; + + /* call the event handler */ + cq->ce_handler(nesdev, cq); + + if (++head >= ceq_size) + head = 0; + } else { + break; + } + } while (1); + + ceq->ceq_head = head; +} + + +/** + * nes_process_aeq + */ +void nes_process_aeq(struct nes_device *nesdev, struct nes_hw_aeq *aeq) +{ +// u64 u64temp; + u32 head; + u32 aeq_size; + u32 aeqe_misc; + u32 aeqe_cq_id; + struct nes_hw_aeqe volatile *aeqe; + + head = aeq->aeq_head; + aeq_size = aeq->aeq_size; + + do { + aeqe = &aeq->aeq_vbase[head]; + if ((le32_to_cpu(aeqe->aeqe_words[NES_AEQE_MISC_IDX]) & NES_AEQE_VALID) == 0) + break; + aeqe_misc = le32_to_cpu(aeqe->aeqe_words[NES_AEQE_MISC_IDX]); + aeqe_cq_id = le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX]); + if (aeqe_misc & (NES_AEQE_QP|NES_AEQE_CQ)) { + if (aeqe_cq_id >= NES_FIRST_QPN) { + /* dealing with an accelerated QP related AE */ +// u64temp = (((u64)(le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_HIGH_IDX])))<<32) | +// ((u64)(le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_LOW_IDX]))); + nes_process_iwarp_aeqe(nesdev, (struct nes_hw_aeqe *)aeqe); + } else { + /* TODO: dealing with a CQP related AE */ + nes_debug(NES_DBG_AEQ, "Processing CQP related AE, misc = 0x%04X\n", + (u16)(aeqe_misc >> 16)); + } + } + + aeqe->aeqe_words[NES_AEQE_MISC_IDX] = 0; + + if (++head >= aeq_size) + head = 0; + } + while (1); + aeq->aeq_head = head; +} + + +/** + * nes_process_mac_intr + */ +void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) +{ + unsigned long flags; + u32 pcs_control_status; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_vnic *nesvnic; + u32 mac_status; + u32 mac_index = nesdev->mac_index; + u32 u32temp; + u16 phy_data; + u16 temp_phy_data; + + spin_lock_irqsave(&nesadapter->phy_lock, flags); + if (nesadapter->mac_sw_state[mac_number] != NES_MAC_SW_IDLE) { + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); + return; + } + nesadapter->mac_sw_state[mac_number] = NES_MAC_SW_INTERRUPT; + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); + + /* ack the MAC interrupt */ + mac_status = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS + (mac_index * 0x200)); + /* Clear the interrupt */ + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS + (mac_index * 0x200), mac_status); + + nes_debug(NES_DBG_PHY, "MAC%u interrupt status = 0x%X.\n", mac_number, mac_status); + + if (mac_status & (NES_MAC_INT_LINK_STAT_CHG | NES_MAC_INT_XGMII_EXT)) { + nesdev->link_status_interrupts++; + /* read the PHY interrupt status register */ + if (nesadapter->OneG_Mode) { + do { + nes_read_1G_phy_reg(nesdev, 0x1a, + nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy%d data from register 0x1a = 0x%X.\n", + nesadapter->phy_index[mac_index], phy_data); + } while (phy_data&0x8000); + + temp_phy_data = 0; + do { + nes_read_1G_phy_reg(nesdev, 0x11, + nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy%d data from register 0x11 = 0x%X.\n", + nesadapter->phy_index[mac_index], phy_data); + if (temp_phy_data == phy_data) + break; + temp_phy_data = phy_data; + } while (1); + + nes_read_1G_phy_reg(nesdev, 0x1e, + nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "Phy%d data from register 0x1e = 0x%X.\n", + nesadapter->phy_index[mac_index], phy_data); + + nes_read_1G_phy_reg(nesdev, 1, + nesadapter->phy_index[mac_index], &phy_data); + nes_debug(NES_DBG_PHY, "1G phy%u data from register 1 = 0x%X\n", + nesadapter->phy_index[mac_index], phy_data); + + if (temp_phy_data & 0x1000) { + nes_debug(NES_DBG_PHY, "The Link is up according to the PHY\n"); + phy_data = 4; + } else { + nes_debug(NES_DBG_PHY, "The Link is down according to the PHY\n"); + } + } + nes_debug(NES_DBG_PHY, "Eth SERDES Common Status: 0=0x%08X, 1=0x%08X\n", + nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0), + nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0+0x200)); + pcs_control_status = nes_read_indexed(nesdev, + NES_IDX_PHY_PCS_CONTROL_STATUS0 + ((mac_index&1)*0x200)); + pcs_control_status = nes_read_indexed(nesdev, + NES_IDX_PHY_PCS_CONTROL_STATUS0 + ((mac_index&1)*0x200)); + nes_debug(NES_DBG_PHY, "PCS PHY Control/Status%u: 0x%08X\n", + mac_index, pcs_control_status); + if (nesadapter->OneG_Mode) { + u32temp = 0x01010000; + if (nesadapter->port_count > 2) { + u32temp |= 0x02020000; + } + if ((pcs_control_status & u32temp)!= u32temp) { + phy_data = 0; + nes_debug(NES_DBG_PHY, "PCS says the link is down\n"); + } + } else { + phy_data = (0x0f0f0000 == (pcs_control_status & 0x0f1f0000)) ? 4 : 0; + } + + if (phy_data & 0x0004) { + nesadapter->mac_link_down[mac_index] = 0; + list_for_each_entry(nesvnic, &nesadapter->nesvnic_list[mac_index], list) { + nes_debug(NES_DBG_PHY, "The Link is UP!!. linkup was %d\n", + nesvnic->linkup); + if (nesvnic->linkup == 0) { + printk(PFX "The Link is now up for port %u, netdev %p.\n", + mac_index, nesvnic->netdev); + if (netif_queue_stopped(nesvnic->netdev)) + netif_start_queue(nesvnic->netdev); + nesvnic->linkup = 1; + netif_carrier_on(nesvnic->netdev); + } + } + } else { + nesadapter->mac_link_down[mac_index] = 1; + list_for_each_entry(nesvnic, &nesadapter->nesvnic_list[mac_index], list) { + nes_debug(NES_DBG_PHY, "The Link is Down!!. linkup was %d\n", + nesvnic->linkup); + if (nesvnic->linkup == 1) { + printk(PFX "The Link is now down for port %u, netdev %p.\n", + mac_index, nesvnic->netdev); + if (!(netif_queue_stopped(nesvnic->netdev))) + netif_stop_queue(nesvnic->netdev); + nesvnic->linkup = 0; + netif_carrier_off(nesvnic->netdev); + } + } + } + } + + nesadapter->mac_sw_state[mac_number] = NES_MAC_SW_IDLE; +} + + +#ifdef NES_NAPI +/** + * nes_nic_napi_ce_handler + */ +void nes_nic_napi_ce_handler(struct nes_device *nesdev, struct nes_hw_nic_cq *cq) +{ + struct nes_vnic *nesvnic = container_of(cq, struct nes_vnic, nic_cq); + + netif_rx_schedule(nesdev->netdev[nesvnic->netdev_index]); +} +#endif + +// The MAX_RQES_TO_PROCESS defines how many max read requests to complete before +// getting out of nic_ce_handler +// +#define MAX_RQES_TO_PROCESS 384 + +/** + * nes_nic_ce_handler + */ +void nes_nic_ce_handler(struct nes_device *nesdev, struct nes_hw_nic_cq *cq) +{ + u64 u64temp; + dma_addr_t bus_address; + struct nes_hw_nic *nesnic; + struct nes_vnic *nesvnic = container_of(cq, struct nes_vnic, nic_cq); + struct nes_hw_nic_rq_wqe *nic_rqe; + struct nes_hw_nic_sq_wqe *nic_sqe; + struct sk_buff *skb; + struct sk_buff *rx_skb; + u16 *wqe_fragment_length; + unsigned long flags; + u32 head; + u32 cq_size; + u32 rx_pkt_size; + u32 cqe_count=0; + u32 cqe_errv; + u32 cqe_misc; + u16 wqe_fragment_index = 1; /* first fragment (0) is used by copy buffer */ + u16 vlan_tag; + u16 pkt_type; + u16 rqes_processed = 0; + + head = cq->cq_head; + cq_size = cq->cq_size; +#ifdef NES_NAPI + nesvnic->cqes_pending = 1; +#endif + do { + if (le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_NIC_CQE_MISC_IDX]) & + NES_NIC_CQE_VALID) { + nesnic = &nesvnic->nic; + cqe_misc = le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_NIC_CQE_MISC_IDX]); + if (cqe_misc & NES_NIC_CQE_SQ) { + + wqe_fragment_index = 1; + nic_sqe = &nesnic->sq_vbase[nesnic->sq_tail]; + skb = nesnic->tx_skb[nesnic->sq_tail]; + wqe_fragment_length = (u16 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX]; + /* bump past the vlan tag */ + wqe_fragment_length++; + if (le16_to_cpu(wqe_fragment_length[wqe_fragment_index]) != 0) { + u64temp = (u64) le32_to_cpu(nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX+wqe_fragment_index*2]); + u64temp += ((u64)le32_to_cpu(nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_HIGH_IDX+wqe_fragment_index*2]))<<32; + bus_address = (dma_addr_t)u64temp; + if ((skb) && (skb_headlen(skb) > NES_FIRST_FRAG_SIZE)) { + pci_unmap_single(nesdev->pcidev, + bus_address, + le16_to_cpu(wqe_fragment_length[wqe_fragment_index++]), + PCI_DMA_TODEVICE); + } + for (; wqe_fragment_index < 5; wqe_fragment_index++) { + if (wqe_fragment_length[wqe_fragment_index]) { + u64temp = le32_to_cpu(nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX+wqe_fragment_index*2]); + u64temp += ((u64)le32_to_cpu(nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_HIGH_IDX+wqe_fragment_index*2]))<<32; + bus_address = (dma_addr_t)u64temp; + pci_unmap_page(nesdev->pcidev, + bus_address, + le16_to_cpu(wqe_fragment_length[wqe_fragment_index]), + PCI_DMA_TODEVICE); + } else + break; + } + if (skb) + dev_kfree_skb_any(skb); + } + spin_lock_irqsave(&nesnic->sq_lock, flags); + nesnic->sq_tail++; + nesnic->sq_tail &= nesnic->sq_size-1; + /* restart the queue if it had been stopped */ + if (netif_queue_stopped(nesvnic->netdev)) + netif_wake_queue(nesvnic->netdev); + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + } else { + rqes_processed ++; +#ifdef NES_NAPI + nesvnic->rx_cqes_completed++; +#endif + rx_pkt_size = cqe_misc & 0x0000ffff; + nic_rqe = &nesnic->rq_vbase[nesnic->rq_tail]; + /* Get the skb */ + rx_skb = nesnic->rx_skb[nesnic->rq_tail]; + nic_rqe = &nesnic->rq_vbase[nesvnic->nic.rq_tail]; + bus_address = (dma_addr_t)le32_to_cpu(nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_LOW_IDX]); + bus_address += ((u64)le32_to_cpu(nic_rqe->wqe_words[NES_NIC_RQ_WQE_FRAG0_HIGH_IDX])) << 32; + pci_unmap_single(nesdev->pcidev, bus_address, + nesvnic->max_frame_size, PCI_DMA_FROMDEVICE); + /* rx_skb->tail = rx_skb->data + rx_pkt_size; */ + /* rx_skb->len = rx_pkt_size; */ + rx_skb->len = 0; /* TODO: see if this is necessary */ + skb_put(rx_skb, rx_pkt_size); + rx_skb->protocol = eth_type_trans(rx_skb, nesvnic->netdev); + nesnic->rq_tail++; + nesnic->rq_tail &= nesnic->rq_size - 1; + + atomic_inc(&nesvnic->rx_skbs_needed); + if (atomic_read(&nesvnic->rx_skbs_needed) > (nesvnic->nic.rq_size>>1)) { + nes_write32(nesdev->regs+NES_CQE_ALLOC, + cq->cq_number | (cqe_count << 16)); + cqe_count = 0; + nes_replenish_nic_rq(nesvnic); + } + pkt_type = (u16)(le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_NIC_CQE_TAG_PKT_TYPE_IDX])); + cqe_errv = (cqe_misc & NES_NIC_CQE_ERRV_MASK) >> NES_NIC_CQE_ERRV_SHIFT; + rx_skb->ip_summed = CHECKSUM_NONE; + + if ((NES_PKT_TYPE_TCPV4_BITS == (pkt_type & NES_PKT_TYPE_TCPV4_MASK)) || + (NES_PKT_TYPE_UDPV4_BITS == (pkt_type & NES_PKT_TYPE_UDPV4_MASK))) { + if (0 == (cqe_errv & + (NES_NIC_ERRV_BITS_IPV4_CSUM_ERR | + NES_NIC_ERRV_BITS_TCPUDP_CSUM_ERR | + NES_NIC_ERRV_BITS_IPH_ERR | + NES_NIC_ERRV_BITS_WQE_OVERRUN))) { + if (0 == nesvnic->rx_checksum_disabled) { + rx_skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } else { + nes_debug(NES_DBG_CQ, "%s: unsuccessfully checksummed TCP or UDP packet." + " errv = 0x%X, pkt_type = 0x%X.\n", + nesvnic->netdev->name, cqe_errv, pkt_type); + } + } else if (NES_PKT_TYPE_IPV4_BITS == (pkt_type & NES_PKT_TYPE_IPV4_MASK)) { + if (0 == (cqe_errv & + (NES_NIC_ERRV_BITS_IPV4_CSUM_ERR | + NES_NIC_ERRV_BITS_IPH_ERR | + NES_NIC_ERRV_BITS_WQE_OVERRUN))) { + if (0 == nesvnic->rx_checksum_disabled) { + rx_skb->ip_summed = CHECKSUM_UNNECESSARY; + /* nes_debug(NES_DBG_CQ, "%s: Reporting successfully checksummed IPv4 packet.\n", + nesvnic->netdev->name); */ + } + } else { + nes_debug(NES_DBG_CQ, "%s: unsuccessfully checksummed TCP or UDP packet." + " errv = 0x%X, pkt_type = 0x%X.\n", + nesvnic->netdev->name, cqe_errv, pkt_type); + } + } + /* nes_debug(NES_DBG_CQ, "pkt_type=%x, APBVT_MASK=%x\n", + pkt_type, (pkt_type & NES_PKT_TYPE_APBVT_MASK)); */ + + if (NES_PKT_TYPE_APBVT_BITS == (pkt_type & NES_PKT_TYPE_APBVT_MASK)) { + /* nes_debug(NES_DBG_CQ, "APBVT bit set; Send up NES; nesif_rx\n"); */ + nes_cm_recv(rx_skb, nesvnic->netdev); + } else { + if (cqe_misc & NES_NIC_CQE_TAG_VALID) { + vlan_tag = (u16)(le32_to_cpu( + cq->cq_vbase[head].cqe_words[NES_NIC_CQE_TAG_PKT_TYPE_IDX]) + >> 16); + nes_debug(NES_DBG_CQ, "%s: Reporting stripped VLAN packet. Tag = 0x%04X\n", + nesvnic->netdev->name, vlan_tag); + +#ifdef NES_NAPI + vlan_hwaccel_receive_skb(rx_skb, nesvnic->vlan_grp, vlan_tag); +#else + vlan_hwaccel_rx(rx_skb, nesvnic->vlan_grp, vlan_tag); +#endif + } else { +#ifdef NES_NAPI + netif_receive_skb(rx_skb); +#else + netif_rx(rx_skb); +#endif + } + } + + nesvnic->netdev->last_rx = jiffies; + /* nesvnic->netstats.rx_packets++; */ + /* nesvnic->netstats.rx_bytes += rx_pkt_size; */ + } + + cq->cq_vbase[head].cqe_words[NES_NIC_CQE_MISC_IDX] = 0; + /* Accounting... */ + cqe_count++; + if (++head >= cq_size) + head = 0; + if (cqe_count == 255) { + /* Replenish Nic CQ */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, + cq->cq_number | (cqe_count << 16)); + cqe_count = 0; + } +#ifdef NES_NAPI + if (nesvnic->rx_cqes_completed >= nesvnic->budget) + break; +#endif + } else { + nesvnic->cqes_pending = 0; + break; + } + if (rqes_processed > MAX_RQES_TO_PROCESS) { + break; + } + } while (1); + + cq->cq_head = head; + /* nes_debug(NES_DBG_CQ, "CQ%u Processed = %u cqes, new head = %u.\n", + cq->cq_number, cqe_count, cq->cq_head); */ +#ifdef NES_NAPI + nesvnic->cqe_allocs_pending = cqe_count; +#else + /* Arm the CCQ */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT | + cq->cq_number | (cqe_count << 16)); + nes_read32(nesdev->regs+NES_CQE_ALLOC); +#endif + if (atomic_read(&nesvnic->rx_skbs_needed)) { + nes_replenish_nic_rq(nesvnic); + } +} + + +/** + * nes_cqp_ce_handler + */ +void nes_cqp_ce_handler(struct nes_device *nesdev, struct nes_hw_cq *cq) +{ + u64 u64temp; + unsigned long flags; + struct nes_hw_cqp *cqp = NULL; + struct nes_cqp_request *cqp_request; + struct nes_hw_cqp_wqe *cqp_wqe; + u32 head; + u32 cq_size; + u32 cqe_count=0; + u32 error_code; + /* u32 counter; */ + + head = cq->cq_head; + cq_size = cq->cq_size; + + do { + /* process the CQE */ + /* nes_debug(NES_DBG_CQP, "head=%u cqe_words=%08X\n", head, + le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX])); */ + + if (le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX]) & NES_CQE_VALID) { + u64temp = (((u64)(le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_COMP_COMP_CTX_HIGH_IDX])))<<32) | + ((u64)(le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_COMP_COMP_CTX_LOW_IDX]))); + cqp = *((struct nes_hw_cqp **)&u64temp); + + error_code = le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_ERROR_CODE_IDX]); + if (error_code) { + nes_debug(NES_DBG_CQP, "Bad Completion code for opcode 0x%02X from CQP," + " Major/Minor codes = 0x%04X:%04X.\n", + le32_to_cpu(cq->cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX])&0x3f, + (u16)(error_code >> 16), + (u16)error_code); + nes_debug(NES_DBG_CQP, "cqp: qp_id=%u, sq_head=%u, sq_tail=%u\n", + cqp->qp_id, cqp->sq_head, cqp->sq_tail); + } + + u64temp = (((u64)(le32_to_cpu(nesdev->cqp.sq_vbase[cqp->sq_tail].wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX])))<<32) | + ((u64)(le32_to_cpu(nesdev->cqp.sq_vbase[cqp->sq_tail].wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX]))); + cqp_request = *((struct nes_cqp_request **)&u64temp); + if (cqp_request) { + if (cqp_request->waiting) { + /* nes_debug(NES_DBG_CQP, "%s: Waking up requestor\n"); */ + cqp_request->major_code = (u16)(error_code >> 16); + cqp_request->minor_code = (u16)error_code; + barrier(); + cqp_request->request_done = 1; + wake_up(&cqp_request->waitq); + if (atomic_dec_and_test(&cqp_request->refcount)) { + nes_debug(NES_DBG_CQP, "CQP request %p (opcode 0x%02X) freed.\n", + cqp_request, + le32_to_cpu(cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_OPCODE_IDX])&0x3f); + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } else { + nes_debug(NES_DBG_CQP, "CQP request %p (opcode 0x%02X) freed.\n", + cqp_request, + le32_to_cpu(cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_OPCODE_IDX])&0x3f); + if (cqp_request->dynamic) { + kfree(cqp_request); + atomic_inc(&cqp_reqs_dynfreed); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } else { + wake_up(&nesdev->cqp.waitq); + } + + cq->cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX] = 0; + nes_write32(nesdev->regs+NES_CQE_ALLOC, cq->cq_number | (1 << 16)); + if (++cqp->sq_tail >= cqp->sq_size) + cqp->sq_tail = 0; + + /* Accounting... */ + cqe_count++; + if (++head >= cq_size) + head = 0; + } else { + break; + } + } while (1); + cq->cq_head = head; + + spin_lock_irqsave(&nesdev->cqp.lock, flags); + while ((!list_empty(&nesdev->cqp_pending_reqs)) && + ((((nesdev->cqp.sq_tail+nesdev->cqp.sq_size)-nesdev->cqp.sq_head) & + (nesdev->cqp.sq_size - 1)) != 1)) { + atomic_inc(&cqp_reqs_redriven); + cqp_request = list_entry(nesdev->cqp_pending_reqs.next, + struct nes_cqp_request, list); + list_del_init(&cqp_request->list); + head = nesdev->cqp.sq_head++; + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; + cqp_wqe = &nesdev->cqp.sq_vbase[head]; + memcpy(cqp_wqe, &cqp_request->cqp_wqe, sizeof(*cqp_wqe)); + barrier(); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)((u64)cqp_request)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)(((u64)cqp_request)>>32)); + nes_debug(NES_DBG_CQP, "CQP request %p (opcode 0x%02X) put on CQPs SQ wqe%u.\n", + cqp_request, le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX])&0x3f, head); + /* Ring doorbell (1 WQEs) */ + barrier(); + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x01800000 | nesdev->cqp.qp_id); + } + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + /* Arm the CCQ */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT | + cq->cq_number); + nes_read32(nesdev->regs+NES_CQE_ALLOC); +} + + +/** + * nes_process_iwarp_aeqe + */ +void nes_process_iwarp_aeqe(struct nes_device *nesdev, struct nes_hw_aeqe *aeqe) +{ + u64 context; + u64 aeqe_context = 0; + unsigned long flags; + struct nes_qp *nesqp; + int resource_allocated; + /* struct iw_cm_id *cm_id; */ + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct ib_event ibevent; + /* struct iw_cm_event cm_event; */ + u32 aeq_info; + u32 next_iwarp_state = 0; + u16 async_event_id; + u8 tcp_state; + u8 iwarp_state; + + nes_debug(NES_DBG_AEQ, "\n"); + aeq_info = le32_to_cpu(aeqe->aeqe_words[NES_AEQE_MISC_IDX]); + if ((NES_AEQE_INBOUND_RDMA&aeq_info) || (!(NES_AEQE_QP&aeq_info))) { + context = le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_LOW_IDX]); + context += ((u64)le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_HIGH_IDX])) << 32; + } else { + aeqe_context = le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_LOW_IDX]); + aeqe_context += ((u64)le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_CTXT_HIGH_IDX])) << 32; + context = (u64)nesadapter->qp_table[le32_to_cpu( + aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX])-NES_FIRST_QPN]; + BUG_ON(!context); + } + + async_event_id = (u16)aeq_info; + tcp_state = (aeq_info & NES_AEQE_TCP_STATE_MASK) >> NES_AEQE_TCP_STATE_SHIFT; + iwarp_state = (aeq_info & NES_AEQE_IWARP_STATE_MASK) >> NES_AEQE_IWARP_STATE_SHIFT; + nes_debug(NES_DBG_AEQ, "aeid = 0x%04X, qp-cq id = %d, aeqe = %p, Tcp state = %d, iWARP state = %d\n", + async_event_id, + le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX]), aeqe, + tcp_state, iwarp_state); + /* nes_tcp_state_str[tcp_state], + nes_iwarp_state_str[iwarp_state]); */ + + + switch (async_event_id) { + case NES_AEQE_AEID_LLP_FIN_RECEIVED: + nesqp = *((struct nes_qp **)&context); + if (atomic_inc_return(&nesqp->close_timer_started)==1) { + nesqp->cm_id->add_ref(nesqp->cm_id); + nes_add_ref(&nesqp->ibqp); + schedule_nes_timer(nesqp->cm_node, (struct sk_buff *)nesqp, + NES_TIMER_TYPE_CLOSE, 1, 0); + nes_debug(NES_DBG_AEQ, "QP%u Not decrementing QP refcount (%d)," + " need ae to finish up, original_last_aeq = 0x%04X." + " last_aeq = 0x%04X, scheduling timer. TCP state = %d\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + async_event_id, nesqp->last_aeq, tcp_state); + } + if ((tcp_state != NES_AEQE_TCP_STATE_CLOSE_WAIT) || + (nesqp->ibqp_state != IB_QPS_RTS)) { + /* FIN Received but tcp state or IB state moved on, + should expect a close complete */ + return; + } + case NES_AEQE_AEID_LLP_CLOSE_COMPLETE: + case NES_AEQE_AEID_LLP_CONNECTION_RESET: + case NES_AEQE_AEID_TERMINATE_SENT: + case NES_AEQE_AEID_RDMAP_ROE_BAD_LLP_CLOSE: + case NES_AEQE_AEID_RESET_SENT: + nesqp = *((struct nes_qp **)&context); + if (async_event_id == NES_AEQE_AEID_RESET_SENT) { + tcp_state = NES_AEQE_TCP_STATE_CLOSED; + } + nes_add_ref(&nesqp->ibqp); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + + if ((tcp_state == NES_AEQE_TCP_STATE_CLOSED) || + (tcp_state == NES_AEQE_TCP_STATE_TIME_WAIT)) { + nesqp->hte_added = 0; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "issuing hw modifyqp for QP%u to remove hte\n", + nesqp->hwqp.qp_id); + nes_hw_modify_qp(nesdev, nesqp, + NES_CQP_QP_IWARP_STATE_ERROR | NES_CQP_QP_DEL_HTE, 0); + spin_lock_irqsave(&nesqp->lock, flags); + } + + if ((nesqp->ibqp_state == IB_QPS_RTS) && + ((tcp_state == NES_AEQE_TCP_STATE_CLOSE_WAIT) || + (async_event_id==NES_AEQE_AEID_LLP_CONNECTION_RESET))) { + switch (nesqp->hw_iwarp_state) { + case NES_AEQE_IWARP_STATE_RTS: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_CLOSING; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_CLOSING; + break; + case NES_AEQE_IWARP_STATE_TERMINATE: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_TERMINATE; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_TERMINATE; + if (async_event_id == NES_AEQE_AEID_RDMAP_ROE_BAD_LLP_CLOSE) { + next_iwarp_state |= 0x02000000; + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + } + break; + default: + next_iwarp_state = 0; + } + spin_unlock_irqrestore(&nesqp->lock, flags); + if (next_iwarp_state) { + nes_add_ref(&nesqp->ibqp); + nes_debug(NES_DBG_AEQ, "issuing hw modifyqp for QP%u. next state = 0x%08X," + " also added another reference\n", + nesqp->hwqp.qp_id, next_iwarp_state); + nes_hw_modify_qp(nesdev, nesqp, next_iwarp_state, 0); + } + nes_cm_disconn(nesqp); + } else { + if (async_event_id == NES_AEQE_AEID_LLP_FIN_RECEIVED) { + /* FIN Received but ib state not RTS, + close complete will be on its way */ + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_rem_ref(&nesqp->ibqp); + return; + } + spin_unlock_irqrestore(&nesqp->lock, flags); + if (async_event_id==NES_AEQE_AEID_RDMAP_ROE_BAD_LLP_CLOSE) { + next_iwarp_state = NES_CQP_QP_IWARP_STATE_TERMINATE | 0x02000000; + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + nes_debug(NES_DBG_AEQ, "issuing hw modifyqp for QP%u. next state = 0x%08X," + " also added another reference\n", + nesqp->hwqp.qp_id, next_iwarp_state); + nes_hw_modify_qp(nesdev, nesqp, next_iwarp_state, 0); + } + nes_cm_disconn(nesqp); + } + break; + case NES_AEQE_AEID_LLP_TERMINATE_RECEIVED: + nesqp = *((struct nes_qp **)&context); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_LLP_TERMINATE_RECEIVED" + " event on QP%u \n Q2 Data:\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_FATAL; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + if ((tcp_state == NES_AEQE_TCP_STATE_CLOSE_WAIT) || + ((nesqp->ibqp_state == IB_QPS_RTS)&& + (async_event_id==NES_AEQE_AEID_LLP_CONNECTION_RESET))) { + nes_add_ref(&nesqp->ibqp); + nes_cm_disconn(nesqp); + } else { + nesqp->in_disconnect = 0; + wake_up(&nesqp->kick_waitq); + } + break; + case NES_AEQE_AEID_LLP_TOO_MANY_RETRIES: + nesqp = *((struct nes_qp **)&context); + nes_add_ref(&nesqp->ibqp); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_ERROR; + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + nesqp->last_aeq = async_event_id; + if (nesqp->cm_id) { + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_LLP_TOO_MANY_RETRIES" + " event on QP%u, remote IP = 0x%08X \n", + nesqp->hwqp.qp_id, + ntohl(nesqp->cm_id->remote_addr.sin_addr.s_addr)); + } else { + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_LLP_TOO_MANY_RETRIES" + " event on QP%u \n", + nesqp->hwqp.qp_id); + } + spin_unlock_irqrestore(&nesqp->lock, flags); + next_iwarp_state = NES_CQP_QP_IWARP_STATE_ERROR | NES_CQP_QP_RESET; + nes_hw_modify_qp(nesdev, nesqp, next_iwarp_state, 0); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_FATAL; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + break; + case NES_AEQE_AEID_AMP_BAD_STAG_INDEX: + if (NES_AEQE_INBOUND_RDMA&aeq_info) { + nesqp = nesadapter->qp_table[le32_to_cpu( + aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX])-NES_FIRST_QPN]; + } else { + /* TODO: get the actual WQE and mask off wqe index */ + context &= ~((u64)511); + nesqp = *((struct nes_qp **)&context); + } + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_AMP_BAD_STAG_INDEX event on QP%u\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_ACCESS_ERR; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + break; + case NES_AEQE_AEID_AMP_UNALLOCATED_STAG: + nesqp = *((struct nes_qp **)&context); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_AMP_UNALLOCATED_STAG event on QP%u\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_ACCESS_ERR; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + break; + case NES_AEQE_AEID_PRIV_OPERATION_DENIED: + nesqp = nesadapter->qp_table[le32_to_cpu(aeqe->aeqe_words + [NES_AEQE_COMP_QP_CQ_ID_IDX])-NES_FIRST_QPN]; + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_PRIV_OPERATION_DENIED event on QP%u," + " nesqp = %p, AE reported %p\n", + nesqp->hwqp.qp_id, nesqp, *((struct nes_qp **)&context)); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_ACCESS_ERR; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + break; + case NES_AEQE_AEID_CQ_OPERATION_ERROR: + context <<= 1; + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_CQ_OPERATION_ERROR event on CQ%u, %p\n", + le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX]), (void *)context); + resource_allocated = nes_is_resource_allocated(nesadapter, nesadapter->allocated_cqs, + le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX])); + if (resource_allocated) { + printk(KERN_ERR PFX "%s: Processing an NES_AEQE_AEID_CQ_OPERATION_ERROR event on CQ%u\n", + __FUNCTION__, le32_to_cpu(aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX])); + } + break; + case NES_AEQE_AEID_DDP_UBE_DDP_MESSAGE_TOO_LONG_FOR_AVAILABLE_BUFFER: + nesqp = nesadapter->qp_table[le32_to_cpu( + aeqe->aeqe_words[NES_AEQE_COMP_QP_CQ_ID_IDX])-NES_FIRST_QPN]; + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_DDP_UBE_DDP_MESSAGE_TOO_LONG" + "_FOR_AVAILABLE_BUFFER event on QP%u\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_ACCESS_ERR; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + /* tell cm to disconnect, cm will queue work to thread */ + nes_add_ref(&nesqp->ibqp); + nes_cm_disconn(nesqp); + break; + case NES_AEQE_AEID_DDP_UBE_INVALID_MSN_NO_BUFFER_AVAILABLE: + nesqp = *((struct nes_qp **)&context); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_DDP_UBE_INVALID_MSN" + "_NO_BUFFER_AVAILABLE event on QP%u\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_FATAL; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + /* tell cm to disconnect, cm will queue work to thread */ + nes_add_ref(&nesqp->ibqp); + nes_cm_disconn(nesqp); + break; + case NES_AEQE_AEID_LLP_RECEIVED_MPA_CRC_ERROR: + nesqp = *((struct nes_qp **)&context); + spin_lock_irqsave(&nesqp->lock, flags); + nesqp->hw_iwarp_state = iwarp_state; + nesqp->hw_tcp_state = tcp_state; + nesqp->last_aeq = async_event_id; + spin_unlock_irqrestore(&nesqp->lock, flags); + nes_debug(NES_DBG_AEQ, "Processing an NES_AEQE_AEID_LLP_RECEIVED_MPA_CRC_ERROR" + " event on QP%u \n Q2 Data:\n", + nesqp->hwqp.qp_id); + if (nesqp->ibqp.event_handler) { + ibevent.device = nesqp->ibqp.device; + ibevent.element.qp = &nesqp->ibqp; + ibevent.event = IB_EVENT_QP_FATAL; + nesqp->ibqp.event_handler(&ibevent, nesqp->ibqp.qp_context); + } + /* tell cm to disconnect, cm will queue work to thread */ + nes_add_ref(&nesqp->ibqp); + nes_cm_disconn(nesqp); + break; + /* TODO: additional AEs need to be here */ + default: + nes_debug(NES_DBG_AEQ, "Processing an iWARP related AE for QP, misc = 0x%04X\n", + async_event_id); + break; + } + +} + + +/** + * nes_iwarp_ce_handler + */ +void nes_iwarp_ce_handler(struct nes_device *nesdev, struct nes_hw_cq *hw_cq) +{ + struct nes_cq *nescq = container_of(hw_cq, struct nes_cq, hw_cq); + + /* nes_debug(NES_DBG_CQ, "Processing completion event for iWARP CQ%u.\n", + nescq->hw_cq.cq_number); */ + nes_write32(nesdev->regs+NES_CQ_ACK, nescq->hw_cq.cq_number); + + if (nescq->ibcq.comp_handler) + nescq->ibcq.comp_handler(&nescq->ibcq, nescq->ibcq.cq_context); + + return; +} + +/** + * nes_manage_apbvt() + */ +int nes_manage_apbvt(struct nes_vnic *nesvnic, u32 accel_local_port, + u32 nic_index, u32 add_port) +{ + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_hw_cqp_wqe *cqp_wqe; + unsigned long flags; + struct nes_cqp_request *cqp_request; + int ret = 0; + u16 major_code; + + /* Send manage APBVT request to CQP */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_QP, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + nes_debug(NES_DBG_QP, "%s APBV for local port=%u(0x%04x), nic_index=%u\n", + (add_port == NES_MANAGE_APBVT_ADD) ? "ADD" : "DEL", + accel_local_port, accel_local_port, nic_index); + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_MANAGE_APBVT | + ((add_port==NES_MANAGE_APBVT_ADD) ? NES_CQP_APBVT_ADD : 0)); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = + cpu_to_le32((nic_index << NES_CQP_APBVT_NIC_SHIFT) | accel_local_port); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + nes_debug(NES_DBG_QP, "Waiting for CQP completion for APBVT.\n"); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + if (add_port==NES_MANAGE_APBVT_ADD) + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_QP, "Completed, ret=%u, CQP Major:Minor codes = 0x%04X:0x%04X\n", + ret, cqp_request->major_code, cqp_request->minor_code); + major_code = cqp_request->major_code; + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) + return -ETIME; + else if (major_code) + return -EIO; + else + return 0; +} + + +/** + * nes_manage_arp_cache + */ +void nes_manage_arp_cache(struct net_device *netdev, unsigned char *mac_addr, + u32 ip_addr, u32 action) +{ + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev; + struct nes_cqp_request *cqp_request; + int arp_index; + + nesdev = nesvnic->nesdev; + arp_index = nes_arp_table(nesdev, ip_addr, mac_addr, action); + if (arp_index == -1) { + /* nes_debug(NES_DBG_NETDEV, "nes_arp_table call returned -1\n"); */ + return; + } + + /* nes_debug(NES_DBG_NETDEV, "Update the ARP entry, arp_index=%d\n", arp_index); */ + + /* update the ARP entry */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_NETDEV, "Failed to get a cqp_request.\n"); + return; + } + cqp_request->waiting = 0; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_MANAGE_ARP_CACHE | NES_CQP_ARP_PERM); + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + (u32)PCI_FUNC(nesdev->pcidev->devfn) << NES_CQP_ARP_AEQ_INDEX_SHIFT); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(arp_index); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + if (action == NES_ARP_ADD) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_ARP_VALID); + cqp_wqe->wqe_words[NES_CQP_ARP_WQE_MAC_ADDR_LOW_IDX] = cpu_to_le32( + (((u32)mac_addr[2]) << 24) | (((u32)mac_addr[3]) << 16) | + (((u32)mac_addr[4]) << 8) | (u32)mac_addr[5]); + cqp_wqe->wqe_words[NES_CQP_ARP_WQE_MAC_HIGH_IDX] = cpu_to_le32( + (((u32)mac_addr[0]) << 16) | (u32)mac_addr[1]); + } else { + cqp_wqe->wqe_words[NES_CQP_ARP_WQE_MAC_ADDR_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_ARP_WQE_MAC_HIGH_IDX] = 0; + } + + nes_debug(NES_DBG_NETDEV, "Not waiting for CQP, cqp.sq_head=%u, cqp.sq_tail=%u\n", + nesdev->cqp.sq_head, nesdev->cqp.sq_tail); + + atomic_set(&cqp_request->refcount, 1); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); +} + + +/** + * flush_wqes + */ +void flush_wqes(struct nes_device *nesdev, struct nes_qp *nesqp, + u32 which_wq, u32 wait_completion) +{ + unsigned long flags; + struct nes_cqp_request *cqp_request; + struct nes_hw_cqp_wqe *cqp_wqe; + int ret; + + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_QP, "Failed to get a cqp_request.\n"); + return; + } + if (wait_completion) { + cqp_request->waiting = 1; + atomic_set(&cqp_request->refcount, 2); + } else { + cqp_request->waiting = 0; + } + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = + cpu_to_le32(NES_CQP_FLUSH_WQES | which_wq); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesqp->hwqp.qp_id); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + if (wait_completion) { + /* Wait for CQP */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_QP, "Flush SQ QP WQEs completed, ret=%u," + " CQP Major:Minor codes = 0x%04X:0x%04X\n", + ret, cqp_request->major_code, cqp_request->minor_code); + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } +} + From ggrundstrom at neteffect.com Fri Oct 19 13:15:42 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:15:42 -0500 Subject: [ofa-general] [PATCH 7/14 v2] nes: hardware specific includes Message-ID: <200710192015.l9JKFgbO021777@neteffect.com> Hardware structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_hw.h 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,1124 @@ +/* +* Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenIB.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +* - Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +* - Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +*/ + +#ifndef __NES_HW_H +#define __NES_HW_H + +enum pci_regs { + NES_INT_STAT = 0x0000, + NES_INT_MASK = 0x0004, + NES_INT_PENDING = 0x0008, + NES_INTF_INT_STAT = 0x000C, + NES_INTF_INT_MASK = 0x0010, + NES_TIMER_STAT = 0x0014, + NES_PERIODIC_CONTROL = 0x0018, + NES_ONE_SHOT_CONTROL = 0x001C, + NES_EEPROM_COMMAND = 0x0020, + NES_EEPROM_DATA = 0x0024, + NES_SOFTWARE_RESET = 0x0030, + NES_CQ_ACK = 0x0034, + NES_WQE_ALLOC = 0x0040, + NES_CQE_ALLOC = 0x0044, +}; + +enum indexed_regs { + NES_IDX_CREATE_CQP_LOW = 0x0000, + NES_IDX_CREATE_CQP_HIGH = 0x0004, + NES_IDX_QP_CONTROL = 0x0040, + NES_IDX_FLM_CONTROL = 0x0080, + NES_IDX_INT_CPU_STATUS = 0x00a0, + NES_IDX_GPIO_CONTROL = 0x00f0, + NES_IDX_GPIO_DATA = 0x00f4, + NES_IDX_TCP_CONFIG0 = 0x01e4, + NES_IDX_TCP_TIMER_CONFIG = 0x01ec, + NES_IDX_TCP_NOW = 0x01f0, + NES_IDX_QP_MAX_CFG_SIZES = 0x0200, + NES_IDX_QP_CTX_SIZE = 0x0218, + NES_IDX_TCP_TIMER_SIZE0 = 0x0238, + NES_IDX_TCP_TIMER_SIZE1 = 0x0240, + NES_IDX_ARP_CACHE_SIZE = 0x0258, + NES_IDX_CQ_CTX_SIZE = 0x0260, + NES_IDX_MRT_SIZE = 0x0278, + NES_IDX_PBL_REGION_SIZE = 0x0280, + NES_IDX_IRRQ_COUNT = 0x02b0, + NES_IDX_RX_WINDOW_BUFFER_PAGE_TABLE_SIZE = 0x02f0, + NES_IDX_RX_WINDOW_BUFFER_SIZE = 0x0300, + NES_IDX_DST_IP_ADDR = 0x0400, + NES_IDX_PCIX_DIAG = 0x08e8, + NES_IDX_MPP_DEBUG = 0x0a00, + NES_IDX_MPP_LB_DEBUG = 0x0b00, + NES_IDX_DENALI_CTL_22 = 0x1058, + NES_IDX_MAC_TX_CONTROL = 0x2000, + NES_IDX_MAC_TX_CONFIG = 0x2004, + NES_IDX_MAC_TX_PAUSE_QUANTA = 0x2008, + NES_IDX_MAC_RX_CONTROL = 0x200c, + NES_IDX_MAC_RX_CONFIG = 0x2010, + NES_IDX_MAC_EXACT_MATCH_BOTTOM = 0x201c, + NES_IDX_MAC_MDIO_CONTROL = 0x2084, + NES_IDX_MAC_TX_OCTETS_LOW = 0x2100, + NES_IDX_MAC_TX_OCTETS_HIGH = 0x2104, + NES_IDX_MAC_TX_FRAMES_LOW = 0x2108, + NES_IDX_MAC_TX_FRAMES_HIGH = 0x210c, + NES_IDX_MAC_TX_PAUSE_FRAMES = 0x2118, + NES_IDX_MAC_TX_ERRORS = 0x2138, + NES_IDX_MAC_RX_OCTETS_LOW = 0x213c, + NES_IDX_MAC_RX_OCTETS_HIGH = 0x2140, + NES_IDX_MAC_RX_FRAMES_LOW = 0x2144, + NES_IDX_MAC_RX_FRAMES_HIGH = 0x2148, + NES_IDX_MAC_RX_BC_FRAMES_LOW = 0x214c, + NES_IDX_MAC_RX_MC_FRAMES_HIGH = 0x2150, + NES_IDX_MAC_RX_PAUSE_FRAMES = 0x2154, + NES_IDX_MAC_RX_SHORT_FRAMES = 0x2174, + NES_IDX_MAC_RX_OVERSIZED_FRAMES = 0x2178, + NES_IDX_MAC_RX_JABBER_FRAMES = 0x217c, + NES_IDX_MAC_RX_CRC_ERR_FRAMES = 0x2180, + NES_IDX_MAC_RX_LENGTH_ERR_FRAMES = 0x2184, + NES_IDX_MAC_RX_SYMBOL_ERR_FRAMES = 0x2188, + NES_IDX_MAC_INT_STATUS = 0x21f0, + NES_IDX_MAC_INT_MASK = 0x21f4, + NES_IDX_PHY_PCS_CONTROL_STATUS0 = 0x2800, + NES_IDX_PHY_PCS_CONTROL_STATUS1 = 0x2a00, + NES_IDX_ETH_SERDES_COMMON_CONTROL0 = 0x2808, + NES_IDX_ETH_SERDES_COMMON_CONTROL1 = 0x2a08, + NES_IDX_ETH_SERDES_COMMON_STATUS0 = 0x280c, + NES_IDX_ETH_SERDES_COMMON_STATUS1 = 0x2a0c, + NES_IDX_ETH_SERDES_TX_EMP0 = 0x2810, + NES_IDX_ETH_SERDES_TX_EMP1 = 0x2a10, + NES_IDX_ETH_SERDES_TX_DRIVE0 = 0x2814, + NES_IDX_ETH_SERDES_TX_DRIVE1 = 0x2a14, + NES_IDX_ETH_SERDES_RX_MODE0 = 0x2818, + NES_IDX_ETH_SERDES_RX_MODE1 = 0x2a18, + NES_IDX_ETH_SERDES_RX_SIGDET0 = 0x281c, + NES_IDX_ETH_SERDES_RX_SIGDET1 = 0x2a1c, + NES_IDX_ETH_SERDES_BYPASS0 = 0x2820, + NES_IDX_ETH_SERDES_BYPASS1 = 0x2a20, + NES_IDX_ETH_SERDES_LOOPBACK_CONTROL0 = 0x2824, + NES_IDX_ETH_SERDES_LOOPBACK_CONTROL1 = 0x2a24, + NES_IDX_ETH_SERDES_RX_EQ_CONTROL0 = 0x2828, + NES_IDX_ETH_SERDES_RX_EQ_CONTROL1 = 0x2a28, + NES_IDX_ETH_SERDES_RX_EQ_STATUS0 = 0x282c, + NES_IDX_ETH_SERDES_RX_EQ_STATUS1 = 0x2a2c, + NES_IDX_ETH_SERDES_CDR_RESET0 = 0x2830, + NES_IDX_ETH_SERDES_CDR_RESET1 = 0x2a30, + NES_IDX_ETH_SERDES_CDR_CONTROL0 = 0x2834, + NES_IDX_ETH_SERDES_CDR_CONTROL1 = 0x2a34, + NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE0 = 0x2838, + NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1 = 0x2a38, + NES_IDX_ENDNODE0_NSTAT_RX_DISCARD = 0x3080, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_LO = 0x3000, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_HI = 0x3004, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_LO = 0x3008, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_HI = 0x300c, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_LO = 0x7000, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_HI = 0x7004, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_LO = 0x7008, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_HI = 0x700c, + NES_IDX_CM_CONFIG = 0x5100, + NES_IDX_NIC_LOGPORT_TO_PHYPORT = 0x6000, + NES_IDX_NIC_PHYPORT_TO_USW = 0x6008, + NES_IDX_NIC_ACTIVE = 0x6010, + NES_IDX_NIC_UNICAST_ALL = 0x6018, + NES_IDX_NIC_MULTICAST_ALL = 0x6020, + NES_IDX_NIC_MULTICAST_ENABLE = 0x6028, + NES_IDX_NIC_BROADCAST_ON = 0x6030, + NES_IDX_USED_CHUNKS_TX = 0x60b0, + NES_IDX_TX_POOL_SIZE = 0x60b8, + NES_IDX_QUAD_HASH_TABLE_SIZE = 0x6148, + NES_IDX_PERFECT_FILTER_LOW = 0x6200, + NES_IDX_PERFECT_FILTER_HIGH = 0x6204, + NES_IDX_IPV4_TCP_REXMITS = 0x7080, + NES_IDX_DEBUG_ERROR_CONTROL_STATUS = 0x913c, + NES_IDX_DEBUG_ERROR_MASKS0 = 0x9140, + NES_IDX_DEBUG_ERROR_MASKS1 = 0x9144, + NES_IDX_DEBUG_ERROR_MASKS2 = 0x9148, + NES_IDX_DEBUG_ERROR_MASKS3 = 0x914c, + NES_IDX_DEBUG_ERROR_MASKS4 = 0x9150, + NES_IDX_DEBUG_ERROR_MASKS5 = 0x9154, +}; + +#define NES_IDX_MAC_TX_CONFIG_ENABLE_PAUSE 1 +#define NES_IDX_MPP_DEBUG_PORT_DISABLE_PAUSE (1 << 17) + +enum nes_cqp_opcodes { + NES_CQP_CREATE_QP = 0x00, + NES_CQP_MODIFY_QP = 0x01, + NES_CQP_DESTROY_QP = 0x02, + NES_CQP_CREATE_CQ = 0x03, + NES_CQP_MODIFY_CQ = 0x04, + NES_CQP_DESTROY_CQ = 0x05, + NES_CQP_ALLOCATE_STAG = 0x09, + NES_CQP_REGISTER_STAG = 0x0a, + NES_CQP_QUERY_STAG = 0x0b, + NES_CQP_REGISTER_SHARED_STAG = 0x0c, + NES_CQP_DEALLOCATE_STAG = 0x0d, + NES_CQP_MANAGE_ARP_CACHE = 0x0f, + NES_CQP_SUSPEND_QPS = 0x11, + NES_CQP_UPLOAD_CONTEXT = 0x13, + NES_CQP_CREATE_CEQ = 0x16, + NES_CQP_DESTROY_CEQ = 0x18, + NES_CQP_CREATE_AEQ = 0x19, + NES_CQP_DESTROY_AEQ = 0x1b, + NES_CQP_LMI_ACCESS = 0x20, + NES_CQP_FLUSH_WQES = 0x22, + NES_CQP_MANAGE_APBVT = 0x23 +}; + +enum nes_cqp_wqe_word_idx { + NES_CQP_WQE_OPCODE_IDX = 0, + NES_CQP_WQE_ID_IDX = 1, + NES_CQP_WQE_COMP_CTX_LOW_IDX = 2, + NES_CQP_WQE_COMP_CTX_HIGH_IDX = 3, + NES_CQP_WQE_COMP_SCRATCH_LOW_IDX = 4, + NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX = 5, +}; + +enum nes_cqp_cq_wqeword_idx { + NES_CQP_CQ_WQE_PBL_LOW_IDX = 6, + NES_CQP_CQ_WQE_PBL_HIGH_IDX = 7, + NES_CQP_CQ_WQE_CQ_CONTEXT_LOW_IDX = 8, + NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX = 9, + NES_CQP_CQ_WQE_DOORBELL_INDEX_HIGH_IDX = 10, +}; + +enum nes_cqp_stag_wqeword_idx { + NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX = 1, + NES_CQP_STAG_WQE_LEN_HIGH_PD_IDX = 6, + NES_CQP_STAG_WQE_LEN_LOW_IDX = 7, + NES_CQP_STAG_WQE_STAG_IDX = 8, + NES_CQP_STAG_WQE_VA_LOW_IDX = 10, + NES_CQP_STAG_WQE_VA_HIGH_IDX = 11, + NES_CQP_STAG_WQE_PA_LOW_IDX = 12, + NES_CQP_STAG_WQE_PA_HIGH_IDX = 13, + NES_CQP_STAG_WQE_PBL_LEN_IDX = 14 +}; + +#define NES_CQP_OP_IWARP_STATE_SHIFT 28 + +enum nes_cqp_qp_bits { + NES_CQP_QP_ARP_VALID = (1<<8), + NES_CQP_QP_WINBUF_VALID = (1<<9), + NES_CQP_QP_CONTEXT_VALID = (1<<10), + NES_CQP_QP_ORD_VALID = (1<<11), + NES_CQP_QP_WINBUF_DATAIND_EN = (1<<12), + NES_CQP_QP_VIRT_WQS = (1<<13), + NES_CQP_QP_DEL_HTE = (1<<14), + NES_CQP_QP_CQS_VALID = (1<<15), + NES_CQP_QP_TYPE_TSA = 0, + NES_CQP_QP_TYPE_IWARP = (1<<16), + NES_CQP_QP_TYPE_CQP = (4<<16), + NES_CQP_QP_TYPE_NIC = (5<<16), + NES_CQP_QP_MSS_CHG = (1<<20), + NES_CQP_QP_STATIC_RESOURCES = (1<<21), + NES_CQP_QP_IGNORE_MW_BOUND = (1<<22), + NES_CQP_QP_VWQ_USE_LMI = (1<<23), + NES_CQP_QP_IWARP_STATE_IDLE = (1<netdev */ + u8 perfect_filter_index; + u8 nic_index; + u8 qp_nic_index[4]; + u8 next_qp_nic_index; + u8 of_device_registered; + u8 rdma_enabled; + u8 cqes_pending; + u8 rx_checksum_disabled; +}; + +struct nes_ib_device { + struct ib_device ibdev; + struct nes_vnic *nesvnic; + + /* Virtual RNIC Limits */ + u32 max_mr; + u32 max_qp; + u32 max_cq; + u32 max_pd; + u32 num_mr; + u32 num_qp; + u32 num_cq; + u32 num_pd; +}; + +#endif /* __NES_HW_H */ + From ggrundstrom at neteffect.com Fri Oct 19 13:17:22 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:17:22 -0500 Subject: [ofa-general] [PATCH 8/14 v2] nes: nic device routines Message-ID: <200710192017.l9JKHMA9021789@neteffect.com> NIC device routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_nic.c 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,1517 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "nes.h" + +static struct nic_qp_map nic_qp_mapping_0[] = { + {16,0,0,1},{24,4,0,0},{28,8,0,0},{32,12,0,0}, + {20,2,2,1},{26,6,2,0},{30,10,2,0},{34,14,2,0}, + {18,1,1,1},{25,5,1,0},{29,9,1,0},{33,13,1,0}, + {22,3,3,1},{27,7,3,0},{31,11,3,0},{35,15,3,0} +}; + +static struct nic_qp_map nic_qp_mapping_1[] = { + {18,1,1,1},{25,5,1,0},{29,9,1,0},{33,13,1,0}, + {22,3,3,1},{27,7,3,0},{31,11,3,0},{35,15,3,0} +}; + +static struct nic_qp_map nic_qp_mapping_2[] = { + {20,2,2,1},{26,6,2,0},{30,10,2,0},{34,14,2,0} +}; + +static struct nic_qp_map nic_qp_mapping_3[] = { + {22,3,3,1},{27,7,3,0},{31,11,3,0},{35,15,3,0} +}; + +static struct nic_qp_map nic_qp_mapping_4[] = { + {28,8,0,0},{32,12,0,0} +}; + +static struct nic_qp_map nic_qp_mapping_5[] = { + {29,9,1,0},{33,13,1,0} +}; + +static struct nic_qp_map nic_qp_mapping_6[] = { + {30,10,2,0},{34,14,2,0} +}; + +static struct nic_qp_map nic_qp_mapping_7[] = { + {31,11,3,0},{35,15,3,0} +}; + +static struct nic_qp_map *nic_qp_mapping_per_function[] = { + nic_qp_mapping_0, nic_qp_mapping_1, nic_qp_mapping_2, nic_qp_mapping_3, + nic_qp_mapping_4, nic_qp_mapping_5, nic_qp_mapping_6, nic_qp_mapping_7 +}; + +static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK + | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN; +static int debug = -1; + +static int rdma_enabled = 0; + +static int nes_netdev_open(struct net_device *); +static int nes_netdev_stop(struct net_device *); +static int nes_netdev_start_xmit(struct sk_buff *, struct net_device *); +static struct net_device_stats *nes_netdev_get_stats(struct net_device *); +static void nes_netdev_tx_timeout(struct net_device *); +static int nes_netdev_set_mac_address(struct net_device *, void *); +static int nes_netdev_change_mtu(struct net_device *, int); + +#ifdef NES_NAPI +/** + * nes_netdev_poll + */ +static int nes_netdev_poll(struct net_device* netdev, int* budget) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_hw_nic_cq *nescq = &nesvnic->nic_cq; + + nesvnic->budget = *budget; + nesvnic->cqes_pending = 0; + nesvnic->rx_cqes_completed = 0; + nesvnic->cqe_allocs_pending = 0; + + nes_nic_ce_handler(nesdev, nescq); + + netdev->quota -= nesvnic->rx_cqes_completed; + *budget -= nesvnic->rx_cqes_completed; + + if (0 == nesvnic->cqes_pending) { + netif_rx_complete(netdev); + /* clear out completed cqes and arm */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT | + nescq->cq_number | (nesvnic->cqe_allocs_pending << 16)); + nes_read32(nesdev->regs+NES_CQE_ALLOC); + } else { + /* clear out completed cqes but don't arm */ + nes_write32(nesdev->regs+NES_CQE_ALLOC, + nescq->cq_number | (nesvnic->cqe_allocs_pending << 16)); + nes_debug(NES_DBG_NETDEV, "%s: exiting with work pending\n", + nesvnic->netdev->name); + } + + return((0 == nesvnic->cqes_pending) ? 0 : 1); +} +#endif + + +/** + * nes_netdev_open - Activate the network interface; ifconfig + * ethx up. + */ +static int nes_netdev_open(struct net_device *netdev) +{ + u32 macaddr_low; + u16 macaddr_high; + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + int ret; + int i; + struct nes_vnic *first_nesvnic; + u32 nic_active_bit; + u32 nic_active; + + assert(nesdev != NULL); + + first_nesvnic = list_entry(nesdev->nesadapter->nesvnic_list[nesdev->mac_index].next, + struct nes_vnic, list); + + if (netif_msg_ifup(nesvnic)) + printk(KERN_INFO PFX "%s: enabling interface\n", netdev->name); + + ret = nes_init_nic_qp(nesdev, netdev); + if (ret) { + return ret; + } + + netif_stop_queue(netdev); + + if ((!nesvnic->of_device_registered) && (nesvnic->rdma_enabled)) { + nesvnic->nesibdev = nes_init_ofa_device(netdev); + if (nesvnic->nesibdev == NULL) { + printk(KERN_ERR PFX "%s: nesvnic->nesibdev alloc failed", netdev->name); + } else { + nesvnic->nesibdev->nesvnic = nesvnic; + ret = nes_register_ofa_device(nesvnic->nesibdev); + if (ret) { + printk(KERN_ERR PFX "%s: Unable to register RDMA device, ret = %d\n", + netdev->name, ret); + } + } + } + /* Set packet filters */ + nic_active_bit = 1 << nesvnic->nic_index; + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE); + nic_active |= nic_active_bit; + nes_write_indexed(nesdev, NES_IDX_NIC_ACTIVE, nic_active); + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL); + nic_active |= nic_active_bit; + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL, nic_active); + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_BROADCAST_ON); + nic_active |= nic_active_bit; + nes_write_indexed(nesdev, NES_IDX_NIC_BROADCAST_ON, nic_active); + + macaddr_high = ((u16)netdev->dev_addr[0]) << 8; + macaddr_high += (u16)netdev->dev_addr[1]; + macaddr_low = ((u32)netdev->dev_addr[2]) << 24; + macaddr_low += ((u32)netdev->dev_addr[3]) << 16; + macaddr_low += ((u32)netdev->dev_addr[4]) << 8; + macaddr_low += (u32)netdev->dev_addr[5]; + +#define NES_MAX_PORT_COUNT 4 + /* Program the various MAC regs */ + for (i = 0; i < NES_MAX_PORT_COUNT; i++) { + if (nesvnic->qp_nic_index[i] == 0xf) { + break; + } + nes_debug(NES_DBG_NETDEV, "i=%d, perfect filter table index= %d, PERF FILTER LOW" + " (Addr:%08X) = %08X, HIGH = %08X.\n", + i, nesvnic->qp_nic_index[i], + NES_IDX_PERFECT_FILTER_LOW+((nesvnic->perfect_filter_index + i) * 8), + macaddr_low, + (u32)macaddr_high | NES_MAC_ADDR_VALID | + ((((u32)nesvnic->nic_index) << 16))); + nes_write_indexed(nesdev, + NES_IDX_PERFECT_FILTER_LOW + (nesvnic->qp_nic_index[i] * 8), + macaddr_low); + nes_write_indexed(nesdev, + NES_IDX_PERFECT_FILTER_HIGH + (nesvnic->qp_nic_index[i] * 8), + (u32)macaddr_high | NES_MAC_ADDR_VALID | + ((((u32)nesvnic->nic_index) << 16))); + } + + + if (netdev->ip_ptr) { + struct in_device *ip = netdev->ip_ptr; + struct in_ifaddr *in = NULL; + if (ip && ip->ifa_list) { + in = ip->ifa_list; + nes_manage_arp_cache(nesvnic->netdev, netdev->dev_addr, + ntohl(in->ifa_address), NES_ARP_ADD); + } + } + + nes_write32(nesdev->regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT | + nesvnic->nic_cq.cq_number); + nes_read32(nesdev->regs+NES_CQE_ALLOC); + + if (first_nesvnic->linkup) { + /* Enable network packets */ + nesvnic->linkup = 1; + netif_start_queue(netdev); + } else { + netif_carrier_off(netdev); + } + nesvnic->netdev_open = 1; + + return 0; +} + + +/** + * nes_netdev_stop + */ +static int nes_netdev_stop(struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + u32 nic_active_mask; + u32 nic_active; + + nes_debug(NES_DBG_SHUTDOWN, "\n"); + if (0 == nesvnic->netdev_open) + return 0; + + if (netif_msg_ifdown(nesvnic)) + printk(KERN_INFO PFX "%s: disabling interface\n", netdev->name); + + /* Disable network packets */ + netif_stop_queue(netdev); + if ((nesdev->netdev[0] == netdev)&(nesvnic->logical_port == nesdev->mac_index)) { + nes_write_indexed(nesdev, + NES_IDX_MAC_INT_MASK+(0x200*nesdev->mac_index), 0xffffffff); + } + + nic_active_mask = ~((u32)(1 << nesvnic->nic_index)); + nes_write_indexed(nesdev, NES_IDX_PERFECT_FILTER_HIGH+ + (nesvnic->perfect_filter_index*8), 0); + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE); + nic_active &= nic_active_mask; + nes_write_indexed(nesdev, NES_IDX_NIC_ACTIVE, nic_active); + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL); + nic_active &= nic_active_mask; + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL, nic_active); + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_BROADCAST_ON); + nic_active &= nic_active_mask; + nes_write_indexed(nesdev, NES_IDX_NIC_BROADCAST_ON, nic_active); + + + if (nesvnic->of_device_registered) { + nes_destroy_ofa_device(nesvnic->nesibdev); + nesvnic->nesibdev = NULL; + nesvnic->of_device_registered = 0; + rdma_enabled = 0; + } + nes_destroy_nic_qp(nesvnic); + + nesvnic->netdev_open = 0; + + return 0; +} + + +/** + * nes_nic_send + */ +static int nes_nic_send(struct sk_buff *skb, struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_hw_nic *nesnic = &nesvnic->nic; + struct nes_hw_nic_sq_wqe *nic_sqe; +#ifdef NETIF_F_TSO + struct tcphdr *tcph; + /* struct udphdr *udph; */ +#endif +// u64 *wqe_fragment_address; + u16 *wqe_fragment_length; + u32 wqe_misc; + u16 wqe_fragment_index = 1; /* first fragment (0) is used by copy buffer */ + u16 skb_fragment_index; + dma_addr_t bus_address; + + nic_sqe = &nesnic->sq_vbase[nesnic->sq_head]; + wqe_fragment_length = (u16 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX]; + + /* setup the VLAN tag if present */ + if (vlan_tx_tag_present(skb)) { + nes_debug(NES_DBG_NIC_TX, "%s: VLAN packet to send... VLAN = %08X\n", + netdev->name, vlan_tx_tag_get(skb)); + wqe_misc = NES_NIC_SQ_WQE_TAGVALUE_ENABLE; + wqe_fragment_length[0] = vlan_tx_tag_get(skb); + } else + wqe_misc = 0; + + /* bump past the vlan tag */ + wqe_fragment_length++; + /* wqe_fragment_address = (u64 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX]; */ + + if (skb->ip_summed == CHECKSUM_PARTIAL) { +#ifdef OFED_1_2 + tcph = skb->h.th; +#else + tcph = tcp_hdr(skb); +#endif + if (1) { +#ifdef NETIF_F_TSO + if (nes_skb_is_gso(skb)) { + /* nes_debug(NES_DBG_NIC_TX, "%s: TSO request... seg size = %u\n", + netdev->name, nes_skb_is_gso(skb)); */ + wqe_misc |= NES_NIC_SQ_WQE_LSO_ENABLE | + NES_NIC_SQ_WQE_COMPLETION | (u16)nes_skb_is_gso(skb); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_LSO_INFO_IDX] = + cpu_to_le32(((u32)tcph->doff) | + (((u32)(((unsigned char *)tcph) - skb->data)) << 4)); + } else { +#endif + wqe_misc |= NES_NIC_SQ_WQE_COMPLETION; +#ifdef NETIF_F_TSO + } +#endif + } + } else { /* CHECKSUM_HW */ + wqe_misc |= NES_NIC_SQ_WQE_DISABLE_CHKSUM | NES_NIC_SQ_WQE_COMPLETION; + } + + nic_sqe->wqe_words[NES_NIC_SQ_WQE_TOTAL_LENGTH_IDX] = cpu_to_le32(skb->len); + memcpy(&nesnic->first_frag_vbase[nesnic->sq_head].buffer, + skb->data, min(((unsigned int)NES_FIRST_FRAG_SIZE), skb_headlen(skb))); + wqe_fragment_length[0] = cpu_to_le16(min(((unsigned int)NES_FIRST_FRAG_SIZE), + skb_headlen(skb))); + wqe_fragment_length[1] = 0; + if (skb_headlen(skb) > NES_FIRST_FRAG_SIZE) { + if ((skb_shinfo(skb)->nr_frags + 1) > 4) { + nes_debug(NES_DBG_NIC_TX, "%s: Packet with %u fragments not sent, skb_headlen=%u\n", + netdev->name, skb_shinfo(skb)->nr_frags + 2, skb_headlen(skb)); + kfree_skb(skb); + nesvnic->tx_sw_dropped++; + return NETDEV_TX_LOCKED; + } + bus_address = pci_map_single(nesdev->pcidev, skb->data + NES_FIRST_FRAG_SIZE, + skb_headlen(skb) - NES_FIRST_FRAG_SIZE, PCI_DMA_TODEVICE); + wqe_fragment_length[wqe_fragment_index++] = + cpu_to_le16(skb_headlen(skb) - NES_FIRST_FRAG_SIZE); + wqe_fragment_length[wqe_fragment_index] = 0; + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG1_LOW_IDX] = cpu_to_le32((u32)((u64)(bus_address))); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG1_HIGH_IDX] = cpu_to_le32((u32)(((u64)(bus_address))>>32)); + nesnic->tx_skb[nesnic->sq_head] = skb; + } + + if (skb_headlen(skb) == skb->len) { + if (skb_headlen(skb) <= NES_FIRST_FRAG_SIZE) { + nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_2_1_IDX] = 0; + nesnic->tx_skb[nesnic->sq_head] = NULL; + dev_kfree_skb(skb); + } + } else { + /* Deal with Fragments */ + nesnic->tx_skb[nesnic->sq_head] = skb; + for (skb_fragment_index = 0; skb_fragment_index < skb_shinfo(skb)->nr_frags; + skb_fragment_index++) { + bus_address = pci_map_page( nesdev->pcidev, + skb_shinfo(skb)->frags[skb_fragment_index].page, + skb_shinfo(skb)->frags[skb_fragment_index].page_offset, + skb_shinfo(skb)->frags[skb_fragment_index].size, + PCI_DMA_TODEVICE); + wqe_fragment_length[wqe_fragment_index] = + cpu_to_le16(skb_shinfo(skb)->frags[skb_fragment_index].size); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX+(2*wqe_fragment_index)] = + cpu_to_le32((u32)((u64)(bus_address))); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_HIGH_IDX+(2*wqe_fragment_index)] = + cpu_to_le32((u32)(((u64)(bus_address))>>32)); + wqe_fragment_index++; + if (wqe_fragment_index < 5) + wqe_fragment_length[wqe_fragment_index] = 0; + } + } + + nic_sqe->wqe_words[NES_NIC_SQ_WQE_MISC_IDX] = cpu_to_le32(wqe_misc); + nesnic->sq_head++; + nesnic->sq_head &= nesnic->sq_size - 1; + + return NETDEV_TX_OK; +} + + +/** + * nes_netdev_start_xmit + */ +static int nes_netdev_start_xmit(struct sk_buff *skb, struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_hw_nic *nesnic = &nesvnic->nic; + struct nes_hw_nic_sq_wqe *nic_sqe; +#ifdef NETIF_F_TSO + struct tcphdr *tcph; + /* struct udphdr *udph; */ +#define NES_MAX_TSO_FRAGS 18 + /* 64K segment plus overflow on each side */ + dma_addr_t tso_bus_address[NES_MAX_TSO_FRAGS]; + u32 tso_frag_index; + u32 tso_frag_count; + u32 tso_wqe_length; + u32 curr_tcp_seq; +#endif + u32 wqe_count=1; + u32 send_rc; + struct iphdr *iph; + unsigned long flags; + u16 *wqe_fragment_length; +// u64 *wqe_fragment_address; + /* first fragment (0) is used by copy buffer */ + u16 wqe_fragment_index=1; + u16 hoffset; + u16 nhoffset; +#ifdef NETIF_F_TSO + u16 wqes_needed; + u16 wqes_available; +#endif + u32 old_head; + u32 wqe_misc; + + if (nes_debug_level & NES_DBG_NIC_TX) { + nes_debug(NES_DBG_NIC_TX, "%s Request to tx NIC packet length %u, headlen %u," + " (%u frags), tso_size=%u\n", + netdev->name, skb->len, skb_headlen(skb), + skb_shinfo(skb)->nr_frags, nes_skb_is_gso(skb)); + } + local_irq_save(flags); + if (!spin_trylock(&nesnic->sq_lock)) { + local_irq_restore(flags); + nesvnic->sq_locked++; + return NETDEV_TX_LOCKED; + } + + /* Check if SQ is full */ + if ((((nesnic->sq_tail+(nesnic->sq_size*2))-nesnic->sq_head) & (nesnic->sq_size - 1)) == 1) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + nesvnic->sq_full++; + return NETDEV_TX_BUSY; + } + + /* Check if too many fragments */ + if (unlikely((skb_shinfo(skb)->nr_frags) > 4)) { +#ifdef NETIF_F_TSO + if (nes_skb_is_gso(skb) && (skb_headlen(skb) <= NES_FIRST_FRAG_SIZE)) { + nesvnic->segmented_tso_requests++; + nesvnic->tso_requests++; + old_head = nesnic->sq_head; + /* Basically 4 fragments available per WQE with extended fragments */ + wqes_needed = skb_shinfo(skb)->nr_frags >> 2; + wqes_needed += (skb_shinfo(skb)->nr_frags&3)?1:0; + wqes_available = (((nesnic->sq_tail+nesnic->sq_size)-nesnic->sq_head) - 1) & + (nesnic->sq_size - 1); + + if (unlikely(wqes_needed > wqes_available)) { + netif_stop_queue(netdev); + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + nes_debug(NES_DBG_NIC_TX, "%s: HNIC SQ full- TSO request has too many frags!\n", + netdev->name); + nesvnic->sq_full++; + return NETDEV_TX_BUSY; + } + /* Map all the buffers */ + for (tso_frag_count=0; tso_frag_count < skb_shinfo(skb)->nr_frags; + tso_frag_count++) { + tso_bus_address[tso_frag_count] = pci_map_page( nesdev->pcidev, + skb_shinfo(skb)->frags[tso_frag_count].page, + skb_shinfo(skb)->frags[tso_frag_count].page_offset, + skb_shinfo(skb)->frags[tso_frag_count].size, + PCI_DMA_TODEVICE); + } + + tso_frag_index = 0; +#ifdef OFED_1_2 + curr_tcp_seq = ntohl(skb->h.th->seq); +#else + curr_tcp_seq = ntohl(tcp_hdr(skb)->seq); +#endif +#ifdef OFED_1_2 + hoffset = skb->h.raw - skb->data; + nhoffset = skb->nh.raw - skb->data; +#else + hoffset = skb_transport_header(skb) - skb->data; + nhoffset = skb_network_header(skb) - skb->data; +#endif + + for (wqe_count=0; wqe_count<((u32)wqes_needed); wqe_count++) { + tso_wqe_length = 0; + nic_sqe = &nesnic->sq_vbase[nesnic->sq_head]; + wqe_fragment_length = + (u16 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX]; + /* setup the VLAN tag if present */ + if (vlan_tx_tag_present(skb)) { + nes_debug(NES_DBG_NIC_TX, "%s: VLAN packet to send... VLAN = %08X\n", + netdev->name, vlan_tx_tag_get(skb) ); + wqe_misc = NES_NIC_SQ_WQE_TAGVALUE_ENABLE; + wqe_fragment_length[0] = vlan_tx_tag_get(skb); + } else + wqe_misc = 0; + + /* bump past the vlan tag */ + wqe_fragment_length++; +// wqe_fragment_address = +// (u64 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX]; + + /* Assumes header totally fits in allocated buffer and is in first fragment */ + if (skb_headlen(skb) > NES_FIRST_FRAG_SIZE) { + nes_debug(NES_DBG_NIC_TX, "ERROR: SKB header too big, skb_headlen=%u, FIRST_FRAG_SIZE=%u\n", + skb_headlen(skb), NES_FIRST_FRAG_SIZE); + nes_debug(NES_DBG_NIC_TX, "%s Request to tx NIC packet length %u, headlen %u," + " (%u frags), tso_size=%u\n", + netdev->name, + skb->len, skb_headlen(skb), + skb_shinfo(skb)->nr_frags, nes_skb_is_gso(skb)); + } + memcpy(&nesnic->first_frag_vbase[nesnic->sq_head].buffer, + skb->data, min(((unsigned int)NES_FIRST_FRAG_SIZE), + skb_headlen(skb))); + iph = (struct iphdr *) + (&nesnic->first_frag_vbase[nesnic->sq_head].buffer[nhoffset]); + tcph = (struct tcphdr *) + (&nesnic->first_frag_vbase[nesnic->sq_head].buffer[hoffset]); + if ((wqe_count+1)!=(u32)wqes_needed) { + tcph->fin = 0; + tcph->psh = 0; + tcph->rst = 0; + tcph->urg = 0; + } + if (wqe_count) { + tcph->syn = 0; + } + tcph->seq = htonl(curr_tcp_seq); + wqe_fragment_length[0] = cpu_to_le16(min(((unsigned int)NES_FIRST_FRAG_SIZE), + skb_headlen(skb))); + + for (wqe_fragment_index = 1; wqe_fragment_index < 5;) { + wqe_fragment_length[wqe_fragment_index] = + cpu_to_le16(skb_shinfo(skb)->frags[tso_frag_index].size); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX+(2*wqe_fragment_index)] = + cpu_to_le32((u32)((u64)tso_bus_address[tso_frag_index])); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_FRAG0_HIGH_IDX+(2*wqe_fragment_index)] = + cpu_to_le32((u32)(((u64)tso_bus_address[tso_frag_index])>>32)); + wqe_fragment_index++; + tso_wqe_length += skb_shinfo(skb)->frags[tso_frag_index++].size; + if (wqe_fragment_index < 5) + wqe_fragment_length[wqe_fragment_index] = 0; + if (tso_frag_index == tso_frag_count) + break; + } + if ((wqe_count+1) == (u32)wqes_needed) { + nesnic->tx_skb[nesnic->sq_head] = skb; + } else { + nesnic->tx_skb[nesnic->sq_head] = NULL; + } + wqe_misc |= NES_NIC_SQ_WQE_COMPLETION | (u16)nes_skb_is_gso(skb); + if ((tso_wqe_length + skb_headlen(skb)) > nes_skb_is_gso(skb)) { + wqe_misc |= NES_NIC_SQ_WQE_LSO_ENABLE; + } else { + iph->tot_len = htons(tso_wqe_length + skb_headlen(skb) - nhoffset); + } + + nic_sqe->wqe_words[NES_NIC_SQ_WQE_MISC_IDX] = cpu_to_le32(wqe_misc); + nic_sqe->wqe_words[NES_NIC_SQ_WQE_LSO_INFO_IDX] = + cpu_to_le32(((u32)tcph->doff) | (((u32)hoffset) << 4)); + + nic_sqe->wqe_words[NES_NIC_SQ_WQE_TOTAL_LENGTH_IDX] = + cpu_to_le32(tso_wqe_length+skb_headlen(skb)); + curr_tcp_seq += tso_wqe_length; + nesnic->sq_head++; + nesnic->sq_head &= nesnic->sq_size-1; + } + } else { +#endif + nesvnic->linearized_skbs++; +#ifdef OFED_1_2 + hoffset = skb->h.raw - skb->data; + nhoffset = skb->nh.raw - skb->data; +#else + hoffset = skb_transport_header(skb) - skb->data; + nhoffset = skb_network_header(skb) - skb->data; +#endif + nes_skb_linearize(skb); +#ifdef OFED_1_2 + skb->h.raw = skb->data + hoffset; + skb->nh.raw = skb->data + nhoffset; +#else + skb_set_transport_header(skb, hoffset); + skb_set_network_header(skb, nhoffset); +#endif + send_rc = nes_nic_send(skb, netdev); + if (send_rc != NETDEV_TX_OK) { + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + return NETDEV_TX_OK; + } +#ifdef NETIF_F_TSO + } +#endif + } else { + send_rc = nes_nic_send(skb, netdev); + if (send_rc != NETDEV_TX_OK) { + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + return NETDEV_TX_OK; + } + } + + barrier(); + + if (wqe_count) + nes_write32(nesdev->regs+NES_WQE_ALLOC, + (wqe_count << 24) | (1 << 23) | nesvnic->nic.qp_id); + + netdev->trans_start = jiffies; + spin_unlock_irqrestore(&nesnic->sq_lock, flags); + + return NETDEV_TX_OK; +} + + +/** + * nes_netdev_get_stats + */ +static struct net_device_stats *nes_netdev_get_stats(struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + u64 u64temp; + u32 u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_DISCARD + (nesvnic->nic_index*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->endnode_nstat_rx_discard += u32temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_LO + (nesvnic->nic_index*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_HI + (nesvnic->nic_index*0x200))) << 32; + + nesvnic->endnode_nstat_rx_octets += u64temp; + nesvnic->netstats.rx_bytes += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_LO + (nesvnic->nic_index*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_HI + (nesvnic->nic_index*0x200))) << 32; + + nesvnic->endnode_nstat_rx_frames += u64temp; + nesvnic->netstats.rx_packets += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_LO + (nesvnic->nic_index*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_HI + (nesvnic->nic_index*0x200))) << 32; + + nesvnic->endnode_nstat_tx_octets += u64temp; + nesvnic->netstats.tx_bytes += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_LO + (nesvnic->nic_index*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_HI + (nesvnic->nic_index*0x200))) << 32; + + nesvnic->endnode_nstat_tx_frames += u64temp; + nesvnic->netstats.tx_packets += u64temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_SHORT_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->nesdev->mac_rx_errors += u32temp; + nesvnic->nesdev->mac_rx_short_frames += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_OVERSIZED_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->nesdev->mac_rx_errors += u32temp; + nesvnic->nesdev->mac_rx_oversized_frames += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_JABBER_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->nesdev->mac_rx_errors += u32temp; + nesvnic->nesdev->mac_rx_jabber_frames += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_SYMBOL_ERR_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->nesdev->mac_rx_errors += u32temp; + nesvnic->nesdev->mac_rx_symbol_err_frames += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_LENGTH_ERR_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->netstats.rx_length_errors += u32temp; + nesvnic->nesdev->mac_rx_errors += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_CRC_ERR_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->nesdev->mac_rx_errors += u32temp; + nesvnic->nesdev->mac_rx_crc_errors += u32temp; + nesvnic->netstats.rx_crc_errors += u32temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_TX_ERRORS + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->nesdev->mac_tx_errors += u32temp; + nesvnic->netstats.tx_errors += u32temp; + + return &nesvnic->netstats; +} + + +/** + * nes_netdev_tx_timeout + */ +static void nes_netdev_tx_timeout(struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + if (netif_msg_timer(nesvnic)) + nes_debug(NES_DBG_NIC_TX, "%s: tx timeout\n", netdev->name); +} + + +/** + * nes_netdev_set_mac_address + */ +static int nes_netdev_set_mac_address(struct net_device *netdev, void *p) +{ + return -1; +} + + +/** + * nes_netdev_change_mtu + */ +static int nes_netdev_change_mtu(struct net_device *netdev, int new_mtu) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + int ret = 0; + + if ((new_mtu < ETH_ZLEN) || (new_mtu > max_mtu)) + return -EINVAL; + + netdev->mtu = new_mtu; + nesvnic->max_frame_size = new_mtu+ETH_HLEN; + + if (netif_running(netdev)) { + nes_netdev_stop(netdev); + nes_netdev_open(netdev); + } + + return ret; +} + + +/** + * nes_netdev_exit - destroy network device + */ +void nes_netdev_exit(struct nes_vnic *nesvnic) +{ + struct net_device *netdev = nesvnic->netdev; + struct nes_ib_device *nesibdev = nesvnic->nesibdev; + + nes_debug(NES_DBG_SHUTDOWN, "\n"); + + // destroy the ibdevice if RDMA enabled + if ((nesvnic->rdma_enabled)&&(nesvnic->of_device_registered)) { + nes_destroy_ofa_device( nesibdev ); + nesvnic->of_device_registered = 0; + rdma_enabled = 0; + nesvnic->nesibdev = NULL; + } + unregister_netdev(netdev); + nes_debug(NES_DBG_SHUTDOWN, "\n"); +} + + +#define NES_ETHTOOL_STAT_COUNT 52 +static const char nes_ethtool_stringset[NES_ETHTOOL_STAT_COUNT][ETH_GSTRING_LEN] = { + "Link Change Interrupts", + "Linearized SKBs", + "T/GSO Requests", + "Pause Frames Sent", + "Pause Frames Received", + "Internal Routing Errors", + "SQ SW Dropped SKBs", + "SQ Locked", + "SQ Full", + "Segmented TSO Requests", + "Rx Symbol Errors", + "Rx Jabber Errors", + "Rx Oversized Frames", + "Rx Short Frames", + "Endnode Rx Discards", + "Endnode Rx Octets", + "Endnode Rx Frames", + "Endnode Tx Octets", + "Endnode Tx Frames", + "mh detected", + "mh pauses", + "Retransmission Count", + "CM Connects", + "CM Accepts", + "Disconnects", + "Connected Events", + "Connect Requests", + "CM Rejects", + "ModifyQP Timeouts", + "CreateQPs", + "SW DestroyQPs", + "DestroyQPs", + "CM Closes", + "CM Packets Sent", + "CM Packets Bounced", + "CM Packets Created", + "CM Packets Rcvd", + "CM Packets Dropped", + "CM Packets Retrans", + "CM Listens Created", + "CM Listens Destroyed", + "CM Backlog Drops", + "CM Nodes Created", + "CM Nodes Destroyed", + "CM Accel Drops", + "CM Resets Received", + "CQP Req Allocs", + "CQP Req Deallocs", + "CQP Req Dynamic Allocs", + "CQP Req Dynamic Deallocs", + "CQP Req Queues", + "CQP Req Redrives", +}; + + +/** + * nes_netdev_get_rx_csum + */ +static u32 nes_netdev_get_rx_csum (struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + if (nesvnic->rx_checksum_disabled) + return 0; + else + return 1; +} + + +/** + * nes_netdev_set_rc_csum + */ +static int nes_netdev_set_rx_csum(struct net_device *netdev, u32 enable) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + if (enable) + nesvnic->rx_checksum_disabled = 0; + else + nesvnic->rx_checksum_disabled = 1; + return 0; +} + + +/** + * nes_netdev_get_stats_count + */ +static int nes_netdev_get_stats_count(struct net_device *netdev) +{ + return NES_ETHTOOL_STAT_COUNT; +} + + +/** + * nes_netdev_get_strings + */ +static void nes_netdev_get_strings(struct net_device *netdev, u32 stringset, + u8 *ethtool_strings) +{ + if (stringset == ETH_SS_STATS) + memcpy(ethtool_strings, + &nes_ethtool_stringset, + sizeof(nes_ethtool_stringset)); +} + + +/** + * nes_netdev_get_ethtool_stats + */ +static void nes_netdev_get_ethtool_stats(struct net_device *netdev, + struct ethtool_stats *target_ethtool_stats, u64 *target_stat_values) +{ + u64 u64temp; + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + u32 nic_count; + u32 u32temp; + + target_ethtool_stats->n_stats = NES_ETHTOOL_STAT_COUNT; + target_stat_values[0] = nesvnic->nesdev->link_status_interrupts; + target_stat_values[1] = nesvnic->linearized_skbs; + target_stat_values[2] = nesvnic->tso_requests; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_TX_PAUSE_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->nesdev->mac_pause_frames_sent += u32temp; + target_stat_values[3] = nesvnic->nesdev->mac_pause_frames_sent; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_RX_PAUSE_FRAMES + (nesvnic->nesdev->mac_index*0x200)); + nesvnic->nesdev->mac_pause_frames_received += u32temp; + + for (nic_count = 0; nic_count < NES_MAX_PORT_COUNT; nic_count++) { + if (nesvnic->qp_nic_index[nic_count] == 0xf) + break; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_DISCARD + + (nesvnic->qp_nic_index[nic_count]*0x200)); + nesvnic->netstats.rx_dropped += u32temp; + nesvnic->endnode_nstat_rx_discard += u32temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_LO + + (nesvnic->qp_nic_index[nic_count]*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_OCTETS_HI + + (nesvnic->qp_nic_index[nic_count]*0x200))) << 32; + + nesvnic->endnode_nstat_rx_octets += u64temp; + nesvnic->netstats.rx_bytes += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_LO + + (nesvnic->qp_nic_index[nic_count]*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_RX_FRAMES_HI + + (nesvnic->qp_nic_index[nic_count]*0x200))) << 32; + + nesvnic->endnode_nstat_rx_frames += u64temp; + nesvnic->netstats.rx_packets += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_LO + + (nesvnic->qp_nic_index[nic_count]*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_HI + + (nesvnic->qp_nic_index[nic_count]*0x200))) << 32; + + nesvnic->endnode_nstat_tx_octets += u64temp; + nesvnic->netstats.tx_bytes += u64temp; + + u64temp = (u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_LO + + (nesvnic->qp_nic_index[nic_count]*0x200)); + u64temp += ((u64)nes_read_indexed(nesdev, + NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_HI + + (nesvnic->qp_nic_index[nic_count]*0x200))) << 32; + + nesvnic->endnode_nstat_tx_frames += u64temp; + nesvnic->netstats.tx_packets += u64temp; + + u32temp = nes_read_indexed(nesdev, + NES_IDX_IPV4_TCP_REXMITS + (nesvnic->qp_nic_index[nic_count]*0x200)); + nesvnic->endnode_ipv4_tcp_retransmits += u32temp; + } + + target_stat_values[4] = nesvnic->nesdev->mac_pause_frames_received; + target_stat_values[5] = nesdev->nesadapter->nic_rx_eth_route_err; + target_stat_values[6] = nesvnic->tx_sw_dropped; + target_stat_values[7] = nesvnic->sq_locked; + target_stat_values[8] = nesvnic->sq_full; + target_stat_values[9] = nesvnic->segmented_tso_requests; + target_stat_values[10] = nesvnic->nesdev->mac_rx_symbol_err_frames; + target_stat_values[11] = nesvnic->nesdev->mac_rx_jabber_frames; + target_stat_values[12] = nesvnic->nesdev->mac_rx_oversized_frames; + target_stat_values[13] = nesvnic->nesdev->mac_rx_short_frames; + target_stat_values[14] = nesvnic->endnode_nstat_rx_discard; + target_stat_values[15] = nesvnic->endnode_nstat_rx_octets; + target_stat_values[16] = nesvnic->endnode_nstat_rx_frames; + target_stat_values[17] = nesvnic->endnode_nstat_tx_octets; + target_stat_values[18] = nesvnic->endnode_nstat_tx_frames; + target_stat_values[19] = mh_detected; + target_stat_values[20] = mh_pauses_sent; + target_stat_values[21] = nesvnic->endnode_ipv4_tcp_retransmits; + target_stat_values[22] = atomic_read(&cm_connects); + target_stat_values[23] = atomic_read(&cm_accepts); + target_stat_values[24] = atomic_read(&cm_disconnects); + target_stat_values[25] = atomic_read(&cm_connecteds); + target_stat_values[26] = atomic_read(&cm_connect_reqs); + target_stat_values[27] = atomic_read(&cm_rejects); + target_stat_values[28] = atomic_read(&mod_qp_timouts); + target_stat_values[29] = atomic_read(&qps_created); + target_stat_values[30] = atomic_read(&sw_qps_destroyed); + target_stat_values[31] = atomic_read(&qps_destroyed); + target_stat_values[32] = atomic_read(&cm_closes); + target_stat_values[33] = cm_packets_sent; + target_stat_values[34] = cm_packets_bounced; + target_stat_values[35] = cm_packets_created; + target_stat_values[36] = cm_packets_received; + target_stat_values[37] = cm_packets_dropped; + target_stat_values[38] = cm_packets_retrans; + target_stat_values[39] = cm_listens_created; + target_stat_values[40] = cm_listens_destroyed; + target_stat_values[41] = cm_backlog_drops; + target_stat_values[42] = atomic_read(&cm_nodes_created); + target_stat_values[43] = atomic_read(&cm_nodes_destroyed); + target_stat_values[44] = atomic_read(&cm_accel_dropped_pkts); + target_stat_values[45] = atomic_read(&cm_resets_recvd); + target_stat_values[46] = atomic_read(&cqp_reqs_allocated); + target_stat_values[47] = atomic_read(&cqp_reqs_freed); + target_stat_values[48] = atomic_read(&cqp_reqs_dynallocated); + target_stat_values[49] = atomic_read(&cqp_reqs_dynfreed); + target_stat_values[50] = atomic_read(&cqp_reqs_queued); + target_stat_values[51] = atomic_read(&cqp_reqs_redriven); + +} + + +/** + * nes_netdev_get_drvinfo + */ +static void nes_netdev_get_drvinfo(struct net_device *netdev, + struct ethtool_drvinfo *drvinfo) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + strcpy(drvinfo->driver, DRV_NAME); + strcpy(drvinfo->bus_info, pci_name(nesvnic->nesdev->pcidev)); + strcpy(drvinfo->fw_version, "TBD"); + strcpy(drvinfo->version, DRV_VERSION); + drvinfo->n_stats = nes_netdev_get_stats_count(netdev); + drvinfo->testinfo_len = 0; + drvinfo->eedump_len = 0; + drvinfo->regdump_len = 0; +} + + +/** + * nes_netdev_set_coalesce + */ +static int nes_netdev_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *et_coalesce) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + + /* using this to drive total interrupt moderation */ + nesvnic->nesdev->et_rx_coalesce_usecs_irq = et_coalesce->rx_coalesce_usecs_irq; + if (nesdev->et_rx_coalesce_usecs_irq) { + nes_write32(nesdev->regs+NES_PERIODIC_CONTROL, + 0x80000000 | ((u32)(nesdev->et_rx_coalesce_usecs_irq*8))); + } + return 0; +} + + +/** + * nes_netdev_get_coalesce + */ +static int nes_netdev_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *et_coalesce) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct ethtool_coalesce temp_et_coalesce; + + memset(&temp_et_coalesce, 0, sizeof(temp_et_coalesce)); + temp_et_coalesce.rx_coalesce_usecs_irq = nesvnic->nesdev->et_rx_coalesce_usecs_irq; + memcpy(et_coalesce, &temp_et_coalesce, sizeof(*et_coalesce)); + return 0; +} + + +/** + * nes_netdev_get_pauseparam + */ +static void nes_netdev_get_pauseparam(struct net_device *netdev, + struct ethtool_pauseparam *et_pauseparam) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + et_pauseparam->autoneg = 0; + et_pauseparam->rx_pause = (nesvnic->nesdev->disable_rx_flow_control==0)?1:0; + et_pauseparam->tx_pause = (nesvnic->nesdev->disable_tx_flow_control==0)?1:0; +} + + +/** + * nes_netdev_set_pauseparam + */ +static int nes_netdev_set_pauseparam(struct net_device *netdev, + struct ethtool_pauseparam *et_pauseparam) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + u32 u32temp; + + if (et_pauseparam->autoneg) { + /* TODO: should return unsupported */ + return 0; + } + if ((et_pauseparam->tx_pause==1) && (nesdev->disable_tx_flow_control==1)) { + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_TX_CONFIG + (nesdev->mac_index*0x200)); + u32temp |= NES_IDX_MAC_TX_CONFIG_ENABLE_PAUSE; + nes_write_indexed(nesdev, + NES_IDX_MAC_TX_CONFIG_ENABLE_PAUSE + (nesdev->mac_index*0x200), u32temp); + nesdev->disable_tx_flow_control = 0; + } else if ((et_pauseparam->tx_pause==0) && (nesdev->disable_tx_flow_control==0)) { + u32temp = nes_read_indexed(nesdev, + NES_IDX_MAC_TX_CONFIG + (nesdev->mac_index*0x200)); + u32temp &= ~NES_IDX_MAC_TX_CONFIG_ENABLE_PAUSE; + nes_write_indexed(nesdev, + NES_IDX_MAC_TX_CONFIG_ENABLE_PAUSE + (nesdev->mac_index*0x200), u32temp); + nesdev->disable_tx_flow_control = 1; + } + if ((et_pauseparam->rx_pause==1) && (nesdev->disable_rx_flow_control==1)) { + u32temp = nes_read_indexed(nesdev, + NES_IDX_MPP_DEBUG + (nesdev->mac_index*0x40)); + u32temp &= ~NES_IDX_MPP_DEBUG_PORT_DISABLE_PAUSE; + nes_write_indexed(nesdev, + NES_IDX_MPP_DEBUG + (nesdev->mac_index*0x40), u32temp); + nesdev->disable_rx_flow_control = 0; + } else if ((et_pauseparam->rx_pause==0) && (nesdev->disable_rx_flow_control==0)) { + u32temp = nes_read_indexed(nesdev, + NES_IDX_MPP_DEBUG + (nesdev->mac_index*0x40)); + u32temp |= NES_IDX_MPP_DEBUG_PORT_DISABLE_PAUSE; + nes_write_indexed(nesdev, + NES_IDX_MPP_DEBUG + (nesdev->mac_index*0x40), u32temp); + nesdev->disable_rx_flow_control = 1; + } + + return 0; +} + + +/** + * nes_netdev_get_settings + */ +static int nes_netdev_get_settings(struct net_device *netdev, struct ethtool_cmd *et_cmd) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + u16 phy_data; + + et_cmd->duplex = DUPLEX_FULL; + if (nesadapter->OneG_Mode) { + et_cmd->supported = SUPPORTED_1000baseT_Full|SUPPORTED_Autoneg; + et_cmd->advertising = ADVERTISED_1000baseT_Full|ADVERTISED_Autoneg; + et_cmd->speed = SPEED_1000; + nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[nesdev->mac_index], + &phy_data); + if (phy_data&0x1000) { + et_cmd->autoneg = AUTONEG_ENABLE; + } else { + et_cmd->autoneg = AUTONEG_DISABLE; + } + et_cmd->transceiver = XCVR_EXTERNAL; + et_cmd->phy_address = nesadapter->phy_index[nesdev->mac_index]; + } else { + et_cmd->supported = SUPPORTED_10000baseT_Full; + et_cmd->advertising = ADVERTISED_10000baseT_Full; + et_cmd->speed = SPEED_10000; + et_cmd->autoneg = AUTONEG_DISABLE; + et_cmd->transceiver = XCVR_INTERNAL; + et_cmd->phy_address = nesdev->mac_index; + } + et_cmd->port = PORT_MII; + et_cmd->maxtxpkt = 511; + et_cmd->maxrxpkt = 511; + return 0; +} + + +/** + * nes_netdev_set_settings + */ +static int nes_netdev_set_settings(struct net_device *netdev, struct ethtool_cmd *et_cmd) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + u16 phy_data; + + if (nesadapter->OneG_Mode) { + nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[nesdev->mac_index], + &phy_data); + if (et_cmd->autoneg) { + /* Turn on Full duplex, Autoneg, and restart autonegotiation */ + phy_data |= 0x1300; + } else { + // Turn off autoneg + phy_data &= ~0x1000; + } + nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[nesdev->mac_index], + phy_data); + } + + return 0; +} + + +/** + * nes_netdev_get_msglevel + */ +static u32 nes_netdev_get_msglevel(struct net_device *netdev) +{ + return nes_debug_level; +} + + +/** + * nes_netdev_set_msglevel + */ +static void nes_netdev_set_msglevel(struct net_device *netdev, u32 level) +{ + nes_debug(NES_DBG_NETDEV, "Setting message level to: %u\n", level); + nes_debug_level = level; +} + + +static struct ethtool_ops nes_ethtool_ops = { + .get_link = ethtool_op_get_link, + .get_settings = nes_netdev_get_settings, + .set_settings = nes_netdev_set_settings, + .get_tx_csum = ethtool_op_get_tx_csum, + .get_rx_csum = nes_netdev_get_rx_csum, + .get_sg = ethtool_op_get_sg, + .get_strings = nes_netdev_get_strings, + .get_stats_count = nes_netdev_get_stats_count, + .get_ethtool_stats = nes_netdev_get_ethtool_stats, + .get_drvinfo = nes_netdev_get_drvinfo, + .get_coalesce = nes_netdev_get_coalesce, + .set_coalesce = nes_netdev_set_coalesce, + .get_pauseparam = nes_netdev_get_pauseparam, + .set_pauseparam = nes_netdev_set_pauseparam, + .get_msglevel = nes_netdev_get_msglevel, + .set_msglevel = nes_netdev_set_msglevel, + .set_tx_csum = ethtool_op_set_tx_csum, + .set_rx_csum = nes_netdev_set_rx_csum, + .set_sg = ethtool_op_set_sg, +#ifdef NETIF_F_TSO + .get_tso = ethtool_op_get_tso, + .set_tso = ethtool_op_set_tso, +#endif +}; + + +#ifdef NETIF_F_HW_VLAN_TX +static void nes_netdev_vlan_rx_register(struct net_device *netdev, struct vlan_group *grp) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + u32 u32temp; + + nesvnic->vlan_grp = grp; + + /* Enable/Disable VLAN Stripping */ + u32temp = nes_read_indexed(nesdev, NES_IDX_PCIX_DIAG); + if (grp) + u32temp &= 0xfdffffff; + else + u32temp |= 0x02000000; + + nes_write_indexed(nesdev, NES_IDX_PCIX_DIAG, u32temp); +} +#endif + + +/** + * nes_netdev_init - initialize network device + */ +struct net_device *nes_netdev_init(struct nes_device *nesdev, + void __iomem *mmio_addr) +{ + u64 u64temp; + struct nes_vnic *nesvnic = NULL; + struct net_device *netdev; + struct nic_qp_map *curr_qp_map; + u32 u32temp; + + netdev = alloc_etherdev(sizeof(struct nes_vnic)); + if (!netdev) { + printk(KERN_ERR PFX "nesvnic etherdev alloc failed"); + return NULL; + } + + nes_debug(NES_DBG_INIT, "netdev = %p.\n", netdev); + + SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, &nesdev->pcidev->dev); + + netdev->open = nes_netdev_open; + netdev->stop = nes_netdev_stop; + netdev->hard_start_xmit = nes_netdev_start_xmit; + netdev->get_stats = nes_netdev_get_stats; + netdev->tx_timeout = nes_netdev_tx_timeout; + netdev->set_mac_address = nes_netdev_set_mac_address; + netdev->change_mtu = nes_netdev_change_mtu; + netdev->watchdog_timeo = NES_TX_TIMEOUT; + netdev->irq = nesdev->pcidev->irq; + netdev->mtu = ETH_DATA_LEN; + netdev->hard_header_len = ETH_HLEN; + netdev->addr_len = ETH_ALEN; + netdev->type = ARPHRD_ETHER; + netdev->features = NETIF_F_HIGHDMA; + netdev->ethtool_ops = &nes_ethtool_ops; +#ifdef NES_NAPI + netdev->poll = nes_netdev_poll; + netdev->weight = 128; +#endif +#ifdef NETIF_F_HW_VLAN_TX + nes_debug(NES_DBG_INIT, "Enabling VLAN Insert/Delete.\n"); + netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + netdev->vlan_rx_register = nes_netdev_vlan_rx_register; +#endif +#ifdef NETIF_F_LLTX + netdev->features |= NETIF_F_LLTX; +#endif + + /* Fill in the port structure */ + nesvnic = netdev_priv(netdev); + + memset(nesvnic, 0, sizeof(*nesvnic)); + nesvnic->netdev = netdev; + nesvnic->nesdev = nesdev; + nesvnic->msg_enable = netif_msg_init(debug, default_msg); + nesvnic->netdev_index = nesdev->netdev_count; + nesvnic->perfect_filter_index = nesdev->nesadapter->netdev_count; + nesvnic->max_frame_size = netdev->mtu+netdev->hard_header_len; + + curr_qp_map = nic_qp_mapping_per_function[PCI_FUNC(nesdev->pcidev->devfn)]; + nesvnic->nic.qp_id = curr_qp_map[nesdev->netdev_count].qpid; + nesvnic->nic_index = curr_qp_map[nesdev->netdev_count].nic_index; + nesvnic->logical_port = curr_qp_map[nesdev->netdev_count].logical_port; + + /* Setup the burned in MAC address */ + u64temp = (u64)nesdev->nesadapter->mac_addr_low; + u64temp += ((u64)nesdev->nesadapter->mac_addr_high) << 32; + u64temp += nesvnic->nic_index; + netdev->dev_addr[0] = (u8)(u64temp>>40); + netdev->dev_addr[1] = (u8)(u64temp>>32); + netdev->dev_addr[2] = (u8)(u64temp>>24); + netdev->dev_addr[3] = (u8)(u64temp>>16); + netdev->dev_addr[4] = (u8)(u64temp>>8); + netdev->dev_addr[5] = (u8)u64temp; + + if ((nesvnic->logical_port < 2) || (nesdev->nesadapter->hw_rev != NE020_REV)) { +#ifdef NETIF_F_TSO + netdev->features |= NETIF_F_TSO | NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_IP_CSUM; +#endif +#ifdef NETIF_F_GSO + netdev->features |= NETIF_F_GSO | NETIF_F_TSO | NETIF_F_SG | NETIF_F_IP_CSUM; +#endif + } else { + netdev->features |= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_IP_CSUM; + } + + nes_debug(NES_DBG_INIT, "nesvnic = %p, reported features = 0x%lX, QPid = %d," + " nic_index = %d, logical_port = %d, mac_index = %d.\n", + nesvnic, (unsigned long)netdev->features, nesvnic->nic.qp_id, + nesvnic->nic_index, nesvnic->logical_port, nesdev->mac_index); + + if (nesvnic->nesdev->nesadapter->port_count == 1) { + nesvnic->qp_nic_index[0] = nesvnic->nic_index; + nesvnic->qp_nic_index[1] = nesvnic->nic_index + 1; + if (nes_drv_opt & NES_DRV_OPT_DUAL_LOGICAL_PORT) { + nesvnic->qp_nic_index[2] = 0xf; + nesvnic->qp_nic_index[3] = 0xf; + } else { + nesvnic->qp_nic_index[2] = nesvnic->nic_index + 2; + nesvnic->qp_nic_index[3] = nesvnic->nic_index + 3; + } + } else { + if (nesvnic->nesdev->nesadapter->port_count == 2) { + nesvnic->qp_nic_index[0] = nesvnic->nic_index; + nesvnic->qp_nic_index[1] = nesvnic->nic_index + 2; + nesvnic->qp_nic_index[2] = 0xf; + nesvnic->qp_nic_index[3] = 0xf; + } else { + nesvnic->qp_nic_index[0] = nesvnic->nic_index; + nesvnic->qp_nic_index[1] = 0xf; + nesvnic->qp_nic_index[2] = 0xf; + nesvnic->qp_nic_index[3] = 0xf; + } + } + nesvnic->next_qp_nic_index = 0; + + if (0 == nesdev->netdev_count) { + if (rdma_enabled == 0) { + rdma_enabled = 1; + nesvnic->rdma_enabled = 1; + } + } else { + nesvnic->rdma_enabled = 0; + } + nesvnic->nic_cq.cq_number = nesvnic->nic.qp_id; + spin_lock_init(&nesvnic->tx_lock); + nesdev->netdev[nesdev->netdev_count] = netdev; + + nes_debug(NES_DBG_INIT, "Adding nesvnic (%p) to the adapters nesvnic_list for MAC%d.\n", + nesvnic, nesdev->mac_index); + list_add_tail(&nesvnic->list, &nesdev->nesadapter->nesvnic_list[nesdev->mac_index]); + + if ((0 == nesdev->netdev_count) && + (PCI_FUNC(nesdev->pcidev->devfn) == nesdev->mac_index)) { + nes_debug(NES_DBG_INIT, "Setting up PHY interrupt mask. Using register index 0x%04X\n", + NES_IDX_PHY_PCS_CONTROL_STATUS0+(0x200*(nesvnic->logical_port&1))); + u32temp = nes_read_indexed(nesdev, NES_IDX_PHY_PCS_CONTROL_STATUS0 + + (0x200*(nesvnic->logical_port&1))); + u32temp |= 0x00200000; + nes_write_indexed(nesdev, NES_IDX_PHY_PCS_CONTROL_STATUS0 + + (0x200*(nesvnic->logical_port&1)), u32temp); + u32temp = nes_read_indexed(nesdev, NES_IDX_PHY_PCS_CONTROL_STATUS0 + + (0x200*(nesvnic->logical_port&1)) ); + if (0x0f0f0000 == (u32temp&0x0f1f0000)) { + nes_debug(NES_DBG_INIT, "The Link is UP!!.\n"); + nesvnic->linkup = 1; + } + nes_debug(NES_DBG_INIT, "Setting up MAC interrupt mask.\n"); + /* clear the MAC interrupt status, assumes direct logical to physical mapping */ + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS+(0x200*nesvnic->logical_port)); + nes_debug(NES_DBG_INIT, "Phy interrupt status = 0x%X.\n", u32temp); + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS+(0x200*nesvnic->logical_port), u32temp); + + nes_init_phy(nesdev); + nes_write_indexed(nesdev, NES_IDX_MAC_INT_MASK+(0x200*nesvnic->logical_port), + ~(NES_MAC_INT_LINK_STAT_CHG | NES_MAC_INT_XGMII_EXT | + NES_MAC_INT_TX_UNDERFLOW | NES_MAC_INT_TX_ERROR)); + } + + return netdev; +} + + +/** + * nes_netdev_destroy - destroy network device structure + */ +void nes_netdev_destroy(struct net_device *netdev) +{ + struct nes_vnic *nesvnic = netdev_priv(netdev); + + /* make sure 'stop' method is called by Linux stack */ + /* nes_netdev_stop(netdev); */ + + list_del(&nesvnic->list); + + if (nesvnic->of_device_registered) { + nes_destroy_ofa_device(nesvnic->nesibdev); + } + + free_netdev(netdev); +} + + +/** + * nes_nic_cm_xmit -- CM calls this to send out pkts + */ +int nes_nic_cm_xmit(struct sk_buff *skb, struct net_device *netdev) +{ + int ret; + + skb->dev = netdev; + ret = dev_queue_xmit(skb); + if (ret) { + nes_debug(NES_DBG_CM, "Bad return code from dev_queue_xmit %d\n", ret); + } + + return ret; +} + From ggrundstrom at neteffect.com Fri Oct 19 13:19:17 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:19:17 -0500 Subject: [ofa-general] [PATCH 9/14 v2] nes: kernel to userspace structures Message-ID: <200710192019.l9JKJHJG021802@neteffect.com> Kernel to userspace includes, structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_user.h 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect. All rights reserved. + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef NES_USER_H +#define NES_USER_H + +#include + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct nes_alloc_ucontext_resp { + __u32 max_pds; /* maximum pds allowed for this user process */ + __u32 max_qps; /* maximum qps allowed for this user process */ + __u32 wq_size; /* size of the WQs (sq+rq) allocated to the mmaped area */ + __u32 reserved; +}; + +struct nes_alloc_pd_resp { + __u32 pd_id; + __u32 mmap_db_index; +}; + +struct nes_create_cq_req { + __u64 user_cq_buffer; +}; + +enum iwnes_memreg_type { + IWNES_MEMREG_TYPE_MEM = 0x0000, + IWNES_MEMREG_TYPE_QP = 0x0001, + IWNES_MEMREG_TYPE_CQ = 0x0002, + IWNES_MEMREG_TYPE_MW = 0x0003, + IWNES_MEMREG_TYPE_FMR = 0x0004, +}; + +struct nes_mem_reg_req { + __u32 reg_type; /* indicates if id is memory, QP or CQ */ + __u32 reserved; +}; + +struct nes_create_cq_resp { + __u32 cq_id; + __u32 cq_size; + __u32 mmap_db_index; + __u32 reserved; +}; + +struct nes_create_qp_resp { + __u32 qp_id; + __u32 actual_sq_size; + __u32 actual_rq_size; + __u32 mmap_sq_db_index; + __u32 mmap_rq_db_index; + __u32 nes_drv_opt; +}; + +#endif /* NES_USER_H */ From ggrundstrom at neteffect.com Fri Oct 19 13:21:16 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:21:16 -0500 Subject: [ofa-general] [PATCH 10/14 v2] nes: eeprom and phy routines Message-ID: <200710192021.l9JKLGFU021817@neteffect.com> Misc eeprom, phy, and debug routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_utils.c 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,873 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "nes.h" + +#define BITMASK(X) (1L << (X)) +#define NES_CRC_WID 32 + +static u16 nes_read16_eeprom(void __iomem *addr, u16 offset); + +static u32 nesCRCTable[256]; +static u32 nesCRCInitialized = 0; + +static u32 nesCRCWidMask(u32); +static u32 nes_crc_table_gen(u32 *, u32, u32, u32); +static u32 reflect(u32, u32); +static u32 byte_swap(u32, u32); + +u32 mh_detected; +u32 mh_pauses_sent; + +/** + * nes_read_eeprom_values - + */ +int nes_read_eeprom_values(struct nes_device *nesdev, struct nes_adapter *nesadapter) +{ + u32 mac_addr_low; + u16 mac_addr_high; + u16 eeprom_data; + u16 eeprom_offset; + u16 next_section_address; + u32 index; + + /* TODO: deal with EEPROM endian issues */ + if (nesadapter->firmware_eeprom_offset == 0) { + /* Read the EEPROM Parameters */ + eeprom_data = nes_read16_eeprom(nesdev->regs, 0); + nes_debug(NES_DBG_HW, "EEPROM Offset 0 = 0x%04X\n", eeprom_data); + eeprom_offset = 2 + (((eeprom_data & 0x007f) << 3) << + ((eeprom_data & 0x0080) >> 7)); + nes_debug(NES_DBG_HW, "Firmware Offset = 0x%04X\n", eeprom_offset); + nesadapter->firmware_eeprom_offset = eeprom_offset; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 4); + if (eeprom_data != 0x5746) { + nes_debug(NES_DBG_HW, "Not a valid Firmware Image = 0x%04X\n", eeprom_data); + return -1; + } + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + eeprom_offset += ((eeprom_data & 0x00ff) << 3) << ((eeprom_data & 0x0100) >> 8); + nes_debug(NES_DBG_HW, "Software Offset = 0x%04X\n", eeprom_offset); + nesadapter->software_eeprom_offset = eeprom_offset; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 4); + if (eeprom_data != 0x5753) { + printk("Not a valid Software Image = 0x%04X\n", eeprom_data); + return -1; + } + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + (((eeprom_data & 0x00ff) << 3) << + ((eeprom_data & 0x0100) >> 8)); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x414d) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x414d but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_offset = next_section_address; + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + (((eeprom_data & 0x00ff) << 3) << + ((eeprom_data & 0x0100) >> 8)); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x4f52) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x4f52 but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_offset = next_section_address; + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + ((eeprom_data & 0x00ff) << 3); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x5746) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x5746 but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_offset = next_section_address; + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + ((eeprom_data & 0x00ff) << 3); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x5753) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x5753 but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_offset = next_section_address; + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + ((eeprom_data & 0x00ff) << 3); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x414d) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x414d but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_offset = next_section_address; + + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset + 2); + nes_debug(NES_DBG_HW, "EEPROM Offset %u (next section) = 0x%04X\n", + eeprom_offset + 2, eeprom_data); + next_section_address = eeprom_offset + ((eeprom_data & 0x00ff) << 3); + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 4); + if (eeprom_data != 0x464e) { + nes_debug(NES_DBG_HW, "EEPROM Changed offset should be 0x464e but was 0x%04X\n", + eeprom_data); + goto no_fw_rev; + } + eeprom_data = nes_read16_eeprom(nesdev->regs, next_section_address + 8); + printk(PFX "Firmware version %u.%u\n", (u8)(eeprom_data>>8), (u8)eeprom_data); + + nesadapter->firmware_version = (((u32)(u8)(eeprom_data>>8)) << 16) + + (u32)((u8)eeprom_data); + +no_fw_rev: + /* eeprom is valid */ + eeprom_offset = nesadapter->software_eeprom_offset; + eeprom_offset += 8; + nesadapter->netdev_max = (u8)nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + mac_addr_high = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + mac_addr_low = (u32)nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + mac_addr_low <<= 16; + mac_addr_low += (u32)nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "Base MAC Address = 0x%04X%08X\n", + mac_addr_high, mac_addr_low); + nes_debug(NES_DBG_HW, "MAC Address count = %u\n", nesadapter->netdev_max); + + nesadapter->mac_addr_low = mac_addr_low; + nesadapter->mac_addr_high = mac_addr_high; + + /* Read the Phy Type array */ + eeprom_offset += 10; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "PhyType: 0x%04x\n", eeprom_data); + + /* Read the port array */ + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + /* port_count is set by soft reset reg */ + for (index = 0; index < 4; index++) { + nesadapter->ports[index] = eeprom_data & 0x000f; + eeprom_data >>= 4; + } + nes_debug(NES_DBG_HW, "port_count = %u, port 0 -> %u, port 1 -> %u, port 2 -> %u, port 3 -> %u\n", + nesadapter->port_count, + nesadapter->ports[0], nesadapter->ports[1], + nesadapter->ports[2], nesadapter->ports[3]); + + eeprom_offset += 46; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->rx_pool_size = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "rx_pool_size = 0x%08X\n", nesadapter->rx_pool_size); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->tx_pool_size = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "tx_pool_size = 0x%08X\n", nesadapter->tx_pool_size); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->rx_threshold = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "rx_threshold = 0x%08X\n", nesadapter->rx_threshold); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->tcp_timer_core_clk_divisor = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "tcp_timer_core_clk_divisor = 0x%08X\n", + nesadapter->tcp_timer_core_clk_divisor); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->iwarp_config = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "iwarp_config = 0x%08X\n", nesadapter->iwarp_config); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->cm_config = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "cm_config = 0x%08X\n", nesadapter->cm_config); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->sws_timer_config = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "sws_timer_config = 0x%08X\n", nesadapter->sws_timer_config); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->tcp_config1 = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "tcp_config1 = 0x%08X\n", nesadapter->tcp_config1); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->wqm_wat = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "wqm_wat = 0x%08X\n", nesadapter->wqm_wat); + + eeprom_offset += 2; + eeprom_data = nes_read16_eeprom(nesdev->regs, eeprom_offset); + eeprom_offset += 2; + nesadapter->core_clock = (((u32)eeprom_data) << 16) + + nes_read16_eeprom(nesdev->regs, eeprom_offset); + nes_debug(NES_DBG_HW, "core_clock = 0x%08X\n", nesadapter->core_clock); + } + + nesadapter->phy_index[0] = 4; + nesadapter->phy_index[1] = 5; + nesadapter->phy_index[2] = 6; + nesadapter->phy_index[3] = 7; + + /* TODO: get this from EEPROM */ + nesdev->base_doorbell_index = 1; + + return 0; +} + + +/** + * nes_read16_eeprom + */ +static u16 nes_read16_eeprom(void __iomem *addr, u16 offset) +{ + writel(NES_EEPROM_READ_REQUEST + (offset >> 1), + (void __iomem *)addr + NES_EEPROM_COMMAND); + + do { + } while (readl((void __iomem *)addr + NES_EEPROM_COMMAND) & + NES_EEPROM_READ_REQUEST); + + return(readw((void __iomem *)addr + NES_EEPROM_DATA)); +} + + +/** + * nes_write_1G_phy_reg + */ +void nes_write_1G_phy_reg(struct nes_device *nesdev, u8 phy_reg, u8 phy_addr, u16 data) +{ + struct nes_adapter *nesadapter = nesdev->nesadapter; + u32 u32temp; + u32 counter; + unsigned long flags; + + spin_lock_irqsave(&nesadapter->phy_lock, flags); + + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x50020000 | data | ((u32)phy_reg << 18) | ((u32)phy_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + /* nes_debug(NES_DBG_PHY, "Phy interrupt status = 0x%X.\n", u32temp); */ + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); + + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); +} + + +/** + * nes_read_1G_phy_reg + * This routine only issues the read, the data must be read + * separately. + */ +void nes_read_1G_phy_reg(struct nes_device *nesdev, u8 phy_reg, u8 phy_addr, u16 *data) +{ + struct nes_adapter *nesadapter = nesdev->nesadapter; + u32 u32temp; + u32 counter; + unsigned long flags; + + /* nes_debug(NES_DBG_PHY, "%s: phy addr = %d, mac_index = %d\n", + __FUNCTION__, phy_addr, nesdev->mac_index); */ + spin_lock_irqsave(&nesadapter->phy_lock, flags); + + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x60020000 | ((u32)phy_reg << 18) | ((u32)phy_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + /* nes_debug(NES_DBG_PHY, "Phy interrupt status = 0x%X.\n", u32temp); */ + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) { + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); + *data = 0xffff; + } else { + *data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + } + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); +} + + +/** + * nes_write_10G_phy_reg + */ +void nes_write_10G_phy_reg(struct nes_device *nesdev, u16 phy_reg, + u8 phy_addr, u16 data) +{ + u32 dev_addr; + u32 port_addr; + u32 u32temp; + u32 counter; + + dev_addr = 5; + port_addr = 0; + + /* set address */ + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x00020000 | phy_reg | (dev_addr << 18) | (port_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); + + /* set data */ + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x10020000 | data | (dev_addr << 18) | (port_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); +} + + +/** + * nes_read_10G_phy_reg + * This routine only issues the read, the data must be read + * separately. + */ +void nes_read_10G_phy_reg(struct nes_device *nesdev, u16 phy_reg, u8 phy_addr) +{ + u32 dev_addr; + u32 port_addr; + u32 u32temp; + u32 counter; + + dev_addr = 5; + port_addr = 0; + + /* set address */ + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x00020000 | phy_reg | (dev_addr << 18) | (port_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); + + /* issue read */ + nes_write_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL, + 0x30020000 | (dev_addr << 18) | (port_addr << 23)); + for (counter = 0; counter < 100 ; counter++) { + udelay(30); + u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS); + if (u32temp & 1) { + nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS, 1); + break; + } + } + if (!(u32temp & 1)) + nes_debug(NES_DBG_PHY, "Phy is not responding. interrupt status = 0x%X.\n", + u32temp); +} + + +/** + * nes_arp_table + */ +int nes_arp_table(struct nes_device *nesdev, u32 ip_addr, u8 *mac_addr, u32 action) +{ + struct nes_adapter *nesadapter = nesdev->nesadapter; + int arp_index; + int err = 0; + + for (arp_index = 0; (u32) arp_index < nesadapter->arp_table_size; arp_index++) { + if (nesadapter->arp_table[arp_index].ip_addr == ip_addr) + break; + } + + if (action == NES_ARP_ADD) { + if (arp_index != nesadapter->arp_table_size) { + return -1; + } + + arp_index = 0; + err = nes_alloc_resource(nesadapter, nesadapter->allocated_arps, + nesadapter->arp_table_size, &arp_index, &nesadapter->next_arp_index); + if (err) { + nes_debug(NES_DBG_NETDEV, "nes_alloc_resource returned error = %u\n", err); + return err; + } + nes_debug(NES_DBG_NETDEV, "ADD, arp_index=%d\n", arp_index); + + nesadapter->arp_table[arp_index].ip_addr = ip_addr; + memcpy(nesadapter->arp_table[arp_index].mac_addr, mac_addr, ETH_ALEN); + return arp_index; + } + + /* DELETE or RESOLVE */ + if (arp_index == nesadapter->arp_table_size) { + nes_debug(NES_DBG_NETDEV, "mac address not in ARP table - cannot delete or resolve\n"); + return -1; + } + + if (action == NES_ARP_RESOLVE) { + nes_debug(NES_DBG_NETDEV, "RESOLVE, arp_index=%d\n", arp_index); + return arp_index; + } + + if (action == NES_ARP_DELETE) { + nes_debug(NES_DBG_NETDEV, "DELETE, arp_index=%d\n", arp_index); + nesadapter->arp_table[arp_index].ip_addr = 0; + memset(nesadapter->arp_table[arp_index].mac_addr, 0x00, ETH_ALEN); + nes_free_resource(nesadapter, nesadapter->allocated_arps, arp_index); + return arp_index; + } + + return -1; +} + + +/** + * nes_mh_fix + */ +void nes_mh_fix(unsigned long parm) +{ + unsigned long flags; + struct nes_device *nesdev = (struct nes_device *)parm; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_vnic *nesvnic; + u32 used_chunks_tx; + u32 temp_used_chunks_tx; + u32 temp_last_used_chunks_tx; + u32 used_chunks_mask; + u32 mac_tx_frames_low; + u32 mac_tx_frames_high; + u32 mac_tx_pauses; + u32 serdes_status; + u32 reset_value; + u32 tx_control; + u32 tx_config; + u32 tx_pause_quanta; + u32 rx_control; + u32 rx_config; + u32 mac_exact_match; + u32 mpp_debug; + u32 i=0; + u32 chunks_tx_progress = 0; + + spin_lock_irqsave(&nesadapter->phy_lock, flags); + if ((nesadapter->mac_sw_state[0] != NES_MAC_SW_IDLE) || (nesadapter->mac_link_down[0])) { + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); + goto no_mh_work; + } + nesadapter->mac_sw_state[0] = NES_MAC_SW_MH; + spin_unlock_irqrestore(&nesadapter->phy_lock, flags); + do { + mac_tx_frames_low = nes_read_indexed(nesdev, NES_IDX_MAC_TX_FRAMES_LOW); + mac_tx_frames_high = nes_read_indexed(nesdev, NES_IDX_MAC_TX_FRAMES_HIGH); + mac_tx_pauses = nes_read_indexed(nesdev, NES_IDX_MAC_TX_PAUSE_FRAMES); + used_chunks_tx = nes_read_indexed(nesdev, NES_IDX_USED_CHUNKS_TX); + nesdev->mac_pause_frames_sent += mac_tx_pauses; + used_chunks_mask = 0; + temp_used_chunks_tx = used_chunks_tx; + temp_last_used_chunks_tx = nesdev->last_used_chunks_tx; + + if (nesdev->netdev[0]) { + nesvnic = netdev_priv(nesdev->netdev[0]); + } else { + break; + } + + for (i=0; i<4; i++) { + used_chunks_mask <<= 8; + if (nesvnic->qp_nic_index[i] != 0xff) { + used_chunks_mask |= 0xff; + if ((temp_used_chunks_tx&0xff)<(temp_last_used_chunks_tx&0xff)) { + chunks_tx_progress = 1; + } + } + temp_used_chunks_tx >>= 8; + temp_last_used_chunks_tx >>= 8; + } + if ((mac_tx_frames_low) || (mac_tx_frames_high) || + (!(used_chunks_tx&used_chunks_mask)) || + (!(nesdev->last_used_chunks_tx&used_chunks_mask)) || + (chunks_tx_progress) ) { + nesdev->last_used_chunks_tx = used_chunks_tx; + break; + } + nesdev->last_used_chunks_tx = used_chunks_tx; + barrier(); + + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONTROL, 0x00000005); + mh_pauses_sent++; + mac_tx_pauses = nes_read_indexed(nesdev, NES_IDX_MAC_TX_PAUSE_FRAMES); + if (mac_tx_pauses) { + nesdev->mac_pause_frames_sent += mac_tx_pauses; + break; + } + + tx_control = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONTROL); + tx_config = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONFIG); + tx_pause_quanta = nes_read_indexed(nesdev, NES_IDX_MAC_TX_PAUSE_QUANTA); + rx_control = nes_read_indexed(nesdev, NES_IDX_MAC_RX_CONTROL); + rx_config = nes_read_indexed(nesdev, NES_IDX_MAC_RX_CONFIG); + mac_exact_match = nes_read_indexed(nesdev, NES_IDX_MAC_EXACT_MATCH_BOTTOM); + mpp_debug = nes_read_indexed(nesdev, NES_IDX_MPP_DEBUG); + + /* one last ditch effort to avoid a false positive */ + mac_tx_pauses = nes_read_indexed(nesdev, NES_IDX_MAC_TX_PAUSE_FRAMES); + if (mac_tx_pauses) { + nesdev->last_mac_tx_pauses = nesdev->mac_pause_frames_sent; + nes_debug(NES_DBG_HW, "failsafe caught slow outbound pause\n"); + break; + } + mh_detected++; + + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONTROL, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, 0x00000000); + reset_value = nes_read32(nesdev->regs+NES_SOFTWARE_RESET); + + nes_write32(nesdev->regs+NES_SOFTWARE_RESET, reset_value | 0x0000001d); + + while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) + & 0x00000040) != 0x00000040) && (i++ < 5000)) { + /* mdelay(1); */ + } + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, 0x00000008); + serdes_status = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0); + + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x000bdef7); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_DRIVE0, 0x9ce73000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_MODE0, 0x0ff00000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_SIGDET0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_BYPASS0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_LOOPBACK_CONTROL0, 0x00000000); + if (nesadapter->OneG_Mode) { + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_CONTROL0, 0xf0182222); + } else { + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_CONTROL0, 0xf0042222); + } + serdes_status = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_RX_EQ_STATUS0); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000ff); + + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONTROL, tx_control); + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, tx_config); + nes_write_indexed(nesdev, NES_IDX_MAC_TX_PAUSE_QUANTA, tx_pause_quanta); + nes_write_indexed(nesdev, NES_IDX_MAC_RX_CONTROL, rx_control); + nes_write_indexed(nesdev, NES_IDX_MAC_RX_CONFIG, rx_config); + nes_write_indexed(nesdev, NES_IDX_MAC_EXACT_MATCH_BOTTOM, mac_exact_match); + nes_write_indexed(nesdev, NES_IDX_MPP_DEBUG, mpp_debug); + + } while (0); + + nesadapter->mac_sw_state[0] = NES_MAC_SW_IDLE; +no_mh_work: + nesdev->nesadapter->mh_timer.expires = jiffies + (HZ/5); + add_timer(&nesdev->nesadapter->mh_timer); +} + + +/* +"Everything you wanted to know about CRC algorithms, but were afraid to ask + for fear that errors in your understanding might be detected." Version : 3. +Date : 19 August 1993. +Author : Ross N. Williams. +Net : ross at guest.adelaide.edu.au. +FTP : ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt +Company : Rocksoft� Pty Ltd. +Snail : 16 Lerwick Avenue, Hazelwood Park 5066, Australia. +Fax : +61 8 373-4911 (c/- Internode Systems Pty Ltd). +Phone : +61 8 379-9217 (10am to 10pm Adelaide Australia time). +Note : "Rocksoft" is a trademark of Rocksoft Pty Ltd, Australia. +Status : Copyright (C) Ross Williams, 1993. However, permission is granted to + make and distribute verbatim copies of this document provided that this information + block and copyright notice is included. Also, the C code modules included in this + document are fully public domain. + +Thanks : Thanks to Jean-loup Gailly (jloup at chorus.fr) and Mark Adler + (me at quest.jpl.nasa.gov) who both proof read this document and picked + out lots of nits as well as some big fat bugs. + +The current web page for this seems to be http://www.ross.net/crc/crcpaper.html. + +*/ + +/****************************************************************************/ +/* Generate width mask */ +/****************************************************************************/ +/* */ +/* Returns a longword whose value is (2^p_cm->cm_width)-1. */ +/* The trick is to do this portably (e.g. without doing <<32). */ +/* */ +/* Author: Tristan Gross */ +/* Source: "A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS" */ +/* Ross N. Williams */ +/* http://www.rocksoft.com */ +/* */ +/****************************************************************************/ + +static u32 nesCRCWidMask (u32 width) +{ + return(((1L<<(((u32)width)-1))-1L)<<1)|1L; +} + + +/****************************************************************************/ +/* Generate CRC table */ +/****************************************************************************/ +/* */ +/* Source: "A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS" */ +/* Ross N. Williams */ +/* http://www.rocksoft.com */ +/* */ +/****************************************************************************/ +static u32 nes_crc_table_gen ( u32 *pCRCTable, + u32 poly, + u32 order, + u32 reflectIn) +{ + u32 i; + u32 reg; + u32 byte; + u32 topbit = BITMASK(NES_CRC_WID-1); + u32 tmp; + + for (byte=0;byte<256;byte++) { + + // If we need to creat a reflected table we must reflect the index (byte) and + // reflect the final reg + tmp = (reflectIn) ? reflect(byte,8): byte; + + reg = tmp << (NES_CRC_WID-8); + + for (i=0; i<8; i++) { + if (reg & topbit) { + reg = (reg << 1) ^ poly; + } else { + reg <<= 1; + } + } + + reg = (reflectIn) ? reflect(reg,order): reg; + pCRCTable[byte] = reg & nesCRCWidMask(NES_CRC_WID); + } + + return 0; +} + + +/****************************************************************************/ +/* Perform 32 bit based CRC calculation */ +/****************************************************************************/ +/* */ +/* Source: "A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS" */ +/* Ross N. Williams */ +/* http://www.rocksoft.com */ +/* */ +/* This performs a standard 32 bit crc on an array of arbitrary length */ +/* with an arbitrary initial value and passed generator polynomial */ +/* in the form of a crc table. */ +/* */ +/****************************************************************************/ +static u32 reflect (u32 data, u32 num) +{ + /* Reflects the lower num bits in 'data' around their center point. */ + u32 i; + u32 j = 1; + u32 result = 0; + + for (i=(u32)1<<(num-1); i; i>>=1) { + if (data & i) result|=j; + j <<= 1; + } + return result; +} + + +/** + * byte_swap + */ +static u32 byte_swap (u32 data, u32 num) +{ + u32 i; + u32 result = 0; + + if (num%16) { + dprintk("\nbyte_swap: ERROR: num is not an even number of bytes\n"); + /* ASSERT(0); */ + } + + for (i = 0; i < num; i += 8) { + result |= (0xFF & (data >> i)) << (num-8-i); + } + + return result; +} + + +/** + * nes_crc32 - + * This is a reflected table algorithm. ReflectIn basically + * means to reflect each incomming byte of the data. But to make + * things more complicated, we can instead reflect the initial + * value, the final crc, and shift data to the right using a + * reflected pCRCTable. CRC is FUN!! + */ +u32 nes_crc32 ( u32 reverse, + u32 initialValue, + u32 finalXOR, + u32 messageLength, + u8 *pMessage, + u32 order, + u32 reflectIn, + u32 reflectOut) + +{ + u8 *pBlockAddr = pMessage; + u32 mlen = messageLength; + u32 crc; + + if (0 == nesCRCInitialized) { + nes_crc_table_gen( &nesCRCTable[0], CRC32C_POLY, ORDER, REFIN); + nesCRCInitialized = 1; + } + + crc = (reflectIn) ? reflect(initialValue,order): initialValue; + + while (mlen--) { + /* printf("byte = %x, index = %u, crctable[index] = %x\n", + *pBlockAddr, (crc & 0xffL) ^ *pBlockAddr, + nesCRCTable[(crc & 0xffL) ^ *pBlockAddr]); + */ + if (reflectIn) { + crc = nesCRCTable[(crc & 0xffL ) ^ *pBlockAddr++] ^ (crc >> 8); + } else { + crc = nesCRCTable[((crc>>24) ^ *pBlockAddr++) & 0xFFL] ^ (crc << 8); + } + } + + /* if reflectOut and reflectIn are both set, we don't */ + /* do anything since reflecting twice effectively does nothing. */ + crc = ((reflectIn)^(reflectOut)) ? reflect(crc,order): crc; + + crc = crc^finalXOR; + + /* We don't really use this, but it is here for completeness */ + crc = (reverse) ? byte_swap(crc,32): crc; + + return crc; +} + From ggrundstrom at neteffect.com Fri Oct 19 13:23:15 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:23:15 -0500 Subject: [ofa-general] [PATCH 11/14 v2] nes: OpenFabrics kernel verbs Message-ID: <200710192023.l9JKNFov021830@neteffect.com> OpenFabrics kernel verbs provider routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_verbs.c 2007-10-19 10:07:14.000000000 -0500 @@ -0,0 +1,3836 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include + +#include +#include +#include +#include "nes.h" +#ifndef OFED_1_2 +#include +#endif + +atomic_t mod_qp_timouts; +atomic_t qps_created; +atomic_t sw_qps_destroyed; + + +/** + * nes_alloc_mw + */ +static struct ib_mw *nes_alloc_mw(struct ib_pd *ibpd) { + unsigned long flags; + struct nes_pd *nespd = to_nespd(ibpd); + struct nes_vnic *nesvnic = to_nesvnic(ibpd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_cqp_request *cqp_request; + struct nes_mr *nesmr; + struct ib_mw *ibmw; + struct nes_hw_cqp_wqe *cqp_wqe; + int ret; + u32 stag; + u32 stag_index = 0; + u32 next_stag_index = 0; + u32 driver_key = 0; + u8 stag_key = 0; + + get_random_bytes(&next_stag_index, sizeof(next_stag_index)); + stag_key = (u8)next_stag_index; + + driver_key = 0; + + next_stag_index >>= 8; + next_stag_index %= nesadapter->max_mr; + + ret = nes_alloc_resource(nesadapter, nesadapter->allocated_mrs, + nesadapter->max_mr, &stag_index, &next_stag_index); + if (ret) { + return ERR_PTR(ret); + } + + nesmr = kmalloc(sizeof(*nesmr), GFP_KERNEL); + if (!nesmr) { + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + return ERR_PTR(-ENOMEM); + } + + stag = stag_index << 8; + stag |= driver_key; + stag += (u32)stag_key; + + nes_debug(NES_DBG_MR, "Registering STag 0x%08X, index = 0x%08X\n", + stag, stag_index); + + /* Register the region with the adapter */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + kfree(nesmr); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + return ERR_PTR(-ENOMEM); + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = + cpu_to_le32( NES_CQP_ALLOCATE_STAG | NES_CQP_STAG_RIGHTS_REMOTE_READ | + NES_CQP_STAG_RIGHTS_REMOTE_WRITE | NES_CQP_STAG_VA_TO | + NES_CQP_STAG_REM_ACC_EN); + + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_HIGH_PD_IDX] = + cpu_to_le32(nespd->pd_id&0x00007fff); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_STAG_IDX] = cpu_to_le32(stag); + + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_HIGH_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = 0; + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MR, "Register STag 0x%08X completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + stag, ret, cqp_request->major_code, cqp_request->minor_code); + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + kfree(nesmr); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + if (!ret) { + return ERR_PTR(-ETIME); + } else { + return ERR_PTR(-ENOMEM); + } + } else { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + nesmr->ibmw.rkey = stag; + nesmr->mode = IWNES_MEMREG_TYPE_MW; + ibmw = &nesmr->ibmw; + nesmr->pbl_4k = 0; + nesmr->pbls_used = 0; + + return ibmw; +} + + +/** + * nes_dealloc_mw + */ +static int nes_dealloc_mw(struct ib_mw *ibmw) +{ + struct nes_mr *nesmr = to_nesmw(ibmw); + struct nes_vnic *nesvnic = to_nesvnic(ibmw->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + int err = 0; + unsigned long flags; + int ret; + + /* Deallocate the window with the adapter */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_MR, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32(NES_CQP_DEALLOCATE_STAG); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp)) >> 32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_STAG_IDX] = cpu_to_le32(ibmw->rkey); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + nes_debug(NES_DBG_MR, "Waiting for deallocate STag 0x%08X to complete.\n", + ibmw->rkey); + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MR, "Deallocate STag completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + ret, cqp_request->major_code, cqp_request->minor_code); + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) { + err = -ETIME; + } else { + err = -EIO; + } + } else { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ibmw->rkey&0x0fffff00) >> 8); + kfree(nesmr); + + return err; +} + + +/** + * nes_bind_mw + */ +static int nes_bind_mw(struct ib_qp *ibqp, struct ib_mw *ibmw, + struct ib_mw_bind *ibmw_bind) +{ + u64 u64temp; + struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); + struct nes_device *nesdev = nesvnic->nesdev; + /* struct nes_mr *nesmr = to_nesmw(ibmw); */ + struct nes_qp *nesqp = to_nesqp(ibqp); + struct nes_hw_qp_wqe *wqe; + unsigned long flags = 0; + u32 head; + u32 wqe_misc = 0; + u32 qsize; + + if (nesqp->ibqp_state > IB_QPS_RTS) + return -EINVAL; + + spin_lock_irqsave(&nesqp->lock, flags); + + head = nesqp->hwqp.sq_head; + qsize = nesqp->hwqp.sq_tail; + + /* Check for SQ overflow */ + if (((head + (2 * qsize) - nesqp->hwqp.sq_tail) % qsize) == (qsize - 1)) { + return -EINVAL; + } + + wqe = &nesqp->hwqp.sq_vbase[head]; + /* nes_debug(NES_DBG_MR, "processing sq wqe at %p, head = %u.\n", wqe, head); */ + u64temp = (u64)ibmw_bind->wr_id; + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)u64temp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)((u64temp)>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)nesqp)>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = (u32)((u64)nesqp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] |= head; + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32(wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]); + wqe_misc = NES_IWARP_SQ_OP_BIND; + + wqe_misc |= NES_IWARP_SQ_WQE_LOCAL_FENCE; + + if (ibmw_bind->send_flags & IB_SEND_SIGNALED) + wqe_misc |= NES_IWARP_SQ_WQE_SIGNALED_COMPL; + + if (ibmw_bind->mw_access_flags & IB_ACCESS_REMOTE_WRITE) { + wqe_misc |= NES_CQP_STAG_RIGHTS_REMOTE_WRITE; + } + if (ibmw_bind->mw_access_flags & IB_ACCESS_REMOTE_READ) { + wqe_misc |= NES_CQP_STAG_RIGHTS_REMOTE_READ; + } + + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(wqe_misc); + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_MR_IDX] = cpu_to_le32(ibmw_bind->mr->lkey); + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_MW_IDX] = cpu_to_le32(ibmw->rkey); + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_LENGTH_LOW_IDX] = + cpu_to_le32(ibmw_bind->length); + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_LENGTH_HIGH_IDX] = 0; + u64temp = (u64)ibmw_bind->addr; + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_VA_FBO_LOW_IDX] = cpu_to_le32((u32)u64temp); + wqe->wqe_words[NES_IWARP_SQ_BIND_WQE_VA_FBO_HIGH_IDX] = cpu_to_le32((u32)(u64temp>>32)); + + head++; + if (head >= qsize) + head = 0; + + nesqp->hwqp.sq_head = head; + barrier(); + + nes_write32(nesdev->regs + NES_WQE_ALLOC, + (1 << 24) | 0x00800000 | nesqp->hwqp.qp_id); + + spin_unlock_irqrestore(&nesqp->lock, flags); + + return 0; +} + + +/** + * nes_alloc_fmr + */ +static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, + int ibmr_access_flags, + struct ib_fmr_attr *ibfmr_attr) +{ + unsigned long flags; + struct nes_pd *nespd = to_nespd(ibpd); + struct nes_vnic *nesvnic = to_nesvnic(ibpd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_fmr *nesfmr; + struct nes_cqp_request *cqp_request; + struct nes_hw_cqp_wqe *cqp_wqe; + int ret; + u32 stag; + u32 stag_index = 0; + u32 next_stag_index = 0; + u32 driver_key = 0; + u8 stag_key = 0; + int i=0; + struct nes_vpbl vpbl; + + get_random_bytes(&next_stag_index, sizeof(next_stag_index)); + stag_key = (u8)next_stag_index; + + driver_key = 0; + + next_stag_index >>= 8; + next_stag_index %= nesadapter->max_mr; + + ret = nes_alloc_resource(nesadapter, nesadapter->allocated_mrs, + nesadapter->max_mr, &stag_index, &next_stag_index); + if (ret) { + goto failed_resource_alloc; + } + + nesfmr = kmalloc(sizeof(*nesfmr), GFP_KERNEL); + if (!nesfmr) { + ret = -ENOMEM; + goto failed_fmr_alloc; + } + + nesfmr->nesmr.mode = IWNES_MEMREG_TYPE_FMR; + if (ibfmr_attr->max_pages == 1) { + /* use zero length PBL */ + nesfmr->nesmr.pbl_4k = 0; + nesfmr->nesmr.pbls_used = 0; + } else if (ibfmr_attr->max_pages <= 32) { + /* use PBL 256 */ + nesfmr->nesmr.pbl_4k = 0; + nesfmr->nesmr.pbls_used = 1; + } else if (ibfmr_attr->max_pages <= 512) { + /* use 4K PBLs */ + nesfmr->nesmr.pbl_4k = 1; + nesfmr->nesmr.pbls_used = 1; + } else { + /* use two level 4K PBLs */ + /* add support for two level 256B PBLs */ + nesfmr->nesmr.pbl_4k = 1; + nesfmr->nesmr.pbls_used = 1 + (ibfmr_attr->max_pages>>9) + + ((ibfmr_attr->max_pages&511)?1:0); + } + /* Register the region with the adapter */ + spin_lock_irqsave(&nesdev->cqp.lock, flags); + + /* track PBL resources */ + if (nesfmr->nesmr.pbls_used != 0) { + if (nesfmr->nesmr.pbl_4k) { + if (nesfmr->nesmr.pbls_used > nesadapter->free_4kpbl) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + ret = -ENOMEM; + goto failed_vpbl_alloc; + } else { + nesadapter->free_4kpbl -= nesfmr->nesmr.pbls_used; + } + } else { + if (nesfmr->nesmr.pbls_used > nesadapter->free_256pbl) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + ret = -ENOMEM; + goto failed_vpbl_alloc; + } else { + nesadapter->free_256pbl -= nesfmr->nesmr.pbls_used; + } + } + } + + /* one level pbl */ + if (nesfmr->nesmr.pbls_used == 0) { + nesfmr->root_vpbl.pbl_vbase = NULL; + nes_debug(NES_DBG_MR, "zero level pbl \n"); + } else if (nesfmr->nesmr.pbls_used == 1) { + /* can change it to kmalloc & dma_map_single */ + nesfmr->root_vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 4096, + &nesfmr->root_vpbl.pbl_pbase); + if (!nesfmr->root_vpbl.pbl_vbase) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + ret = -ENOMEM; + goto failed_vpbl_alloc; + } + nesfmr->leaf_pbl_cnt = 0; + nes_debug(NES_DBG_MR, "one level pbl, root_vpbl.pbl_vbase=%p \n", + nesfmr->root_vpbl.pbl_vbase); + } + /* two level pbl */ + else { + nesfmr->root_vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 8192, + &nesfmr->root_vpbl.pbl_pbase); + if (!nesfmr->root_vpbl.pbl_vbase) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + ret = -ENOMEM; + goto failed_vpbl_alloc; + } + + nesfmr->root_vpbl.leaf_vpbl = kmalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_KERNEL); + if (!nesfmr->root_vpbl.leaf_vpbl) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + ret = -ENOMEM; + goto failed_leaf_vpbl_alloc; + } + + nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1; + nes_debug(NES_DBG_MR, "two level pbl, root_vpbl.pbl_vbase=%p" + " leaf_pbl_cnt=%d root_vpbl.leaf_vpbl=%p\n", + nesfmr->root_vpbl.pbl_vbase, nesfmr->leaf_pbl_cnt, nesfmr->root_vpbl.leaf_vpbl); + + for (i=0; ileaf_pbl_cnt; i++) + nesfmr->root_vpbl.leaf_vpbl[i].pbl_vbase = NULL; + + for (i=0; ileaf_pbl_cnt; i++) { + vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 4096, + &vpbl.pbl_pbase); + + if (!vpbl.pbl_vbase) { + ret = -ENOMEM; + goto failed_leaf_vpbl_pages_alloc; + } + + nesfmr->root_vpbl.pbl_vbase[i].pa_low = cpu_to_le32((u32)vpbl.pbl_pbase); + nesfmr->root_vpbl.pbl_vbase[i].pa_high = cpu_to_le32((u32)((((u64)vpbl.pbl_pbase)>>32))); + nesfmr->root_vpbl.leaf_vpbl[i] = vpbl; + + nes_debug(NES_DBG_MR, "pbase_low=0x%x, pbase_high=0x%x, vpbl=%p\n", + nesfmr->root_vpbl.pbl_vbase[i].pa_low, + nesfmr->root_vpbl.pbl_vbase[i].pa_high, + &nesfmr->root_vpbl.leaf_vpbl[i]); + } + } + nesfmr->ib_qp = NULL; + nesfmr->access_rights =0; + + stag = stag_index << 8; + stag |= driver_key; + stag += (u32)stag_key; + + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_MR, "Failed to get a cqp_request.\n"); + ret = -ENOMEM; + goto failed_leaf_vpbl_pages_alloc; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + nes_debug(NES_DBG_MR, "Registering STag 0x%08X, index = 0x%08X\n", + stag, stag_index); + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_ALLOCATE_STAG | + NES_CQP_STAG_VA_TO | + NES_CQP_STAG_MR); + + if (nesfmr->nesmr.pbl_4k == 1) + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_STAG_PBL_BLK_SIZE); + + if (ibmr_access_flags & IB_ACCESS_REMOTE_WRITE) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= + cpu_to_le32(NES_CQP_STAG_RIGHTS_REMOTE_WRITE | + NES_CQP_STAG_RIGHTS_LOCAL_WRITE | NES_CQP_STAG_REM_ACC_EN); + nesfmr->access_rights |= + NES_CQP_STAG_RIGHTS_REMOTE_WRITE | NES_CQP_STAG_RIGHTS_LOCAL_WRITE | + NES_CQP_STAG_REM_ACC_EN; + } + + if (ibmr_access_flags & IB_ACCESS_REMOTE_READ) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= + cpu_to_le32( NES_CQP_STAG_RIGHTS_REMOTE_READ | + NES_CQP_STAG_RIGHTS_LOCAL_READ | NES_CQP_STAG_REM_ACC_EN); + nesfmr->access_rights |= + NES_CQP_STAG_RIGHTS_REMOTE_READ | NES_CQP_STAG_RIGHTS_LOCAL_READ | + NES_CQP_STAG_REM_ACC_EN; + } + + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_HIGH_PD_IDX] = + cpu_to_le32(nespd->pd_id & 0x00007fff); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_STAG_IDX] = cpu_to_le32(stag); + + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_HIGH_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = + cpu_to_le32((nesfmr->nesmr.pbls_used>1) ? + (nesfmr->nesmr.pbls_used-1) : nesfmr->nesmr.pbls_used); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = 0; + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MR, "Register STag 0x%08X completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + stag, ret, cqp_request->major_code, cqp_request->minor_code); + + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + ret = (!ret) ? -ETIME : -EIO; + goto failed_leaf_vpbl_pages_alloc; + } else { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + nesfmr->nesmr.ibfmr.lkey = stag; + nesfmr->nesmr.ibfmr.rkey = stag; + nesfmr->attr = *ibfmr_attr; + + return &nesfmr->nesmr.ibfmr; + failed_leaf_vpbl_pages_alloc: + /* unroll all allocated pages */ + for (i=0; ileaf_pbl_cnt; i++) { + if (nesfmr->root_vpbl.leaf_vpbl[i].pbl_vbase) { + pci_free_consistent(nesdev->pcidev, 4096, nesfmr->root_vpbl.leaf_vpbl[i].pbl_vbase, + nesfmr->root_vpbl.leaf_vpbl[i].pbl_pbase); + } + } + if (nesfmr->root_vpbl.leaf_vpbl) + kfree( nesfmr->root_vpbl.leaf_vpbl ); + failed_leaf_vpbl_alloc: + if (nesfmr->leaf_pbl_cnt == 0) { + if (nesfmr->root_vpbl.pbl_vbase) + pci_free_consistent(nesdev->pcidev, 4096, nesfmr->root_vpbl.pbl_vbase, + nesfmr->root_vpbl.pbl_pbase); + } else + pci_free_consistent(nesdev->pcidev, 8192, nesfmr->root_vpbl.pbl_vbase, + nesfmr->root_vpbl.pbl_pbase); + failed_vpbl_alloc: + kfree(nesfmr); + failed_fmr_alloc: + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + failed_resource_alloc: + return ERR_PTR(ret); +} + + +/** + * nes_dealloc_fmr + */ +static int nes_dealloc_fmr(struct ib_fmr *ibfmr) +{ + struct nes_mr *nesmr = to_nesmr_from_ibfmr(ibfmr); + struct nes_fmr *nesfmr = to_nesfmr(nesmr); + struct nes_vnic *nesvnic = to_nesvnic(ibfmr->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_mr temp_nesmr = *nesmr; + int i = 0; + + temp_nesmr.ibmw.device = ibfmr->device; + temp_nesmr.ibmw.pd = ibfmr->pd; + temp_nesmr.ibmw.rkey = ibfmr->rkey; + temp_nesmr.ibmw.uobject = NULL; + + /* free the resources */ + if (nesfmr->leaf_pbl_cnt == 0) { + /* single PBL case */ + if (nesfmr->root_vpbl.pbl_vbase) + pci_free_consistent(nesdev->pcidev, 4096, nesfmr->root_vpbl.pbl_vbase, + nesfmr->root_vpbl.pbl_pbase); + } else { + for (i=0; ileaf_pbl_cnt; i++) { + pci_free_consistent(nesdev->pcidev, 4096, nesfmr->root_vpbl.leaf_vpbl[i].pbl_vbase, + nesfmr->root_vpbl.leaf_vpbl[i].pbl_pbase); + } + kfree(nesfmr->root_vpbl.leaf_vpbl); + pci_free_consistent(nesdev->pcidev, 8192, nesfmr->root_vpbl.pbl_vbase, + nesfmr->root_vpbl.pbl_pbase); + } + + return nes_dealloc_mw(&temp_nesmr.ibmw); +} + + +/** + * nes_map_phys_fmr + */ +static int nes_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova) +{ + return 0; +} + + +/** + * nes_unmap_frm + */ +static int nes_unmap_fmr(struct list_head *ibfmr_list) +{ + return 0; +} + + + +/** + * nes_query_device + */ +static int nes_query_device(struct ib_device *ibdev, struct ib_device_attr *props) +{ + struct nes_vnic *nesvnic = to_nesvnic(ibdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_ib_device *nesibdev = nesvnic->nesibdev; + + memset(props, 0, sizeof(*props)); + memcpy(&props->sys_image_guid, nesvnic->netdev->dev_addr, 6); + + props->fw_ver = nesdev->nesadapter->fw_ver; + props->device_cap_flags = nesdev->nesadapter->device_cap_flags; + props->vendor_id = nesdev->nesadapter->vendor_id; + props->vendor_part_id = nesdev->nesadapter->vendor_part_id; + props->hw_ver = nesdev->nesadapter->hw_rev; + props->max_mr_size = 0x80000000; + props->max_qp = nesibdev->max_qp; + props->max_qp_wr = nesdev->nesadapter->max_qp_wr - 2; + props->max_sge = nesdev->nesadapter->max_sge; + props->max_cq = nesibdev->max_cq; + props->max_cqe = nesdev->nesadapter->max_cqe - 1; + props->max_mr = nesibdev->max_mr; + props->max_mw = nesibdev->max_mr; + props->max_pd = nesibdev->max_pd; + props->max_sge_rd = 1; + switch (nesdev->nesadapter->max_irrq_wr) { + case 0: + props->max_qp_rd_atom = 1; + break; + case 1: + props->max_qp_rd_atom = 4; + break; + case 2: + props->max_qp_rd_atom = 16; + break; + case 3: + props->max_qp_rd_atom = 32; + break; + default: + props->max_qp_rd_atom = 0; + } + props->max_qp_init_rd_atom = props->max_qp_wr; + props->atomic_cap = IB_ATOMIC_NONE; + props->max_map_per_fmr = 1; + + return 0; +} + + +/** + * nes_query_port + */ +static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props) +{ + memset(props, 0, sizeof(*props)); + + props->max_mtu = IB_MTU_2048; + props->active_mtu = IB_MTU_2048; + props->lid = 1; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = IB_PORT_CM_SUP | IB_PORT_REINIT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = IB_WIDTH_4X; + props->active_speed = 1; + props->max_msg_sz = 0x80000000; + + return 0; +} + + +/** + * nes_modify_port + */ +static int nes_modify_port(struct ib_device *ibdev, u8 port, + int port_modify_mask, struct ib_port_modify *props) +{ + return 0; +} + + +/** + * nes_query_pkey + */ +static int nes_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey) +{ + *pkey = 0; + return 0; +} + + +/** + * nes_query_gid + */ +static int nes_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct nes_vnic *nesvnic = to_nesvnic(ibdev); + + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), nesvnic->netdev->dev_addr, 6); + + return 0; +} + + +/** + * nes_alloc_ucontext - Allocate the user context data structure. This keeps track + * of all objects associated with a particular user-mode client. + */ +static struct ib_ucontext *nes_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct nes_vnic *nesvnic = to_nesvnic(ibdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_alloc_ucontext_resp uresp; + struct nes_ucontext *nes_ucontext; + struct nes_ib_device *nesibdev = nesvnic->nesibdev; + + memset(&uresp, 0, sizeof uresp); + + uresp.max_qps = nesibdev->max_qp; + uresp.max_pds = nesibdev->max_pd; + uresp.wq_size = nesdev->nesadapter->max_qp_wr*2; + + nes_ucontext = kmalloc(sizeof *nes_ucontext, GFP_KERNEL); + if (!nes_ucontext) + return ERR_PTR(-ENOMEM); + + memset(nes_ucontext, 0, sizeof(struct nes_ucontext)); + + nes_ucontext->nesdev = nesdev; + nes_ucontext->mmap_wq_offset = ((uresp.max_pds * 4096) + PAGE_SIZE-1) / PAGE_SIZE; + nes_ucontext->mmap_cq_offset = nes_ucontext->mmap_wq_offset + + ((sizeof(struct nes_hw_qp_wqe) * uresp.max_qps * 2) + PAGE_SIZE-1) / + PAGE_SIZE; + + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { + kfree(nes_ucontext); + return ERR_PTR(-EFAULT); + } + + INIT_LIST_HEAD(&nes_ucontext->cq_reg_mem_list); + return &nes_ucontext->ibucontext; +} + + +/** + * nes_dealloc_ucontext + */ +static int nes_dealloc_ucontext(struct ib_ucontext *context) +{ + /* struct nes_vnic *nesvnic = to_nesvnic(context->device); */ + /* struct nes_device *nesdev = nesvnic->nesdev; */ + struct nes_ucontext *nes_ucontext = to_nesucontext(context); + + kfree(nes_ucontext); + return 0; +} + + +/** + * nes_mmap + */ +static int nes_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + unsigned long index; + struct nes_vnic *nesvnic = to_nesvnic(context->device); + struct nes_device *nesdev = nesvnic->nesdev; + /* struct nes_adapter *nesadapter = nesdev->nesadapter; */ + struct nes_ucontext *nes_ucontext; + struct nes_qp *nesqp; + + nes_ucontext = to_nesucontext(context); + + + if (vma->vm_pgoff >= nes_ucontext->mmap_wq_offset) { + index = (vma->vm_pgoff - nes_ucontext->mmap_wq_offset) * PAGE_SIZE; + index /= ((sizeof(struct nes_hw_qp_wqe) * nesdev->nesadapter->max_qp_wr * 2) + + PAGE_SIZE-1) & (~(PAGE_SIZE-1)); + if (!test_bit(index, nes_ucontext->allocated_wqs)) { + nes_debug(NES_DBG_MMAP, "wq %lu not allocated\n", index); + return -EFAULT; + } + nesqp = nes_ucontext->mmap_nesqp[index]; + if (NULL == nesqp) { + nes_debug(NES_DBG_MMAP, "wq %lu has a NULL QP base.\n", index); + return -EFAULT; + } + if (remap_pfn_range(vma, vma->vm_start, + virt_to_phys(nesqp->hwqp.sq_vbase) >> PAGE_SHIFT, + vma->vm_end - vma->vm_start, + vma->vm_page_prot)) { + nes_debug(NES_DBG_MMAP, "remap_pfn_range failed.\n"); + return -EAGAIN; + } + vma->vm_private_data = nesqp; + return 0; + } else { + index = vma->vm_pgoff; + if (!test_bit(index, nes_ucontext->allocated_doorbells)) + return -EFAULT; + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + if (io_remap_pfn_range(vma, vma->vm_start, + (nesdev->doorbell_start + + ((nes_ucontext->mmap_db_index[index]-nesdev->base_doorbell_index) * 4096)) + >> PAGE_SHIFT, PAGE_SIZE, vma->vm_page_prot)) + return -EAGAIN; + vma->vm_private_data = nes_ucontext; + return 0; + } + + return -ENOSYS; +} + + +/** + * nes_alloc_pd + */ +static struct ib_pd *nes_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, struct ib_udata *udata) +{ + struct nes_pd *nespd; + struct nes_vnic *nesvnic = to_nesvnic(ibdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_ucontext *nesucontext; + struct nes_alloc_pd_resp uresp; + u32 pd_num = 0; + int err; + + nes_debug(NES_DBG_PD, "netdev refcnt=%u\n", + atomic_read(&nesvnic->netdev->refcnt)); + + err = nes_alloc_resource(nesadapter, nesadapter->allocated_pds, + nesadapter->max_pd, &pd_num, &nesadapter->next_pd); + if (err) { + return ERR_PTR(err); + } + + nespd = kmalloc(sizeof (struct nes_pd), GFP_KERNEL); + if (!nespd) { + nes_free_resource(nesadapter, nesadapter->allocated_pds, pd_num); + return ERR_PTR(-ENOMEM); + } + memset(nespd, 0, sizeof(struct nes_pd)); + nes_debug(NES_DBG_PD, "Allocating PD (%p) for ib device %s\n", + nespd, nesvnic->nesibdev->ibdev.name); + + nespd->pd_id = pd_num + nesadapter->base_pd; + + if (context) { + nesucontext = to_nesucontext(context); + nespd->mmap_db_index = find_next_zero_bit(nesucontext->allocated_doorbells, + NES_MAX_USER_DB_REGIONS, nesucontext->first_free_db); + nes_debug(NES_DBG_PD, "find_first_zero_biton doorbells returned %u, mapping pd_id %u.\n", + nespd->mmap_db_index, nespd->pd_id); + if (nespd->mmap_db_index > NES_MAX_USER_DB_REGIONS) { + nes_debug(NES_DBG_PD, "mmap_db_index > MAX\n"); + nes_free_resource(nesadapter, nesadapter->allocated_pds, pd_num); + kfree(nespd); + return ERR_PTR(-ENOMEM); + } + + uresp.pd_id = nespd->pd_id; + uresp.mmap_db_index = nespd->mmap_db_index; + if (ib_copy_to_udata(udata, &uresp, sizeof (struct nes_alloc_pd_resp))) { + nes_free_resource(nesadapter, nesadapter->allocated_pds, pd_num); + kfree(nespd); + return ERR_PTR(-EFAULT); + } + + set_bit(nespd->mmap_db_index, nesucontext->allocated_doorbells); + nesucontext->mmap_db_index[nespd->mmap_db_index] = nespd->pd_id; + nesucontext->first_free_db = nespd->mmap_db_index + 1; + } + + nes_debug(NES_DBG_PD, "PD%u structure located @%p.\n", nespd->pd_id, nespd); + return &nespd->ibpd; +} + + +/** + * nes_dealloc_pd + */ +static int nes_dealloc_pd(struct ib_pd *ibpd) +{ + struct nes_ucontext *nesucontext; + struct nes_pd *nespd = to_nespd(ibpd); + struct nes_vnic *nesvnic = to_nesvnic(ibpd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + + if ((ibpd->uobject) && (ibpd->uobject->context)) { + nesucontext = to_nesucontext(ibpd->uobject->context); + nes_debug(NES_DBG_PD, "Clearing bit %u from allocated doorbells\n", + nespd->mmap_db_index); + clear_bit(nespd->mmap_db_index, nesucontext->allocated_doorbells); + nesucontext->mmap_db_index[nespd->mmap_db_index] = 0; + if (nesucontext->first_free_db > nespd->mmap_db_index) { + nesucontext->first_free_db = nespd->mmap_db_index; + } + } + + nes_debug(NES_DBG_PD, "Deallocating PD%u structure located @%p.\n", + nespd->pd_id, nespd); + nes_free_resource(nesadapter, nesadapter->allocated_pds, + nespd->pd_id-nesadapter->base_pd); + kfree(nespd); + + return 0; +} + + +/** + * nes_create_ah + */ +static struct ib_ah *nes_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + + +/** + * nes_destroy_ah + */ +static int nes_destroy_ah(struct ib_ah *ah) +{ + return -ENOSYS; +} + + +/** + * nes_create_qp + */ +static struct ib_qp *nes_create_qp(struct ib_pd *ibpd, + struct ib_qp_init_attr *init_attr, struct ib_udata *udata) +{ + u64 u64temp= 0; + u64 u64nesqp = 0; + struct nes_pd *nespd = to_nespd(ibpd); + struct nes_vnic *nesvnic = to_nesvnic(ibpd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_qp *nesqp; + struct nes_cq *nescq; + struct nes_ucontext *nes_ucontext; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + struct nes_create_qp_resp uresp; + u32 qp_num = 0; + /* u32 counter = 0; */ + void *mem; + unsigned long flags; + int ret; + int sq_size; + int rq_size; + u8 sq_encoded_size; + u8 rq_encoded_size; + /* int counter; */ + + atomic_inc(&qps_created); + switch (init_attr->qp_type) { + case IB_QPT_RC: + if (nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA) { + init_attr->cap.max_inline_data = 0; + } else { + init_attr->cap.max_inline_data = 64; + } + + if (init_attr->cap.max_send_wr < 32) { + sq_size = 32; + sq_encoded_size = 1; + } else if (init_attr->cap.max_send_wr < 128) { + sq_size = 128; + sq_encoded_size = 2; + } else if (init_attr->cap.max_send_wr < 512) { + sq_size = 512; + sq_encoded_size = 3; + } else { + printk(KERN_ERR PFX "%s: SQ size (%u) too large.\n", + __FUNCTION__, init_attr->cap.max_send_wr); + return ERR_PTR(-EINVAL); + } + init_attr->cap.max_send_wr = sq_size - 2; + if (init_attr->cap.max_recv_wr < 32) { + rq_size = 32; + rq_encoded_size = 1; + } else if (init_attr->cap.max_recv_wr < 128) { + rq_size = 128; + rq_encoded_size = 2; + } else if (init_attr->cap.max_recv_wr < 512) { + rq_size = 512; + rq_encoded_size = 3; + } else { + printk(KERN_ERR PFX "%s: RQ size (%u) too large.\n", + __FUNCTION__, init_attr->cap.max_recv_wr); + return ERR_PTR(-EINVAL); + } + init_attr->cap.max_recv_wr = rq_size -1; + nes_debug(NES_DBG_QP, "RQ size=%u, SQ Size=%u\n", rq_size, sq_size); + + ret = nes_alloc_resource(nesadapter, nesadapter->allocated_qps, + nesadapter->max_qp, &qp_num, &nesadapter->next_qp); + if (ret) { + return ERR_PTR(ret); + } + + /* Need 512 (actually now 1024) byte alignment on this structure */ + mem = kzalloc(sizeof(*nesqp)+NES_SW_CONTEXT_ALIGN-1, GFP_KERNEL); + if (!mem) { + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + nes_debug(NES_DBG_QP, "Unable to allocate QP\n"); + return ERR_PTR(-ENOMEM); + } + u64nesqp = (u64)mem; /* u64nesqp = (u64)((uint)mem); */ + u64nesqp += ((u64)NES_SW_CONTEXT_ALIGN) - 1; + u64temp = ((u64)NES_SW_CONTEXT_ALIGN) - 1; + u64nesqp &= ~u64temp; + nesqp = (struct nes_qp *)u64nesqp; + /* nes_debug(NES_DBG_QP, "nesqp=%p, allocated buffer=%p. Rounded to closest %u\n", + nesqp, mem, NES_SW_CONTEXT_ALIGN); */ + nesqp->allocated_buffer = mem; + + if (udata) { + if ((ibpd->uobject) && (ibpd->uobject->context)) { + nesqp->user_mode = 1; + nes_ucontext = to_nesucontext(ibpd->uobject->context); + nesqp->mmap_sq_db_index = + find_next_zero_bit(nes_ucontext->allocated_wqs, + NES_MAX_USER_WQ_REGIONS, nes_ucontext->first_free_wq); + /* nes_debug(NES_DBG_QP, "find_first_zero_biton wqs returned %u\n", + nespd->mmap_db_index); */ + if (nesqp->mmap_sq_db_index > NES_MAX_USER_WQ_REGIONS) { + nes_debug(NES_DBG_QP, + "db index > max user regions, failing create QP\n"); + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + kfree(nesqp->allocated_buffer); + return ERR_PTR(-ENOMEM); + } + set_bit(nesqp->mmap_sq_db_index, nes_ucontext->allocated_wqs); + nes_ucontext->mmap_nesqp[nesqp->mmap_sq_db_index] = nesqp; + nes_ucontext->first_free_wq = nesqp->mmap_sq_db_index + 1; + } else { + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + kfree(nesqp->allocated_buffer); + return ERR_PTR(-EFAULT); + } + } + + nesqp->qp_mem_size = (sizeof(struct nes_hw_qp_wqe) * sq_size) + + (sizeof(struct nes_hw_qp_wqe) * rq_size) + + max((u32)sizeof(struct nes_qp_context), ((u32)256)) + + 256; /* this is Q2 */ + /* Round up to a multiple of a page */ + nesqp->qp_mem_size += PAGE_SIZE - 1; + nesqp->qp_mem_size &= ~(PAGE_SIZE - 1); + + mem = pci_alloc_consistent(nesdev->pcidev, nesqp->qp_mem_size, + &nesqp->hwqp.sq_pbase); + if (!mem) { + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + nes_debug(NES_DBG_QP, + "Unable to allocate memory for host descriptor rings\n"); + kfree(nesqp->allocated_buffer); + return ERR_PTR(-ENOMEM); + } + nes_debug(NES_DBG_QP, "PCI consistent memory for " + "host descriptor rings located @ %p (pa = 0x%08lX.) size = %u.\n", + mem, (unsigned long)nesqp->hwqp.sq_pbase, nesqp->qp_mem_size); + + memset(mem, 0, nesqp->qp_mem_size); + + nesqp->hwqp.sq_vbase = mem; + nesqp->hwqp.sq_size = sq_size; + nesqp->hwqp.sq_encoded_size = sq_encoded_size; + nesqp->hwqp.sq_head = 1; + mem += sizeof(struct nes_hw_qp_wqe) * sq_size; + + nesqp->hwqp.rq_vbase = mem; + nesqp->hwqp.rq_size = rq_size; + nesqp->hwqp.rq_encoded_size = rq_encoded_size; + nesqp->hwqp.rq_pbase = nesqp->hwqp.sq_pbase + + sizeof(struct nes_hw_qp_wqe) * sq_size; + mem += sizeof(struct nes_hw_qp_wqe)*rq_size; + + nesqp->hwqp.q2_vbase = mem; + nesqp->hwqp.q2_pbase = nesqp->hwqp.rq_pbase + + sizeof(struct nes_hw_qp_wqe) * rq_size; + mem += 256; + memset(nesqp->hwqp.q2_vbase, 0, 256); + + nesqp->nesqp_context = mem; + nesqp->nesqp_context_pbase = nesqp->hwqp.q2_pbase + 256; + memset(nesqp->nesqp_context, 0, sizeof(*nesqp->nesqp_context)); + + /* nes_debug(NES_DBG_QP, "nesqp->nesqp_context_pbase = %p\n", + (void *)nesqp->nesqp_context_pbase); + */ + nesqp->hwqp.qp_id = qp_num; + nesqp->ibqp.qp_num = nesqp->hwqp.qp_id; + nesqp->nespd = nespd; + + nescq = to_nescq(init_attr->send_cq); + nesqp->nesscq = nescq; + nescq = to_nescq(init_attr->recv_cq); + nesqp->nesrcq = nescq; + + nesqp->nesqp_context->misc |= cpu_to_le32((u32)PCI_FUNC(nesdev->pcidev->devfn) << + NES_QPCONTEXT_MISC_PCI_FCN_SHIFT); + nesqp->nesqp_context->misc |= cpu_to_le32((u32)nesqp->hwqp.rq_encoded_size << + NES_QPCONTEXT_MISC_RQ_SIZE_SHIFT); + nesqp->nesqp_context->misc |= cpu_to_le32((u32)nesqp->hwqp.sq_encoded_size << + NES_QPCONTEXT_MISC_SQ_SIZE_SHIFT); + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_PRIV_EN); + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_FAST_REGISTER_EN); + nesqp->nesqp_context->cqs = cpu_to_le32(nesqp->nesscq->hw_cq.cq_number + + ((u32)nesqp->nesrcq->hw_cq.cq_number << 16)); + u64temp = (u64)nesqp->hwqp.sq_pbase; + nesqp->nesqp_context->sq_addr_low = cpu_to_le32((u32)u64temp); + nesqp->nesqp_context->sq_addr_high = cpu_to_le32((u32)(u64temp >> 32)); + u64temp = (u64)nesqp->hwqp.rq_pbase; + nesqp->nesqp_context->rq_addr_low = cpu_to_le32((u32)u64temp); + nesqp->nesqp_context->rq_addr_high = cpu_to_le32((u32)(u64temp >> 32)); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + /* nes_debug(NES_DBG_QP, "next_qp_nic_index=%u, using nic_index=%d\n", + nesvnic->next_qp_nic_index, + nesvnic->qp_nic_index[nesvnic->next_qp_nic_index]); */ + nesqp->nesqp_context->misc2 |= cpu_to_le32( + (u32)nesvnic->qp_nic_index[nesvnic->next_qp_nic_index] << + NES_QPCONTEXT_MISC2_NIC_INDEX_SHIFT); + nesvnic->next_qp_nic_index++; + if ((nesvnic->next_qp_nic_index > 3) || + (nesvnic->qp_nic_index[nesvnic->next_qp_nic_index] == 0xf)) { + nesvnic->next_qp_nic_index = 0; + } + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + nesqp->nesqp_context->pd_index_wscale |= cpu_to_le32((u32)nesqp->nespd->pd_id << 16); + u64temp = (u64)nesqp->hwqp.q2_pbase; + nesqp->nesqp_context->q2_addr_low = cpu_to_le32((u32)u64temp); + nesqp->nesqp_context->q2_addr_high = cpu_to_le32((u32)(u64temp>>32)); + nesqp->nesqp_context->aeq_token_low = cpu_to_le32((u32)((u64)(nesqp))); + nesqp->nesqp_context->aeq_token_high = cpu_to_le32((u32)(((u64)(nesqp))>>32)); + nesqp->nesqp_context->ird_ord_sizes = cpu_to_le32(NES_QPCONTEXT_ORDIRD_ALSMM | + ((((u32)nesadapter->max_irrq_wr) << + NES_QPCONTEXT_ORDIRD_IRDSIZE_SHIFT) & NES_QPCONTEXT_ORDIRD_IRDSIZE_MASK)); + if (disable_mpa_crc) { + nes_debug(NES_DBG_QP, "Disabling MPA crc checking due to module option.\n"); + nesqp->nesqp_context->ird_ord_sizes |= cpu_to_le32(NES_QPCONTEXT_ORDIRD_RNMC); + } + + + /* Create the QP */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_QP, "Failed to get a cqp_request\n"); + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + pci_free_consistent(nesdev->pcidev, nesqp->qp_mem_size, + nesqp->hwqp.sq_vbase, nesqp->hwqp.sq_pbase); + kfree(nesqp->allocated_buffer); + return ERR_PTR(-ENOMEM); + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_CREATE_QP | NES_CQP_QP_TYPE_IWARP | + NES_CQP_QP_IWARP_STATE_IDLE); + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_QP_CQS_VALID); + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesqp->hwqp.qp_id); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesqp->nesqp_context_pbase; + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_HIGH_IDX] = + cpu_to_le32((u32)(u64temp >> 32)); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + nes_debug(NES_DBG_QP, "Waiting for create iWARP QP%u to complete.\n", + nesqp->hwqp.qp_id); + ret = wait_event_timeout(cqp_request->waitq, + (0 != cqp_request->request_done), NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_QP, "Create iwarp QP%u completed, wait_event_timeout ret=%u," + " nesdev->cqp_head = %u, nesdev->cqp.sq_tail = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + nesqp->hwqp.qp_id, ret, nesdev->cqp.sq_head, nesdev->cqp.sq_tail, + cqp_request->major_code, cqp_request->minor_code); + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + pci_free_consistent(nesdev->pcidev, nesqp->qp_mem_size, + nesqp->hwqp.sq_vbase, nesqp->hwqp.sq_pbase); + kfree(nesqp->allocated_buffer); + if (!ret) { + return ERR_PTR(-ETIME); + } else { + return ERR_PTR(-EIO); + } + } else { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + if (ibpd->uobject) { + uresp.mmap_sq_db_index = nesqp->mmap_sq_db_index; + uresp.actual_sq_size = sq_size; + uresp.actual_rq_size = rq_size; + uresp.qp_id = nesqp->hwqp.qp_id; + uresp.nes_drv_opt = nes_drv_opt; + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) { + pci_free_consistent(nesdev->pcidev, nesqp->qp_mem_size, + nesqp->hwqp.sq_vbase, nesqp->hwqp.sq_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_qps, qp_num); + kfree(nesqp->allocated_buffer); + return ERR_PTR(-EFAULT); + } + } + + nes_debug(NES_DBG_QP, "QP%u structure located @%p.Size = %u.\n", + nesqp->hwqp.qp_id, nesqp, (u32)sizeof(*nesqp)); + spin_lock_init(&nesqp->lock); + init_waitqueue_head(&nesqp->state_waitq); + init_waitqueue_head(&nesqp->kick_waitq); + nes_add_ref(&nesqp->ibqp); + break; + default: + nes_debug(NES_DBG_QP, "Invalid QP type: %d\n", init_attr->qp_type); + return ERR_PTR(-EINVAL); + break; + } + + /* update the QP table */ + nesdev->nesadapter->qp_table[nesqp->hwqp.qp_id-NES_FIRST_QPN] = nesqp; + nes_debug(NES_DBG_QP, "netdev refcnt=%u\n", + atomic_read(&nesvnic->netdev->refcnt)); + + return &nesqp->ibqp; +} + + +/** + * nes_destroy_qp + */ +static int nes_destroy_qp(struct ib_qp *ibqp) +{ + struct nes_qp *nesqp = to_nesqp(ibqp); + /* struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); */ + struct nes_ucontext *nes_ucontext; + struct ib_qp_attr attr; + struct iw_cm_id *cm_id; + struct iw_cm_event cm_event; + int ret; + + atomic_inc(&sw_qps_destroyed); + nesqp->destroyed = 1; + + /* Blow away the connection if it exists. */ + if (nesqp->ibqp_state >= IB_QPS_INIT && nesqp->ibqp_state <= IB_QPS_RTS) { + /* if (nesqp->ibqp_state == IB_QPS_RTS) { */ + attr.qp_state = IB_QPS_ERR; + nes_modify_qp(&nesqp->ibqp, &attr, IB_QP_STATE, NULL); + } + + if (((nesqp->ibqp_state == IB_QPS_INIT) || + (nesqp->ibqp_state == IB_QPS_RTR)) && (nesqp->cm_id)) { + cm_id = nesqp->cm_id; + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.status = IW_CM_EVENT_STATUS_TIMEOUT; + cm_event.local_addr = cm_id->local_addr; + cm_event.remote_addr = cm_id->remote_addr; + cm_event.private_data = NULL; + cm_event.private_data_len = 0; + + nes_debug(NES_DBG_QP, "Generating a CM Timeout Event for " + "QP%u. cm_id = %p, refcount = %u. \n", + nesqp->hwqp.qp_id, cm_id, atomic_read(&nesqp->refcount)); + + cm_id->rem_ref(cm_id); + ret = cm_id->event_handler(cm_id, &cm_event); + if (ret) + nes_debug(NES_DBG_QP, "OFA CM event_handler returned, ret=%d\n", ret); + } + + + if (nesqp->user_mode) { + if ((ibqp->uobject)&&(ibqp->uobject->context)) { + nes_ucontext = to_nesucontext(ibqp->uobject->context); + clear_bit(nesqp->mmap_sq_db_index, nes_ucontext->allocated_wqs); + nes_ucontext->mmap_nesqp[nesqp->mmap_sq_db_index] = NULL; + if (nes_ucontext->first_free_wq > nesqp->mmap_sq_db_index) { + nes_ucontext->first_free_wq = nesqp->mmap_sq_db_index; + } + } + } + + nes_rem_ref(&nesqp->ibqp); + return 0; +} + + +/** + * nes_create_cq + */ +static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, +#ifndef OFED_1_2 + int comp_vector, +#endif + struct ib_ucontext *context, struct ib_udata *udata) +{ + u64 u64temp; + struct nes_vnic *nesvnic = to_nesvnic(ibdev); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_cq *nescq; + struct nes_ucontext *nes_ucontext = NULL; + struct nes_cqp_request *cqp_request; + void *mem = NULL; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_pbl *nespbl = NULL; + struct nes_create_cq_req req; + struct nes_create_cq_resp resp; + u32 cq_num = 0; + u32 pbl_entries = 1; + int err; + unsigned long flags; + int ret; + + err = nes_alloc_resource(nesadapter, nesadapter->allocated_cqs, + nesadapter->max_cq, &cq_num, &nesadapter->next_cq); + if (err) { + return ERR_PTR(err); + } + + nescq = kmalloc(sizeof(struct nes_cq), GFP_KERNEL); + if (!nescq) { + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + nes_debug(NES_DBG_CQ, "Unable to allocate nes_cq struct\n"); + return ERR_PTR(-ENOMEM); + } + memset(nescq, 0, sizeof(struct nes_cq)); + + nescq->hw_cq.cq_size = max(entries + 1, 5); + nescq->hw_cq.cq_number = cq_num; + nescq->ibcq.cqe = nescq->hw_cq.cq_size - 1; + + if (context) { + nes_ucontext = to_nesucontext(context); + if (ib_copy_from_udata(&req, udata, sizeof (struct nes_create_cq_req))) { + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-EFAULT); + } + nes_debug(NES_DBG_CQ, "CQ Virtual Address = %08lX, size = %u.\n", + (unsigned long)req.user_cq_buffer, entries); + list_for_each_entry(nespbl, &nes_ucontext->cq_reg_mem_list, list) { + if (nespbl->user_base == (unsigned long )req.user_cq_buffer) { + list_del(&nespbl->list); + err = 0; + nes_debug(NES_DBG_CQ, "Found PBL for virtual CQ. nespbl=%p.\n", + nespbl); + break; + } + } + if (err) { + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(err); + } + pbl_entries = nespbl->pbl_size >> 3; + nescq->cq_mem_size = 0; + } else { + nescq->cq_mem_size = nescq->hw_cq.cq_size * sizeof(struct nes_hw_cqe); + nes_debug(NES_DBG_CQ, "Attempting to allocate pci memory (%u entries, %u bytes) for CQ%u.\n", + entries, nescq->cq_mem_size, nescq->hw_cq.cq_number); + + /* allocate the physical buffer space */ + mem = pci_alloc_consistent(nesdev->pcidev, nescq->cq_mem_size, + &nescq->hw_cq.cq_pbase); + if (!mem) { + printk(KERN_ERR PFX "Unable to allocate pci memory for cq\n"); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-ENOMEM); + } + + memset(mem, 0, nescq->cq_mem_size); + nescq->hw_cq.cq_vbase = mem; + nescq->hw_cq.cq_head = 0; + nes_debug(NES_DBG_CQ, "CQ%u virtual address @ %p, phys = 0x%08X\n", + nescq->hw_cq.cq_number, nescq->hw_cq.cq_vbase, + (u32)nescq->hw_cq.cq_pbase); + } + + nescq->hw_cq.ce_handler = nes_iwarp_ce_handler; + spin_lock_init(&nescq->lock); + + /* send CreateCQ request to CQP */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_CQ, "Failed to get a cqp_request.\n"); + if (!context) + pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, + nescq->hw_cq.cq_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-ENOMEM); + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_CREATE_CQ | NES_CQP_CQ_CEQ_VALID | + NES_CQP_CQ_CHK_OVERFLOW | + NES_CQP_CQ_CEQE_MASK |((u32)nescq->hw_cq.cq_size << 16)); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + if (1 != pbl_entries) { + if (pbl_entries > 32) { + /* use 4k pbl */ + nes_debug(NES_DBG_CQ, "pbl_entries=%u, use a 4k PBL\n", pbl_entries); + if (0 == nesadapter->free_4kpbl) { + if (cqp_request->dynamic) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + if (!context) + pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, + nescq->hw_cq.cq_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-ENOMEM); + } else { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + (NES_CQP_CQ_VIRT | NES_CQP_CQ_4KB_CHUNK)); + nescq->virtual_cq = 2; + nesadapter->free_4kpbl--; + } + } else { + /* use 256 byte pbl */ + nes_debug(NES_DBG_CQ, "pbl_entries=%u, use a 256 byte PBL\n", pbl_entries); + if (0 == nesadapter->free_256pbl) { + if (cqp_request->dynamic) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + if (!context) + pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, + nescq->hw_cq.cq_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-ENOMEM); + } else { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_CQ_VIRT); + nescq->virtual_cq = 1; + nesadapter->free_256pbl--; + } + } + } + + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = + cpu_to_le32(nescq->hw_cq.cq_number | ((u32)nesdev->ceq_index << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + if (context) { + if (1 != pbl_entries) + u64temp = (u64)nespbl->pbl_pbase; + else + u64temp = le64_to_cpu(nespbl->pbl_vbase[0]); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_DOORBELL_INDEX_HIGH_IDX] = + cpu_to_le32(nes_ucontext->mmap_db_index[0]); + } else { + u64temp = (u64)nescq->hw_cq.cq_pbase; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_DOORBELL_INDEX_HIGH_IDX] = 0; + } + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_PBL_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = 0; + u64temp = (u64)&nescq->hw_cq; + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_LOW_IDX] = cpu_to_le32((u32)(u64temp>>1)); + cqp_wqe->wqe_words[NES_CQP_CQ_WQE_CQ_CONTEXT_HIGH_IDX] = cpu_to_le32(((u32)((u64temp)>>33))&0x7FFFFFFF); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + /* Wait for CQP */ + nes_debug(NES_DBG_CQ, "Waiting for create iWARP CQ%u to complete.\n", + nescq->hw_cq.cq_number); + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT * 2); + nes_debug(NES_DBG_CQ, "Create iWARP CQ%u completed, wait_event_timeout ret = %d.\n", + nescq->hw_cq.cq_number, ret); + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + nes_debug(NES_DBG_CQ, "iWARP CQ%u create timeout expired, major code = 0x%04X," + " minor code = 0x%04X\n", + nescq->hw_cq.cq_number, cqp_request->major_code, cqp_request->minor_code); + if (!context) + pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, + nescq->hw_cq.cq_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-EIO); + } else { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + if (context) { + /* free the nespbl */ + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, nespbl->pbl_vbase, + nespbl->pbl_pbase); + kfree(nespbl); + resp.cq_id = nescq->hw_cq.cq_number; + resp.cq_size = nescq->hw_cq.cq_size; + resp.mmap_db_index = 0; + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); + kfree(nescq); + return ERR_PTR(-EFAULT); + } + } + + return &nescq->ibcq; +} + + +/** + * nes_destroy_cq + */ +static int nes_destroy_cq(struct ib_cq *ib_cq) +{ + struct nes_cq *nescq; + struct nes_device *nesdev; + struct nes_vnic *nesvnic; + struct nes_adapter *nesadapter; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + unsigned long flags; + int ret; + + if (ib_cq == NULL) + return 0; + + nescq = to_nescq(ib_cq); + nesvnic = to_nesvnic(ib_cq->device); + nesdev = nesvnic->nesdev; + nesadapter = nesdev->nesadapter; + + nes_debug(NES_DBG_CQ, "Destroy CQ%u\n", nescq->hw_cq.cq_number); + + /* Send DestroyCQ request to CQP */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_CQ, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_DESTROY_CQ | (nescq->hw_cq.cq_size << 16)); + + spin_lock_irqsave(&nesdev->cqp.lock, flags); + if (nescq->virtual_cq == 1) { + nesadapter->free_256pbl++; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) { + printk(KERN_ERR PFX "%s: free 256B PBLs(%u) has exceeded the max(%u)\n", + __FUNCTION__, nesadapter->free_256pbl, nesadapter->max_256pbl); + } + } else if (nescq->virtual_cq == 2) { + nesadapter->free_4kpbl++; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { + printk(KERN_ERR PFX "%s: free 4K PBLs(%u) has exceeded the max(%u)\n", + __FUNCTION__, nesadapter->free_4kpbl, nesadapter->max_4kpbl); + } + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_CQ_4KB_CHUNK); + } + + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32( + nescq->hw_cq.cq_number | ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 16)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + + atomic_set(&cqp_request->refcount, 2); + nes_free_resource(nesadapter, nesadapter->allocated_cqs, nescq->hw_cq.cq_number); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + /* Wait for CQP */ + nes_debug(NES_DBG_CQ, "Waiting for destroy iWARP CQ%u to complete.\n", + nescq->hw_cq.cq_number); + /* cqp_head = (cqp_head+1)&(nesdev->cqp.sq_size-1); */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_CQ, "Destroy iWARP CQ%u completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + nescq->hw_cq.cq_number, ret, cqp_request->major_code, + cqp_request->minor_code); + if ((!ret) || (cqp_request->major_code)) { + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) { + nes_debug(NES_DBG_CQ, "iWARP CQ%u destroy timeout expired\n", + nescq->hw_cq.cq_number); + ret = -ETIME; + } else { + nes_debug(NES_DBG_CQ, "iWARP CQ%u destroy failed\n", + nescq->hw_cq.cq_number); + ret = -EIO; + } + } else { + ret = 0; + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + } + + if (nescq->cq_mem_size) + pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, + (void *)nescq->hw_cq.cq_vbase, nescq->hw_cq.cq_pbase); + kfree(nescq); + + return ret; +} + + +/** + * nes_reg_mr + */ +static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, + u32 stag, u64 region_length, struct nes_root_vpbl *root_vpbl, + dma_addr_t single_buffer, u16 pbl_count, u16 residual_page_count, + int acc, u64 *iova_start) +{ + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + unsigned long flags; + int ret; + struct nes_adapter *nesadapter = nesdev->nesadapter; + /* int count; */ + u16 major_code; + + /* Register the region with the adapter */ + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_MR, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + spin_lock_irqsave(&nesdev->cqp.lock, flags); + /* track PBL resources */ + if (pbl_count != 0) { + if (pbl_count > 1) { + /* Two level PBL */ + if ((pbl_count+1) > nesadapter->free_4kpbl) { + nes_debug(NES_DBG_MR, "Out of 4KB Pbls for two level request.\n"); + if (cqp_request->dynamic) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + return -ENOMEM; + } else { + nesadapter->free_4kpbl -= pbl_count+1; + } + } else if (residual_page_count > 32) { + if (pbl_count > nesadapter->free_4kpbl) { + nes_debug(NES_DBG_MR, "Out of 4KB Pbls.\n"); + if (cqp_request->dynamic) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + return -ENOMEM; + } else { + nesadapter->free_4kpbl -= pbl_count; + } + } else { + if (pbl_count > nesadapter->free_256pbl) { + nes_debug(NES_DBG_MR, "Out of 256B Pbls.\n"); + if (cqp_request->dynamic) { + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + return -ENOMEM; + } else { + nesadapter->free_256pbl -= pbl_count; + } + } + } + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_REGISTER_STAG | NES_CQP_STAG_RIGHTS_LOCAL_READ); + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + NES_CQP_STAG_VA_TO | NES_CQP_STAG_MR); + if (acc & IB_ACCESS_LOCAL_WRITE) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_STAG_RIGHTS_LOCAL_WRITE); + } + if (acc & IB_ACCESS_REMOTE_WRITE) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + NES_CQP_STAG_RIGHTS_REMOTE_WRITE | NES_CQP_STAG_REM_ACC_EN); + } + if (acc & IB_ACCESS_REMOTE_READ) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + NES_CQP_STAG_RIGHTS_REMOTE_READ | NES_CQP_STAG_REM_ACC_EN); + } + if (acc & IB_ACCESS_MW_BIND) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( + NES_CQP_STAG_RIGHTS_WINDOW_BIND | NES_CQP_STAG_REM_ACC_EN); + } + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_VA_LOW_IDX] = cpu_to_le32((u32)*iova_start); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_VA_HIGH_IDX] = + cpu_to_le32((u32)((((u64)*iova_start) >> 32))); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_LOW_IDX] = cpu_to_le32((u32)region_length); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_HIGH_PD_IDX] = + cpu_to_le32((u32)(region_length >> 8) & 0xff000000); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_LEN_HIGH_PD_IDX] |= + cpu_to_le32(nespd->pd_id & 0x00007fff); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_STAG_IDX] = cpu_to_le32(stag); + + if (pbl_count == 0) { + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_LOW_IDX] = + cpu_to_le32((u32)single_buffer); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_HIGH_IDX] = + cpu_to_le32((u32)((((u64)single_buffer) >> 32))); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = 0; + } else { + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_LOW_IDX] = + cpu_to_le32((u32)root_vpbl->pbl_pbase); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PA_HIGH_IDX] = + cpu_to_le32((u32)((((u64)root_vpbl->pbl_pbase) >> 32))); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = cpu_to_le32(pbl_count); + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = + cpu_to_le32(((pbl_count-1) * 4096) + (residual_page_count*8)); + if ((pbl_count > 1) || (residual_page_count > 32)) { + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32( NES_CQP_STAG_PBL_BLK_SIZE ); + } + } + barrier(); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MR, "Register STag 0x%08X completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X.\n", + stag, ret, cqp_request->major_code, cqp_request->minor_code); + major_code = cqp_request->major_code; + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) + return -ETIME; + else if (major_code) + return -EIO; + else + return 0; + + return 0; +} + + +/** + * nes_reg_phys_mr + */ +static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, + struct ib_phys_buf *buffer_list, int num_phys_buf, int acc, + u64 * iova_start) { + u64 region_length; + struct nes_pd *nespd = to_nespd(ib_pd); + struct nes_vnic *nesvnic = to_nesvnic(ib_pd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_mr *nesmr; + struct ib_mr *ibmr; + struct nes_vpbl vpbl; + struct nes_root_vpbl root_vpbl; + u32 stag; + u32 i; + u32 stag_index = 0; + u32 next_stag_index = 0; + u32 driver_key = 0; + u32 root_pbl_index = 0; + u32 cur_pbl_index = 0; + int err = 0, pbl_depth = 0; + int ret = 0; + u16 pbl_count = 0; + u8 single_page = 1; + u8 stag_key = 0; + + pbl_depth = 0; + region_length = 0; + vpbl.pbl_vbase = NULL; + root_vpbl.pbl_vbase = NULL; + root_vpbl.pbl_pbase = 0; + + get_random_bytes(&next_stag_index, sizeof(next_stag_index)); + stag_key = (u8)next_stag_index; + + driver_key = 0; + + next_stag_index >>= 8; + next_stag_index %= nesadapter->max_mr; + if (num_phys_buf > (1024*512)) { + return ERR_PTR(-E2BIG); + } + + err = nes_alloc_resource(nesadapter, nesadapter->allocated_mrs, nesadapter->max_mr, + &stag_index, &next_stag_index); + if (err) { + return ERR_PTR(err); + } + + nesmr = kmalloc(sizeof(*nesmr), GFP_KERNEL); + if (!nesmr) { + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + return ERR_PTR(-ENOMEM); + } + + for (i = 0; i < num_phys_buf; i++) { + + if ((i & 0x01FF) == 0) { + if (1 == root_pbl_index) { + /* Allocate the root PBL */ + root_vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 8192, + &root_vpbl.pbl_pbase); + nes_debug(NES_DBG_MR, "Allocating root PBL, va = %p, pa = 0x%08X\n", + root_vpbl.pbl_vbase, (unsigned int)root_vpbl.pbl_pbase); + if (!root_vpbl.pbl_vbase) { + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + kfree(nesmr); + return ERR_PTR(-ENOMEM); + } + root_vpbl.leaf_vpbl = kmalloc(sizeof(*root_vpbl.leaf_vpbl)*1024, GFP_KERNEL); + if (!root_vpbl.leaf_vpbl) { + pci_free_consistent(nesdev->pcidev, 8192, root_vpbl.pbl_vbase, + root_vpbl.pbl_pbase); + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + kfree(nesmr); + return ERR_PTR(-ENOMEM); + } + root_vpbl.pbl_vbase[0].pa_low = cpu_to_le32((u32)vpbl.pbl_pbase); + root_vpbl.pbl_vbase[0].pa_high = + cpu_to_le32((u32)((((u64)vpbl.pbl_pbase) >> 32))); + root_vpbl.leaf_vpbl[0] = vpbl; + } + /* Allocate a 4K buffer for the PBL */ + vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 4096, + &vpbl.pbl_pbase); + nes_debug(NES_DBG_MR, "Allocating leaf PBL, va = %p, pa = 0x%016lX\n", + vpbl.pbl_vbase, (unsigned long)vpbl.pbl_pbase); + if (!vpbl.pbl_vbase) { + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + ibmr = ERR_PTR(-ENOMEM); + kfree(nesmr); + goto reg_phys_err; + } + /* Fill in the root table */ + if (1 <= root_pbl_index) { + root_vpbl.pbl_vbase[root_pbl_index].pa_low = + cpu_to_le32((u32)vpbl.pbl_pbase); + root_vpbl.pbl_vbase[root_pbl_index].pa_high = + cpu_to_le32((u32)((((u64)vpbl.pbl_pbase) >> 32))); + root_vpbl.leaf_vpbl[root_pbl_index] = vpbl; + } + root_pbl_index++; + cur_pbl_index = 0; + } + if (buffer_list[i].addr & ~PAGE_MASK) { + /* TODO: Unwind allocated buffers */ + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + nes_debug(NES_DBG_MR, "Unaligned Memory Buffer: 0x%x\n", + (unsigned int) buffer_list[i].addr); + ibmr = ERR_PTR(-EINVAL); + kfree(nesmr); + goto reg_phys_err; + } + + if (!buffer_list[i].size) { + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + nes_debug(NES_DBG_MR, "Invalid Buffer Size\n"); + ibmr = ERR_PTR(-EINVAL); + kfree(nesmr); + goto reg_phys_err; + } + + region_length += buffer_list[i].size; + if ((i != 0) && (single_page)) { + if ((buffer_list[i-1].addr+PAGE_SIZE) != buffer_list[i].addr) + single_page = 0; + } + vpbl.pbl_vbase[cur_pbl_index].pa_low = cpu_to_le32((u32)buffer_list[i].addr); + vpbl.pbl_vbase[cur_pbl_index++].pa_high = + cpu_to_le32((u32)((((u64)buffer_list[i].addr) >> 32))); + } + + stag = stag_index << 8; + stag |= driver_key; + stag += (u32)stag_key; + + nes_debug(NES_DBG_MR, "Registering STag 0x%08X, VA = 0x%016lX," + " length = 0x%016lX, index = 0x%08X\n", + stag, (unsigned long)*iova_start, (unsigned long)region_length, stag_index); + + region_length -= (*iova_start)&PAGE_MASK; + + /* Make the leaf PBL the root if only one PBL */ + if (root_pbl_index == 1) { + root_vpbl.pbl_pbase = vpbl.pbl_pbase; + } + + if (single_page) { + pbl_count = 0; + } else { + pbl_count = root_pbl_index; + } + ret = nes_reg_mr(nesdev, nespd, stag, region_length, &root_vpbl, + buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start); + + if (ret == 0) { + nesmr->ibmr.rkey = stag; + nesmr->ibmr.lkey = stag; + nesmr->mode = IWNES_MEMREG_TYPE_MEM; + ibmr = &nesmr->ibmr; + nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; + nesmr->pbls_used = pbl_count; + if (pbl_count > 1) { + nesmr->pbls_used++; + } + } else { + kfree(nesmr); + ibmr = ERR_PTR(-ENOMEM); + } + + reg_phys_err: + /* free the resources */ + if (root_pbl_index == 1) { + /* single PBL case */ + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, vpbl.pbl_pbase); + } else { + for (i=0; ipcidev, 4096, root_vpbl.leaf_vpbl[i].pbl_vbase, + root_vpbl.leaf_vpbl[i].pbl_pbase); + } + kfree(root_vpbl.leaf_vpbl); + pci_free_consistent(nesdev->pcidev, 8192, root_vpbl.pbl_vbase, + root_vpbl.pbl_pbase); + } + + return ibmr; +} + + +/** + * nes_get_dma_mr + */ +static struct ib_mr *nes_get_dma_mr(struct ib_pd *pd, int acc) { + struct ib_phys_buf bl; + u64 kva = 0; + + nes_debug(NES_DBG_MR, "\n"); + + bl.size = (u64)0xffffffffffULL; + bl.addr = 0; + return nes_reg_phys_mr(pd, &bl, 1, acc, &kva); +} + + +/** + * nes_reg_user_mr + */ +#ifdef OFED_1_2 +static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +#else +static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int acc, struct ib_udata *udata) +#endif + { + u64 iova_start; + u64 *pbl; + u64 region_length; + dma_addr_t last_dma_addr = 0; + dma_addr_t first_dma_addr = 0; + struct nes_pd *nespd = to_nespd(pd); + struct nes_vnic *nesvnic = to_nesvnic(pd->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct ib_mr *ibmr; + struct ib_umem_chunk *chunk; + struct nes_ucontext *nes_ucontext; + struct nes_pbl *nespbl; + struct nes_mr *nesmr; +#ifndef OFED_1_2 + struct ib_umem *region; +#endif + struct nes_mem_reg_req req; + struct nes_vpbl vpbl; + struct nes_root_vpbl root_vpbl; + int j; + int page_count = 0; + int err, pbl_depth = 0; + int ret; + u32 stag; + u32 stag_index = 0; + u32 next_stag_index; + u32 driver_key; + u32 root_pbl_index = 0; + u32 cur_pbl_index = 0; + u16 pbl_count; + u8 single_page = 1; + u8 stag_key; + + + nes_debug(NES_DBG_MR, "\n"); + +#ifdef OFED_1_2 + nes_debug(NES_DBG_MR, "User base = 0x%lX, Virt base = 0x%lX, length = %u," + " offset = %u, page size = %u.\n", + region->user_base, region->virt_base, (u32)region->length, + region->offset, region->page_size); +#else + region = ib_umem_get(pd->uobject->context, start, length, acc); + if (IS_ERR(region)) { + return (struct ib_mr *)region; + } + + nes_debug(NES_DBG_MR, "User base = 0x%lX, Virt base = 0x%lX, length = %u\n", + (unsigned long int)start, (unsigned long int)virt, (u32)length); +#endif + + if (ib_copy_from_udata(&req, udata, sizeof(req))) + return ERR_PTR(-EFAULT); + nes_debug(NES_DBG_MR, "Memory Registration type = %08X.\n", req.reg_type); + + switch (req.reg_type) { + case IWNES_MEMREG_TYPE_MEM: + pbl_depth = 0; + region_length = 0; + vpbl.pbl_vbase = NULL; + root_vpbl.pbl_vbase = NULL; + root_vpbl.pbl_pbase = 0; + + get_random_bytes(&next_stag_index, sizeof(next_stag_index)); + stag_key = (u8)next_stag_index; + + driver_key = 0; + + next_stag_index >>= 8; + next_stag_index %= nesadapter->max_mr; + + err = nes_alloc_resource(nesadapter, nesadapter->allocated_mrs, + nesadapter->max_mr, &stag_index, &next_stag_index); + if (err) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + return ERR_PTR(err); + } + + nesmr = kmalloc(sizeof(*nesmr), GFP_KERNEL); + if (!nesmr) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + return ERR_PTR(-ENOMEM); + } +#ifndef OFED_1_2 + nesmr->region = region; +#endif + + list_for_each_entry(chunk, ®ion->chunk_list, list) { + nes_debug(NES_DBG_MR, "Chunk: nents = %u, nmap = %u .\n", + chunk->nents, chunk->nmap); + for (j = 0; j < chunk->nmap; ++j) { + if ((page_count&0x01FF) == 0) { + if (page_count>(1024*512)) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + nes_free_resource(nesadapter, + nesadapter->allocated_mrs, stag_index); + kfree(nesmr); + return ERR_PTR(-E2BIG); + } + if (1 == root_pbl_index) { + root_vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, + 8192, &root_vpbl.pbl_pbase); + nes_debug(NES_DBG_MR, "Allocating root PBL, va = %p, pa = 0x%08X\n", + root_vpbl.pbl_vbase, (unsigned int)root_vpbl.pbl_pbase); + if (!root_vpbl.pbl_vbase) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + stag_index); + kfree(nesmr); + return ERR_PTR(-ENOMEM); + } + root_vpbl.leaf_vpbl = kmalloc(sizeof(*root_vpbl.leaf_vpbl)*1024, + GFP_KERNEL); + if (!root_vpbl.leaf_vpbl) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + pci_free_consistent(nesdev->pcidev, 8192, root_vpbl.pbl_vbase, + root_vpbl.pbl_pbase); + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + stag_index); + kfree(nesmr); + return ERR_PTR(-ENOMEM); + } + root_vpbl.pbl_vbase[0].pa_low = + cpu_to_le32((u32)vpbl.pbl_pbase); + root_vpbl.pbl_vbase[0].pa_high = + cpu_to_le32((u32)((((u64)vpbl.pbl_pbase) >> 32))); + root_vpbl.leaf_vpbl[0] = vpbl; + } + vpbl.pbl_vbase = pci_alloc_consistent(nesdev->pcidev, 4096, + &vpbl.pbl_pbase); + nes_debug(NES_DBG_MR, "Allocating leaf PBL, va = %p, pa = 0x%08X\n", + vpbl.pbl_vbase, (unsigned int)vpbl.pbl_pbase); + if (!vpbl.pbl_vbase) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + ibmr = ERR_PTR(-ENOMEM); + kfree(nesmr); + goto reg_user_mr_err; + } + if (1 <= root_pbl_index) { + root_vpbl.pbl_vbase[root_pbl_index].pa_low = + cpu_to_le32((u32)vpbl.pbl_pbase); + root_vpbl.pbl_vbase[root_pbl_index].pa_high = + cpu_to_le32((u32)((((u64)vpbl.pbl_pbase)>>32))); + root_vpbl.leaf_vpbl[root_pbl_index] = vpbl; + } + root_pbl_index++; + cur_pbl_index = 0; + } + if (sg_dma_address(&chunk->page_list[j]) & ~PAGE_MASK) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); + nes_debug(NES_DBG_MR, "Unaligned Memory Buffer: 0x%x\n", + (unsigned int) sg_dma_address(&chunk->page_list[j])); + ibmr = ERR_PTR(-EINVAL); + kfree(nesmr); + goto reg_user_mr_err; + } + + if (!sg_dma_len(&chunk->page_list[j])) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + stag_index); + nes_debug(NES_DBG_MR, "Invalid Buffer Size\n"); + ibmr = ERR_PTR(-EINVAL); + kfree(nesmr); + goto reg_user_mr_err; + } + + region_length += sg_dma_len(&chunk->page_list[j]); + if (single_page) { + if (page_count != 0) { + if ((last_dma_addr+PAGE_SIZE) != + sg_dma_address(&chunk->page_list[j])) + single_page = 0; + last_dma_addr = sg_dma_address(&chunk->page_list[j]); + } else { + first_dma_addr = sg_dma_address(&chunk->page_list[j]); + last_dma_addr = first_dma_addr; + } + } + + vpbl.pbl_vbase[cur_pbl_index].pa_low = + cpu_to_le32((u32)sg_dma_address(&chunk->page_list[j])); + vpbl.pbl_vbase[cur_pbl_index].pa_high = + cpu_to_le32((u32)((((u64)sg_dma_address(&chunk->page_list[j]))>>32))); + cur_pbl_index++; + page_count++; + } + } + + nes_debug(NES_DBG_MR, "calculating stag, stag_index=0x%08x, driver_key=0x%08x," + " stag_key=0x%08x\n", + stag_index, driver_key, stag_key); + stag = stag_index << 8; + stag |= driver_key; + stag += (u32)stag_key; + if (stag == 0) { + stag = 1; + } + +#ifdef OFED_1_2 + iova_start = (u64)region->virt_base; +#else + iova_start = virt; +#endif + nes_debug(NES_DBG_MR, "Registering STag 0x%08X, VA = 0x%08X, length = 0x%08X," + " index = 0x%08X, region->length=0x%08llx\n", + stag, (unsigned int)iova_start, + (unsigned int)region_length, stag_index, + (unsigned long long)region->length); + + /* Make the leaf PBL the root if only one PBL */ + if (root_pbl_index == 1) { + root_vpbl.pbl_pbase = vpbl.pbl_pbase; + } + + if (single_page) { + pbl_count = 0; + } else { + pbl_count = root_pbl_index; + first_dma_addr = 0; + } + ret = nes_reg_mr( nesdev, nespd, stag, region->length, &root_vpbl, + first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, &iova_start); + + nes_debug(NES_DBG_MR, "ret=%d\n", ret); + + if (ret == 0) { + nesmr->ibmr.rkey = stag; + nesmr->ibmr.lkey = stag; + nesmr->mode = IWNES_MEMREG_TYPE_MEM; + ibmr = &nesmr->ibmr; + nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; + nesmr->pbls_used = pbl_count; + if (pbl_count > 1) { + nesmr->pbls_used++; + } + } else { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + kfree(nesmr); + ibmr = ERR_PTR(-ENOMEM); + } + + reg_user_mr_err: + /* free the resources */ + if (root_pbl_index == 1) { + pci_free_consistent(nesdev->pcidev, 4096, vpbl.pbl_vbase, + vpbl.pbl_pbase); + } else { + for (j=0; jpcidev, 4096, + root_vpbl.leaf_vpbl[j].pbl_vbase, + root_vpbl.leaf_vpbl[j].pbl_pbase); + } + kfree(root_vpbl.leaf_vpbl); + pci_free_consistent(nesdev->pcidev, 8192, root_vpbl.pbl_vbase, + root_vpbl.pbl_pbase); + } + + nes_debug(NES_DBG_MR, "Leaving, ibmr=%p", ibmr); + + return ibmr; + break; + case IWNES_MEMREG_TYPE_QP: +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + return ERR_PTR(-ENOSYS); + break; + case IWNES_MEMREG_TYPE_CQ: + nespbl = kmalloc(sizeof(*nespbl), GFP_KERNEL); + if (!nespbl) { + nes_debug(NES_DBG_MR, "Unable to allocate PBL\n"); +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + return ERR_PTR(-ENOMEM); + } + memset(nespbl, 0, sizeof(*nespbl)); + nesmr = kmalloc(sizeof(*nesmr), GFP_KERNEL); + if (!nesmr) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + kfree(nespbl); + nes_debug(NES_DBG_MR, "Unable to allocate nesmr\n"); + return ERR_PTR(-ENOMEM); + } + memset(nesmr, 0, sizeof(*nesmr)); +#ifndef OFED_1_2 + nesmr->region = region; +#endif + nes_ucontext = to_nesucontext(pd->uobject->context); + pbl_depth = region->length >> PAGE_SHIFT; + pbl_depth += (region->length & ~PAGE_MASK) ? 1 : 0; + nespbl->pbl_size = pbl_depth*sizeof(u64); + nes_debug(NES_DBG_MR, "Attempting to allocate CQ PBL memory, %u bytes, %u entries.\n", + nespbl->pbl_size, pbl_depth); + pbl = pci_alloc_consistent(nesdev->pcidev, nespbl->pbl_size, + &nespbl->pbl_pbase); + if (!pbl) { +#ifndef OFED_1_2 + ib_umem_release(region); +#endif + kfree(nesmr); + kfree(nespbl); + nes_debug(NES_DBG_MR, "Unable to allocate cq PBL memory\n"); + return ERR_PTR(-ENOMEM); + } + + nespbl->pbl_vbase = pbl; +#ifdef OFED_1_2 + nespbl->user_base = region->user_base; +#else + nespbl->user_base = start; +#endif + nes_debug(NES_DBG_MR, "Allocated CQ PBL memory, %u bytes, pbl_pbase=%p," + " pbl_vbase=%p user_base=0x%lx\n", + nespbl->pbl_size, (void *)nespbl->pbl_pbase, + (void*)nespbl->pbl_vbase, nespbl->user_base); + + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (j = 0; j < chunk->nmap; ++j) { + ((u32 *)pbl)[0] = cpu_to_le32((u32)sg_dma_address(&chunk->page_list[j])); + ((u32 *)pbl)[1] = cpu_to_le32(((u64)sg_dma_address(&chunk->page_list[j]))>>32); + nes_debug(NES_DBG_MR, "pbl=%p, *pbl=0x%016llx, 0x%08x%08x\n", pbl, *pbl, le32_to_cpu(((u32 *)pbl)[1]), le32_to_cpu(((u32 *)pbl)[0])); + pbl++; + } + } + list_add_tail(&nespbl->list, &nes_ucontext->cq_reg_mem_list); + nesmr->ibmr.rkey = -1; + nesmr->ibmr.lkey = -1; + nesmr->mode = IWNES_MEMREG_TYPE_CQ; + return &nesmr->ibmr; + break; + } + + return ERR_PTR(-ENOSYS); +} + + +/** + * nes_dereg_mr + */ +static int nes_dereg_mr(struct ib_mr *ib_mr) +{ + struct nes_mr *nesmr = to_nesmr(ib_mr); + struct nes_vnic *nesvnic = to_nesvnic(ib_mr->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + struct nes_hw_cqp_wqe *cqp_wqe; + struct nes_cqp_request *cqp_request; + unsigned long flags; + int ret; + u16 major_code; + u16 minor_code; + +#ifndef OFED_1_2 + if (nesmr->region) { + ib_umem_release(nesmr->region); + } +#endif + if (nesmr->mode != IWNES_MEMREG_TYPE_MEM) { + kfree(nesmr); + return 0; + } + + /* Deallocate the region with the adapter */ + + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_MR, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + cqp_request->waiting = 1; + cqp_wqe = &cqp_request->cqp_wqe; + + spin_lock_irqsave(&nesdev->cqp.lock, flags); + if (0 != nesmr->pbls_used) { + if (nesmr->pbl_4k) { + nesadapter->free_4kpbl += nesmr->pbls_used; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { + printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n", + nesadapter->free_4kpbl, nesadapter->max_4kpbl); + } + } else { + nesadapter->free_256pbl += nesmr->pbls_used; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) { + printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n", + nesadapter->free_256pbl, nesadapter->max_256pbl); + } + } + } + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO | + NES_CQP_STAG_DEALLOC_PBLS | NES_CQP_STAG_MR); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_PBL_LEN_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_STAG_WQE_STAG_IDX] = cpu_to_le32(ib_mr->rkey); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + nes_debug(NES_DBG_MR, "Waiting for deallocate STag 0x%08X completed\n", ib_mr->rkey); + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MR, "Deallocate STag 0x%08X completed, wait_event_timeout ret = %u," + " CQP Major:Minor codes = 0x%04X:0x%04X\n", + ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code); + + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ib_mr->rkey & 0x0fffff00) >> 8); + + kfree(nesmr); + + major_code = cqp_request->major_code; + minor_code = cqp_request->minor_code; + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) { + nes_debug(NES_DBG_MR, "Timeout waiting to destroy STag," + " ib_mr=%p, rkey = 0x%08X\n", + ib_mr, ib_mr->rkey); + return -ETIME; + } else if (major_code) { + nes_debug(NES_DBG_MR, "Error (0x%04X:0x%04X) while attempting" + " to destroy STag, ib_mr=%p, rkey = 0x%08X\n", + major_code, minor_code, ib_mr, ib_mr->rkey); + return -EIO; + } else + return 0; +} + + +/** + * show_rev + */ +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct nes_ib_device *nesibdev = + container_of(cdev, struct nes_ib_device, ibdev.class_dev); + struct nes_vnic *nesvnic = nesibdev->nesvnic; + + nes_debug(NES_DBG_INIT, "\n"); + return sprintf(buf, "%x\n", nesvnic->nesdev->nesadapter->hw_rev); +} + + +/** + * show_fw_ver + */ +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct nes_ib_device *nesibdev = + container_of(cdev, struct nes_ib_device, ibdev.class_dev); + struct nes_vnic *nesvnic = nesibdev->nesvnic; + + nes_debug(NES_DBG_INIT, "\n"); + return sprintf(buf, "%x.%x.%x\n", + (int)(nesvnic->nesdev->nesadapter->fw_ver >> 32), + (int)(nesvnic->nesdev->nesadapter->fw_ver >> 16) & 0xffff, + (int)(nesvnic->nesdev->nesadapter->fw_ver & 0xffff)); +} + + +/** + * show_hca + */ +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + nes_debug(NES_DBG_INIT, "\n"); + return sprintf(buf, "NES020\n"); +} + + +/** + * show_board + */ +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + nes_debug(NES_DBG_INIT, "\n"); + return sprintf(buf, "%.*s\n", 32, "NES020 Board ID"); +} + + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *nes_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + + +/** + * nes_query_qp + */ +static int nes_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_qp_init_attr *init_attr) +{ + struct nes_qp *nesqp = to_nesqp(ibqp); + + nes_debug(NES_DBG_QP, "\n"); + + attr->qp_access_flags = 0; + attr->cap.max_send_wr = nesqp->hwqp.sq_size; + attr->cap.max_recv_wr = nesqp->hwqp.rq_size; + attr->cap.max_recv_sge = 1; + if (nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA) { + init_attr->cap.max_inline_data = 0; + } else { + init_attr->cap.max_inline_data = 64; + } + + init_attr->event_handler = nesqp->ibqp.event_handler; + init_attr->qp_context = nesqp->ibqp.qp_context; + init_attr->send_cq = nesqp->ibqp.send_cq; + init_attr->recv_cq = nesqp->ibqp.recv_cq; + init_attr->srq = nesqp->ibqp.srq = nesqp->ibqp.srq; + init_attr->cap = attr->cap; + + return 0; +} + + +/** + * nes_hw_modify_qp + */ +int nes_hw_modify_qp(struct nes_device *nesdev, struct nes_qp *nesqp, u32 next_iwarp_state, u32 wait_completion) +{ + u64 u64temp; + struct nes_hw_cqp_wqe *cqp_wqe; + /* struct iw_cm_id *cm_id = nesqp->cm_id; */ + /* struct iw_cm_event cm_event; */ + struct nes_cqp_request *cqp_request; + unsigned long flags; + int ret; + u16 major_code; + + nes_debug(NES_DBG_MOD_QP, "QP%u, refcount=%d\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount)); + + cqp_request = nes_get_cqp_request(nesdev, NES_CQP_REQUEST_NOT_HOLDING_LOCK); + if (NULL == cqp_request) { + nes_debug(NES_DBG_MOD_QP, "Failed to get a cqp_request.\n"); + return -ENOMEM; + } + if (wait_completion) { + cqp_request->waiting = 1; + } else { + cqp_request->waiting = 0; + } + cqp_wqe = &cqp_request->cqp_wqe; + + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] = cpu_to_le32( + NES_CQP_MODIFY_QP | NES_CQP_QP_TYPE_IWARP | next_iwarp_state); + nes_debug(NES_DBG_MOD_QP, "using next_iwarp_state=%08x, wqe_words=%08x\n", + next_iwarp_state, le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX])); + + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(nesqp->hwqp.qp_id); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(&nesdev->cqp))); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32)); + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0; + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0; + u64temp = (u64)nesqp->nesqp_context_pbase; + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_LOW_IDX] = cpu_to_le32((u32)u64temp); + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_HIGH_IDX] = cpu_to_le32((u32)(u64temp >> 32)); + + atomic_set(&cqp_request->refcount, 2); + nes_post_cqp_request(nesdev, cqp_request, NES_CQP_REQUEST_NOT_HOLDING_LOCK, + NES_CQP_REQUEST_RING_DOORBELL); + + /* Wait for CQP */ + if (wait_completion) { + /* nes_debug(NES_DBG_MOD_QP, "Waiting for modify iWARP QP%u to complete.\n", + nesqp->hwqp.qp_id); */ + ret = wait_event_timeout(cqp_request->waitq, (0 != cqp_request->request_done), + NES_EVENT_TIMEOUT); + nes_debug(NES_DBG_MOD_QP, "Modify iwarp QP%u completed, wait_event_timeout ret=%u, " + "CQP Major:Minor codes = 0x%04X:0x%04X.\n", + nesqp->hwqp.qp_id, ret, cqp_request->major_code, cqp_request->minor_code); + major_code = cqp_request->major_code; + if (major_code) { + nes_debug(NES_DBG_MOD_QP, "Modify iwarp QP%u failed" + "CQP Major:Minor codes = 0x%04X:0x%04X, intended next state = 0x%08X.\n", + nesqp->hwqp.qp_id, cqp_request->major_code, + cqp_request->minor_code, next_iwarp_state); + } + if (atomic_dec_and_test(&cqp_request->refcount)) { + if (cqp_request->dynamic) { + atomic_inc(&cqp_reqs_dynfreed); + kfree(cqp_request); + } else { + atomic_inc(&cqp_reqs_freed); + spin_lock_irqsave(&nesdev->cqp.lock, flags); + list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs); + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); + } + } + if (!ret) + return -ETIME; + else if (major_code) + return -EIO; + else + return 0; + } else { + return 0; + } +} + + +/** + * nes_modify_qp + */ +int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct nes_qp *nesqp = to_nesqp(ibqp); + struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); + struct nes_device *nesdev = nesvnic->nesdev; + /* u32 cqp_head; */ + /* u32 counter; */ + u32 next_iwarp_state = 0; + int err; + unsigned long qplockflags; + int ret; + u16 original_last_aeq; + u8 issue_modify_qp = 0; + u8 issue_disconnect = 0; + u8 dont_wait = 0; + + nes_debug(NES_DBG_MOD_QP, "QP%u: QP State=%u, cur QP State=%u," + " iwarp_state=0x%X, refcount=%d\n", + nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state, + nesqp->iwarp_state, atomic_read(&nesqp->refcount)); + + nes_add_ref(&nesqp->ibqp); + spin_lock_irqsave(&nesqp->lock, qplockflags); + + nes_debug(NES_DBG_MOD_QP, "QP%u: hw_iwarp_state=0x%X, hw_tcp_state=0x%X," + " QP Access Flags=0x%X, attr_mask = 0x%0x\n", + nesqp->hwqp.qp_id, nesqp->hw_iwarp_state, + nesqp->hw_tcp_state, attr->qp_access_flags, attr_mask); + + if (attr_mask & IB_QP_STATE) { + switch (attr->qp_state) { + case IB_QPS_INIT: + nes_debug(NES_DBG_MOD_QP, "QP%u: new state = init\n", + nesqp->hwqp.qp_id); + if (nesqp->iwarp_state > (u32)NES_CQP_QP_IWARP_STATE_IDLE) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + next_iwarp_state = NES_CQP_QP_IWARP_STATE_IDLE; + issue_modify_qp = 1; + break; + case IB_QPS_RTR: + nes_debug(NES_DBG_MOD_QP, "QP%u: new state = rtr\n", + nesqp->hwqp.qp_id); + if (nesqp->iwarp_state>(u32)NES_CQP_QP_IWARP_STATE_IDLE) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + next_iwarp_state = NES_CQP_QP_IWARP_STATE_IDLE; + issue_modify_qp = 1; + break; + case IB_QPS_RTS: + nes_debug(NES_DBG_MOD_QP, "QP%u: new state = rts\n", + nesqp->hwqp.qp_id); + if (nesqp->iwarp_state>(u32)NES_CQP_QP_IWARP_STATE_RTS) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + if (nesqp->cm_id == NULL) { + nes_debug(NES_DBG_MOD_QP, "QP%u: Failing attempt to move QP to RTS without a CM_ID. \n", + nesqp->hwqp.qp_id ); + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + next_iwarp_state = NES_CQP_QP_IWARP_STATE_RTS; + if (nesqp->iwarp_state != NES_CQP_QP_IWARP_STATE_RTS) + next_iwarp_state |= NES_CQP_QP_CONTEXT_VALID | + NES_CQP_QP_ARP_VALID | NES_CQP_QP_ORD_VALID; + issue_modify_qp = 1; + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_ESTABLISHED; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_RTS; + nesqp->hte_added = 1; + break; + case IB_QPS_SQD: + issue_modify_qp = 1; + nes_debug(NES_DBG_MOD_QP, "QP%u: new state=closing. SQ head=%u, SQ tail=%u\n", + nesqp->hwqp.qp_id, nesqp->hwqp.sq_head, nesqp->hwqp.sq_tail); + if (nesqp->iwarp_state==(u32)NES_CQP_QP_IWARP_STATE_CLOSING) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return 0; + } else { + if (nesqp->iwarp_state > (u32)NES_CQP_QP_IWARP_STATE_CLOSING) { + nes_debug(NES_DBG_MOD_QP, "QP%u: State change to closing" + " ignored due to current iWARP state\n", + nesqp->hwqp.qp_id); + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + if (nesqp->hw_iwarp_state != NES_AEQE_IWARP_STATE_RTS) { + nes_debug(NES_DBG_MOD_QP, "QP%u: State change to closing" + " already done based on hw state.\n", + nesqp->hwqp.qp_id); + issue_modify_qp = 0; + nesqp->in_disconnect = 0; + } + switch (nesqp->hw_iwarp_state) { + case NES_AEQE_IWARP_STATE_CLOSING: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_CLOSING; + case NES_AEQE_IWARP_STATE_TERMINATE: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_TERMINATE; + break; + case NES_AEQE_IWARP_STATE_ERROR: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_ERROR; + break; + default: + next_iwarp_state = NES_CQP_QP_IWARP_STATE_CLOSING; + nesqp->in_disconnect = 1; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_CLOSING; + break; + } + } + break; + case IB_QPS_SQE: + nes_debug(NES_DBG_MOD_QP, "QP%u: new state = terminate\n", + nesqp->hwqp.qp_id); + if (nesqp->iwarp_state>=(u32)NES_CQP_QP_IWARP_STATE_TERMINATE) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + /* next_iwarp_state = (NES_CQP_QP_IWARP_STATE_TERMINATE | 0x02000000); */ + next_iwarp_state = NES_CQP_QP_IWARP_STATE_TERMINATE; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_TERMINATE; + issue_modify_qp = 1; + nesqp->in_disconnect = 1; + break; + case IB_QPS_ERR: + case IB_QPS_RESET: + if (nesqp->iwarp_state==(u32)NES_CQP_QP_IWARP_STATE_ERROR) { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + } + nes_debug(NES_DBG_MOD_QP, "QP%u: new state = error\n", + nesqp->hwqp.qp_id); + next_iwarp_state = NES_CQP_QP_IWARP_STATE_ERROR; + /* next_iwarp_state = (NES_CQP_QP_IWARP_STATE_TERMINATE | 0x02000000); */ + if (nesqp->hte_added) { + nes_debug(NES_DBG_MOD_QP, "set CQP_QP_DEL_HTE\n"); + next_iwarp_state |= NES_CQP_QP_DEL_HTE; + nesqp->hte_added = 0; + } + if ((nesqp->hw_tcp_state > NES_AEQE_TCP_STATE_CLOSED) && + (nesqp->hw_tcp_state != NES_AEQE_TCP_STATE_TIME_WAIT)) { + next_iwarp_state |= NES_CQP_QP_RESET; + nesqp->in_disconnect = 1; + } else { + nes_debug(NES_DBG_MOD_QP, "QP%u NOT setting NES_CQP_QP_RESET since TCP state = %u\n", + nesqp->hwqp.qp_id, nesqp->hw_tcp_state); + dont_wait = 1; + } + issue_modify_qp = 1; + nesqp->hw_iwarp_state = NES_AEQE_IWARP_STATE_ERROR; + break; + default: + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_rem_ref(&nesqp->ibqp); + return -EINVAL; + break; + } + + nesqp->ibqp_state = attr->qp_state; + if (((nesqp->iwarp_state & NES_CQP_QP_IWARP_STATE_MASK) == + (u32)NES_CQP_QP_IWARP_STATE_RTS) && + ((next_iwarp_state & NES_CQP_QP_IWARP_STATE_MASK) > + (u32)NES_CQP_QP_IWARP_STATE_RTS)) { + nesqp->iwarp_state = next_iwarp_state & NES_CQP_QP_IWARP_STATE_MASK; + nes_debug(NES_DBG_MOD_QP, "Change nesqp->iwarp_state=%08x\n", + nesqp->iwarp_state); + issue_disconnect = 1; + } else { + nesqp->iwarp_state = next_iwarp_state & NES_CQP_QP_IWARP_STATE_MASK; + nes_debug(NES_DBG_MOD_QP, "Change nesqp->iwarp_state=%08x\n", + nesqp->iwarp_state); + } + } + + if (attr_mask & IB_QP_ACCESS_FLAGS) { + if (attr->qp_access_flags & IB_ACCESS_LOCAL_WRITE) { + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_RDMA_WRITE_EN | + NES_QPCONTEXT_MISC_RDMA_READ_EN); + issue_modify_qp = 1; + } + if (attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE) { + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_RDMA_WRITE_EN); + issue_modify_qp = 1; + } + if (attr->qp_access_flags & IB_ACCESS_REMOTE_READ) { + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_RDMA_READ_EN); + issue_modify_qp = 1; + } + if (attr->qp_access_flags & IB_ACCESS_MW_BIND) { + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_WBIND_EN); + issue_modify_qp = 1; + } + + if (nesqp->user_mode) { + nesqp->nesqp_context->misc |= cpu_to_le32(NES_QPCONTEXT_MISC_RDMA_WRITE_EN | + NES_QPCONTEXT_MISC_RDMA_READ_EN); + issue_modify_qp = 1; + } + } + + original_last_aeq = nesqp->last_aeq; + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + + nes_debug(NES_DBG_MOD_QP, "issue_modify_qp=%u\n", issue_modify_qp); + + ret = 0; + + + if (issue_modify_qp) { + nes_debug(NES_DBG_MOD_QP, "call nes_hw_modify_qp\n"); + ret = nes_hw_modify_qp(nesdev, nesqp, next_iwarp_state, 1); + if (ret) + nes_debug(NES_DBG_MOD_QP, "nes_hw_modify_qp (next_iwarp_state = 0x%08X)" + " failed for QP%u.\n", + next_iwarp_state, nesqp->hwqp.qp_id); + + } + + if ((issue_modify_qp) && (nesqp->ibqp_state > IB_QPS_RTS)) { + nes_debug(NES_DBG_MOD_QP, "QP%u Issued ModifyQP refcount (%d)," + " original_last_aeq = 0x%04X. last_aeq = 0x%04X.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + if ((!ret) || + ((original_last_aeq != NES_AEQE_AEID_RDMAP_ROE_BAD_LLP_CLOSE) && + (ret))) { + if (dont_wait) { + if (nesqp->cm_id && nesqp->hw_tcp_state != 0) { + nes_debug(NES_DBG_MOD_QP, "QP%u Queuing fake disconnect for QP refcount (%d)," + " original_last_aeq = 0x%04X. last_aeq = 0x%04X.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + /* this one is for the cm_disconnect thread */ + nes_add_ref(&nesqp->ibqp); + spin_lock_irqsave(&nesqp->lock, qplockflags); + nesqp->hw_tcp_state = NES_AEQE_TCP_STATE_CLOSED; + nesqp->last_aeq = NES_AEQE_AEID_RESET_SENT; + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_cm_disconn(nesqp); + } else { + nes_debug(NES_DBG_MOD_QP, "QP%u No fake disconnect, QP refcount=%d\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount)); + nes_rem_ref(&nesqp->ibqp); + } + } else { + spin_lock_irqsave(&nesqp->lock, qplockflags); + if (nesqp->cm_id) { + /* These two are for the timer thread */ + if (atomic_inc_return(&nesqp->close_timer_started)==1) { + nes_add_ref(&nesqp->ibqp); + nesqp->cm_id->add_ref(nesqp->cm_id); + nes_debug(NES_DBG_MOD_QP, "QP%u Not decrementing QP refcount (%d)," + " need ae to finish up, original_last_aeq = 0x%04X." + " last_aeq = 0x%04X, scheduling timer.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + schedule_nes_timer(nesqp->cm_node, (struct sk_buff *) nesqp, NES_TIMER_TYPE_CLOSE, 1, 0); + } + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + } else { + spin_unlock_irqrestore(&nesqp->lock, qplockflags); + nes_debug(NES_DBG_MOD_QP, "QP%u Not decrementing QP refcount (%d)," + " need ae to finish up, original_last_aeq = 0x%04X." + " last_aeq = 0x%04X.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + } + } + } else { + nes_debug(NES_DBG_MOD_QP, "QP%u Decrementing QP refcount (%d), No ae to finish up," + " original_last_aeq = 0x%04X. last_aeq = 0x%04X.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + nes_rem_ref(&nesqp->ibqp); + } + } else { + nes_debug(NES_DBG_MOD_QP, "QP%u Decrementing QP refcount (%d), No ae to finish up," + " original_last_aeq = 0x%04X. last_aeq = 0x%04X.\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount), + original_last_aeq, nesqp->last_aeq); + nes_rem_ref(&nesqp->ibqp); + } + + err = 0; + + nes_debug(NES_DBG_MOD_QP, "QP%u Leaving, refcount=%d\n", + nesqp->hwqp.qp_id, atomic_read(&nesqp->refcount)); + + return err; +} + + +/** + * nes_muticast_attach + */ +static int nes_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + nes_debug(NES_DBG_INIT, "\n"); + return -ENOSYS; +} + + +/** + * nes_multicast_detach + */ +static int nes_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + nes_debug(NES_DBG_INIT, "\n"); + return -ENOSYS; +} + + +/** + * nes_process_mad + */ +static int nes_process_mad(struct ib_device *ibdev, int mad_flags, + u8 port_num, struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + nes_debug(NES_DBG_INIT, "\n"); + return -ENOSYS; +} + + +/** + * nes_post_send + */ +static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, + struct ib_send_wr **bad_wr) +{ + u64 u64temp; + unsigned long flags = 0; + struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_qp *nesqp = to_nesqp(ibqp); + struct nes_hw_qp_wqe *wqe; + int err; + int sge_index; + u32 qsize = nesqp->hwqp.sq_size; + u32 head; + u32 wqe_misc; + u32 wqe_count; + u32 counter; + u32 total_payload_length; + + err = 0; + wqe_misc = 0; + wqe_count = 0; + total_payload_length = 0; + + nes_debug(NES_DBG_IW_TX, "\n"); + if (nesqp->ibqp_state > IB_QPS_RTS) + return -EINVAL; + + spin_lock_irqsave(&nesqp->lock, flags); + + head = nesqp->hwqp.sq_head; + + while (ib_wr) { + /* Check for SQ overflow */ + if (((head + (2 * qsize) - nesqp->hwqp.sq_tail) % qsize) == (qsize - 1)) { + err = -EINVAL; + break; + } + + wqe = &nesqp->hwqp.sq_vbase[head]; + /* nes_debug(NES_DBG_IW_TX, "processing sq wqe for QP%u at %p, head = %u.\n", + nesqp->hwqp.qp_id, wqe, head); */ + u64temp = (u64)(ib_wr->wr_id); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)u64temp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)((u64temp)>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(nesqp))); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(nesqp))>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head); + + switch (ib_wr->opcode) { + case IB_WR_SEND: + if (ib_wr->send_flags & IB_SEND_SOLICITED) { + wqe_misc = NES_IWARP_SQ_OP_SENDSE; + } else { + wqe_misc = NES_IWARP_SQ_OP_SEND; + } + if (ib_wr->num_sge > nesdev->nesadapter->max_sge) { + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + wqe_misc |= NES_IWARP_SQ_WQE_LOCAL_FENCE; + } + if ((ib_wr->send_flags & IB_SEND_INLINE) && + (0 == (nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA)) && + (ib_wr->sg_list[0].length <= 64)) { + memcpy(&wqe->wqe_words[NES_IWARP_SQ_WQE_IMM_DATA_START_IDX], + (void *)ib_wr->sg_list[0].addr, ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32( + ib_wr->sg_list[0].length); + wqe_misc |= NES_IWARP_SQ_WQE_IMM_DATA; + } else { + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = + cpu_to_le32((u32)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = + cpu_to_le32((u32)(ib_wr->sg_list[sge_index].addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list[sge_index].length; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = + cpu_to_le32(total_payload_length); + } + + break; + case IB_WR_RDMA_WRITE: + wqe_misc = NES_IWARP_SQ_OP_RDMAW; + if (ib_wr->num_sge > nesdev->nesadapter->max_sge) { + nes_debug(NES_DBG_IW_TX, "Exceeded max sge, ib_wr=%u, max=%u\n", + ib_wr->num_sge, + nesdev->nesadapter->max_sge); + err = -EINVAL; + break; + } + if (ib_wr->send_flags & IB_SEND_FENCE) { + wqe_misc |= NES_IWARP_SQ_WQE_LOCAL_FENCE; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_STAG_IDX] = + cpu_to_le32(ib_wr->wr.rdma.rkey); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX] = + cpu_to_le32(ib_wr->wr.rdma.remote_addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX] = + cpu_to_le32((u32)(ib_wr->wr.rdma.remote_addr >> 32)); + + if ((ib_wr->send_flags & IB_SEND_INLINE) && + (0 == (nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA)) && + (ib_wr->sg_list[0].length <= 64)) { + memcpy(&wqe->wqe_words[NES_IWARP_SQ_WQE_IMM_DATA_START_IDX], + (void *)ib_wr->sg_list[0].addr, ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32( + ib_wr->sg_list[0].length); + wqe_misc |= NES_IWARP_SQ_WQE_IMM_DATA; + } else { + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = + cpu_to_le32((u32)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = + cpu_to_le32((u32)(ib_wr->sg_list[sge_index].addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list[sge_index].length; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = + cpu_to_le32(total_payload_length); + } + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX] = + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX]; + break; + case IB_WR_RDMA_READ: + /* iWARP only supports 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + nes_debug(NES_DBG_IW_TX, "Exceeded max sge, ib_wr=%u, max=1\n", + ib_wr->num_sge); + err = -EINVAL; + break; + } + wqe_misc = NES_IWARP_SQ_OP_RDMAR; + + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX] = + cpu_to_le32(ib_wr->wr.rdma.remote_addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX] = + cpu_to_le32((u32)(ib_wr->wr.rdma.remote_addr >> 32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_STAG_IDX] = + cpu_to_le32(ib_wr->wr.rdma.rkey); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX] = + cpu_to_le32(ib_wr->sg_list->length); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX] = + cpu_to_le32(ib_wr->sg_list->addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX] = + cpu_to_le32((u32)(ib_wr->sg_list->addr >> 32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX] = + cpu_to_le32(ib_wr->sg_list->lkey); + break; + default: + /* error */ + err = -EINVAL; + break; + } + + if (ib_wr->send_flags & IB_SEND_SIGNALED) { + wqe_misc |= NES_IWARP_SQ_WQE_SIGNALED_COMPL; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(wqe_misc); + + ib_wr = ib_wr->next; + head++; + wqe_count++; + if (head >= qsize) + head = 0; + + } + + nesqp->hwqp.sq_head = head; + barrier(); + while (wqe_count) { + counter = min(wqe_count, ((u32)255)); + wqe_count -= counter; + nes_write32(nesdev->regs + NES_WQE_ALLOC, + (counter << 24) | 0x00800000 | nesqp->hwqp.qp_id); + } + + spin_unlock_irqrestore(&nesqp->lock, flags); + + if (err) + *bad_wr = ib_wr; + return err; +} + + +/** + * nes_post_recv + */ +static int nes_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, + struct ib_recv_wr **bad_wr) +{ + u64 u64temp; + unsigned long flags = 0; + struct nes_vnic *nesvnic = to_nesvnic(ibqp->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_qp *nesqp = to_nesqp(ibqp); + struct nes_hw_qp_wqe *wqe; + int err = 0; + int sge_index; + u32 qsize = nesqp->hwqp.rq_size; + u32 head; + u32 wqe_count = 0; + u32 counter; + u32 total_payload_length; + + nes_debug(NES_DBG_IW_RX, "\n"); + if (nesqp->ibqp_state > IB_QPS_RTS) + return -EINVAL; + + spin_lock_irqsave(&nesqp->lock, flags); + + head = nesqp->hwqp.rq_head; + + while (ib_wr) { + if (ib_wr->num_sge > nesdev->nesadapter->max_sge) { + err = -EINVAL; + break; + } + /* Check for RQ overflow */ + if (((head + (2 * qsize) - nesqp->hwqp.rq_tail) % qsize) == (qsize - 1)) { + err = -EINVAL; + break; + } + + nes_debug(NES_DBG_IW_RX, "ibwr sge count = %u.\n", ib_wr->num_sge); + wqe = &nesqp->hwqp.rq_vbase[head]; + + /* nes_debug(NES_DBG_IW_RX, "QP%u:processing rq wqe at %p, head = %u.\n", + nesqp->hwqp.qp_id, wqe, head); */ + u64temp = (u64)(ib_wr->wr_id); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)(u64temp)); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)((u64temp)>>32)); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((u32)((u64)(nesqp))); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((u32)(((u64)(nesqp))>>32)); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head); + + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_RQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = + cpu_to_le32((u32)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_RQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = + cpu_to_le32((u32)(ib_wr->sg_list[sge_index].addr >> 32)); + wqe->wqe_words[NES_IWARP_RQ_WQE_LENGTH0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_RQ_WQE_STAG0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list[sge_index].length; + } + wqe->wqe_words[NES_IWARP_RQ_WQE_TOTAL_PAYLOAD_IDX] = + cpu_to_le32(total_payload_length); + + ib_wr = ib_wr->next; + head++; + wqe_count++; + if (head >= qsize) + head = 0; + } + + nesqp->hwqp.rq_head = head; + barrier(); + while (wqe_count) { + counter = min(wqe_count, ((u32)255)); + wqe_count -= counter; + nes_write32(nesdev->regs+NES_WQE_ALLOC, (counter<<24) | nesqp->hwqp.qp_id); + } + + spin_unlock_irqrestore(&nesqp->lock, flags); + + if (err) + *bad_wr = ib_wr; + return err; +} + + +/** + * nes_poll_cq + */ +static int nes_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) +{ + u64 u64temp; + u64 wrid; + /* u64 u64temp; */ + unsigned long flags = 0; + struct nes_vnic *nesvnic = to_nesvnic(ibcq->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_cq *nescq = to_nescq(ibcq); + struct nes_qp *nesqp; + struct nes_hw_cqe cqe; + u32 head; + u32 wq_tail; + u32 cq_size; + u32 cqe_count=0; + u32 wqe_index; + u32 u32temp; + /* u32 counter; */ + + nes_debug(NES_DBG_CQ, "\n"); + + spin_lock_irqsave(&nescq->lock, flags); + + head = nescq->hw_cq.cq_head; + cq_size = nescq->hw_cq.cq_size; + + while (cqe_counthw_cq.cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX]) & + NES_CQE_VALID) { + cqe = nescq->hw_cq.cq_vbase[head]; + nescq->hw_cq.cq_vbase[head].cqe_words[NES_CQE_OPCODE_IDX] = 0; + u32temp = le32_to_cpu(cqe.cqe_words[NES_CQE_COMP_COMP_CTX_LOW_IDX]); + wqe_index = u32temp & + (nesdev->nesadapter->max_qp_wr - 1); + u32temp &= ~(NES_SW_CONTEXT_ALIGN-1); + /* parse CQE, get completion context from WQE (either rq or sq */ + u64temp = (((u64)(le32_to_cpu(cqe.cqe_words[NES_CQE_COMP_COMP_CTX_HIGH_IDX])))<<32) | + ((u64)u32temp); + nesqp = *((struct nes_qp **)&u64temp); + memset(entry, 0, sizeof *entry); + if (0 == cqe.cqe_words[NES_CQE_ERROR_CODE_IDX]) { + entry->status = IB_WC_SUCCESS; + } else { + entry->status = IB_WC_WR_FLUSH_ERR; + } + + entry->qp = &nesqp->ibqp; + entry->src_qp = nesqp->hwqp.qp_id; + + if (le32_to_cpu(cqe.cqe_words[NES_CQE_OPCODE_IDX]) & NES_CQE_SQ) { + if (nesqp->skip_lsmm) { + nesqp->skip_lsmm = 0; + wq_tail = nesqp->hwqp.sq_tail++; + } + + /* Working on a SQ Completion*/ + wq_tail = wqe_index; + nesqp->hwqp.sq_tail = (wqe_index+1)&(nesqp->hwqp.sq_size - 1); + wrid = (((u64)(cpu_to_le32((u32)nesqp->hwqp.sq_vbase[wq_tail].wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX])))<<32) | + ((u64)(cpu_to_le32((u32)nesqp->hwqp.sq_vbase[wq_tail].wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX]))); + entry->byte_len = le32_to_cpu(nesqp->hwqp.sq_vbase[wq_tail]. + wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX]); + + switch (le32_to_cpu(nesqp->hwqp.sq_vbase[wq_tail]. + wqe_words[NES_IWARP_SQ_WQE_MISC_IDX]) & 0x3f) { + case NES_IWARP_SQ_OP_RDMAW: + nes_debug(NES_DBG_CQ, "Operation = RDMA WRITE.\n"); + entry->opcode = IB_WC_RDMA_WRITE; + break; + case NES_IWARP_SQ_OP_RDMAR: + nes_debug(NES_DBG_CQ, "Operation = RDMA READ.\n"); + entry->opcode = IB_WC_RDMA_READ; + entry->byte_len = le32_to_cpu(nesqp->hwqp.sq_vbase[wq_tail]. + wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX]); + break; + case NES_IWARP_SQ_OP_SENDINV: + case NES_IWARP_SQ_OP_SENDSEINV: + case NES_IWARP_SQ_OP_SEND: + case NES_IWARP_SQ_OP_SENDSE: + nes_debug(NES_DBG_CQ, "Operation = Send.\n"); + entry->opcode = IB_WC_SEND; + break; + } + } else { + /* Working on a RQ Completion*/ + wq_tail = wqe_index; + nesqp->hwqp.rq_tail = (wqe_index+1)&(nesqp->hwqp.rq_size - 1); + entry->byte_len = le32_to_cpu(cqe.cqe_words[NES_CQE_PAYLOAD_LENGTH_IDX]); + wrid = ((u64)(le32_to_cpu(nesqp->hwqp.rq_vbase[wq_tail].wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX]))) | + ((u64)(le32_to_cpu(nesqp->hwqp.rq_vbase[wq_tail].wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX]))<<32); + entry->opcode = IB_WC_RECV; + } + entry->wr_id = wrid; + + if (++head >= cq_size) + head = 0; + cqe_count++; + nescq->polled_completions++; + if ((nescq->polled_completions > (cq_size/2)) || + (nescq->polled_completions == 255)) { + nes_debug(NES_DBG_CQ, "CQ%u Issuing CQE Allocate since more than half of cqes" + " are pending %u of %u.\n", + nescq->hw_cq.cq_number, nescq->polled_completions, cq_size); + nes_write32(nesdev->regs+NES_CQE_ALLOC, + nescq->hw_cq.cq_number | (nescq->polled_completions << 16)); + nescq->polled_completions = 0; + } + entry++; + } else + break; + } + + if (nescq->polled_completions) { + nes_write32(nesdev->regs+NES_CQE_ALLOC, + nescq->hw_cq.cq_number | (nescq->polled_completions << 16)); + nescq->polled_completions = 0; + } + + nescq->hw_cq.cq_head = head; + nes_debug(NES_DBG_CQ, "Reporting %u completions for CQ%u.\n", + cqe_count, nescq->hw_cq.cq_number); + + spin_unlock_irqrestore(&nescq->lock, flags); + + return cqe_count; +} + + +/** + * nes_req_notify_cq + */ +#ifdef OFED_1_2 +static int nes_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +#else +static int nes_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags notify_flags) +#endif + { + struct nes_vnic *nesvnic = to_nesvnic(ibcq->device); + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_cq *nescq = to_nescq(ibcq); + u32 cq_arm; + + nes_debug(NES_DBG_CQ, "Requesting notification for CQ%u.\n", + nescq->hw_cq.cq_number); + + cq_arm = nescq->hw_cq.cq_number; +#ifdef OFED_1_2 + if (notify == IB_CQ_NEXT_COMP) + cq_arm |= NES_CQE_ALLOC_NOTIFY_NEXT; + else if (notify == IB_CQ_SOLICITED) + cq_arm |= NES_CQE_ALLOC_NOTIFY_SE; +#else + if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_NEXT_COMP) + cq_arm |= NES_CQE_ALLOC_NOTIFY_NEXT; + else if ((notify_flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED) + cq_arm |= NES_CQE_ALLOC_NOTIFY_SE; +#endif + else + return -EINVAL; + + nes_write32(nesdev->regs+NES_CQE_ALLOC, cq_arm); + nes_read32(nesdev->regs+NES_CQE_ALLOC); + + return 0; +} + + +/** + * nes_init_ofa_device + */ +struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev) { + struct nes_ib_device *nesibdev; + struct nes_vnic *nesvnic = netdev_priv(netdev); + struct nes_device *nesdev = nesvnic->nesdev; + + nesibdev = (struct nes_ib_device *)ib_alloc_device(sizeof(struct nes_ib_device)); + if (nesibdev == NULL) { + return NULL; + } + strlcpy(nesibdev->ibdev.name, "nes%d", IB_DEVICE_NAME_MAX); + nesibdev->ibdev.owner = THIS_MODULE; + + nesibdev->ibdev.node_type = RDMA_NODE_RNIC; + memset(&nesibdev->ibdev.node_guid, 0, sizeof(nesibdev->ibdev.node_guid)); + memcpy(&nesibdev->ibdev.node_guid, netdev->dev_addr, 6); + + nesibdev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ALLOC_MW) | + (1ull << IB_USER_VERBS_CMD_BIND_MW) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_MW) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_POST_SEND); + + nesibdev->ibdev.phys_port_cnt = 1; +#ifndef OFED_1_2 + nesibdev->ibdev.num_comp_vectors = 1; +#endif + nesibdev->ibdev.dma_device = &nesdev->pcidev->dev; + nesibdev->ibdev.class_dev.dev = &nesdev->pcidev->dev; + nesibdev->ibdev.query_device = nes_query_device; + nesibdev->ibdev.query_port = nes_query_port; + nesibdev->ibdev.modify_port = nes_modify_port; + nesibdev->ibdev.query_pkey = nes_query_pkey; + nesibdev->ibdev.query_gid = nes_query_gid; + nesibdev->ibdev.alloc_ucontext = nes_alloc_ucontext; + nesibdev->ibdev.dealloc_ucontext = nes_dealloc_ucontext; + nesibdev->ibdev.mmap = nes_mmap; + nesibdev->ibdev.alloc_pd = nes_alloc_pd; + nesibdev->ibdev.dealloc_pd = nes_dealloc_pd; + nesibdev->ibdev.create_ah = nes_create_ah; + nesibdev->ibdev.destroy_ah = nes_destroy_ah; + nesibdev->ibdev.create_qp = nes_create_qp; + nesibdev->ibdev.modify_qp = nes_modify_qp; + nesibdev->ibdev.query_qp = nes_query_qp; + nesibdev->ibdev.destroy_qp = nes_destroy_qp; + nesibdev->ibdev.create_cq = nes_create_cq; + nesibdev->ibdev.destroy_cq = nes_destroy_cq; + nesibdev->ibdev.poll_cq = nes_poll_cq; + nesibdev->ibdev.get_dma_mr = nes_get_dma_mr; + nesibdev->ibdev.reg_phys_mr = nes_reg_phys_mr; + nesibdev->ibdev.reg_user_mr = nes_reg_user_mr; + nesibdev->ibdev.dereg_mr = nes_dereg_mr; + nesibdev->ibdev.alloc_mw = nes_alloc_mw; + nesibdev->ibdev.dealloc_mw = nes_dealloc_mw; + nesibdev->ibdev.bind_mw = nes_bind_mw; + + nesibdev->ibdev.alloc_fmr = nes_alloc_fmr; + nesibdev->ibdev.unmap_fmr = nes_unmap_fmr; + nesibdev->ibdev.dealloc_fmr = nes_dealloc_fmr; + nesibdev->ibdev.map_phys_fmr = nes_map_phys_fmr; + + nesibdev->ibdev.attach_mcast = nes_multicast_attach; + nesibdev->ibdev.detach_mcast = nes_multicast_detach; + nesibdev->ibdev.process_mad = nes_process_mad; + + nesibdev->ibdev.req_notify_cq = nes_req_notify_cq; + nesibdev->ibdev.post_send = nes_post_send; + nesibdev->ibdev.post_recv = nes_post_recv; + + nesibdev->ibdev.iwcm = kmalloc(sizeof(*nesibdev->ibdev.iwcm), GFP_KERNEL); + if (nesibdev->ibdev.iwcm == NULL) { + ib_dealloc_device(&nesibdev->ibdev); + return NULL; + } + nesibdev->ibdev.iwcm->add_ref = nes_add_ref; + nesibdev->ibdev.iwcm->rem_ref = nes_rem_ref; + nesibdev->ibdev.iwcm->get_qp = nes_get_qp; + nesibdev->ibdev.iwcm->connect = nes_connect; + nesibdev->ibdev.iwcm->accept = nes_accept; + nesibdev->ibdev.iwcm->reject = nes_reject; + nesibdev->ibdev.iwcm->create_listen = nes_create_listen; + nesibdev->ibdev.iwcm->destroy_listen = nes_destroy_listen; + + return nesibdev; +} + + +/** + * nes_destroy_ofa_device + */ +void nes_destroy_ofa_device(struct nes_ib_device *nesibdev) +{ + if (NULL == nesibdev) + return; + + nes_unregister_ofa_device(nesibdev); + + kfree(nesibdev->ibdev.iwcm); + ib_dealloc_device(&nesibdev->ibdev); + + nes_debug(NES_DBG_SHUTDOWN, "\n"); +} + + +/** + * nes_register_ofa_device + */ +int nes_register_ofa_device(struct nes_ib_device *nesibdev) +{ + struct nes_vnic *nesvnic = nesibdev->nesvnic; + struct nes_device *nesdev = nesvnic->nesdev; + struct nes_adapter *nesadapter = nesdev->nesadapter; + int i, ret; + + ret = ib_register_device(&nesvnic->nesibdev->ibdev); + if (ret) { + nes_debug(NES_DBG_INIT, "\n"); + return ret; + } + + /* Get the resources allocated to this device */ + nesibdev->max_cq = (nesadapter->max_cq-NES_FIRST_QPN) / nesadapter->port_count; + nesibdev->max_mr = nesadapter->max_mr / nesadapter->port_count; + nesibdev->max_qp = (nesadapter->max_qp-NES_FIRST_QPN) / nesadapter->port_count; + nesibdev->max_pd = nesadapter->max_pd / nesadapter->port_count; + + for (i = 0; i < ARRAY_SIZE(nes_class_attributes); ++i) { + nes_debug(NES_DBG_INIT, "call class_device_create_file\n"); + ret = class_device_create_file(&nesibdev->ibdev.class_dev, nes_class_attributes[i]); + if (ret) { + while (i > 0) { + i--; + class_device_remove_file(&nesibdev->ibdev.class_dev, + nes_class_attributes[i]); + } + ib_unregister_device(&nesibdev->ibdev); + return ret; + } + } + + nesvnic->of_device_registered = 1; + + return 0; +} + + +/** + * nes_unregister_ofa_device + */ +void nes_unregister_ofa_device(struct nes_ib_device *nesibdev) +{ + struct nes_vnic *nesvnic = nesibdev->nesvnic; + int i; + + if (NULL == nesibdev) + return; + + for (i = 0; i < ARRAY_SIZE(nes_class_attributes); ++i) { + class_device_remove_file(&nesibdev->ibdev.class_dev, nes_class_attributes[i]); + } + + if (nesvnic->of_device_registered) { + nes_debug(NES_DBG_SHUTDOWN, "call ib_unregister_device()\n"); + ib_unregister_device(&nesibdev->ibdev); + } + + nesvnic->of_device_registered = 0; + +} + From ggrundstrom at neteffect.com Fri Oct 19 13:25:13 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:25:13 -0500 Subject: [ofa-general] [PATCH 12/14 v2] nes: OpenFabrics kernel verbs includes Message-ID: <200710192025.l9JKPDoH021842@neteffect.com> OpenFabrics kernel vers provider structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/nes_verbs.h 2007-10-19 09:43:33.000000000 -0500 @@ -0,0 +1,165 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef NES_VERBS_H +#define NES_VERBS_H + +struct nes_device; + +#define NES_MAX_USER_DB_REGIONS 4096 +#define NES_MAX_USER_WQ_REGIONS 4096 + +struct nes_ucontext { + struct ib_ucontext ibucontext; + struct nes_device *nesdev; + unsigned long mmap_wq_offset; + unsigned long mmap_cq_offset; /* to be removed */ + int index; /* rnic index (minor) */ + unsigned long allocated_doorbells[BITS_TO_LONGS(NES_MAX_USER_DB_REGIONS)]; + u16 mmap_db_index[NES_MAX_USER_DB_REGIONS]; + u16 first_free_db; + unsigned long allocated_wqs[BITS_TO_LONGS(NES_MAX_USER_WQ_REGIONS)]; + struct nes_qp * mmap_nesqp[NES_MAX_USER_WQ_REGIONS]; + u16 first_free_wq; + struct list_head cq_reg_mem_list; +}; + +struct nes_pd { + struct ib_pd ibpd; + u16 pd_id; + atomic_t sqp_count; + u16 mmap_db_index; +}; + +struct nes_mr { + union { + struct ib_mr ibmr; + struct ib_mw ibmw; + struct ib_fmr ibfmr; + }; +#ifndef OFED_1_2 + struct ib_umem *region; +#endif + u16 pbls_used; + u8 mode; + u8 pbl_4k; +}; + +struct nes_hw_pb { + u32 pa_low; + u32 pa_high; +}; + +struct nes_vpbl { + dma_addr_t pbl_pbase; + struct nes_hw_pb *pbl_vbase; +}; + +struct nes_root_vpbl { + dma_addr_t pbl_pbase; + struct nes_hw_pb *pbl_vbase; + struct nes_vpbl *leaf_vpbl; +}; + +struct nes_fmr { + struct nes_mr nesmr; + u32 leaf_pbl_cnt; + struct nes_root_vpbl root_vpbl; + struct ib_qp* ib_qp; + int access_rights; + struct ib_fmr_attr attr; +}; + +struct nes_av; + +struct nes_cq { + struct ib_cq ibcq; + struct nes_hw_cq hw_cq; + u32 polled_completions; + u32 cq_mem_size; + spinlock_t lock; + u8 virtual_cq; + u8 pad[3]; +}; + +struct nes_wq { + spinlock_t lock; +}; + +struct iw_cm_id; +struct ietf_mpa_frame; + +struct nes_qp { + struct ib_qp ibqp; + void * allocated_buffer; + struct iw_cm_id *cm_id; + struct workqueue_struct *wq; + struct work_struct disconn_work; + struct nes_cq *nesscq; + struct nes_cq *nesrcq; + struct nes_pd *nespd; + void *cm_node; /* handle of the node this QP is associated with */ + struct ietf_mpa_frame *ietf_frame; + dma_addr_t ietf_frame_pbase; + wait_queue_head_t state_waitq; + unsigned long socket; + struct nes_hw_qp hwqp; + struct work_struct work; + struct work_struct ae_work; + enum ib_qp_state ibqp_state; + u32 iwarp_state; + u32 hte_index; + u32 last_aeq; + u32 qp_mem_size; + atomic_t refcount; + atomic_t close_timer_started; + u32 mmap_sq_db_index; + u32 mmap_rq_db_index; + spinlock_t lock; + struct nes_qp_context *nesqp_context; + dma_addr_t nesqp_context_pbase; + wait_queue_head_t kick_waitq; + u16 in_disconnect; + u16 private_data_len; + u8 active_conn; + u8 skip_lsmm; + u8 user_mode; + u8 hte_added; + u8 hw_iwarp_state; + u8 flush_issued; + u8 hw_tcp_state; + u8 disconn_pending; + u8 destroyed; +}; +#endif /* NES_VERBS_H */ From ggrundstrom at neteffect.com Fri Oct 19 13:27:01 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:27:01 -0500 Subject: [ofa-general] [PATCH 13/14 v2] nes: kernel build infrastructure Message-ID: <200710192027.l9JKR197021855@neteffect.com> Kconfig kernel build file. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/Kconfig 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,15 @@ +config INFINIBAND_NES + tristate "NetEffect RNIC Driver" + depends on PCI && INET && INFINIBAND + ---help--- + This is a low-level driver for NetEffect RDMA enabled + Network Interface Cards (RNIC). + +config INFINIBAND_NES_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_NES + default n + ---help--- + This option causes the NetEffect RNIC driver to produce debug + messages. Select this if you are developing the driver + or trying to diagnose a problem. From ralph.campbell at qlogic.com Fri Oct 19 13:33:46 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 19 Oct 2007 13:33:46 -0700 Subject: [ofa-general] [PATCH] IB/ipath - Enable loopback of DR SMP responses from userspace Message-ID: <1192826026.6112.43.camel@brick.pathscale.com> This patch is in response to reviewing a patch to the core MAD processing which fixes loopback of directed route packets to/from user level MAD agents. This change enables the core code to work for ib_ipath. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 3d1432d..1978c34 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: smp->status |= IB_SMP_UNSUP_METHOD; @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 port_num, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: pmp->status |= IB_SMP_UNSUP_METHOD; From ggrundstrom at neteffect.com Fri Oct 19 13:28:49 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:28:49 -0500 Subject: [ofa-general] [PATCH 14/14 v2] nes: kernel build infrastructure Message-ID: <200710192028.l9JKSnZG021867@neteffect.com> Makefile kernel build file. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ ofa_kernel-1.3/drivers/infiniband/hw/nes/Makefile 2007-10-19 09:43:32.000000000 -0500 @@ -0,0 +1,7 @@ + +EXTRA_CFLAGS += -DNES_MINICM + +obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o + +iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o + From ggrundstrom at neteffect.com Fri Oct 19 13:31:00 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:31:00 -0500 Subject: [ofa-general] [PATCH 1/5 v2] libnes: library init entry points Message-ID: <200710192031.l9JKV0xk021882@neteffect.com> Userspace library initialization entry points. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ libnes/src/nes.map 2007-10-19 11:06:52.000000000 -0500 @@ -0,0 +1,6 @@ +{ + global: + ibv_driver_init; + openib_driver_init; + local: *; +}; From ggrundstrom at neteffect.com Fri Oct 19 13:32:45 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:32:45 -0500 Subject: [ofa-general] [PATCH 2/5 v2] libnes: library initialization Message-ID: <200710192032.l9JKWjlb021895@neteffect.com> Main userspace library initialization routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ libnes/src/nes_umain.c 2007-10-19 11:07:03.000000000 -0500 @@ -0,0 +1,228 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if HAVE_CONFIG_H +#include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include + +#include "nes_umain.h" +#include "nes-abi.h" + +long int page_size; + +#include +#include +#include + +#ifndef PCI_VENDOR_ID_NETEFFECT +#define PCI_VENDOR_ID_NETEFFECT 0x1678 +#endif + +#ifndef PCI_DEVICE_ID_NETEFFECT_nes +#define PCI_DEVICE_ID_NETEFFECT_nes 0x0100 +#endif + +#define HCA(v, d, t) \ + { .vendor = PCI_VENDOR_ID_##v, \ + .device = PCI_DEVICE_ID_NETEFFECT_##d, \ + .type = NETEFFECT_##t } + +struct { + unsigned vendor; + unsigned device; + enum nes_uhca_type type; +} hca_table[] = { + HCA(NETEFFECT, nes, nes),}; + +static struct ibv_context *nes_ualloc_context(struct ibv_device *, int); +static void nes_ufree_context(struct ibv_context *); + +static struct ibv_context_ops nes_uctx_ops = { + .query_device = nes_uquery_device, + .query_port = nes_uquery_port, + .alloc_pd = nes_ualloc_pd, + .dealloc_pd = nes_ufree_pd, + .reg_mr = nes_ureg_mr, + .dereg_mr = nes_udereg_mr, + .create_cq = nes_ucreate_cq, + .poll_cq = nes_upoll_cq, + .req_notify_cq = nes_uarm_cq, + .cq_event = NULL, + .resize_cq = nes_uresize_cq, + .destroy_cq = nes_udestroy_cq, + .create_srq = NULL, + .modify_srq = NULL, + .query_srq = NULL, + .destroy_srq = NULL, + .post_srq_recv = NULL, + .create_qp = nes_ucreate_qp, + .query_qp = NULL, + .modify_qp = nes_umodify_qp, + .destroy_qp = nes_udestroy_qp, + .post_send = nes_upost_send, + .post_recv = nes_upost_recv, + .create_ah = nes_ucreate_ah, + .destroy_ah = nes_udestroy_ah, + .attach_mcast = nes_uattach_mcast, + .detach_mcast = nes_udetach_mcast, + .async_event = NULL +}; + + +/** + * nes_ualloc_context + */ +static struct ibv_context *nes_ualloc_context(struct ibv_device *ibdev, int cmd_fd) +{ + struct ibv_pd *ibv_pd; + struct nes_uvcontext *nesvctx; + struct ibv_get_context cmd; + struct nes_ualloc_ucontext_resp resp; + + page_size = sysconf(_SC_PAGESIZE); + + nesvctx = malloc(sizeof *nesvctx); + if (!nesvctx) + return NULL; + + nesvctx->ibv_ctx.cmd_fd = cmd_fd; + + if (ibv_cmd_get_context(&nesvctx->ibv_ctx, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof(resp))) + goto err_free; + + nesvctx->ibv_ctx.device = ibdev; + nesvctx->ibv_ctx.ops = nes_uctx_ops; + nesvctx->max_pds = resp.max_pds; + nesvctx->max_qps = resp.max_qps; + nesvctx->wq_size = resp.wq_size; + + /* Get a doorbell region for the CQs */ + ibv_pd = nes_ualloc_pd(&nesvctx->ibv_ctx); + if (!ibv_pd) + goto err_free; + ibv_pd->context = &nesvctx->ibv_ctx; + nesvctx->nesupd = to_nes_upd(ibv_pd); + + return &nesvctx->ibv_ctx; + +err_free: + fprintf(stderr, PFX "%s: Failed to allocate context for device.\n", __FUNCTION__); + free(nesvctx); + + return NULL; +} + + +/** + * nes_ufree_context + */ +static void nes_ufree_context(struct ibv_context *ibctx) +{ + struct nes_uvcontext *nesvctx = to_nes_uctx(ibctx); + nes_ufree_pd(&nesvctx->nesupd->ibv_pd); + + free(nesvctx); +} + + +static struct ibv_device_ops nes_udev_ops = { + .alloc_context = nes_ualloc_context, + .free_context = nes_ufree_context +}; + + +/** + * nes_driver_init + */ +struct ibv_device *nes_driver_init(const char *uverbs_sys_path, int abi_version) +{ + char value[16]; + struct nes_udevice *dev; + unsigned vendor, device; + int i; + + /* fprintf(stderr, PFX "called ibv_driver_init()\n"); */ + + if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", + value, sizeof(value)) < 0) { + return NULL; + } + + sscanf(value, "%i", &vendor); + + if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", + value, sizeof(value)) < 0) { + return NULL; + } + sscanf(value, "%i", &device); + + for (i = 0; i < sizeof hca_table / sizeof hca_table[0]; ++i) + if (vendor == hca_table[i].vendor && + device == hca_table[i].device) + goto found; + + return NULL; + +found: + dev = malloc(sizeof *dev); + if (!dev) { + fprintf(stderr, PFX "Fatal: couldn't allocate device for libnes\n"); + return NULL; + } + + dev->ibv_dev.ops = nes_udev_ops; + dev->hca_type = hca_table[i].type; + dev->page_size = sysconf(_SC_PAGESIZE); + + return &dev->ibv_dev; +} + + +/** + * nes_register_driver + */ +static __attribute__((constructor)) void nes_register_driver(void) +{ + /* fprintf(stderr, PFX "nes_register_driver: call ibv_register_driver()\n"); */ + + ibv_register_driver("nes", nes_driver_init); +} + From ggrundstrom at neteffect.com Fri Oct 19 13:34:41 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:34:41 -0500 Subject: [ofa-general] [PATCH 3/5 v2] libnes: library structures and defines Message-ID: <200710192034.l9JKYffd021907@neteffect.com> Main userspace library structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ libnes/src/nes_umain.h 2007-10-19 11:07:09.000000000 -0500 @@ -0,0 +1,295 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef nes_umain_H +#define nes_umain_H + +#include +#include +#include + +#include +#include + +#ifndef likely +#define likely(x) __builtin_expect((x),1) +#endif +#ifndef unlikely +#define unlikely(x) __builtin_expect((x),0) +#endif + +#define HIDDEN __attribute__((visibility ("hidden"))) + +#define PFX "nes: " + +#define NES_DRV_OPT_NO_INLINE_DATA 0x00000080 + +enum nes_cqe_opcode_bits { + NES_CQE_STAG_VALID = (1<<6), + NES_CQE_ERROR = (1<<7), + NES_CQE_SQ = (1<<8), + NES_CQE_SE = (1<<9), + NES_CQE_PSH = (1<<29), + NES_CQE_FIN = (1<<30), + NES_CQE_VALID = (1<<31), +}; + +enum nes_cqe_word_idx { + NES_CQE_PAYLOAD_LENGTH_IDX = 0, + NES_CQE_COMP_COMP_CTX_LOW_IDX = 2, + NES_CQE_COMP_COMP_CTX_HIGH_IDX = 3, + NES_CQE_INV_STAG_IDX = 4, + NES_CQE_QP_ID_IDX = 5, + NES_CQE_ERROR_CODE_IDX = 6, + NES_CQE_OPCODE_IDX = 7, +}; + +enum nes_cqe_allocate_bits { + NES_CQE_ALLOC_INC_SELECT = (1<<28), + NES_CQE_ALLOC_NOTIFY_NEXT = (1<<29), + NES_CQE_ALLOC_NOTIFY_SE = (1<<30), + NES_CQE_ALLOC_RESET = (1<<31), +}; + +enum nes_iwarp_sq_wqe_word_idx { + NES_IWARP_SQ_WQE_MISC_IDX = 0, + NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX = 1, + NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX = 2, + NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX = 3, + NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX = 4, + NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX = 5, + NES_IWARP_SQ_WQE_INV_STAG_LOW_IDX = 7, + NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX = 8, + NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX = 9, + NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX = 10, + NES_IWARP_SQ_WQE_RDMA_STAG_IDX = 11, + NES_IWARP_SQ_WQE_IMM_DATA_START_IDX = 12, + NES_IWARP_SQ_WQE_FRAG0_LOW_IDX = 16, + NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX = 17, + NES_IWARP_SQ_WQE_LENGTH0_IDX = 18, + NES_IWARP_SQ_WQE_STAG0_IDX = 19, + NES_IWARP_SQ_WQE_FRAG1_LOW_IDX = 20, + NES_IWARP_SQ_WQE_FRAG1_HIGH_IDX = 21, + NES_IWARP_SQ_WQE_LENGTH1_IDX = 22, + NES_IWARP_SQ_WQE_STAG1_IDX = 23, + NES_IWARP_SQ_WQE_FRAG2_LOW_IDX = 24, + NES_IWARP_SQ_WQE_FRAG2_HIGH_IDX = 25, + NES_IWARP_SQ_WQE_LENGTH2_IDX = 26, + NES_IWARP_SQ_WQE_STAG2_IDX = 27, + NES_IWARP_SQ_WQE_FRAG3_LOW_IDX = 28, + NES_IWARP_SQ_WQE_FRAG3_HIGH_IDX = 29, + NES_IWARP_SQ_WQE_LENGTH3_IDX = 30, + NES_IWARP_SQ_WQE_STAG3_IDX = 31, +}; + +enum nes_iwarp_rq_wqe_word_idx { + NES_IWARP_RQ_WQE_TOTAL_PAYLOAD_IDX = 1, + NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX = 2, + NES_IWARP_RQ_WQE_COMP_CTX_HIGH_IDX = 3, + NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX = 4, + NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX = 5, + NES_IWARP_RQ_WQE_FRAG0_LOW_IDX = 8, + NES_IWARP_RQ_WQE_FRAG0_HIGH_IDX = 9, + NES_IWARP_RQ_WQE_LENGTH0_IDX = 10, + NES_IWARP_RQ_WQE_STAG0_IDX = 11, + NES_IWARP_RQ_WQE_FRAG1_LOW_IDX = 12, + NES_IWARP_RQ_WQE_FRAG1_HIGH_IDX = 13, + NES_IWARP_RQ_WQE_LENGTH1_IDX = 14, + NES_IWARP_RQ_WQE_STAG1_IDX = 15, + NES_IWARP_RQ_WQE_FRAG2_LOW_IDX = 16, + NES_IWARP_RQ_WQE_FRAG2_HIGH_IDX = 17, + NES_IWARP_RQ_WQE_LENGTH2_IDX = 18, + NES_IWARP_RQ_WQE_STAG2_IDX = 19, + NES_IWARP_RQ_WQE_FRAG3_LOW_IDX = 20, + NES_IWARP_RQ_WQE_FRAG3_HIGH_IDX = 21, + NES_IWARP_RQ_WQE_LENGTH3_IDX = 22, + NES_IWARP_RQ_WQE_STAG3_IDX = 23, +}; + +enum nes_iwarp_sq_opcodes { + NES_IWARP_SQ_WQE_STREAMING = (1<<23), + NES_IWARP_SQ_WQE_IMM_DATA = (1<<28), + NES_IWARP_SQ_WQE_READ_FENCE = (1<<29), + NES_IWARP_SQ_WQE_LOCAL_FENCE = (1<<30), + NES_IWARP_SQ_WQE_SIGNALED_COMPL = (1<<31), +}; + +enum nes_iwarp_sq_wqe_bits { + NES_IWARP_SQ_OP_RDMAW = 0, + NES_IWARP_SQ_OP_RDMAR = 1, + NES_IWARP_SQ_OP_SEND = 3, + NES_IWARP_SQ_OP_SENDINV = 4, + NES_IWARP_SQ_OP_SENDSE = 5, + NES_IWARP_SQ_OP_SENDSEINV = 6, + NES_IWARP_SQ_OP_BIND = 8, + NES_IWARP_SQ_OP_FAST_REG = 9, + NES_IWARP_SQ_OP_LOCINV = 10, + NES_IWARP_SQ_OP_RDMAR_LOCINV = 11, + NES_IWARP_SQ_OP_NOP = 12, +}; + +struct nes_hw_qp_wqe { + uint32_t wqe_words[32]; +}; + +struct nes_hw_cqe { + uint32_t cqe_words[8]; +}; + +enum nes_uhca_type { + NETEFFECT_nes +}; + +struct nes_user_doorbell { + uint32_t wqe_alloc; + uint32_t reserved[3]; + uint32_t cqe_alloc; +}; + +struct nes_udevice { + struct ibv_device ibv_dev; + enum nes_uhca_type hca_type; + int page_size; +}; + +struct nes_upd { + struct ibv_pd ibv_pd; + struct nes_user_doorbell volatile *udoorbell; + uint32_t pd_id; + uint32_t db_index; +}; + +struct nes_uvcontext { + struct ibv_context ibv_ctx; + struct nes_upd *nesupd; + uint32_t max_pds; /* maximum pds allowed for this user process */ + uint32_t max_qps; /* maximum qps allowed for this user process */ + uint32_t wq_size; /* size of the WQs (sq+rq) allocated to the mmaped area */ +}; + +struct nes_ucq { + struct ibv_cq ibv_cq; + struct nes_hw_cqe volatile *cqes; + struct ibv_mr mr; + pthread_spinlock_t lock; + uint32_t cq_id; + uint16_t size; + uint16_t head; + uint16_t polled_completions; +}; + +struct nes_uqp { + struct ibv_qp ibv_qp; + struct nes_hw_qp_wqe volatile *sq_vbase; + struct nes_hw_qp_wqe volatile *rq_vbase; + uint32_t qp_id; + uint32_t nes_drv_opt; + pthread_spinlock_t lock; + uint16_t sq_db_index; + uint16_t sq_head; + uint16_t sq_tail; + uint16_t sq_size; + uint16_t rq_db_index; + uint16_t rq_head; + uint16_t rq_tail; + uint16_t rq_size; +}; + +#define to_nes_uxxx(xxx, type) \ + ((struct nes_u##type *) \ + ((void *) ib##xxx - offsetof(struct nes_u##type, ibv_##xxx))) + +static inline struct nes_udevice *to_nes_udev(struct ibv_device *ibdev) +{ + return to_nes_uxxx(dev, device); +} + +static inline struct nes_uvcontext *to_nes_uctx(struct ibv_context *ibctx) +{ + return to_nes_uxxx(ctx, vcontext); +} + +static inline struct nes_upd *to_nes_upd(struct ibv_pd *ibpd) +{ + return to_nes_uxxx(pd, pd); +} + +static inline struct nes_ucq *to_nes_ucq(struct ibv_cq *ibcq) +{ + return to_nes_uxxx(cq, cq); +} + +static inline struct nes_uqp *to_nes_uqp(struct ibv_qp *ibqp) +{ + return to_nes_uxxx(qp, qp); +} + + +/* nes_umain.c */ +struct ibv_device *ibv_driver_init(const char *, int); + +/* nes_uverbs.c */ +int nes_uquery_device(struct ibv_context *, struct ibv_device_attr *); +int nes_uquery_port(struct ibv_context *, uint8_t, struct ibv_port_attr *); +struct ibv_pd *nes_ualloc_pd(struct ibv_context *); +int nes_ufree_pd(struct ibv_pd *); +struct ibv_mr *nes_ureg_mr(struct ibv_pd *, void *, size_t, enum ibv_access_flags); +int nes_udereg_mr(struct ibv_mr *); +struct ibv_cq *nes_ucreate_cq(struct ibv_context *, int, struct ibv_comp_channel *, int); +int nes_uresize_cq(struct ibv_cq *, int); +int nes_udestroy_cq(struct ibv_cq *); +int nes_upoll_cq(struct ibv_cq *, int, struct ibv_wc *); +int nes_uarm_cq(struct ibv_cq *, int); +struct ibv_srq *nes_ucreate_srq(struct ibv_pd *, struct ibv_srq_init_attr *); +int nes_umodify_srq(struct ibv_srq *, struct ibv_srq_attr *, enum ibv_srq_attr_mask); +int nes_udestroy_srq(struct ibv_srq *); +int nes_upost_srq_recv(struct ibv_srq *, struct ibv_recv_wr *, struct ibv_recv_wr **); +struct ibv_qp *nes_ucreate_qp(struct ibv_pd *, struct ibv_qp_init_attr *); +int nes_umodify_qp(struct ibv_qp *, struct ibv_qp_attr *, enum ibv_qp_attr_mask); +int nes_udestroy_qp(struct ibv_qp *); +int nes_upost_send(struct ibv_qp *, struct ibv_send_wr *, struct ibv_send_wr **); +int nes_upost_recv(struct ibv_qp *, struct ibv_recv_wr *, struct ibv_recv_wr **); +struct ibv_ah *nes_ucreate_ah(struct ibv_pd *, struct ibv_ah_attr *); +int nes_udestroy_ah(struct ibv_ah *); +int nes_uattach_mcast(struct ibv_qp *, union ibv_gid *, uint16_t); +int nes_udetach_mcast(struct ibv_qp *, union ibv_gid *, uint16_t); + +#if __BYTE_ORDER == __LITTLE_ENDIAN +static inline uint32_t cpu_to_le32(uint32_t x) { return x; } +static inline uint32_t le32_to_cpu(uint32_t x) { return x; } +#else +static inline uint32_t cpu_to_le32(uint32_t x) { return (((x&0xFF000000)>>24) | ((x&0x00FF0000)>>8) | ((x&0x0000FF00)<<8) | ((x&0x000000FF)<<24)); } +static inline uint32_t le32_to_cpu(uint32_t x) { return (((x&0xFF000000)>>24) | ((x&0x00FF0000)>>8) | ((x&0x0000FF00)<<8) | ((x&0x000000FF)<<24)); } +#endif + +#endif /* nes_umain_H */ From ggrundstrom at neteffect.com Fri Oct 19 13:36:24 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:36:24 -0500 Subject: [ofa-general] [PATCH 4/5 v2] libnes: OpenFabrics userspace verbs Message-ID: <200710192036.l9JKaOTo021921@neteffect.com> OpenFabrics userspace verbs provider routines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ libnes/src/nes_uverbs.c 2007-10-19 11:07:15.000000000 -0500 @@ -0,0 +1,918 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if HAVE_CONFIG_H +#include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "nes_umain.h" +#include "nes-abi.h" + +extern long int page_size; + + +/** + * nes_uquery_device + */ +int nes_uquery_device(struct ibv_context *context, struct ibv_device_attr *attr) +{ + struct ibv_query_device cmd; + uint64_t reserved; + int ret; + + ret = ibv_cmd_query_device(context, attr, &reserved, &cmd, sizeof cmd); + return ret; +} + + +/** + * nes_uquery_port + */ +int nes_uquery_port(struct ibv_context *context, uint8_t port, + struct ibv_port_attr *attr) +{ + struct ibv_query_port cmd; + + return ibv_cmd_query_port(context, port, attr, &cmd, sizeof cmd); +} + + +/** + * nes_ualloc_pd + */ +struct ibv_pd *nes_ualloc_pd(struct ibv_context *context) +{ + struct ibv_alloc_pd cmd; + struct nes_ualloc_pd_resp resp; + struct nes_upd *nesupd; + + nesupd = malloc(sizeof *nesupd); + if (!nesupd) + return NULL; + + if (ibv_cmd_alloc_pd(context, &nesupd->ibv_pd, &cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp)) { + free(nesupd); + return NULL; + } + nesupd->pd_id = resp.pd_id; + nesupd->db_index = resp.db_index; + + nesupd->udoorbell = mmap(NULL, page_size, PROT_WRITE | PROT_READ, MAP_SHARED, + context->cmd_fd, nesupd->db_index * page_size); + + if (((void *)-1) == nesupd->udoorbell) { + free(nesupd); + return NULL; + } + + return &nesupd->ibv_pd; +} + + +/** + * nes_ufree_pd + */ +int nes_ufree_pd(struct ibv_pd *pd) +{ + int ret; + struct nes_upd *nesupd; + + nesupd = to_nes_upd(pd); + + ret = ibv_cmd_dealloc_pd(pd); + if (ret) + return ret; + + munmap((void *)nesupd->udoorbell, page_size); + free(nesupd); + + return 0; +} + + +/** + * nes_ureg_mr + */ +struct ibv_mr *nes_ureg_mr(struct ibv_pd *pd, void *addr, + size_t length, enum ibv_access_flags access) +{ + struct ibv_mr *mr; + struct nes_ureg_mr cmd; +#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS + struct ibv_reg_mr_resp resp; +#endif + + mr = malloc(sizeof *mr); + if (!mr) + return NULL; + + cmd.reg_type = NES_UMEMREG_TYPE_MEM; +#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS + if (ibv_cmd_reg_mr(pd, addr, length, (uintptr_t) addr, + access, mr, &cmd.ibv_cmd, sizeof cmd, + &resp, sizeof resp)) { +#else + if (ibv_cmd_reg_mr(pd, addr, length, (uintptr_t) addr, + access, mr, &cmd.ibv_cmd, sizeof cmd)) { +#endif + free(mr); + + return NULL; + } + + return mr; +} + + +/** + * nes_udereg_mr + */ +int nes_udereg_mr(struct ibv_mr *mr) +{ + int ret; + + ret = ibv_cmd_dereg_mr(mr); + if (ret) + return ret; + + free(mr); + return 0; +} + + +/** + * nes_ucreate_cq + */ +struct ibv_cq *nes_ucreate_cq(struct ibv_context *context, int cqe, + struct ibv_comp_channel *channel, int comp_vector) +{ + struct nes_ucq *nesucq; + struct nes_ureg_mr reg_mr_cmd; +#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS + struct ibv_reg_mr_resp reg_mr_resp; +#endif + struct nes_ucreate_cq cmd; + struct nes_ucreate_cq_resp resp; + int ret; + struct nes_uvcontext *nesvctx = to_nes_uctx(context); + + nesucq = malloc(sizeof *nesucq); + if (!nesucq) { + return NULL; + } + memset(nesucq, 0, sizeof(*nesucq)); + + if (pthread_spin_init(&nesucq->lock, PTHREAD_PROCESS_PRIVATE)) { + free(nesucq); + return NULL; + } + + if (cqe < 4) /* a reasonable minimum */ + cqe = 4; + nesucq->size = cqe + 1; + + nesucq->cqes = memalign(page_size, nesucq->size*sizeof(struct nes_hw_cqe)); + if (!nesucq->cqes) + goto err; + + /* Register the memory for the CQ */ + reg_mr_cmd.reg_type = NES_UMEMREG_TYPE_CQ; + +#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS + ret = ibv_cmd_reg_mr(&nesvctx->nesupd->ibv_pd, (void *)nesucq->cqes, + (nesucq->size*sizeof(struct nes_hw_cqe)), + (uintptr_t)nesucq->cqes, IBV_ACCESS_LOCAL_WRITE, &nesucq->mr, + ®_mr_cmd.ibv_cmd, sizeof reg_mr_cmd, + ®_mr_resp, sizeof reg_mr_resp); +#else + ret = ibv_cmd_reg_mr(&nesvctx->nesupd->ibv_pd, (void *)nesucq->cqes, + (nesucq->size*sizeof(struct nes_hw_cqe)), + (uintptr_t)nesucq->cqes, IBV_ACCESS_LOCAL_WRITE, &nesucq->mr, + ®_mr_cmd.ibv_cmd, sizeof reg_mr_cmd); +#endif + if (ret) { + /* fprintf(stderr, "ibv_cmd_reg_mr failed (ret = %d).\n", ret); */ + free((struct nes_hw_cqe *)nesucq->cqes); + goto err; + } + + /* Create the CQ */ + memset(&cmd, 0, sizeof(cmd)); + cmd.user_cq_buffer = (__u64)((uintptr_t)nesucq->cqes); + + ret = ibv_cmd_create_cq(context, nesucq->size-1, channel, comp_vector, + &nesucq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) + goto err; + + nesucq->cq_id = (uint16_t)resp.cq_id; + + /* Zero out the CQ */ + memset((struct nes_hw_cqe *)nesucq->cqes, 0, nesucq->size*sizeof(struct nes_hw_cqe)); + + return &nesucq->ibv_cq; + +err: + /* fprintf(stderr, PFX "%s: Error Creating CQ.\n", __FUNCTION__); */ + pthread_spin_destroy(&nesucq->lock); + free(nesucq); + + return NULL; +} + + +/** + * nes_uresize_cq + */ +int nes_uresize_cq(struct ibv_cq *cq, int cqe) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_udestroy_cq + */ +int nes_udestroy_cq(struct ibv_cq *cq) +{ + struct nes_ucq *nesucq = to_nes_ucq(cq); + int ret; + + ret = ibv_cmd_destroy_cq(cq); + if (ret) + return ret; + + /* Free CQ the memory */ + free((struct nes_hw_cqe *)nesucq->cqes); + pthread_spin_destroy(&nesucq->lock); + free(nesucq); + + return 0; +} + + +/** + * nes_upoll_cq + */ +int nes_upoll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *entry) +{ + uint64_t wrid; + struct nes_ucq *nesucq; + struct nes_uvcontext *nesvctx = NULL; + struct nes_uqp *nesuqp; + int cqe_count=0; + uint32_t head; + uint32_t wq_tail; + uint32_t cq_size; + uint32_t wqe_index; + struct nes_hw_cqe cqe; + uint32_t tmp; + unsigned long u64temp; + + nesucq = to_nes_ucq(cq); + nesvctx = to_nes_uctx(cq->context); + + pthread_spin_lock(&nesucq->lock); + + head = nesucq->head; + cq_size = nesucq->size; + + while (cqe_countcqes[head].cqe_words[NES_CQE_OPCODE_IDX]) & NES_CQE_VALID) { + cqe = (volatile struct nes_hw_cqe)nesucq->cqes[head]; + + memset(entry, 0, sizeof *entry); + /* this is for both the cqe copy and the zeroing of entry */ + asm __volatile__("": : :"memory"); + + nesucq->cqes[head].cqe_words[NES_CQE_OPCODE_IDX] = 0; + + /* parse CQE, get completion context from WQE (either rq or sq */ + wqe_index = le32_to_cpu(cqe.cqe_words[NES_CQE_COMP_COMP_CTX_LOW_IDX]) & 511; + u64temp = ((uint64_t) (le32_to_cpu(cqe.cqe_words[NES_CQE_COMP_COMP_CTX_LOW_IDX]))) | + (((uint64_t) (le32_to_cpu(cqe.cqe_words[NES_CQE_COMP_COMP_CTX_HIGH_IDX])))<<32); + nesuqp = *((struct nes_uqp **)&u64temp); + nesuqp = (struct nes_uqp *)((uintptr_t)nesuqp & (~1023)); + if (0 == le32_to_cpu(cqe.cqe_words[NES_CQE_ERROR_CODE_IDX])) { + entry->status = IBV_WC_SUCCESS; + } else { + /* TODO: other errors? */ + entry->status = IBV_WC_WR_FLUSH_ERR; + } + entry->qp_num = nesuqp->qp_id; + entry->src_qp = nesuqp->qp_id; + + if (le32_to_cpu(cqe.cqe_words[NES_CQE_OPCODE_IDX]) & NES_CQE_SQ) { + /* Working on a SQ Completion*/ + wq_tail = wqe_index; + nesuqp->sq_tail = (wqe_index+1)&(nesuqp->sq_size - 1); + wrid = ((uint64_t) le32_to_cpu(nesuqp->sq_vbase[wq_tail].wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX])) | + (((uint64_t) le32_to_cpu(nesuqp->sq_vbase[wq_tail].wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX]))<<32); + entry->byte_len = le32_to_cpu(nesuqp->sq_vbase[wq_tail].wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX]); + + switch (le32_to_cpu(nesuqp->sq_vbase[wq_tail]. + wqe_words[NES_IWARP_SQ_WQE_MISC_IDX]) & 0x3f) { + case NES_IWARP_SQ_OP_RDMAW: + /* fprintf(stderr, PFX "%s: Operation = RDMA WRITE.\n", + __FUNCTION__ ); */ + entry->opcode = IBV_WC_RDMA_WRITE; + break; + case NES_IWARP_SQ_OP_RDMAR: + /* fprintf(stderr, PFX "%s: Operation = RDMA READ.\n", + __FUNCTION__ ); */ + entry->opcode = IBV_WC_RDMA_READ; + entry->byte_len = le32_to_cpu(nesuqp->sq_vbase[wq_tail]. + wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX]); + break; + case NES_IWARP_SQ_OP_SENDINV: + case NES_IWARP_SQ_OP_SENDSEINV: + case NES_IWARP_SQ_OP_SEND: + case NES_IWARP_SQ_OP_SENDSE: + /* fprintf(stderr, PFX "%s: Operation = Send.\n", + __FUNCTION__ ); */ + entry->opcode = IBV_WC_SEND; + break; + } + } else { + /* Working on a RQ Completion*/ + wq_tail = wqe_index; + nesuqp->rq_tail = (wqe_index+1)&(nesuqp->rq_size - 1); + entry->byte_len = le32_to_cpu(cqe.cqe_words[NES_CQE_PAYLOAD_LENGTH_IDX]); + wrid = ((uint64_t) le32_to_cpu(nesuqp->rq_vbase[wq_tail].wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX])) | + (((uint64_t) le32_to_cpu(nesuqp->rq_vbase[wq_tail].wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX]))<<32); + entry->opcode = IBV_WC_RECV; + } + entry->wr_id = wrid; + + if (++head >= cq_size) + head = 0; + cqe_count++; + nesucq->polled_completions++; + + /* TODO: find a better number...if there is one */ + if ((nesucq->polled_completions > (cq_size/2)) || + (nesucq->polled_completions == 255)) { + if (NULL == nesvctx) + nesvctx = to_nes_uctx(cq->context); + nesvctx->nesupd->udoorbell->cqe_alloc = cpu_to_le32(nesucq->cq_id | + (nesucq->polled_completions << 16)); + tmp = nesvctx->nesupd->udoorbell->cqe_alloc; + nesucq->polled_completions = 0; + } + entry++; + } else + break; + } + + if (nesucq->polled_completions) { + if (NULL == nesvctx) + nesvctx = to_nes_uctx(cq->context); + nesvctx->nesupd->udoorbell->cqe_alloc = cpu_to_le32(nesucq->cq_id | + (nesucq->polled_completions << 16)); + tmp = nesvctx->nesupd->udoorbell->cqe_alloc; + nesucq->polled_completions = 0; + } + nesucq->head = head; + + pthread_spin_unlock(&nesucq->lock); + return cqe_count; +} + + +/** + * nes_uarm_cq + */ +int nes_uarm_cq(struct ibv_cq *cq, int solicited) +{ + struct nes_ucq *nesucq; + struct nes_uvcontext *nesvctx; + uint32_t cq_arm; + uint32_t tmp; + + nesucq = to_nes_ucq(cq); + nesvctx = to_nes_uctx(cq->context); + + cq_arm = nesucq->cq_id; + + if (solicited) + cq_arm |= NES_CQE_ALLOC_NOTIFY_SE; + else + cq_arm |= NES_CQE_ALLOC_NOTIFY_NEXT; + + nesvctx->nesupd->udoorbell->cqe_alloc = cpu_to_le32(cq_arm); + tmp = nesvctx->nesupd->udoorbell->cqe_alloc; + + return 0; +} + + +/** + * nes_ucreate_srq + */ +struct ibv_srq *nes_ucreate_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *attr) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return (void *)-ENOSYS; +} + + +/** + * nes_umodify_srq + */ +int nes_umodify_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr, + enum ibv_srq_attr_mask attr_mask) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_udestroy_srq + */ +int nes_udestroy_srq(struct ibv_srq *srq) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_upost_srq_recv + */ +int nes_upost_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_ucreate_qp + */ +struct ibv_qp *nes_ucreate_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +{ + struct nes_uqp *nesuqp; + struct nes_uvcontext *nesvctx = to_nes_uctx(pd->context); + struct nes_ucreate_qp cmd; + struct nes_ucreate_qp_resp resp; + unsigned long mmap_offset; + int ret; + + /* Sanity check QP size before proceeding */ + if (attr->cap.max_send_wr > 510 || + attr->cap.max_recv_wr > 510 || + attr->cap.max_send_sge > 4 || + attr->cap.max_recv_sge > 4 ) + return NULL; + + nesuqp = memalign(1024, sizeof(*nesuqp)); + if (!nesuqp) + return NULL; + memset(nesuqp, 0, sizeof(*nesuqp)); + + if (pthread_spin_init(&nesuqp->lock, PTHREAD_PROCESS_PRIVATE)) { + free(nesuqp); + return NULL; + } + + ret = ibv_cmd_create_qp(pd, &nesuqp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, + &resp.ibv_resp, sizeof resp); + if (ret) { + pthread_spin_destroy(&nesuqp->lock); + free(nesuqp); + return NULL; + } + + nesuqp->qp_id = resp.qp_id; + nesuqp->sq_db_index = resp.mmap_sq_db_index; + nesuqp->rq_db_index = resp.mmap_rq_db_index; + nesuqp->sq_size = resp.actual_sq_size; + nesuqp->rq_size = resp.actual_rq_size; + nesuqp->nes_drv_opt = resp.nes_drv_opt; + /* Account for LSMM, in theory, could get overrun if app preposts to SQ */ + nesuqp->sq_head = 1; + nesuqp->sq_tail = 1; + + /* Map the SQ/RQ buffers */ + mmap_offset = ((nesvctx->max_pds*4096) + page_size-1) & (~(page_size-1)); + mmap_offset += (((sizeof(struct nes_hw_qp_wqe) * nesvctx->wq_size) + page_size-1) & + (~(page_size-1)))*nesuqp->sq_db_index; + + nesuqp->sq_vbase = mmap(NULL, (nesuqp->sq_size+nesuqp->rq_size) * + sizeof(struct nes_hw_qp_wqe), PROT_WRITE | PROT_READ, + MAP_SHARED, pd->context->cmd_fd, mmap_offset); + + if (((void *)-1) == nesuqp->sq_vbase) { + pthread_spin_destroy(&nesuqp->lock); + free(nesuqp); + return NULL; + } + nesuqp->rq_vbase = (struct nes_hw_qp_wqe *)(((char *)nesuqp->sq_vbase) + + (nesuqp->sq_size*sizeof(struct nes_hw_qp_wqe))); + *((unsigned int *)nesuqp->sq_vbase) = 0; + + return &nesuqp->ibv_qp; +} + + +/** + * nes_uquery_qp + */ +int nes_uquery_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *init_attr) +{ + struct ibv_query_qp cmd; + + /* fprintf(stderr, PFX "nes_uquery_qp: calling ibv_cmd_query_qp\n"); */ + + return ibv_cmd_query_qp(qp, attr, attr_mask, init_attr, &cmd, sizeof(cmd)); +} + + +/** + * nes_umodify_qp + */ +int nes_umodify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + enum ibv_qp_attr_mask attr_mask) +{ + struct ibv_modify_qp cmd; + + /* fprintf(stderr, PFX "%s, QP State = %u, attr_mask = 0x%X.\n", __FUNCTION__, + (unsigned int)attr->qp_state, (unsigned int)attr_mask ); */ + return ibv_cmd_modify_qp(qp, attr, attr_mask, &cmd, sizeof cmd); +} + + +/** + * nes_udestroy_qp + */ +int nes_udestroy_qp(struct ibv_qp *qp) +{ + struct nes_uqp *nesuqp = to_nes_uqp(qp); + int ret; + + munmap((void *) nesuqp->sq_vbase, (nesuqp->sq_size+nesuqp->rq_size) * + sizeof(struct nes_hw_qp_wqe)); + + ret = ibv_cmd_destroy_qp(qp); + if (ret) + return ret; + + pthread_spin_destroy(&nesuqp->lock); + free(nesuqp); + + return 0; +} + + +/** + * nes_upost_send + */ +int nes_upost_send(struct ibv_qp *ib_qp, struct ibv_send_wr *ib_wr, + struct ibv_send_wr **bad_wr) +{ + uint64_t u64temp; + struct nes_uqp *nesuqp = to_nes_uqp(ib_qp); + struct nes_upd *nesupd = to_nes_upd(ib_qp->pd); + struct nes_hw_qp_wqe volatile *wqe; + uint32_t head = nesuqp->sq_head; + uint32_t qsize = nesuqp->sq_size; + uint32_t counter; + uint32_t err = 0; + uint32_t wqe_count = 0; + uint32_t outstanding_wqes; + uint32_t total_payload_length = 0; + int sge_index; + + pthread_spin_lock(&nesuqp->lock); + + while (ib_wr) { + /* Check for SQ overflow */ + outstanding_wqes = head + (2 * qsize) - nesuqp->sq_tail; + outstanding_wqes &= qsize - 1; + if (unlikely(outstanding_wqes == (qsize - 1))) { + err = -EINVAL; + break; + } + if (unlikely(ib_wr->num_sge > 4)) { + err = -EINVAL; + break; + } + + wqe = (struct nes_hw_qp_wqe *)&nesuqp->sq_vbase[head]; + /* fprintf(stderr, PFX "%s: QP%u: processing sq wqe at %p, head = %u.\n", + __FUNCTION__, nesuqp->qp_id, wqe, head); */ + u64temp = (uint64_t) ib_wr->wr_id; + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((uint32_t)u64temp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((uint32_t)(u64temp>>32)); + u64temp = (uint64_t)((uintptr_t)nesuqp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] = cpu_to_le32((uint32_t)u64temp); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX] = cpu_to_le32((uint32_t)(u64temp>>32)); + asm __volatile__("": : :"memory"); + wqe->wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head); + + switch (ib_wr->opcode) { + case IBV_WR_SEND: + case IBV_WR_SEND_WITH_IMM: + /* fprintf(stderr, PFX "%s: QP%u: processing sq wqe%u. Opcode = %s\n", + __FUNCTION__, nesuqp->qp_id, head, "Send"); */ + if (ib_wr->send_flags & IBV_SEND_SOLICITED) { + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(NES_IWARP_SQ_OP_SENDSE); + } else { + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(NES_IWARP_SQ_OP_SEND); + } + + if (ib_wr->send_flags & IBV_SEND_FENCE) { + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] |= cpu_to_le32(NES_IWARP_SQ_WQE_LOCAL_FENCE); + } + + /* if (ib_wr->send_flags & IBV_SEND_INLINE) { + fprintf(stderr, PFX "%s: Send SEND_INLINE, length=%d\n", + __FUNCTION__, ib_wr->sg_list[0].length); + } */ + if ((ib_wr->send_flags & IBV_SEND_INLINE) && (ib_wr->sg_list[0].length <= 64) && + (0 == (nesuqp->nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA)) && + (ib_wr->num_sge == 1)) { + memcpy((void *)&wqe->wqe_words[NES_IWARP_SQ_WQE_IMM_DATA_START_IDX], + (void *)ib_wr->sg_list[0].addr, ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32(ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] |= cpu_to_le32(NES_IWARP_SQ_WQE_IMM_DATA); + } else { + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = + cpu_to_le32((uint32_t)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = + cpu_to_le32((uint32_t)(ib_wr->sg_list[sge_index].addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list[sge_index].length; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = + cpu_to_le32(total_payload_length); + } + + break; + case IBV_WR_RDMA_WRITE: + case IBV_WR_RDMA_WRITE_WITH_IMM: + /* fprintf(stderr, PFX "%s:QP%u: processing sq wqe%u. Opcode = %s\n", + __FUNCTION__, nesuqp->qp_id, head, "Write"); */ + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(NES_IWARP_SQ_OP_RDMAW); + + if (ib_wr->send_flags & IBV_SEND_FENCE) { + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] |= cpu_to_le32(NES_IWARP_SQ_WQE_LOCAL_FENCE); + } + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_STAG_IDX] = cpu_to_le32(ib_wr->wr.rdma.rkey); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX] = cpu_to_le32( + (uint32_t)ib_wr->wr.rdma.remote_addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX] = cpu_to_le32( + (uint32_t)(ib_wr->wr.rdma.remote_addr>>32)); + + /* if (ib_wr->send_flags & IBV_SEND_INLINE) { + fprintf(stderr, PFX "%s: Write SEND_INLINE, length=%d\n", + __FUNCTION__, ib_wr->sg_list[0].length); + } */ + if ((ib_wr->send_flags & IBV_SEND_INLINE) && (ib_wr->sg_list[0].length <= 64) && + (0 == (nesuqp->nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA)) && + (ib_wr->num_sge == 1)) { + memcpy((void *)&wqe->wqe_words[NES_IWARP_SQ_WQE_IMM_DATA_START_IDX], + (void *)ib_wr->sg_list[0].addr, ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32(ib_wr->sg_list[0].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] |= cpu_to_le32(NES_IWARP_SQ_WQE_IMM_DATA); + } else { + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = cpu_to_le32( + (uint32_t)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = cpu_to_le32( + (uint32_t)(ib_wr->sg_list[sge_index].addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX+(sge_index*4)] = cpu_to_le32( + ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX+(sge_index*4)] = cpu_to_le32( + ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list[sge_index].length; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32(total_payload_length); + } + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX] = + wqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX]; + break; + case IBV_WR_RDMA_READ: + /* fprintf(stderr, PFX "%s:QP%u:processing sq wqe%u. Opcode = %s\n", + __FUNCTION__, nesuqp->qp_id, head, "Read"); */ + /* IWarp only supports 1 sge for RDMA reads */ + if (ib_wr->num_sge > 1) { + err = -EINVAL; + break; + } + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] = cpu_to_le32(NES_IWARP_SQ_OP_RDMAR); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX] = cpu_to_le32((uint32_t)ib_wr->wr.rdma.remote_addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX] = cpu_to_le32((uint32_t)(ib_wr->wr.rdma.remote_addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_STAG_IDX] = cpu_to_le32(ib_wr->wr.rdma.rkey); + wqe->wqe_words[NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX] = cpu_to_le32(ib_wr->sg_list->length); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_LOW_IDX] = cpu_to_le32((uint32_t)ib_wr->sg_list->addr); + wqe->wqe_words[NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX] = cpu_to_le32((uint32_t)(ib_wr->sg_list->addr>>32)); + wqe->wqe_words[NES_IWARP_SQ_WQE_STAG0_IDX] = cpu_to_le32(ib_wr->sg_list->lkey); + break; + default: + /* error */ + err = -EINVAL; + break; + } + + if (ib_wr->send_flags & IBV_SEND_SIGNALED) { + /* fprintf(stderr, PFX "%s:sq wqe%u is signalled\n", __FUNCTION__, head); */ + wqe->wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] |= cpu_to_le32(NES_IWARP_SQ_WQE_SIGNALED_COMPL); + } + ib_wr = ib_wr->next; + head++; + wqe_count++; + if (head >= qsize) + head = 0; + } + + nesuqp->sq_head = head; + asm __volatile__("": : :"memory"); + while (wqe_count) { + counter = (wqe_count<(uint32_t)255) ? wqe_count : 255; + wqe_count -= counter; + nesupd->udoorbell->wqe_alloc = cpu_to_le32((counter<<24) | 0x00800000 | nesuqp->qp_id); + } + + if (err) + *bad_wr = ib_wr; + + pthread_spin_unlock(&nesuqp->lock); + + return err; +} + + +/** + * nes_upost_recv + */ +int nes_upost_recv(struct ibv_qp *ib_qp, struct ibv_recv_wr *ib_wr, + struct ibv_recv_wr **bad_wr) +{ + uint64_t u64temp; + struct nes_uqp *nesuqp = to_nes_uqp(ib_qp); + struct nes_upd *nesupd = to_nes_upd(ib_qp->pd); + struct nes_hw_qp_wqe *wqe; + uint32_t head = nesuqp->rq_head; + uint32_t qsize = nesuqp->rq_size; + uint32_t counter; + uint32_t err = 0; + uint32_t wqe_count = 0; + uint32_t outstanding_wqes; + uint32_t total_payload_length; + int sge_index; + + pthread_spin_lock(&nesuqp->lock); + + while (ib_wr) { + /* Check for RQ overflow */ + outstanding_wqes = head + (2 * qsize) - nesuqp->rq_tail; + outstanding_wqes &= qsize - 1; + if (unlikely(outstanding_wqes == (qsize - 1))) { + err = -EINVAL; + break; + } + + wqe = (struct nes_hw_qp_wqe *)&nesuqp->rq_vbase[head]; + u64temp = ib_wr->wr_id; + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX] = + cpu_to_le32((uint32_t)u64temp); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX] = + cpu_to_le32((uint32_t)(u64temp >> 32)); + u64temp = (uint64_t)((uintptr_t)nesuqp); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX] = + cpu_to_le32((uint32_t)u64temp); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_HIGH_IDX] = + cpu_to_le32((uint32_t)(u64temp >> 32)); + asm __volatile__("": : :"memory"); + wqe->wqe_words[NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX] |= cpu_to_le32(head); + + total_payload_length = 0; + for (sge_index=0; sge_index < ib_wr->num_sge; sge_index++) { + wqe->wqe_words[NES_IWARP_RQ_WQE_FRAG0_LOW_IDX+(sge_index*4)] = + cpu_to_le32((uint32_t)ib_wr->sg_list[sge_index].addr); + wqe->wqe_words[NES_IWARP_RQ_WQE_FRAG0_HIGH_IDX+(sge_index*4)] = + cpu_to_le32((uint32_t)(ib_wr->sg_list[sge_index].addr>>32)); + wqe->wqe_words[NES_IWARP_RQ_WQE_LENGTH0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].length); + wqe->wqe_words[NES_IWARP_RQ_WQE_STAG0_IDX+(sge_index*4)] = + cpu_to_le32(ib_wr->sg_list[sge_index].lkey); + total_payload_length += ib_wr->sg_list->length; + } + wqe->wqe_words[NES_IWARP_RQ_WQE_TOTAL_PAYLOAD_IDX] = cpu_to_le32(total_payload_length); + + ib_wr = ib_wr->next; + head++; + wqe_count++; + if (head >= qsize) + head = 0; + } + + nesuqp->rq_head = head; + asm __volatile__("": : :"memory"); + while (wqe_count) { + counter = (wqe_count<(uint32_t)255) ? wqe_count : 255; + wqe_count -= counter; + nesupd->udoorbell->wqe_alloc = cpu_to_le32((counter << 24) | nesuqp->qp_id); + } + + if (err) + *bad_wr = ib_wr; + + pthread_spin_unlock(&nesuqp->lock); + + return err; +} + + +/** + * nes_ucreate_ah + */ +struct ibv_ah *nes_ucreate_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return (void *)-ENOSYS; +} + + +/** + * nes_udestroy_ah + */ +int nes_udestroy_ah(struct ibv_ah *ah) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_uattach_mcast + */ +int nes_uattach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + + +/** + * nes_udetach_mcast + */ +int nes_udetach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +{ + /* fprintf(stderr, PFX "%s\n", __FUNCTION__); */ + return -ENOSYS; +} + From ggrundstrom at neteffect.com Fri Oct 19 13:39:23 2007 From: ggrundstrom at neteffect.com (ggrundstrom at neteffect.com) Date: Fri, 19 Oct 2007 15:39:23 -0500 Subject: [ofa-general] [PATCH 5/5 v2] libnes: userspace structures and defines Message-ID: <200710192039.l9JKdNEg021939@neteffect.com> Userspace library structures and defines. Signed-off-by: Glenn Grundstrom --- --- NULL 1969-12-31 18:00:00.000000000 -0600 +++ libnes/src/nes_umain.h 2007-10-19 11:07:09.000000000 -0500 @@ -0,0 +1,295 @@ +/* + * Copyright (c) 2006 - 2007 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef nes_umain_H +#define nes_umain_H + +#include +#include +#include + +#include +#include + +#ifndef likely +#define likely(x) __builtin_expect((x),1) +#endif +#ifndef unlikely +#define unlikely(x) __builtin_expect((x),0) +#endif + +#define HIDDEN __attribute__((visibility ("hidden"))) + +#define PFX "nes: " + +#define NES_DRV_OPT_NO_INLINE_DATA 0x00000080 + +enum nes_cqe_opcode_bits { + NES_CQE_STAG_VALID = (1<<6), + NES_CQE_ERROR = (1<<7), + NES_CQE_SQ = (1<<8), + NES_CQE_SE = (1<<9), + NES_CQE_PSH = (1<<29), + NES_CQE_FIN = (1<<30), + NES_CQE_VALID = (1<<31), +}; + +enum nes_cqe_word_idx { + NES_CQE_PAYLOAD_LENGTH_IDX = 0, + NES_CQE_COMP_COMP_CTX_LOW_IDX = 2, + NES_CQE_COMP_COMP_CTX_HIGH_IDX = 3, + NES_CQE_INV_STAG_IDX = 4, + NES_CQE_QP_ID_IDX = 5, + NES_CQE_ERROR_CODE_IDX = 6, + NES_CQE_OPCODE_IDX = 7, +}; + +enum nes_cqe_allocate_bits { + NES_CQE_ALLOC_INC_SELECT = (1<<28), + NES_CQE_ALLOC_NOTIFY_NEXT = (1<<29), + NES_CQE_ALLOC_NOTIFY_SE = (1<<30), + NES_CQE_ALLOC_RESET = (1<<31), +}; + +enum nes_iwarp_sq_wqe_word_idx { + NES_IWARP_SQ_WQE_MISC_IDX = 0, + NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX = 1, + NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX = 2, + NES_IWARP_SQ_WQE_COMP_CTX_HIGH_IDX = 3, + NES_IWARP_SQ_WQE_COMP_SCRATCH_LOW_IDX = 4, + NES_IWARP_SQ_WQE_COMP_SCRATCH_HIGH_IDX = 5, + NES_IWARP_SQ_WQE_INV_STAG_LOW_IDX = 7, + NES_IWARP_SQ_WQE_RDMA_TO_LOW_IDX = 8, + NES_IWARP_SQ_WQE_RDMA_TO_HIGH_IDX = 9, + NES_IWARP_SQ_WQE_RDMA_LENGTH_IDX = 10, + NES_IWARP_SQ_WQE_RDMA_STAG_IDX = 11, + NES_IWARP_SQ_WQE_IMM_DATA_START_IDX = 12, + NES_IWARP_SQ_WQE_FRAG0_LOW_IDX = 16, + NES_IWARP_SQ_WQE_FRAG0_HIGH_IDX = 17, + NES_IWARP_SQ_WQE_LENGTH0_IDX = 18, + NES_IWARP_SQ_WQE_STAG0_IDX = 19, + NES_IWARP_SQ_WQE_FRAG1_LOW_IDX = 20, + NES_IWARP_SQ_WQE_FRAG1_HIGH_IDX = 21, + NES_IWARP_SQ_WQE_LENGTH1_IDX = 22, + NES_IWARP_SQ_WQE_STAG1_IDX = 23, + NES_IWARP_SQ_WQE_FRAG2_LOW_IDX = 24, + NES_IWARP_SQ_WQE_FRAG2_HIGH_IDX = 25, + NES_IWARP_SQ_WQE_LENGTH2_IDX = 26, + NES_IWARP_SQ_WQE_STAG2_IDX = 27, + NES_IWARP_SQ_WQE_FRAG3_LOW_IDX = 28, + NES_IWARP_SQ_WQE_FRAG3_HIGH_IDX = 29, + NES_IWARP_SQ_WQE_LENGTH3_IDX = 30, + NES_IWARP_SQ_WQE_STAG3_IDX = 31, +}; + +enum nes_iwarp_rq_wqe_word_idx { + NES_IWARP_RQ_WQE_TOTAL_PAYLOAD_IDX = 1, + NES_IWARP_RQ_WQE_COMP_CTX_LOW_IDX = 2, + NES_IWARP_RQ_WQE_COMP_CTX_HIGH_IDX = 3, + NES_IWARP_RQ_WQE_COMP_SCRATCH_LOW_IDX = 4, + NES_IWARP_RQ_WQE_COMP_SCRATCH_HIGH_IDX = 5, + NES_IWARP_RQ_WQE_FRAG0_LOW_IDX = 8, + NES_IWARP_RQ_WQE_FRAG0_HIGH_IDX = 9, + NES_IWARP_RQ_WQE_LENGTH0_IDX = 10, + NES_IWARP_RQ_WQE_STAG0_IDX = 11, + NES_IWARP_RQ_WQE_FRAG1_LOW_IDX = 12, + NES_IWARP_RQ_WQE_FRAG1_HIGH_IDX = 13, + NES_IWARP_RQ_WQE_LENGTH1_IDX = 14, + NES_IWARP_RQ_WQE_STAG1_IDX = 15, + NES_IWARP_RQ_WQE_FRAG2_LOW_IDX = 16, + NES_IWARP_RQ_WQE_FRAG2_HIGH_IDX = 17, + NES_IWARP_RQ_WQE_LENGTH2_IDX = 18, + NES_IWARP_RQ_WQE_STAG2_IDX = 19, + NES_IWARP_RQ_WQE_FRAG3_LOW_IDX = 20, + NES_IWARP_RQ_WQE_FRAG3_HIGH_IDX = 21, + NES_IWARP_RQ_WQE_LENGTH3_IDX = 22, + NES_IWARP_RQ_WQE_STAG3_IDX = 23, +}; + +enum nes_iwarp_sq_opcodes { + NES_IWARP_SQ_WQE_STREAMING = (1<<23), + NES_IWARP_SQ_WQE_IMM_DATA = (1<<28), + NES_IWARP_SQ_WQE_READ_FENCE = (1<<29), + NES_IWARP_SQ_WQE_LOCAL_FENCE = (1<<30), + NES_IWARP_SQ_WQE_SIGNALED_COMPL = (1<<31), +}; + +enum nes_iwarp_sq_wqe_bits { + NES_IWARP_SQ_OP_RDMAW = 0, + NES_IWARP_SQ_OP_RDMAR = 1, + NES_IWARP_SQ_OP_SEND = 3, + NES_IWARP_SQ_OP_SENDINV = 4, + NES_IWARP_SQ_OP_SENDSE = 5, + NES_IWARP_SQ_OP_SENDSEINV = 6, + NES_IWARP_SQ_OP_BIND = 8, + NES_IWARP_SQ_OP_FAST_REG = 9, + NES_IWARP_SQ_OP_LOCINV = 10, + NES_IWARP_SQ_OP_RDMAR_LOCINV = 11, + NES_IWARP_SQ_OP_NOP = 12, +}; + +struct nes_hw_qp_wqe { + uint32_t wqe_words[32]; +}; + +struct nes_hw_cqe { + uint32_t cqe_words[8]; +}; + +enum nes_uhca_type { + NETEFFECT_nes +}; + +struct nes_user_doorbell { + uint32_t wqe_alloc; + uint32_t reserved[3]; + uint32_t cqe_alloc; +}; + +struct nes_udevice { + struct ibv_device ibv_dev; + enum nes_uhca_type hca_type; + int page_size; +}; + +struct nes_upd { + struct ibv_pd ibv_pd; + struct nes_user_doorbell volatile *udoorbell; + uint32_t pd_id; + uint32_t db_index; +}; + +struct nes_uvcontext { + struct ibv_context ibv_ctx; + struct nes_upd *nesupd; + uint32_t max_pds; /* maximum pds allowed for this user process */ + uint32_t max_qps; /* maximum qps allowed for this user process */ + uint32_t wq_size; /* size of the WQs (sq+rq) allocated to the mmaped area */ +}; + +struct nes_ucq { + struct ibv_cq ibv_cq; + struct nes_hw_cqe volatile *cqes; + struct ibv_mr mr; + pthread_spinlock_t lock; + uint32_t cq_id; + uint16_t size; + uint16_t head; + uint16_t polled_completions; +}; + +struct nes_uqp { + struct ibv_qp ibv_qp; + struct nes_hw_qp_wqe volatile *sq_vbase; + struct nes_hw_qp_wqe volatile *rq_vbase; + uint32_t qp_id; + uint32_t nes_drv_opt; + pthread_spinlock_t lock; + uint16_t sq_db_index; + uint16_t sq_head; + uint16_t sq_tail; + uint16_t sq_size; + uint16_t rq_db_index; + uint16_t rq_head; + uint16_t rq_tail; + uint16_t rq_size; +}; + +#define to_nes_uxxx(xxx, type) \ + ((struct nes_u##type *) \ + ((void *) ib##xxx - offsetof(struct nes_u##type, ibv_##xxx))) + +static inline struct nes_udevice *to_nes_udev(struct ibv_device *ibdev) +{ + return to_nes_uxxx(dev, device); +} + +static inline struct nes_uvcontext *to_nes_uctx(struct ibv_context *ibctx) +{ + return to_nes_uxxx(ctx, vcontext); +} + +static inline struct nes_upd *to_nes_upd(struct ibv_pd *ibpd) +{ + return to_nes_uxxx(pd, pd); +} + +static inline struct nes_ucq *to_nes_ucq(struct ibv_cq *ibcq) +{ + return to_nes_uxxx(cq, cq); +} + +static inline struct nes_uqp *to_nes_uqp(struct ibv_qp *ibqp) +{ + return to_nes_uxxx(qp, qp); +} + + +/* nes_umain.c */ +struct ibv_device *ibv_driver_init(const char *, int); + +/* nes_uverbs.c */ +int nes_uquery_device(struct ibv_context *, struct ibv_device_attr *); +int nes_uquery_port(struct ibv_context *, uint8_t, struct ibv_port_attr *); +struct ibv_pd *nes_ualloc_pd(struct ibv_context *); +int nes_ufree_pd(struct ibv_pd *); +struct ibv_mr *nes_ureg_mr(struct ibv_pd *, void *, size_t, enum ibv_access_flags); +int nes_udereg_mr(struct ibv_mr *); +struct ibv_cq *nes_ucreate_cq(struct ibv_context *, int, struct ibv_comp_channel *, int); +int nes_uresize_cq(struct ibv_cq *, int); +int nes_udestroy_cq(struct ibv_cq *); +int nes_upoll_cq(struct ibv_cq *, int, struct ibv_wc *); +int nes_uarm_cq(struct ibv_cq *, int); +struct ibv_srq *nes_ucreate_srq(struct ibv_pd *, struct ibv_srq_init_attr *); +int nes_umodify_srq(struct ibv_srq *, struct ibv_srq_attr *, enum ibv_srq_attr_mask); +int nes_udestroy_srq(struct ibv_srq *); +int nes_upost_srq_recv(struct ibv_srq *, struct ibv_recv_wr *, struct ibv_recv_wr **); +struct ibv_qp *nes_ucreate_qp(struct ibv_pd *, struct ibv_qp_init_attr *); +int nes_umodify_qp(struct ibv_qp *, struct ibv_qp_attr *, enum ibv_qp_attr_mask); +int nes_udestroy_qp(struct ibv_qp *); +int nes_upost_send(struct ibv_qp *, struct ibv_send_wr *, struct ibv_send_wr **); +int nes_upost_recv(struct ibv_qp *, struct ibv_recv_wr *, struct ibv_recv_wr **); +struct ibv_ah *nes_ucreate_ah(struct ibv_pd *, struct ibv_ah_attr *); +int nes_udestroy_ah(struct ibv_ah *); +int nes_uattach_mcast(struct ibv_qp *, union ibv_gid *, uint16_t); +int nes_udetach_mcast(struct ibv_qp *, union ibv_gid *, uint16_t); + +#if __BYTE_ORDER == __LITTLE_ENDIAN +static inline uint32_t cpu_to_le32(uint32_t x) { return x; } +static inline uint32_t le32_to_cpu(uint32_t x) { return x; } +#else +static inline uint32_t cpu_to_le32(uint32_t x) { return (((x&0xFF000000)>>24) | ((x&0x00FF0000)>>8) | ((x&0x0000FF00)<<8) | ((x&0x000000FF)<<24)); } +static inline uint32_t le32_to_cpu(uint32_t x) { return (((x&0xFF000000)>>24) | ((x&0x00FF0000)>>8) | ((x&0x0000FF00)<<8) | ((x&0x000000FF)<<24)); } +#endif + +#endif /* nes_umain_H */ From kilian at stanford.edu Fri Oct 19 15:56:18 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Fri, 19 Oct 2007 15:56:18 -0700 Subject: [ofa-general] ibdiagui: unknown color name "efefef" Message-ID: <200710191556.18958.kilian@stanford.edu> Hi all, I'm trying to use ibdiagui on a RHEL4 machine using OFED 1.2, and I'm getting the following error about an unknown color code: ---------------------------------------------------------------------- Loading IBDIAGUI from: /usr/lib64/ibdiagui1.2 Loading IBDM from: /usr/lib64/ibdm1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. -I- Using port 1 as the local port. -I- Found 0 errors 0 warnings 0 infos -I- Found 0 names 0 LIDS 0 GUIDs 0 Directed-Routes -I- Found 0 errors 2 warnings 14 infos -I- Found 0 names 1 LIDS 1 GUIDs 0 Directed-Routes -I- Parsing subnet lst: /tmp/ibdiagnet.lst -I- Parsing Subnet file:/tmp/ibdiagnet.lst -I- Defined 318/324 systems/nodes Warning: Illegal value hier for attribute "mode" in graph graph0 - ignored -I- Marked 318 systems 0 nodes 1152 ports -E- unknown color name "efefef" unknown color name "efefef" (processing "-fill" option) invoked from within "$c create polygon -1 612 -1 -1 772 -1 772 612 -fill efefef -outline efefef -tags 1graph0" ("eval" body line 10) invoked from within "eval $newCode" (procedure "drawFabric" line 37) invoked from within "drawFabric $gFabric $C" (procedure "GraphUpdate" line 37) invoked from within "GraphUpdate $lstFile" (procedure "DiagNet" line 27) invoked from within "DiagNet" ---------------------------------------------------------------------- Everything looks ok but this "efefef" color. I didn't find any reference to this color code in the *.tcl files, so I'm not sure where it comes from. [root at frontend2 ~]# rpm -qf $(which ibdiagui) ibutils-1.2-0.x86_64 [root at frontend2 ~]# rpm -q graphviz-tcl graphviz-tcl-2.2-1.2.el4.rf.x86_64 Thanks! -- Kilian From changquing.tang at hp.com Fri Oct 19 18:40:17 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Sat, 20 Oct 2007 01:40:17 -0000 Subject: [ofa-general] Behavior on dropped message Message-ID: <349DCDA352EACF42A0C49FA6DCEA84030292363B@G3W0634.americas.hpqcorp.net> Hi, I have question on dropped message. During QP connection setup, QPs are in INIT state, after exchange the qp_num, one end is moving INIT-->RTR-->RTS, the other end is still in INIT state. Then the side in RTS state sends a message. From the standard, the message is silently dropped on the receiving side because it is still in INIT state. What is the behavior on the sending side ? Do I get compeletion error, or never get a completion ? On Mellanox HCA, I got completion error. But for the new connectX card, I never get a completion event, and my code is hanging there. Thanks for explanation. --CQ From rdreier at cisco.com Fri Oct 19 19:40:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 19 Oct 2007 19:40:44 -0700 Subject: [ofa-general] [PATCH] fix horrible hole in uverbs Message-ID: Not sure how we missed this for so long... unless I'm very confused it was possible for different contexts to stomp on each other since June of last year! commit cc81b99d8ef91e3692eb920f6a300453e2988114 Author: Roland Dreier Date: Fri Oct 19 19:39:23 2007 -0700 IB/uverbs: Fix checking of userspace object ownership Commit 9ead190b ("IB/uverbs: Don't serialize with ib_uverbs_idr_mutex") rewrote how userspace objects are looked up in the uverbs module's idrs, and introduced a severe bug in the process: there is no checking that an operation is being performed by the right process any more. Fix this by adding the missing check of uobj->context in __idr_get_uobj(). Apparently everyone is being very careful to only touch their own objects, because this bug was introduced in June 2006 in 2.6.18, and has gone undetected until now. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 01d7008..495c803 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -147,8 +147,12 @@ static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id, spin_lock(&ib_uverbs_idr_lock); uobj = idr_find(idr, id); - if (uobj) - kref_get(&uobj->ref); + if (uobj) { + if (uobj->context == context) + kref_get(&uobj->ref); + else + uobj = NULL; + } spin_unlock(&ib_uverbs_idr_lock); return uobj; From rdreier at cisco.com Fri Oct 19 20:12:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 19 Oct 2007 20:12:39 -0700 Subject: [ofa-general] Re: [PATCH] ipoib/cm: use common CQ for all TX QPs In-Reply-To: <20070816123616.GI5684@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 16 Aug 2007 15:36:16 +0300") References: <20070816123616.GI5684@mellanox.co.il> Message-ID: Applied at long last... - R. From rdreier at cisco.com Fri Oct 19 20:14:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 19 Oct 2007 20:14:49 -0700 Subject: [ofa-general] Re: [PATCH 1/14 v2] nes: module and device initialization In-Reply-To: <200710192001.l9JK1U8O021689@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:01:30 -0500") References: <200710192001.l9JK1U8O021689@neteffect.com> Message-ID: Thanks... I am kind of overloaded trying to handle the last few things for the 2.6.24 merge window, but I will look at this next week, and I expect we should be able to merge the driver for 2.6.25 unless there are unexpected hangups. From shemminger at linux-foundation.org Fri Oct 19 22:00:15 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 19 Oct 2007 22:00:15 -0700 Subject: [ofa-general] Re: [PATCH 2/14 v2] nes: device structures and defines In-Reply-To: <200710192004.l9JK48dm021704@neteffect.com> References: <200710192004.l9JK48dm021704@neteffect.com> Message-ID: <20071019220015.3faa9bbb@freepuppy.rosehill> On Fri, 19 Oct 2007 15:04:08 -0500 ggrundstrom at neteffect.com wrote: > Main include file for device structures and defines. > > Signed-off-by: Glenn Grundstrom You are starting off on the wrong foot. > +#ifdef CONFIG_INFINIBAND_NES_DEBUG > +#define assert(expr) \ > +if(!(expr)) { \ > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \ > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > +} Use BUG_ON > +#define nes_debug(level, fmt, args...) \ > + if (level & nes_debug_level) \ > + printk(KERN_ERR PFX "%s[%u]: " fmt, __FUNCTION__, __LINE__, ##args) > + > +#ifndef dprintk > +#define dprintk(fmt, args...) do { printk(KERN_ERR PFX fmt, ##args); } while (0) > +#endif pr_debug or dev_dgg() > +#define NES_EVENT_TIMEOUT 1200000 > +/* #define NES_EVENT_TIMEOUT 1200 */ > +#else > +#define assert(expr) do {} while (0) > +#define nes_debug(level, fmt, args...) > +#define dprintk(fmt, args...) do {} while (0) > + > +#define NES_EVENT_TIMEOUT 100000 > +#endif > + > +#include "nes_hw.h" > +#include "nes_verbs.h" > +#include "nes_context.h" > +#include "nes_user.h" > +#include "nes_cm.h" > + > +extern int max_mtu; > +extern int nics_per_function; > +#define max_frame_len (max_mtu+ETH_HLEN) > +extern int interrupt_mod_interval; > +extern int nes_if_count; > +extern int mpa_version; > +extern int disable_mpa_crc; > +extern unsigned int send_first; > +extern unsigned int nes_drv_opt; > +extern unsigned int nes_debug_level; Lots of GLOBAL symbols that should be local to the driver. Also you want to be able to set them per board, not for the whol driver. > + > +static inline int nes_skb_is_gso(const struct sk_buff *skb) > +{ > + return skb_shinfo(skb)->gso_size; > +} > + > +#define nes_skb_linearize(_skb) skb_linearize(_skb) > + Why the silly wrappers? > +/* Read from memory-mapped device */ > +static inline u32 nes_read_indexed(struct nes_device *nesdev, u32 reg_index) > +{ > + unsigned long flags; > + void __iomem *addr = nesdev->index_reg; > + u32 value; > + > + spin_lock_irqsave(&nesdev->indexed_regs_lock, flags); > + > + writel(reg_index, addr); > + value = readl((void __iomem *)addr + 4); > + > + spin_unlock_irqrestore(&nesdev->indexed_regs_lock, flags); > + return value; > +} Bad feeling, I smell bad locking coming. > +static inline u32 nes_read32(const void __iomem* addr) > +{ > + return readl(addr); > +} > + > +static inline u16 nes_read16(const void __iomem* addr) > +{ > + return readw(addr); > +} > + > +static inline u8 nes_read8(const void __iomem* addr) > +{ > + return readb(addr); > +} More silly wrappers. > +/* Write to memory-mapped device */ > +static inline void nes_write_indexed(struct nes_device *nesdev, u32 reg_index, u32 val) > +{ > + unsigned long flags; > + void __iomem *addr = nesdev->index_reg; > + > + spin_lock_irqsave(&nesdev->indexed_regs_lock, flags); > + > + writel(reg_index, addr); > + writel(val, (void __iomem *)addr + 4); > + > + spin_unlock_irqrestore(&nesdev->indexed_regs_lock, flags); > +} > +static inline void nes_write32(void __iomem *addr, u32 val) > +{ > + writel(val, addr); > +} > + > +static inline void nes_write16(void __iomem *addr, u16 val) > +{ > + writew(val, addr); > +} > + > +static inline void nes_write8(void __iomem *addr, u8 val) > +{ > + writeb(val, addr); > +} > + > + > + > +static inline int nes_alloc_resource(struct nes_adapter *nesadapter, > + unsigned long *resource_array, u32 max_resources, > + u32 *req_resource_num, u32 *next) > +{ > + unsigned long flags; > + u32 resource_num; > + > + spin_lock_irqsave(&nesadapter->resource_lock, flags); > + > + resource_num = find_next_zero_bit(resource_array, max_resources, *next); > + if (resource_num >= max_resources) { > + resource_num = find_first_zero_bit(resource_array, max_resources); > + if (resource_num >= max_resources) { > + printk(KERN_ERR PFX "%s: No available resourcess.\n", __FUNCTION__); > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > + return -EMFILE; > + } > + } > + nes_debug(NES_DBG_HW, "find_next_zero_bit returned = %u (max = %u).\n", > + resource_num, max_resources); > + set_bit(resource_num, resource_array); > + *next = resource_num+1; > + if (*next == max_resources) { > + *next = 0; > + } > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > + *req_resource_num = resource_num; > + > + return 0; > +} Big fat initialization routine that shouldn't be as device inline. > +static inline int nes_is_resource_allocated(struct nes_adapter *nesadapter, > + unsigned long *resource_array, u32 resource_num) > +{ > + unsigned long flags; > + int bit_is_set; > + > + spin_lock_irqsave(&nesadapter->resource_lock, flags); > + > + bit_is_set = test_bit(resource_num, resource_array); > + nes_debug(NES_DBG_HW, "resource_num %u is%s allocated.\n", > + resource_num, (bit_is_set ? "": " not")); > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > + > + return bit_is_set; > +} What resource, how about a comment? > +static inline void nes_free_resource(struct nes_adapter *nesadapter, > + unsigned long *resource_array, u32 resource_num) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&nesadapter->resource_lock, flags); > + clear_bit(resource_num, resource_array); > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > +} > + > +static inline struct nes_vnic *to_nesvnic(struct ib_device *ibdev) { > + return(container_of(ibdev, struct nes_ib_device, ibdev)->nesvnic); return container_of(ibdev, struct nes_ib_device, ibdev)->nesvnic; > +static inline struct nes_pd *to_nespd(struct ib_pd *ibpd) { > + return(container_of(ibpd, struct nes_pd, ibpd)); > +} > + > +static inline struct nes_ucontext *to_nesucontext(struct ib_ucontext *ibucontext) { > + return(container_of(ibucontext, struct nes_ucontext, ibucontext)); > +} > + > +static inline struct nes_mr *to_nesmr(struct ib_mr *ibmr) { > + return(container_of(ibmr, struct nes_mr, ibmr)); > +} > + > +static inline struct nes_mr *to_nesmr_from_ibfmr(struct ib_fmr *ibfmr) { > + return(container_of(ibfmr, struct nes_mr, ibfmr)); > +} > + > +static inline struct nes_mr *to_nesmw(struct ib_mw *ibmw) { > + return(container_of(ibmw, struct nes_mr, ibmw)); > +} > + > +static inline struct nes_fmr *to_nesfmr(struct nes_mr *nesmr) { > + return(container_of(nesmr, struct nes_fmr, nesmr)); > +} > + > +static inline struct nes_cq *to_nescq(struct ib_cq *ibcq) { > + return(container_of(ibcq, struct nes_cq, ibcq)); > +} > + > +static inline struct nes_qp *to_nesqp(struct ib_qp *ibqp) { > + return(container_of(ibqp, struct nes_qp, ibqp)); > +} > + > + > +#define NES_CQP_REQUEST_NOT_HOLDING_LOCK 0 > +#define NES_CQP_REQUEST_HOLDING_LOCK 1 > +#define NES_CQP_REQUEST_NO_DOORBELL_RING 0 > +#define NES_CQP_REQUEST_RING_DOORBELL 1 > + > +static inline struct nes_cqp_request > + *nes_get_cqp_request(struct nes_device *nesdev, int holding_lock) { Any code like that has conditional locking is indication of poor design. It also makes static analysis tools harder. > + unsigned long flags; > + struct nes_cqp_request *cqp_request = NULL; > + > + if (!holding_lock) { > + spin_lock_irqsave(&nesdev->cqp.lock, flags); > + } > + if (!list_empty(&nesdev->cqp_avail_reqs)) { > + cqp_request = list_entry(nesdev->cqp_avail_reqs.next, > + struct nes_cqp_request, list); > + atomic_inc(&cqp_reqs_allocated); > + list_del_init(&cqp_request->list); > + } else if (!holding_lock) { > + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); > + cqp_request = kzalloc(sizeof(struct nes_cqp_request), > + GFP_KERNEL); > + if (cqp_request) { > + cqp_request->dynamic = 1; > + INIT_LIST_HEAD(&cqp_request->list); > + atomic_inc(&cqp_reqs_dynallocated); > + } > + spin_lock_irqsave(&nesdev->cqp.lock, flags); > + } > + if (!holding_lock) { > + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); > + } > + > + if (cqp_request) { > + init_waitqueue_head(&cqp_request->waitq); > + cqp_request->waiting = 0; > + cqp_request->request_done = 0; > + init_waitqueue_head(&cqp_request->waitq); > + nes_debug(NES_DBG_CQP, "Got cqp request %p from the available list \n", > + cqp_request); > + } else > + printk(KERN_ERR PFX "%s: Could not allocated a CQP request.\n", > + __FUNCTION__); > + > + return cqp_request; > +} > + > +static inline void nes_post_cqp_request(struct nes_device *nesdev, > + struct nes_cqp_request *cqp_request, int holding_lock, int ring_doorbell) > +{ > + /* caller must be holding CQP lock */ > + struct nes_hw_cqp_wqe *cqp_wqe; > + unsigned long flags; > + u32 cqp_head; > + > + if (!holding_lock) { > + spin_lock_irqsave(&nesdev->cqp.lock, flags); > + } > + > + if (((((nesdev->cqp.sq_tail+(nesdev->cqp.sq_size*2))-nesdev->cqp.sq_head) & > + (nesdev->cqp.sq_size - 1)) != 1) > + && (list_empty(&nesdev->cqp_pending_reqs))) { > + cqp_head = nesdev->cqp.sq_head++; > + nesdev->cqp.sq_head &= nesdev->cqp.sq_size-1; > + cqp_wqe = &nesdev->cqp.sq_vbase[cqp_head]; > + memcpy(cqp_wqe, &cqp_request->cqp_wqe, sizeof(*cqp_wqe)); > + barrier(); > + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = cpu_to_le32((u32)((u64)(cqp_request))); > + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = cpu_to_le32((u32)(((u64)(cqp_request))>>32)); > + nes_debug(NES_DBG_CQP, "CQP request (opcode 0x%02X), line 1 = 0x%08X put on CQPs SQ," > + " request = %p, cqp_head = %u, cqp_tail = %u, cqp_size = %u," > + " waiting = %d, refcount = %d.\n", > + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX])&0x3f, > + le32_to_cpu(cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX]), cqp_request, > + nesdev->cqp.sq_head, nesdev->cqp.sq_tail, nesdev->cqp.sq_size, > + cqp_request->waiting, atomic_read(&cqp_request->refcount)); > + barrier(); > + if (ring_doorbell) { > + /* Ring doorbell (1 WQEs) */ > + nes_write32(nesdev->regs+NES_WQE_ALLOC, 0x01800000 | nesdev->cqp.qp_id); > + } > + > + barrier(); > + } else { > + atomic_inc(&cqp_reqs_queued); > + nes_debug(NES_DBG_CQP, "CQP request %p (opcode 0x%02X), line 1 = 0x%08X" > + " put on the pending queue.\n", > + cqp_request, > + cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_OPCODE_IDX]&0x3f, > + cqp_request->cqp_wqe.wqe_words[NES_CQP_WQE_ID_IDX]); > + list_add_tail(&cqp_request->list, &nesdev->cqp_pending_reqs); > + } > + > + if (!holding_lock) { > + spin_unlock_irqrestore(&nesdev->cqp.lock, flags); > + } > + > + return; > +} > + You really think that you need to have a function this big inline in the header file. > + > +/* Utils */ > +#define CRC32C_POLY 0x1EDC6F41 Linux has a perfectly good crc32 library routine, use it! -- Stephen Hemminger From rdreier at cisco.com Fri Oct 19 22:22:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 19 Oct 2007 22:22:44 -0700 Subject: [ofa-general] Re: [PATCH 2/14 v2] nes: device structures and defines In-Reply-To: <20071019220015.3faa9bbb@freepuppy.rosehill> (Stephen Hemminger's message of "Fri, 19 Oct 2007 22:00:15 -0700") References: <200710192004.l9JK48dm021704@neteffect.com> <20071019220015.3faa9bbb@freepuppy.rosehill> Message-ID: > You are starting off on the wrong foot. ??? > > +if(!(expr)) { \ > > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \ > > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > > +} > > Use BUG_ON I agree that there's no need to invent a driver-private assertion macro, but (to first order at least) drivers should never use BUG_ON. I don't want some glitch in a network driver that the system could probably survive to be turned into a panic by BUG_ON -- WARN_ON seems infinitely preferable. - R. From shemminger at linux-foundation.org Fri Oct 19 22:26:20 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 19 Oct 2007 22:26:20 -0700 Subject: [ofa-general] Re: [PATCH 2/14 v2] nes: device structures and defines In-Reply-To: References: <200710192004.l9JK48dm021704@neteffect.com> <20071019220015.3faa9bbb@freepuppy.rosehill> Message-ID: <20071019222620.1ae322c9@freepuppy.rosehill> On Fri, 19 Oct 2007 22:22:44 -0700 Roland Dreier wrote: > > You are starting off on the wrong foot. > > ??? That was a introductory comment because even in reviewing the first file (which had almost no code), I saw so many style issues. > > > +if(!(expr)) { \ > > > + printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \ > > > + #expr, __FILE__, __FUNCTION__, __LINE__); \ > > > +} > > > > Use BUG_ON > > I agree that there's no need to invent a driver-private assertion > macro, but (to first order at least) drivers should never use BUG_ON. > I don't want some glitch in a network driver that the system could > probably survive to be turned into a panic by BUG_ON -- WARN_ON seems > infinitely preferable. > > - R. -- Stephen Hemminger From vlad at lists.openfabrics.org Sat Oct 20 02:53:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 20 Oct 2007 02:53:35 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071020-0200 daily build status Message-ID: <20071020095335.58880E608FB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From kaushikmazumdar at yahoo.com Sat Oct 20 09:31:40 2007 From: kaushikmazumdar at yahoo.com (Caitlin Rhodes) Date: Sat, 20 Oct 2007 08:31:40 -0800 Subject: [ofa-general] [University news] Message-ID: <01c812f3$a3597180$c5ec11d9@kaushikmazumdar> Obtain the_degree you deserve, based on your present knowledge and life experience. A prosperous future, money earning power, and the Admiration of all. Degrees from an Established, Prestigious, Leading Institution. Your Degree will show exactly what you really can do. Get the Job, Promotion, Business and Social Advancement you Desire! Get your Bachelors,Masters,MBA, or PhD in the field of your expertise Call now - your Graduation is a phone call away. Please call: +1(413)376-9218 are always speaking of the trenches, while not all parts of the line are At least, one faintly realized what it meant to be in the support From cap at nsc.liu.se Sat Oct 20 14:33:53 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Sat, 20 Oct 2007 23:33:53 +0200 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> References: <200710191720.58526.cap@nsc.liu.se> <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> Message-ID: <200710202333.56962.cap@nsc.liu.se> On Friday 19 October 2007, Michael Krause wrote: > At 08:20 AM 10/19/2007, Peter Kjellstrom wrote: > >On Thursday 18 October 2007, Chuck Hartley wrote: ... > > > What is the maximum theoretical BW for > > > DDR IB - 1525MB/sec? > > > >No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective which > > is 2000 MB/s (10-base) and 1907 MiB/s (2-base). > > There is also IB protocol overhead combined with driver / device control > traffic overhead (consumes device as well as PCI resources / bandwidth), > end-to-end control traffic which is also a function of how the application > is constructed. In general, hitting about 80-85% of the theoretical > maximum is possible. IB can do much better than that. On an SDR system I typically get 950 MB/s (10-base), 95%. This on 8x pci-express so the limitations of pci-e above does not bite. If IB DDR could strech it's legs (if we had faster pci-e, say pci-e-2.0...) then maybe we would see 95% there too :-). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From cap at nsc.liu.se Sat Oct 20 14:35:29 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Sat, 20 Oct 2007 23:35:29 +0200 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <4718FF72.801@oracle.com> References: <200710191720.58526.cap@nsc.liu.se> <4718FF72.801@oracle.com> Message-ID: <200710202335.29585.cap@nsc.liu.se> On Friday 19 October 2007, Richard Frank wrote: > Does it follow then that it's possible to get 1400 mbytes / sec out + > 1400 mbytes / sec in for total of 2800 mbytes rdma write + rdma read ? It doesn't logically follow from the information in my previous post, but yes, systems managing 1400 one way usually do ~2800 both ways. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From dotanb at dev.mellanox.co.il Sat Oct 20 23:06:21 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 21 Oct 2007 08:06:21 +0200 Subject: [ofa-general] Behavior on dropped message In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA84030292363B@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA84030292363B@G3W0634.americas.hpqcorp.net> Message-ID: <471AEC5D.8040802@dev.mellanox.co.il> Hi. Tang, Changqing wrote: > Hi, > I have question on dropped message. During QP connection setup, > QPs are in INIT state, after exchange the qp_num, > one end is moving INIT-->RTR-->RTS, the other end is still in INIT > state. > > Then the side in RTS state sends a message. From the standard, > the message is silently dropped on the receiving > side because it is still in INIT state. What is the behavior on the > sending side ? Do I get compeletion error, or never > get a completion ? > > On Mellanox HCA, I got completion error. But for the new > connectX card, I never get a completion event, and my > code is hanging there. > The sending QP don't get any response from the remote QP (because it is in INIT state and its SQ is not enabled yet), so the (retry) timeout of the sender will be expired for retry_cnt times and you should get retry exceeded completion. You should get this behavior for all of the HCAs. (unless you are using timeout = 0 which means infinite timeout) Which FW are you using for the connectX? thanks Dotan From sweitzen at cisco.com Sun Oct 21 00:02:53 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 21 Oct 2007 00:02:53 -0700 Subject: [ofa-general] RE: bugzilla ipoib bugs In-Reply-To: <1192698027.16927.6.camel@mtls03> References: <1192698027.16927.6.camel@mtls03> Message-ID: Done. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Eli Cohen [mailto:eli at mellanox.co.il] > Sent: Thursday, October 18, 2007 2:00 AM > To: Scott Weitzenkamp (sweitzen) > Subject: bugzilla ipoib bugs > > Hi Scott, > > could you please have me get bugs regarding IPOIB? > > thanks. > From kliteyn at dev.mellanox.co.il Sun Oct 21 01:00:47 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 21 Oct 2007 10:00:47 +0200 Subject: [ofa-general] Re: [PATCH v3] osm: QoS - parsing port names In-Reply-To: <20071017225709.GQ6945@sashak.voltaire.com> References: <4714C6A6.7050300@dev.mellanox.co.il> <20071017225709.GQ6945@sashak.voltaire.com> Message-ID: <471B072F.1060808@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 16:11 Tue 16 Oct , Yevgeny Kliteynik wrote: >> Added node-by-name hash to the QoS policy object and >> as port names are parsed they use this hash to locate >> that actual port that the name refers to. >> For now I prefer to keep this hash local, so it's part >> of QoS policy object. >> When the same parser will be used for partitions too, >> this hash will be moved to be part of the subnet object. >> >> V3 changes (vs. V2): >> - node-by-name instead of ca-by-name >> - removed any constraints on the format of node name >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/include/opensm/osm_qos_policy.h | 3 +- >> opensm/opensm/osm_qos_parser.y | 64 ++++++++++++++++++++++++++------ >> opensm/opensm/osm_qos_policy.c | 38 ++++++++++++++++--- >> 3 files changed, 86 insertions(+), 19 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h >> index 30c2e6d..61fc325 100644 >> --- a/opensm/include/opensm/osm_qos_policy.h >> +++ b/opensm/include/opensm/osm_qos_policy.h >> @@ -49,6 +49,7 @@ >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { >> typedef struct _osm_qos_port_group_t { >> char *name; /* single string (this port group name) */ >> char *use; /* single string (description) */ >> - cl_list_t port_name_list; /* list of port names (.../.../...) */ >> uint8_t node_types; /* node types bitmask */ >> cl_qmap_t port_map; >> } osm_qos_port_group_t; >> @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { >> cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ >> osm_qos_level_t *p_default_qos_level; /* default QoS level */ >> osm_subn_t *p_subn; /* osm subnet object */ >> + st_table * p_node_hash; /* node by name hash */ >> } osm_qos_policy_t; >> >> /***************************************************/ >> diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y >> index d2917d3..5a6e0c9 100644 >> --- a/opensm/opensm/osm_qos_parser.y >> +++ b/opensm/opensm/osm_qos_parser.y >> @@ -245,7 +245,8 @@ qos_policy_entry: port_groups_section >> * use: our SRP storage targets >> * port-guid: 0x1000000000000001,0x1000000000000002 >> * ... >> - * port-name: vs1/HCA-1/P1 >> + * port-name: vs1 HCA-1/P1 >> + * port-name: node_and_HCA_name/P2 > > Maybe node_desc is cleaner instead of node_and_HCA_name. > >> * ... >> * pkey: 0x00FF-0x0FFF >> * ... >> @@ -602,21 +603,60 @@ port_group_use_start: TK_USE { >> >> port_group_port_name: port_group_port_name_start string_list { >> /* 'port-name' in 'port-group' - any num of instances */ >> - cl_list_iterator_t list_iterator; >> - char * tmp_str; >> - >> - list_iterator = cl_list_head(&tmp_parser_struct.str_list); >> - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) >> + cl_list_iterator_t list_iterator; >> + osm_node_t * p_node; >> + osm_physp_t * p_physp; >> + unsigned port_num; >> + char * tmp_str; >> + char * port_str; >> + >> + /* parsing port name strings */ >> + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); >> + list_iterator != cl_list_end(&tmp_parser_struct.str_list); >> + list_iterator = cl_list_next(list_iterator)) >> { >> tmp_str = (char*)cl_list_obj(list_iterator); >> + if (tmp_str) >> + { >> + /* last slash in port name string is a separator >> + between node name and port number */ >> + port_str = strrchr(tmp_str, '/'); >> + if (!port_str || (strlen(port_str) < 3) || > > If port number is not specified it could be nice wildcarding - all > ports for this node. There is no wild card expansion with multiple ports > mapping in this patch, so this comment is just idea for future use, no > need to change yet. I prefer to have an 'explicit' wildcarding: ..../P* or ..../P[2-8] >> + (port_str[1] != 'p' && port_str[1] != 'P')) { >> + yyerror("illegal port name"); >> + free(tmp_str); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> >> - /* >> - * TODO: parse port name strings >> - */ >> + if (!(port_num = strtoul(&port_str[2],NULL,0))) { >> + yyerror("illegal port number in port name"); >> + free(tmp_str); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> >> - if (tmp_str) >> - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); >> - list_iterator = cl_list_next(list_iterator); >> + /* separate node name from port number */ >> + port_str[0] = '\0'; >> + >> + if (st_lookup(p_qos_policy->p_node_hash, >> + (st_data_t)tmp_str, >> + (st_data_t*)&p_node)) >> + { >> + /* we found the node, now get the right port */ >> + p_physp = osm_node_get_physp_ptr(p_node, port_num); >> + if (!p_physp) { >> + yyerror("port number out of range in port name"); >> + free(tmp_str); >> + cl_list_remove_all(&tmp_parser_struct.str_list); >> + return 1; >> + } >> + /* we found the port, now add it to guid table */ >> + __parser_add_port_to_port_map(&p_current_port_group->port_map, >> + p_physp); >> + } >> + free(tmp_str); >> + } >> } >> cl_list_remove_all(&tmp_parser_struct.str_list); >> } >> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c >> index 51dd7b9..1207295 100644 >> --- a/opensm/opensm/osm_qos_policy.c >> +++ b/opensm/opensm/osm_qos_policy.c >> @@ -59,6 +59,33 @@ >> /*************************************************** >> ***************************************************/ >> >> +static void >> +__build_nodebyname_hash(osm_qos_policy_t * p_qos_policy) >> +{ >> + osm_node_t * p_node; >> + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; >> + >> + p_qos_policy->p_node_hash = st_init_strtable(); >> + CL_ASSERT(p_qos_policy->p_node_hash); >> + >> + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) >> + return; >> + >> + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); >> + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); >> + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { >> + if (!st_lookup(p_qos_policy->p_node_hash, >> + (st_data_t)p_node->print_desc, >> + (st_data_t*)&p_node)) >> + st_insert(p_qos_policy->p_node_hash, >> + (st_data_t)p_node->print_desc, >> + (st_data_t)p_node); > > st_lookup() is not needed? st_insert() replace entry if it exists. In > case of identical node_desc last will appear. Whether the first or the last appearance will remain in the hash, it's bad either way, but if the value in the hash will be replaced, it will create memory leak, since the previous value won't be freed. -- Yevgeny > Sasha > >> + } >> +} >> + >> +/*************************************************** >> + ***************************************************/ >> + >> static boolean_t >> __is_num_in_range_arr(uint64_t ** range_arr, >> unsigned range_arr_len, uint64_t num) >> @@ -127,8 +154,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() >> return NULL; >> >> memset(p, 0, sizeof(osm_qos_port_group_t)); >> - >> - cl_list_init(&p->port_name_list, 10); >> cl_qmap_init(&p->port_map); >> >> return p; >> @@ -150,10 +175,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) >> if (p->use) >> free(p->use); >> >> - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); >> - cl_list_remove_all(&p->port_name_list); >> - cl_list_destroy(&p->port_name_list); >> - >> p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); >> while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) >> { >> @@ -423,6 +444,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) >> cl_list_init(&p_qos_policy->qos_match_rules, 10); >> >> p_qos_policy->p_subn = p_subn; >> + __build_nodebyname_hash(p_qos_policy); >> + >> return p_qos_policy; >> } >> >> @@ -495,6 +518,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) >> cl_list_remove_all(&p_qos_policy->qos_match_rules); >> cl_list_destroy(&p_qos_policy->qos_match_rules); >> >> + if (p_qos_policy->p_node_hash) >> + st_free_table(p_qos_policy->p_node_hash); >> + >> free(p_qos_policy); >> >> p_qos_policy = NULL; >> -- >> 1.5.1.4 >> > From kliteyn at dev.mellanox.co.il Sun Oct 21 01:04:22 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 21 Oct 2007 10:04:22 +0200 Subject: [ofa-general] [PATCH v4] osm: QoS - parsing port names Message-ID: <471B0806.6010604@dev.mellanox.co.il> Added node-by-name hash to the QoS policy object and as port names are parsed they use this hash to locate that actual port that the name refers to. For now I prefer to keep this hash local, so it's part of QoS policy object. When the same parser will be used for partitions too, this hash will be moved to be part of the subnet object. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_qos_policy.h | 3 +- opensm/opensm/osm_qos_parser.y | 64 ++++++++++++++++++++++++++------ opensm/opensm/osm_qos_policy.c | 38 ++++++++++++++++--- 3 files changed, 86 insertions(+), 19 deletions(-) diff --git a/opensm/include/opensm/osm_qos_policy.h b/opensm/include/opensm/osm_qos_policy.h index 30c2e6d..61fc325 100644 --- a/opensm/include/opensm/osm_qos_policy.h +++ b/opensm/include/opensm/osm_qos_policy.h @@ -49,6 +49,7 @@ #include #include +#include #include #include #include @@ -72,7 +73,6 @@ typedef struct _osm_qos_port_t { typedef struct _osm_qos_port_group_t { char *name; /* single string (this port group name) */ char *use; /* single string (description) */ - cl_list_t port_name_list; /* list of port names (.../.../...) */ uint8_t node_types; /* node types bitmask */ cl_qmap_t port_map; } osm_qos_port_group_t; @@ -148,6 +148,7 @@ typedef struct _osm_qos_policy_t { cl_list_t qos_match_rules; /* list of osm_qos_match_rule_t */ osm_qos_level_t *p_default_qos_level; /* default QoS level */ osm_subn_t *p_subn; /* osm subnet object */ + st_table * p_node_hash; /* node by name hash */ } osm_qos_policy_t; /***************************************************/ diff --git a/opensm/opensm/osm_qos_parser.y b/opensm/opensm/osm_qos_parser.y index d2917d3..77a49c3 100644 --- a/opensm/opensm/osm_qos_parser.y +++ b/opensm/opensm/osm_qos_parser.y @@ -245,7 +245,8 @@ qos_policy_entry: port_groups_section * use: our SRP storage targets * port-guid: 0x1000000000000001,0x1000000000000002 * ... - * port-name: vs1/HCA-1/P1 + * port-name: vs1 HCA-1/P1 + * port-name: node_description/P2 * ... * pkey: 0x00FF-0x0FFF * ... @@ -602,21 +603,60 @@ port_group_use_start: TK_USE { port_group_port_name: port_group_port_name_start string_list { /* 'port-name' in 'port-group' - any num of instances */ - cl_list_iterator_t list_iterator; - char * tmp_str; - - list_iterator = cl_list_head(&tmp_parser_struct.str_list); - while( list_iterator != cl_list_end(&tmp_parser_struct.str_list) ) + cl_list_iterator_t list_iterator; + osm_node_t * p_node; + osm_physp_t * p_physp; + unsigned port_num; + char * tmp_str; + char * port_str; + + /* parsing port name strings */ + for (list_iterator = cl_list_head(&tmp_parser_struct.str_list); + list_iterator != cl_list_end(&tmp_parser_struct.str_list); + list_iterator = cl_list_next(list_iterator)) { tmp_str = (char*)cl_list_obj(list_iterator); + if (tmp_str) + { + /* last slash in port name string is a separator + between node name and port number */ + port_str = strrchr(tmp_str, '/'); + if (!port_str || (strlen(port_str) < 3) || + (port_str[1] != 'p' && port_str[1] != 'P')) { + yyerror("illegal port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - /* - * TODO: parse port name strings - */ + if (!(port_num = strtoul(&port_str[2],NULL,0))) { + yyerror("illegal port number in port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } - if (tmp_str) - cl_list_insert_tail(&p_current_port_group->port_name_list,tmp_str); - list_iterator = cl_list_next(list_iterator); + /* separate node name from port number */ + port_str[0] = '\0'; + + if (st_lookup(p_qos_policy->p_node_hash, + (st_data_t)tmp_str, + (st_data_t*)&p_node)) + { + /* we found the node, now get the right port */ + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp) { + yyerror("port number out of range in port name"); + free(tmp_str); + cl_list_remove_all(&tmp_parser_struct.str_list); + return 1; + } + /* we found the port, now add it to guid table */ + __parser_add_port_to_port_map(&p_current_port_group->port_map, + p_physp); + } + free(tmp_str); + } } cl_list_remove_all(&tmp_parser_struct.str_list); } diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 51dd7b9..1207295 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -59,6 +59,33 @@ /*************************************************** ***************************************************/ +static void +__build_nodebyname_hash(osm_qos_policy_t * p_qos_policy) +{ + osm_node_t * p_node; + cl_qmap_t * p_node_guid_tbl = &p_qos_policy->p_subn->node_guid_tbl; + + p_qos_policy->p_node_hash = st_init_strtable(); + CL_ASSERT(p_qos_policy->p_node_hash); + + if (!p_node_guid_tbl || !cl_qmap_count(p_node_guid_tbl)) + return; + + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { + if (!st_lookup(p_qos_policy->p_node_hash, + (st_data_t)p_node->print_desc, + (st_data_t*)&p_node)) + st_insert(p_qos_policy->p_node_hash, + (st_data_t)p_node->print_desc, + (st_data_t)p_node); + } +} + +/*************************************************** + ***************************************************/ + static boolean_t __is_num_in_range_arr(uint64_t ** range_arr, unsigned range_arr_len, uint64_t num) @@ -127,8 +154,6 @@ osm_qos_port_group_t *osm_qos_policy_port_group_create() return NULL; memset(p, 0, sizeof(osm_qos_port_group_t)); - - cl_list_init(&p->port_name_list, 10); cl_qmap_init(&p->port_map); return p; @@ -150,10 +175,6 @@ void osm_qos_policy_port_group_destroy(osm_qos_port_group_t * p) if (p->use) free(p->use); - cl_list_apply_func(&p->port_name_list, __free_single_element, NULL); - cl_list_remove_all(&p->port_name_list); - cl_list_destroy(&p->port_name_list); - p_port = (osm_qos_port_t *) cl_qmap_head(&p->port_map); while (p_port != (osm_qos_port_t *) cl_qmap_end(&p->port_map)) { @@ -423,6 +444,8 @@ osm_qos_policy_t * osm_qos_policy_create(osm_subn_t * p_subn) cl_list_init(&p_qos_policy->qos_match_rules, 10); p_qos_policy->p_subn = p_subn; + __build_nodebyname_hash(p_qos_policy); + return p_qos_policy; } @@ -495,6 +518,9 @@ void osm_qos_policy_destroy(osm_qos_policy_t * p_qos_policy) cl_list_remove_all(&p_qos_policy->qos_match_rules); cl_list_destroy(&p_qos_policy->qos_match_rules); + if (p_qos_policy->p_node_hash) + st_free_table(p_qos_policy->p_node_hash); + free(p_qos_policy); p_qos_policy = NULL; -- 1.5.1.4 From tziporet at dev.mellanox.co.il Sun Oct 21 01:22:52 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 21 Oct 2007 10:22:52 +0200 Subject: [ofa-general] Re: [ewg] [PATCH 0/14 v2] nes: NetEffect 10Gb RNIC Driver In-Reply-To: <200710191957.l9JJvAgC021662@neteffect.com> References: <200710191957.l9JJvAgC021662@neteffect.com> Message-ID: <471B0C5C.5030107@mellanox.co.il> ggrundstrom at neteffect.com wrote: > This is the second posting for the series of patches containing the source code > for the NetEffect 10Gb RNIC adapter. The driver is split into two components - a > kernel driver module and a userspace library. > > The code can also be found in the following git trees. > > git.openfabrics.org/~glenn/libnes.git > git.openfabrics.org/~glenn/linux-2.6.git > > Thanks, > Glenn. > Can you review the release notes of OFED 1.3 and send me update regarding name of cards you support or anything that should be documented Thanks, Tziporet From kliteyn at dev.mellanox.co.il Sun Oct 21 02:44:07 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 21 Oct 2007 11:44:07 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <20071017221322.GN6945@sashak.voltaire.com> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> Message-ID: <471B1F67.60904@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 16:24 Tue 16 Oct , Yevgeny Kliteynik wrote: >> Adding ClassPortInfo:CapabilityMask2 field and turning >> on OSM QoS capabiliry bit (OSM_CAP2_IS_QOS_SUPPORTED). > ^^^^^^^^^^ > capability Right >> Signed-off-by: Yevgeny Kliteynik >> --- >> infiniband-diags/src/saquery.c | 6 +- >> opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- >> opensm/include/opensm/osm_base.h | 12 +++ >> opensm/opensm/osm_sa_class_port_info.c | 4 +- >> opensm/osmtest/osmtest.c | 13 +++- >> 5 files changed, 162 insertions(+), 10 deletions(-) >> >> diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c >> index a9a8da4..e17ec5a 100644 >> --- a/infiniband-diags/src/saquery.c >> +++ b/infiniband-diags/src/saquery.c >> @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) >> "\t\tBase version.............%d\n" >> "\t\tClass version............%d\n" >> "\t\tCapability mask..........0x%04X\n" >> - "\t\tResponse time value......0x%08X\n" >> + "\t\tCapability mask 2........0x%08X\n" >> + "\t\tResponse time value......0x%02X\n" >> "\t\tRedirect GID.............0x%s\n" >> "\t\tRedirect TC/SL/FL........0x%08X\n" >> "\t\tRedirect LID.............0x%04X\n" >> @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) >> class_port_info->base_ver, >> class_port_info->class_ver, >> cl_ntoh16(class_port_info->cap_mask), >> - class_port_info->resp_time_val, >> + ib_class_cap_mask2(class_port_info), >> + ib_class_resp_time_val(class_port_info), >> sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), >> cl_ntoh32(class_port_info->redir_tc_sl_fl), >> cl_ntoh16(class_port_info->redir_lid), >> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h >> index 0969755..3685007 100644 >> --- a/opensm/include/iba/ib_types.h >> +++ b/opensm/include/iba/ib_types.h >> @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { >> uint8_t base_ver; >> uint8_t class_ver; >> ib_net16_t cap_mask; >> - uint8_t reserved[3]; >> - uint8_t resp_time_val; >> + ib_net32_t cap_mask2_resp_time; >> ib_gid_t redir_gid; >> ib_net32_t redir_tc_sl_fl; >> ib_net16_t redir_lid; >> @@ -3275,8 +3274,9 @@ typedef struct _ib_class_port_info { >> * cap_mask >> * Supported capabilities of this management class. >> * >> -* resp_time_value >> -* Maximum expected response time. >> +* cap_mask2_resp_time >> +* Maximum expected response time and additional >> +* supported capabilities of this management class. >> * >> * redr_gid >> * GID to use for redirection, or zero >> @@ -3322,6 +3322,135 @@ typedef struct _ib_class_port_info { >> * >> *********/ >> >> +/****f* IBA Base: Types/ib_class_set_resp_time_val >> +* NAME >> +* ib_class_set_resp_time_val >> +* >> +* DESCRIPTION >> +* Set maximum expected response time. >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, >> + IN const uint8_t val) >> +{ >> + p_cpi->cap_mask2_resp_time = >> + (p_cpi->cap_mask2_resp_time & CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | > > Souldn't be ~IB_CLASS_RESP_TIME_MASK? Good catch!!! Thanks. > >> + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* val >> +* [in] Response time value to set. >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_resp_time_val >> +* NAME >> +* ib_class_resp_time_val >> +* >> +* DESCRIPTION >> +* Get response time value. >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint8_t OSM_API >> +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) >> +{ >> + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & >> + IB_CLASS_RESP_TIME_MASK); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* RETURN VALUES >> +* Response time value. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_set_cap_mask2 >> +* NAME >> +* ib_class_set_cap_mask2 >> +* >> +* DESCRIPTION >> +* Set ClassPortInfo:CapabilityMask2. >> +* >> +* SYNOPSIS >> +*/ >> +static inline void OSM_API >> +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, >> + IN const uint32_t cap_mask2) >> +{ >> + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & >> + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | >> + cl_hton32(cap_mask2 << 5); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* cap_mask2 >> +* [in] CapabilityMask2 value to set. >> +* >> +* RETURN VALUES >> +* None >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> +/****f* IBA Base: Types/ib_class_cap_mask2 >> +* NAME >> +* ib_class_cap_mask2 >> +* >> +* DESCRIPTION >> +* Get ClassPortInfo:CapabilityMask2. >> +* >> +* SYNOPSIS >> +*/ >> +static inline uint32_t OSM_API >> +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) >> +{ >> + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); >> +} >> + >> +/* >> +* PARAMETERS >> +* p_cpi >> +* [in] Pointer to the class port info object. >> +* >> +* RETURN VALUES >> +* CapabilityMask2 of the ClassPortInfo. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +* ib_class_port_info_t >> +*********/ >> + >> /****s* IBA Base: Types/ib_sm_info_t >> * NAME >> * ib_sm_info_t >> diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h >> index e635dcb..26ef067 100644 >> --- a/opensm/include/opensm/osm_base.h >> +++ b/opensm/include/opensm/osm_base.h >> @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { >> #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) >> /***********/ >> >> +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED >> +* Name >> +* OSM_CAP2_IS_QOS_SUPPORTED >> +* >> +* DESCRIPTION >> +* QoS is supported >> +* >> +* SYNOPSIS >> +*/ >> +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) > > This one is IB specific. I guess it should be somewhere in ib_types.h. Not sure I'm following. How is it different from other capability bits here? For instance, why is it more "IB specific" than OSM_CAP_IS_MULTIPATH_SUP? Just to make sure there's no misunderstanding here: OSM_CAP2_IS_QOS_SUPPORTED doesn't say whether or not QoS on fabric is supported. It just denotes that SM can handle Service-ID and QoS-Class fields of the PR/MPR. > >> +/***********/ >> + >> /****d* OpenSM: Base/osm_sm_state_t >> * NAME >> * osm_sm_state_t >> diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c >> index d5c9f82..96d8898 100644 >> --- a/opensm/opensm/osm_sa_class_port_info.c >> +++ b/opensm/opensm/osm_sa_class_port_info.c >> @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, >> } >> } >> rtv += 8; >> - p_resp_cpi->resp_time_val = rtv; >> + ib_class_set_resp_time_val(p_resp_cpi, rtv); >> p_resp_cpi->redir_gid = zero_gid; >> p_resp_cpi->redir_tc_sl_fl = 0; >> p_resp_cpi->redir_lid = 0; >> @@ -209,6 +209,8 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, >> p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | >> OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; >> #endif >> + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); > > Shouldn't it check subn->opts.qos? Good idea. -- Yevgeny > Sasha > >> + >> if (p_rcv->p_subn->opt.no_multicast_option != TRUE) >> p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; >> p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); >> diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c >> index 73933a3..de54f2d 100644 >> --- a/opensm/osmtest/osmtest.c >> +++ b/opensm/osmtest/osmtest.c >> @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) >> (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); >> >> osm_log(&p_osmt->log, OSM_LOG_INFO, >> - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" >> - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", >> + "osmtest_validate_sa_class_port_info:\n" >> + "-----------------------------\n" >> + "SA Class Port Info:\n" >> + " base_ver:%u\n" >> + " class_ver:%u\n" >> + " cap_mask:0x%X\n" >> + " cap_mask2:0x%X\n" >> + " resp_time_val:0x%X\n" >> + "-----------------------------\n", >> p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), >> - p_cpi->resp_time_val); >> + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); >> >> Exit: >> #if 0 >> -- >> 1.5.1.4 >> > From jackm at dev.mellanox.co.il Sun Oct 21 03:03:01 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 21 Oct 2007 12:03:01 +0200 Subject: [ofa-general] [PATCH] mlx4/core: increase command timeout for INIT_HCA to 10 seconds Message-ID: <200710211203.02021.jackm@dev.mellanox.co.il> Increase the timeout for the INIT_HCA command, to bring it into line with all the other command timeouts (which are 10 seconds). Signed-off-by: Jack Morgenstein --- Roland, The ConnectX PRM (section 19.3.4 -- Init HCA) states: "Execution time depends on the amount of resources (QPs, CQs, etc) which are allocated and initialized by the HCA" The current timeout value of 1 second is sufficient for the resources currently allocated. However, if more resources are allocated (via module parameters, per my previous patch), this 1-second timeout will not be sufficiently long. The INIT_HCA timeout should be increased to 10 seconds, to bring it into line with all the other HCA commands. Jack diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index 6471d33..5064873 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -736,7 +736,7 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, struct mlx4_init_hca_param *param) MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); MLX4_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); - err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 1000); + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 10000); if (err) mlx4_err(dev, "INIT_HCA returns %d\n", err); From vlad at lists.openfabrics.org Sun Oct 21 02:58:08 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 21 Oct 2007 02:58:08 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071021-0200 daily build status Message-ID: <20071021095808.144A2E60846@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Sun Oct 21 04:53:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 13:53:27 +0200 Subject: [ofa-general] Re: [PATCH v3] osm: QoS - parsing port names In-Reply-To: <471B072F.1060808@dev.mellanox.co.il> References: <4714C6A6.7050300@dev.mellanox.co.il> <20071017225709.GQ6945@sashak.voltaire.com> <471B072F.1060808@dev.mellanox.co.il> Message-ID: <20071021115327.GA21849@sashak.voltaire.com> On 10:00 Sun 21 Oct , Yevgeny Kliteynik wrote: > >> + if (!st_lookup(p_qos_policy->p_node_hash, > >> + (st_data_t)p_node->print_desc, > >> + (st_data_t*)&p_node)) > >> + st_insert(p_qos_policy->p_node_hash, > >> + (st_data_t)p_node->print_desc, > >> + (st_data_t)p_node); > > st_lookup() is not needed? st_insert() replace entry if it exists. In > > case of identical node_desc last will appear. > > Whether the first or the last appearance will remain in the hash, > it's bad either way, but if the value in the hash will be replaced, > it will create memory leak, since the previous value won't be freed. Hmm, right, st_* does internal allocation - I missed this. Sasha From sashak at voltaire.com Sun Oct 21 04:59:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 13:59:36 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <471B1F67.60904@dev.mellanox.co.il> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> <471B1F67.60904@dev.mellanox.co.il> Message-ID: <20071021115936.GB21849@sashak.voltaire.com> On 11:44 Sun 21 Oct , Yevgeny Kliteynik wrote: > >> > >> +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED > >> +* Name > >> +* OSM_CAP2_IS_QOS_SUPPORTED > >> +* > >> +* DESCRIPTION > >> +* QoS is supported > >> +* > >> +* SYNOPSIS > >> +*/ > >> +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) > > This one is IB specific. I guess it should be somewhere in ib_types.h. > > Not sure I'm following. > How is it different from other capability bits here? > For instance, why is it more "IB specific" than OSM_CAP_IS_MULTIPATH_SUP? Good point, it is not more. I think all other IB spec constants should go to ib_types.h too. Sasha From sashak at voltaire.com Sun Oct 21 05:00:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 14:00:52 +0200 Subject: [ofa-general] Re: [PATCH v4] osm: QoS - parsing port names In-Reply-To: <471B0806.6010604@dev.mellanox.co.il> References: <471B0806.6010604@dev.mellanox.co.il> Message-ID: <20071021120052.GC21849@sashak.voltaire.com> On 10:04 Sun 21 Oct , Yevgeny Kliteynik wrote: > Added node-by-name hash to the QoS policy object and > as port names are parsed they use this hash to locate > that actual port that the name refers to. > For now I prefer to keep this hash local, so it's part > of QoS policy object. > When the same parser will be used for partitions too, > this hash will be moved to be part of the subnet object. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Sun Oct 21 05:01:16 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 14:01:16 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags: Formalize BuildRequires for rpmbuild In-Reply-To: <20071019111136.48518c07.weiny2@llnl.gov> References: <20071019111136.48518c07.weiny2@llnl.gov> Message-ID: <20071021120116.GD21849@sashak.voltaire.com> On 11:11 Fri 19 Oct , Ira Weiny wrote: > From 33d2c9cca44ce13aa8f35b2228369a33f7a45a70 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Wed, 17 Oct 2007 15:23:55 -0700 > Subject: [PATCH] Formalize BuildRequires for rpmbuild > > the mock build tool in particular requires specific build requires > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sun Oct 21 05:04:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 14:04:44 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071021120444.GE21849@sashak.voltaire.com> On 16:25 Tue 16 Oct , Hal Rosenstock wrote: > infiniband-diags/perfquery.c: Support PMAs which don't support > AllPortSelect option > > Currently only support single port HCAs are supported in this mode but > this can be extended for other devices if needed > > Tested-by: Greg Kurtzer > Signed-off-by: Hal Rosenstock Applied. Thanks. However I have a question below. > + if (allports == 1) { > + > + /* > + * Simulate all ports support in PMA > + * Determine node type, number of (physical) ports, > + * and, if switch, whether SP0 is enhanced > + * to determine first and last port to query > + */ > + > + /* For now, support single port CAs */ > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > + IBERROR("smp query nodeinfo failed"); > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > + IBERROR("smp query nodeinfo: Node type not CA"); Not supporting switches and routers is temporary limitation (like all port simulation for single port CAs only), right? Sasha From sashak at voltaire.com Sun Oct 21 05:05:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 14:05:01 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/scripts: Updated for perfquery support of no AllPortSelect option In-Reply-To: <1192577457.5921.179.camel@hrosenstock-ws.xsigo.com> References: <1192577457.5921.179.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071021120501.GF21849@sashak.voltaire.com> On 16:30 Tue 16 Oct , Hal Rosenstock wrote: > infiniband-diags/scripts: Updated for perfquery support of no > AllPortSelect option > > Eliminate new ibwarn message added to perfquery to let user know > AllPortSelect option is not supported by specified PMA > > Tested-by: Greg Kurtzer > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hrosenstock at xsigo.com Sun Oct 21 05:53:09 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 21 Oct 2007 05:53:09 -0700 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <20071021120444.GE21849@sashak.voltaire.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> <20071021120444.GE21849@sashak.voltaire.com> Message-ID: <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> On Sun, 2007-10-21 at 14:04 +0200, Sasha Khapyorsky wrote: > On 16:25 Tue 16 Oct , Hal Rosenstock wrote: > > infiniband-diags/perfquery.c: Support PMAs which don't support > > AllPortSelect option > > > > Currently only support single port HCAs are supported in this mode but > > this can be extended for other devices if needed > > > > Tested-by: Greg Kurtzer > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. > > However I have a question below. > > > + if (allports == 1) { > > + > > + /* > > + * Simulate all ports support in PMA > > + * Determine node type, number of (physical) ports, > > + * and, if switch, whether SP0 is enhanced > > + * to determine first and last port to query > > + */ > > + > > + /* For now, support single port CAs */ > > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > > + IBERROR("smp query nodeinfo failed"); > > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > > + IBERROR("smp query nodeinfo: Node type not CA"); > > Not supporting switches and routers and multiport CAs > is temporary limitation (like > all port simulation for single port CAs only), right? Temporary in the sense that it can be fixed but I have no current plan to do so as I am unaware of any practical need for this. Are you aware of such a need ? -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Sun Oct 21 06:52:35 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 15:52:35 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> <20071021120444.GE21849@sashak.voltaire.com> <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071021135235.GL21849@sashak.voltaire.com> On 05:53 Sun 21 Oct , Hal Rosenstock wrote: > On Sun, 2007-10-21 at 14:04 +0200, Sasha Khapyorsky wrote: > > On 16:25 Tue 16 Oct , Hal Rosenstock wrote: > > > infiniband-diags/perfquery.c: Support PMAs which don't support > > > AllPortSelect option > > > > > > Currently only support single port HCAs are supported in this mode but > > > this can be extended for other devices if needed > > > > > > Tested-by: Greg Kurtzer > > > Signed-off-by: Hal Rosenstock > > > > Applied. Thanks. > > > > However I have a question below. > > > > > + if (allports == 1) { > > > + > > > + /* > > > + * Simulate all ports support in PMA > > > + * Determine node type, number of (physical) ports, > > > + * and, if switch, whether SP0 is enhanced > > > + * to determine first and last port to query > > > + */ > > > + > > > + /* For now, support single port CAs */ > > > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > > > + IBERROR("smp query nodeinfo failed"); > > > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > > > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > > > + IBERROR("smp query nodeinfo: Node type not CA"); > > > > Not supporting switches and routers > > and multiport CAs > > > is temporary limitation (like > > all port simulation for single port CAs only), right? > > Temporary in the sense that it can be fixed but I have no current plan > to do so as I am unaware of any practical need for this. Probably it would be better to have cleaner error message then - something like "non-CA nodes are not supported yet." > Are you aware > of such a need ? Yes, assuming "all ports" queries are useful. Sasha From kliteyn at dev.mellanox.co.il Sun Oct 21 07:25:22 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 21 Oct 2007 16:25:22 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <20071021115936.GB21849@sashak.voltaire.com> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> <471B1F67.60904@dev.mellanox.co.il> <20071021115936.GB21849@sashak.voltaire.com> Message-ID: <471B6152.2060402@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 11:44 Sun 21 Oct , Yevgeny Kliteynik wrote: >>>> +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED >>>> +* Name >>>> +* OSM_CAP2_IS_QOS_SUPPORTED >>>> +* >>>> +* DESCRIPTION >>>> +* QoS is supported >>>> +* >>>> +* SYNOPSIS >>>> +*/ >>>> +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) >>> This one is IB specific. I guess it should be somewhere in ib_types.h. >> Not sure I'm following. >> How is it different from other capability bits here? >> For instance, why is it more "IB specific" than OSM_CAP_IS_MULTIPATH_SUP? > > Good point, it is not more. I think all other IB spec constants should go > to ib_types.h too. OK, then we should move all these capability bits to ib_types in a separate patch. -- Yevgeny > Sasha > From hrosenstock at xsigo.com Sun Oct 21 07:37:56 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 21 Oct 2007 07:37:56 -0700 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <20071021135235.GL21849@sashak.voltaire.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> <20071021120444.GE21849@sashak.voltaire.com> <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> <20071021135235.GL21849@sashak.voltaire.com> Message-ID: <1192977476.23494.392.camel@hrosenstock-ws.xsigo.com> On Sun, 2007-10-21 at 15:52 +0200, Sasha Khapyorsky wrote: > On 05:53 Sun 21 Oct , Hal Rosenstock wrote: > > On Sun, 2007-10-21 at 14:04 +0200, Sasha Khapyorsky wrote: > > > On 16:25 Tue 16 Oct , Hal Rosenstock wrote: > > > > infiniband-diags/perfquery.c: Support PMAs which don't support > > > > AllPortSelect option > > > > > > > > Currently only support single port HCAs are supported in this mode but > > > > this can be extended for other devices if needed > > > > > > > > Tested-by: Greg Kurtzer > > > > Signed-off-by: Hal Rosenstock > > > > > > Applied. Thanks. > > > > > > However I have a question below. > > > > > > > + if (allports == 1) { > > > > + > > > > + /* > > > > + * Simulate all ports support in PMA > > > > + * Determine node type, number of (physical) ports, > > > > + * and, if switch, whether SP0 is enhanced > > > > + * to determine first and last port to query > > > > + */ > > > > + > > > > + /* For now, support single port CAs */ > > > > + if (smp_query(data, &portid, IB_ATTR_NODE_INFO, 0, 0) < 0) > > > > + IBERROR("smp query nodeinfo failed"); > > > > + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); > > > > + if (node_type != IB_NODE_CA) /* NodeType other than CA ? */ > > > > + IBERROR("smp query nodeinfo: Node type not CA"); > > > > > > Not supporting switches and routers > > > > and multiport CAs > > > > > is temporary limitation (like > > > all port simulation for single port CAs only), right? > > > > Temporary in the sense that it can be fixed but I have no current plan > > to do so as I am unaware of any practical need for this. > > Probably it would be better to have cleaner error message then - > something like "non-CA nodes are not supported yet." To be more precise, it would be "Only single port CAs currently supported". Feel free to change the message if you want. > > Are you aware of such a need ? > > Yes, assuming "all ports" queries are useful. Can you be more specific as to where ? Greg's email was the only indication of where this was needed. -- Hal > Sasha From tziporet at dev.mellanox.co.il Sun Oct 21 09:01:50 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 21 Oct 2007 18:01:50 +0200 Subject: [ewg] RE: [ofa-general] OFED 1.3 Alpha release is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563FA5@mtlexch01.mtl.com> <471721B9.3090306@mellanox.co.il> Message-ID: <471B77EE.6070707@mellanox.co.il> Woodruff, Robert J wrote: > > Not sure I buy that argument. > > I think in the past we have had some features in OFED that were only > availible > on certain kernels/distros. If I recall, for example, I think that for > a while iser was not available for all kernels until the backport > patches were developed. > Not a problem that a ULP will not support all kernels (its even the status now of some drivers/ulps) But every ULP must have a maintainer with git tree on ofa server, supporting few of the relevant kernels. Also the maintainer must run some unit level testing to make sure its working. > Are you proposing we remove something that is in an upstream kernel > from OFED ? We never remove anything that is coming from Linux kernel > as I thought that generally all features that are upstream get included > into OFED and then perhaps some features that are not yet upstream > are added on. That seems to be the process that we have followed > in the past anyway. > > I agree - this is the way things are working Tziporet From sashak at voltaire.com Sun Oct 21 11:07:32 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 20:07:32 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <471B6152.2060402@dev.mellanox.co.il> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> <471B1F67.60904@dev.mellanox.co.il> <20071021115936.GB21849@sashak.voltaire.com> <471B6152.2060402@dev.mellanox.co.il> Message-ID: <20071021180732.GP21849@sashak.voltaire.com> On 16:25 Sun 21 Oct , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 11:44 Sun 21 Oct , Yevgeny Kliteynik wrote: > >>>> +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED > >>>> +* Name > >>>> +* OSM_CAP2_IS_QOS_SUPPORTED > >>>> +* > >>>> +* DESCRIPTION > >>>> +* QoS is supported > >>>> +* > >>>> +* SYNOPSIS > >>>> +*/ > >>>> +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) > >>> This one is IB specific. I guess it should be somewhere in ib_types.h. > >> Not sure I'm following. > >> How is it different from other capability bits here? > >> For instance, why is it more "IB specific" than OSM_CAP_IS_MULTIPATH_SUP? > > Good point, it is not more. I think all other IB spec constants should go > > to ib_types.h too. > > OK, then we should move all these capability bits to > ib_types in a separate patch. Agreed. Sasha From sashak at voltaire.com Sun Oct 21 11:25:29 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 21 Oct 2007 20:25:29 +0200 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <1192977476.23494.392.camel@hrosenstock-ws.xsigo.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> <20071021120444.GE21849@sashak.voltaire.com> <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> <20071021135235.GL21849@sashak.voltaire.com> <1192977476.23494.392.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071021182529.GQ21849@sashak.voltaire.com> On 07:37 Sun 21 Oct , Hal Rosenstock wrote: > > > > > > > is temporary limitation (like > > > > all port simulation for single port CAs only), right? > > > > > > Temporary in the sense that it can be fixed but I have no current plan > > > to do so as I am unaware of any practical need for this. > > > > Probably it would be better to have cleaner error message then - > > something like "non-CA nodes are not supported yet." > > To be more precise, it would be "Only single port CAs currently > supported". Feel free to change the message if you want. > > > > Are you aware of such a need ? > > > > Yes, assuming "all ports" queries are useful. > > Can you be more specific as to where ? Greg's email was the only > indication of where this was needed. perfquery is generic utility. Isn't it? Assumig it is we should cover general usage case and not just specifics. Sasha From hrosenstock at xsigo.com Sun Oct 21 14:44:04 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 21 Oct 2007 14:44:04 -0700 Subject: [ofa-general] Re: [PATCHv2] infiniband-diags/perfquery.c: Support PMAs which don't support AllPortSelect option In-Reply-To: <20071021182529.GQ21849@sashak.voltaire.com> References: <1192577137.5921.176.camel@hrosenstock-ws.xsigo.com> <20071021120444.GE21849@sashak.voltaire.com> <1192971189.23494.382.camel@hrosenstock-ws.xsigo.com> <20071021135235.GL21849@sashak.voltaire.com> <1192977476.23494.392.camel@hrosenstock-ws.xsigo.com> <20071021182529.GQ21849@sashak.voltaire.com> Message-ID: <1193003044.23494.397.camel@hrosenstock-ws.xsigo.com> On Sun, 2007-10-21 at 20:25 +0200, Sasha Khapyorsky wrote: > On 07:37 Sun 21 Oct , Hal Rosenstock wrote: > > > > > > > > > is temporary limitation (like > > > > > all port simulation for single port CAs only), right? > > > > > > > > Temporary in the sense that it can be fixed but I have no current plan > > > > to do so as I am unaware of any practical need for this. > > > > > > Probably it would be better to have cleaner error message then - > > > something like "non-CA nodes are not supported yet." > > > > To be more precise, it would be "Only single port CAs currently > > supported". Feel free to change the message if you want. > > > > > > Are you aware of such a need ? > > > > > > Yes, assuming "all ports" queries are useful. > > > > Can you be more specific as to where ? Greg's email was the only > > indication of where this was needed. > > perfquery is generic utility. Isn't it? Assumig it is we should cover > general usage case and not just specifics. As I said, I have no current plan for this. You are welcome to fix the general case if you think it is important now. -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Sun Oct 21 15:06:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 21 Oct 2007 15:06:37 -0700 Subject: [ofa-general] Re: [PATCH] mlx4/core: increase command timeout for INIT_HCA to 10 seconds In-Reply-To: <200710211203.02021.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 21 Oct 2007 12:03:01 +0200") References: <200710211203.02021.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From zoi5002 at 017.net.il Sun Oct 21 11:00:05 2007 From: zoi5002 at 017.net.il (=?windows-1255?Q?=E4=EE=F8=EB=E6_=EC=F9=E9=F7=E5=ED_=F2=F1=F7=E9=ED?=) Date: Sun, 21 Oct 2007 20:00:05 +0200 Subject: [ofa-general] =?windows-1255?b?5O724SDy7CDk9PDp7T8=?= Message-ID: <497132e5d3d7e0f5b3efa19e00136443@017.net.il> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Sun Oct 21 19:22:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 21 Oct 2007 19:22:27 -0700 Subject: [ofa-general] [PATCH 14/14 v2] nes: kernel build infrastructure In-Reply-To: <200710192028.l9JKSnZG021867@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:28:49 -0500") References: <200710192028.l9JKSnZG021867@neteffect.com> Message-ID: > + > +EXTRA_CFLAGS += -DNES_MINICM I don't see anyplace NES_MINICM is used. Delete this line? > + > +obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o > + > +iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o > + Also the file has an extra blank line at the beginning and end. Might as well kill them. From rdreier at cisco.com Sun Oct 21 19:23:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 21 Oct 2007 19:23:50 -0700 Subject: [ofa-general] Re: [PATCH 13/14 v2] nes: kernel build infrastructure In-Reply-To: <200710192027.l9JKR197021855@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:27:01 -0500") References: <200710192027.l9JKR197021855@neteffect.com> Message-ID: > +config INFINIBAND_NES_DEBUG > + bool "Verbose debugging output" > + depends on INFINIBAND_NES > + default n > + ---help--- > + This option causes the NetEffect RNIC driver to produce debug > + messages. Select this if you are developing the driver > + or trying to diagnose a problem. If you make this default n then no distro will have it enabled and you'll have to rebuild to debug anything. Better to have the default be enabled and make it controllable at runtime too with a module parameter. (you can look at what mthca does for an example of what I mean) From joycepps at hinet.net Sun Oct 21 22:20:51 2007 From: joycepps at hinet.net (joycepps at hinet.net) Date: Mon, 22 Oct 2007 12:20:51 +0700 Subject: [ofa-general] This is for you mom Message-ID: <471C3333.2070708@hinet.net> Standing Alert Exit Only Inc. EXTO $0.40 Five Points Of Interest - Canadian response to the system has been overwhelming. - Exit's goals for the year end were met in September of this year, just a few months from launch. - The US version of the site is now ready to be released to the market. - Market exposure will be huge, as Exit has made partnerships with high volume web-centers for immediate exposure to consumers. - Sellers now receive lead info to there cell phone ass well, for added convenience and to increase response time. If you don't act on this, you might feel regrets come Tuesday. From jackm at dev.mellanox.co.il Mon Oct 22 01:13:39 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 22 Oct 2007 10:13:39 +0200 Subject: [ofa-general] __always_inline macro usage Message-ID: <200710221013.40112.jackm@dev.mellanox.co.il> I noticed in libmlx4 that in your commit 338a180f3ca81d12dbc5b6587433d557769ee098 (factor out setting WQE segment entries), you introduced using the __always_inline macro. Several of the GCC compilers (I tried out gcc 4.1.1 on Red Hat Enterprise Linux 5, gcc 4.1.0 on SuSE SLES 10, and gcc 3.4.6 on RHEL4 update 5) do not recognize this macro (and consequently emit a compilation error). However, they all accepted: __attribute__ ((always_inline)) (and, for "-O2", they all behaved nicely -- not emitting redundant "non-inline" copies of the function). How about changing your instances of "__always_inline" to "__attribute__ ((always_inline))"? (I notice that you completely eliminated use of __always_inline for set_data_seg() in a subsequent patch. However, the following prototype remains: static __always_inline void set_raddr_seg(struct mlx4_wqe_raddr_seg *rseg, uint64_t remote_addr, uint32_t rkey) ) - Jack From koen.segers at vrt.be Mon Oct 22 02:06:07 2007 From: koen.segers at vrt.be (Koen Segers) Date: Mon, 22 Oct 2007 11:06:07 +0200 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> References: <20 0710191720.58526.cap@nsc.liu.se> <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> Message-ID: <1193043967.6395.22.camel@koenVRT> On Fri, 2007-10-19 at 09:09 -0700, Michael Krause wrote: > At 08:20 AM 10/19/2007, Peter Kjellstrom wrote: > > On Thursday 18 October 2007, Chuck Hartley wrote: > > ... > > > 8388608 5000 1342.12 1342.12 > > > ------------------------------------------------------------------ > > > > > > Is this typical RDMA performance? > > > > It's close to what I've seen on similar hw. ~1400 is what you can > > push through > > the 8x pci-e of the intel 5000 chipset (confirmed by trying 4x pci-e > > which > > has shown ~700). > > > > > What is the maximum theoretical BW for > > > DDR IB - 1525MB/sec? > > > > No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective > > which is > > 2000 MB/s (10-base) and 1907 MiB/s (2-base). > > There is also IB protocol overhead combined with driver / device > control traffic overhead (consumes device as well as PCI resources / > bandwidth), end-to-end control traffic which is also a function of > how the application is constructed. In general, hitting about 80-85% > of the theoretical maximum is possible. I'm very interested in this result. Can you elaborate this a bit more? Has anyone documented the ib traffic control mechanism? Regards, Koen Segers > > > On our system (with a different HCA) we see quite a difference with > > snoop-filter off (bios option). With snoop off (our) application > > performance > > goes up (not very suprising) but IB performance goes down (latency > > 0.4us > > worse and bw ~1400->1200). > > Mike > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer From vlad at lists.openfabrics.org Mon Oct 22 02:56:16 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 22 Oct 2007 02:56:16 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071022-0200 daily build status Message-ID: <20071022095616.2BA2DE60842@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Mon Oct 22 03:30:00 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 22 Oct 2007 12:30:00 +0200 Subject: [ofa-general] [PATCH] opensm: DOR (Dimension Order Routing) routing engine Message-ID: <20071022103000.GR21849@sashak.voltaire.com> From: Dale Purdy The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions. Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension. Paths are grown from a destination back to a source using the lowest dimension (port) of available paths at each step. This provides the ordering necessary to avoid deadlock. When there are multiple links between any two switches, they still represent only one dimension and traffic is balanced across them unless port equalization is turned off. In the case of hypercubes, the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable. In the case of meshes, the dimension should consistently use the same pair of ports, one port on one end of the cable, and the other port on the other end, continuing along the mesh dimension. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_switch.h | 4 ++++ opensm/man/opensm.8 | 32 ++++++++++++++++++++++++++++++-- opensm/opensm/main.c | 2 +- opensm/opensm/osm_dump.c | 11 ++++++++++- opensm/opensm/osm_opensm.c | 1 + opensm/opensm/osm_switch.c | 15 +++++++++++++++ opensm/opensm/osm_ucast_mgr.c | 10 ++++++++-- 7 files changed, 69 insertions(+), 6 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index e294527..e2fe86d 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -958,6 +958,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, IN osm_port_t * p_port, IN const uint16_t lid_ho, IN const boolean_t ignore_existing, + IN const boolean_t dor, IN OUT uint64_t * remote_sys_guids, IN OUT uint16_t * p_num_used_sys, IN OUT uint64_t * remote_node_guids, @@ -980,6 +981,9 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, * If false, the switch will choose an existing route if one * exists, otherwise will choose the optimal route. * +* dor +* [in] If TRUE, Dimension Order Routing will be done. +* * remote_sys_guids * [in out] The array of remote system guids already used to * route the other lids of the same target port (if LMC > 0). diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 index a0178df..2bdea8e 100644 --- a/opensm/man/opensm.8 +++ b/opensm/man/opensm.8 @@ -92,7 +92,7 @@ LID assignments resolving multiple use of same LID. \fB\-R\fR, \fB\-\-routing_engine\fR This option chooses routing engine instead of Min Hop algorithm (default). -Supported engines: updn, file, ftree, lash +Supported engines: updn, file, ftree, lash, dor .TP \fB\-z\fR, \fB\-\-connect_roots\fR This option enforces a routing engine (currently up/down @@ -452,7 +452,7 @@ Examples: .SH ROUTING .PP -OpenSM now offers four routing engines: +OpenSM now offers five routing engines: 1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized. @@ -474,6 +474,12 @@ distributing the paths between layers. LASH is an alternative deadlock-free topology-agnostic routing algorithm to the non-minimal UPDN algorithm avoiding the use of a potentially congested root node. +5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but +avoids port equalization except for redundant links between the same +two switches. This provides deadlock free routes for hypercubes when +the fabric is cabled as a hypercube and for meshes when cabled as a +mesh (see details below). + OpenSM also supports a file method which can load routes from a table. See \'Modular Routing Engine\' for more information on this. @@ -742,6 +748,28 @@ Note: LMC > 0 is not supported by the LASH routing. If this is specified, the default routing algorithm is invoked instead. +DOR Routing Algorithm + +The Dimension Order Routing algorithm is based on the Min Hop +algorithm and so uses shortest paths. Instead of spreading traffic +out across different paths with the same shortest distance, it chooses +among the available shortest paths based on an ordering of dimensions. +Each port must be consistently cabled to represent a hypercube +dimension or a mesh dimension. Paths are grown from a destination +back to a source using the lowest dimension (port) of available paths +at each step. This provides the ordering necessary to avoid deadlock. +When there are multiple links between any two switches, they still +represent only one dimension and traffic is balanced across them +unless port equalization is turned off. In the case of hypercubes, +the same port must be used throughout the fabric to represent the +hypercube dimension and match on both ends of the cable. In the case +of meshes, the dimension should consistently use the same pair of +ports, one port on one end of the cable, and the other port on the +other end, continuing along the mesh dimension. + +Use '-R dor' option to activate the DOR algorithm. + + Routing References To learn more about deadlock-free routing, see the article diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 099a8d1..5771e9e 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -174,7 +174,7 @@ void show_usage(void) "--routing_engine \n" " This option chooses routing engine instead of Min Hop\n" " algorithm (default).\n" - " Supported engines: updn, file, ftree, lash\n\n"); + " Supported engines: updn, file, ftree, lash, dor\n\n"); printf("-z\n" "--connect_roots\n" " This option enforces a routing engine (currently\n" diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c index b7d99b2..fa07f83 100644 --- a/opensm/opensm/osm_dump.c +++ b/opensm/opensm/osm_dump.c @@ -136,6 +136,7 @@ static void dump_ucast_routes(cl_map_item_t * p_map_item, void *cxt) uint16_t max_lid_ho; uint16_t lid_ho, base_lid; boolean_t direct_route_exists = FALSE; + boolean_t dor; osm_switch_t *p_sw = (osm_switch_t *) p_map_item; osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm; FILE *file = ((struct dump_context *)cxt)->file; @@ -148,6 +149,10 @@ static void dump_ucast_routes(cl_map_item_t * p_map_item, void *cxt) "Switch 0x%016" PRIx64 "\n" "LID : Port : Hops : Optimal\n", cl_ntoh64(osm_node_get_node_guid(p_node))); + + dor = (p_osm->routing_engine.name && + (strcmp(p_osm->routing_engine.name, "dor") == 0)); + for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) { fprintf(file, "0x%04X : ", lid_ho); @@ -228,7 +233,11 @@ static void dump_ucast_routes(cl_map_item_t * p_map_item, void *cxt) if (best_hops == num_hops) fprintf(file, "yes"); else { - best_port = osm_switch_recommend_path(p_sw, p_port, lid_ho, TRUE, NULL, NULL, NULL, NULL); /* No LMC Optimization */ + /* No LMC Optimization */ + best_port = osm_switch_recommend_path(p_sw, p_port, + lid_ho, TRUE, dor, + NULL, NULL, + NULL, NULL); fprintf(file, "No %u hop path possible via port %u!", best_hops, best_port); } diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 329305e..5b45401 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -81,6 +81,7 @@ const static struct routing_engine_module routing_modules[] = { {"file", osm_ucast_file_setup}, {"ftree", osm_ucast_ftree_setup}, {"lash", osm_ucast_lash_setup}, + {"dor", osm_ucast_null_setup }, {NULL, NULL} }; diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 5a636a2..bf686ad 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -224,6 +224,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, IN osm_port_t * p_port, IN const uint16_t lid_ho, IN const boolean_t ignore_existing, + IN const boolean_t dor, IN OUT uint64_t * remote_sys_guids, IN OUT uint16_t * p_num_used_sys, IN OUT uint64_t * remote_node_guids, @@ -267,6 +268,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, osm_physp_t *p_physp; osm_physp_t *p_rem_physp; osm_node_t *p_rem_node; + osm_node_t *p_rem_node_first = NULL; CL_ASSERT(lid_ho > 0); @@ -430,6 +432,19 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, the count is min but also lower then the max subscribed */ if (check_count < least_paths) { + if (dor) { + /* Get the Remote Node */ + p_rem_physp = osm_physp_get_remote(p_physp); + p_rem_node = osm_physp_get_node_ptr(p_rem_physp); + /* use the first dimension, but spread + * traffic out among the group of ports + * representing that dimension */ + if (port_found) { + if (p_rem_node != p_rem_node_first) + continue; + } else + p_rem_node_first = p_rem_node; + } port_found = TRUE; best_port = port_num; least_paths = check_count; diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 2a5fe88..43c2647 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -211,6 +211,8 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, uint8_t port; boolean_t is_ignored_by_port_prof; ib_net64_t node_guid; + struct osm_routing_engine *p_routing_eng; + boolean_t dor; /* The following are temporary structures that will aid in providing better routing in LMC > 0 situations @@ -274,6 +276,9 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, node_guid = osm_node_get_node_guid(p_sw->p_node); + p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; + dor = p_routing_eng->name && (strcmp(p_routing_eng->name, "dor") == 0); + /* The lid matrix contains the number of hops to each lid from each port. From this information we determine @@ -286,6 +291,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, port = osm_switch_recommend_path(p_sw, p_port, lid_ho, p_mgr->p_subn-> ignore_existing_lfts, + dor, remote_sys_guids, &num_used_sys, remote_node_guids, @@ -294,6 +300,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, port = osm_switch_recommend_path(p_sw, p_port, lid_ho, p_mgr->p_subn-> ignore_existing_lfts, + dor, NULL, NULL, NULL, NULL); @@ -306,8 +313,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, /* Up/Down routing can cause unreachable routes between some switches so we do not report that as an error in that case */ - if (!p_mgr->p_subn->p_osm->routing_engine. - build_lid_matrices) { + if (!p_routing_eng->build_lid_matrices) { osm_log(p_mgr->p_log, OSM_LOG_ERROR, "__osm_ucast_mgr_process_port: ERR 3A08: " "No path to get to LID 0x%X from switch 0x%" -- 1.5.3.4.206.g58ba4 From sashak at voltaire.com Mon Oct 22 03:38:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 22 Oct 2007 12:38:48 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_mgr: trivial improvements In-Reply-To: <20071022103000.GR21849@sashak.voltaire.com> References: <20071022103000.GR21849@sashak.voltaire.com> Message-ID: <20071022103848.GS21849@sashak.voltaire.com> Some trivial improvement: make is_dor boolean be member of osm_umast_mgr object, remove unused default_routing variable. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_mgr.h | 4 ++++ opensm/opensm/osm_ucast_mgr.c | 17 +++++++---------- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index c3c26e4..88d8cca 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -100,6 +100,7 @@ typedef struct _osm_ucast_mgr { osm_req_t *p_req; osm_log_t *p_log; cl_plock_t *p_lock; + boolean_t is_dor; boolean_t any_change; boolean_t some_hop_count_set; uint8_t *lft_buf; @@ -118,6 +119,9 @@ typedef struct _osm_ucast_mgr { * p_lock * Pointer to the serializing lock. * +* is_dor +* Dimension Order Routing (DOR) will be done +* * any_change * Initialized to FALSE at the beginning of the algorithm, * set to TRUE by osm_ucast_mgr_set_fwd_table() if any mad diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 43c2647..e708508 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -212,7 +212,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, boolean_t is_ignored_by_port_prof; ib_net64_t node_guid; struct osm_routing_engine *p_routing_eng; - boolean_t dor; /* The following are temporary structures that will aid in providing better routing in LMC > 0 situations @@ -277,7 +276,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, node_guid = osm_node_get_node_guid(p_sw->p_node); p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; - dor = p_routing_eng->name && (strcmp(p_routing_eng->name, "dor") == 0); /* The lid matrix contains the number of hops to each @@ -291,7 +289,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, port = osm_switch_recommend_path(p_sw, p_port, lid_ho, p_mgr->p_subn-> ignore_existing_lfts, - dor, + p_mgr->is_dor, remote_sys_guids, &num_used_sys, remote_node_guids, @@ -300,7 +298,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, port = osm_switch_recommend_path(p_sw, p_port, lid_ho, p_mgr->p_subn-> ignore_existing_lfts, - dor, + p_mgr->is_dor, NULL, NULL, NULL, NULL); @@ -772,13 +770,15 @@ osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) struct osm_routing_engine *p_routing_eng; osm_signal_t signal = OSM_SIGNAL_DONE; cl_qmap_t *p_sw_guid_tbl; - boolean_t default_routing = TRUE; OSM_LOG_ENTER(p_mgr->p_log, osm_ucast_mgr_process); p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; + p_mgr->is_dor = p_routing_eng->name + && (strcmp(p_routing_eng->name, "dor") == 0); + CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock); /* @@ -803,11 +803,8 @@ osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) Now that the lid matrices have been built, we can build and download the switch forwarding tables. */ - if (p_routing_eng->ucast_build_fwd_tables && - (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == - 0)) - default_routing = FALSE; - else + if (!p_routing_eng->ucast_build_fwd_tables || + !p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context)) cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr); -- 1.5.3.4.206.g58ba4 From jackm at dev.mellanox.co.il Mon Oct 22 06:30:39 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 22 Oct 2007 15:30:39 +0200 Subject: [ofa-general] [PATCH] libmlx4: fix thinko in headroom marking order commit Message-ID: <200710221530.39454.jackm@dev.mellanox.co.il> Fix thinko bug in commit c45efd89ef667b30b84e4f63d8c712d1ebcabde2, wherein s/g entries were written in forward (rather than reverse) order. Signed-off-by: Jack Morgenstein diff --git a/src/qp.c b/src/qp.c index 8213533..b82029c 100644 --- a/src/qp.c +++ b/src/qp.c @@ -344,7 +344,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, } else { struct mlx4_wqe_data_seg *seg = wqe; - for (i = 0; i < wr->num_sge; ++i) + for (i = wr->num_sge - 1; i >= 0 ; --i) set_data_seg(seg + i, wr->sg_list + i); size += wr->num_sge * (sizeof *seg / 16); From tziporet at mellanox.co.il Mon Oct 22 06:49:09 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 22 Oct 2007 15:49:09 +0200 Subject: [ofa-general] Agenda for OFED meeting today (Oct 22) Message-ID: <6C2C79E72C305246B504CBA17B5500C90156407C@mtlexch01.mtl.com> This is the agenda for the PFED meeting today 1. Alpha release status: * Each company that conduct testing should report its progress 2. MPI status: * MVAPICH - When the new 1.0 package will be ready for integration in OFED - DK * Open MPI - does the new version is the final version - Jeff 3. Review the tasks that should completed for the beta: 1. Integrate all SDP features - Jim (Mellanox) 2. Complete RDS work - Vlad (Mellanox) 3. Apply patches that fix warning of backport patches - Vlad (Mellanox) 4. Fix compilation problems on PPC - Vlad (Mellanox) 5. Add qperf test from Qlogic - Johann (Qlogic) 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) 7. Support RHEL 5 up1 - Woody & Vlad - done 8. SPEC files should be part of each user space package - each owner should take the spec file * Any other task that must be completed for the beta? 4. I suggest to start having a weekly meetings to track the release progress. Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Mon Oct 22 09:23:38 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 22 Oct 2007 18:23:38 +0200 Subject: [ofa-general] Agenda for OFED meeting today (Oct 22) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90156407C@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90156407C@mtlexch01.mtl.com> Message-ID: <471CCE8A.3000004@dev.mellanox.co.il> Tziporet Koren wrote: > This is the agenda for the PFED meeting today > > 1. Alpha release status: > > o Each company that conduct testing should report its progress > > 2. MPI status: > > * MVAPICH - When the new 1.0 package will be ready for integration > in OFED - DK > * Open MPI - does the new version is the final version - Jeff > > 3. Review the tasks that should completed for the beta: > > 1. Integrate all SDP features - Jim (Mellanox) > 2. Complete RDS work - Vlad (Mellanox) Coding done. Under test. > 3. Apply patches that fix warning of backport patches - Vlad > (Mellanox) In progress. > 4. Fix compilation problems on PPC - Vlad (Mellanox) In progress. > 5. Add qperf test from Qlogic - Johann (Qlogic) Waiting to get the qperf package from Johann > 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) > 7. Support RHEL 5 up1 - Woody & Vlad - done > 8. SPEC files should be part of each user space package - each > owner should take the spec file > > * Any other task that must be completed for the beta? > > 4. I suggest to start having a weekly meetings to track the release > progress. > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: _tziporet at mellanox.co.il_ > Tel +972-4-9097200, ext 380 > Regards, Vladimir From krause at cup.hp.com Mon Oct 22 12:52:56 2007 From: krause at cup.hp.com (Michael Krause) Date: Mon, 22 Oct 2007 12:52:56 -0700 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <200710202333.56962.cap@nsc.liu.se> References: <200710191720.58526.cap@nsc.liu.se> <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> <200710202333.56962.cap@nsc.liu.se> Message-ID: <6.2.0.14.2.20071022124627.02a87a10@esmail.cup.hp.com> At 02:33 PM 10/20/2007, Peter Kjellstrom wrote: >On Friday 19 October 2007, Michael Krause wrote: > > At 08:20 AM 10/19/2007, Peter Kjellstrom wrote: > > >On Thursday 18 October 2007, Chuck Hartley wrote: >... > > > > What is the maximum theoretical BW for > > > > DDR IB - 1525MB/sec? > > > > > >No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective which > > > is 2000 MB/s (10-base) and 1907 MiB/s (2-base). > > > > There is also IB protocol overhead combined with driver / device control > > traffic overhead (consumes device as well as PCI resources / bandwidth), > > end-to-end control traffic which is also a function of how the application > > is constructed. In general, hitting about 80-85% of the theoretical > > maximum is possible. > >IB can do much better than that. On an SDR system I typically get 950 MB/s >(10-base), 95%. This on 8x pci-express so the limitations of pci-e above does >not bite. If IB DDR could strech it's legs (if we had faster pci-e, say >pci-e-2.0...) then maybe we would see 95% there too :-). While there are certainly marketing workloads that can hit such high efficiencies, the number of real world workloads is rather small. There was one interconnect provider a few years back who used to demonstrate 95+% efficiency post 8b/10b encoding overhead by sending 1MB messages so the host interaction was to pull a single work request and then just issue DMA Read Requests. Just like that they were at link rate. However, most workloads are not single streams but a mix of streams with varying work request rates, sizes, etc. I don't doubt that one can hit higher rates than 80-85% but expect most workloads to rarely exceed this value. A couple years ago a reporter asked me about why some interconnects are at 50-60% efficiency when measured in real environments. We walked through the host / device as well as driver interactions, the ability of the platform to actually generate useful I/O work (some are processor / memory limited so improvements in the I/O subsystems has no real ROI), the protocol overheads, etc. He was trying ascertain whether there was a story here about vendors basically hyping their technology using the various marketing numbers when in reality they could not actually deliver the performance under more than a contrived or limited set of workloads. I convinced him there was no story here but it did illustrate my earlier industry talks about marketing hype vs. reality and how marketing does a great deal of harm due to lost credibility when it comes to running real world applications. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Mon Oct 22 12:57:59 2007 From: krause at cup.hp.com (Michael Krause) Date: Mon, 22 Oct 2007 12:57:59 -0700 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <1193043967.6395.22.camel@koenVRT> References: <20 0710191720.58526.cap@nsc.liu.se> <6.2.0.14.2.20071019090759.02e24a00@esmail.cup.hp.com> <1193043967.6395.22.camel@koenVRT> Message-ID: <6.2.0.14.2.20071022125302.02257f00@esmail.cup.hp.com> At 02:06 AM 10/22/2007, Koen Segers wrote: >On Fri, 2007-10-19 at 09:09 -0700, Michael Krause wrote: > > At 08:20 AM 10/19/2007, Peter Kjellstrom wrote: > > > On Thursday 18 October 2007, Chuck Hartley wrote: > > > ... > > > > 8388608 5000 1342.12 1342.12 > > > > ------------------------------------------------------------------ > > > > > > > > Is this typical RDMA performance? > > > > > > It's close to what I've seen on similar hw. ~1400 is what you can > > > push through > > > the 8x pci-e of the intel 5000 chipset (confirmed by trying 4x pci-e > > > which > > > has shown ~700). > > > > > > > What is the maximum theoretical BW for > > > > DDR IB - 1525MB/sec? > > > > > > No, it's 20 Gbps on the wire and 8/10 encoded so 16 Gbps effective > > > which is > > > 2000 MB/s (10-base) and 1907 MiB/s (2-base). > > > > There is also IB protocol overhead combined with driver / device > > control traffic overhead (consumes device as well as PCI resources / > > bandwidth), end-to-end control traffic which is also a function of > > how the application is constructed. In general, hitting about 80-85% > > of the theoretical maximum is possible. > > >I'm very interested in this result. Can you elaborate this a bit more? In what regard? >Has anyone documented the ib traffic control mechanism? Driver-to-device interactions consume resources and contend for local I/O bandwidth / local device processing Application-to-device interactions have similar impacts ULP exchanges such as SEND operations to communicate protection keys, addresses, etc. Host OS / application execution to generate work as well as schedule / process work and deal with any interrupts / polling mechanisms. This can lead from zero to significant delays resulting in burst style traffic patterns. Also many workloads may be small transaction dominate so their efficiency will be significantly lower than one that is large transaction dominant. And so forth. There are many variables and mileage will vary as a result. Some will do quite well while others will not. This is why a good range of benchmarks is required to evaluate whether a given solution is reasonable for the targeted workloads or problem space. Anyone can contrive something to do outstanding at one thing while completely biting it in other areas. Mike >Regards, > >Koen Segers > > > > > On our system (with a different HCA) we see quite a difference with > > > snoop-filter off (bios option). With snoop off (our) application > > > performance > > > goes up (not very suprising) but IB performance goes down (latency > > > 0.4us > > > worse and bw ~1400->1200). > > > > Mike > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general >*** Disclaimer *** > >Vlaamse Radio- en Televisieomroep >Auguste Reyerslaan 52, 1043 Brussel > >nv van publiek recht >BTW BE 0244.142.664 >RPR Brussel >http://www.vrt.be/disclaimer > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim at mellanox.com Mon Oct 22 13:24:32 2007 From: jim at mellanox.com (Jim Mott) Date: Mon, 22 Oct 2007 13:24:32 -0700 Subject: [ofa-general] [PATCH 1/1] SDP - Bug644 fix (DisConn, ChRcvBuf and ChRcvBufAck are sent solicited) Message-ID: This patch brings the implementation into compliance with A4.11.2 which says that these messages are not to be sent as solicited. Signed-off-by: Jim Mott This patch was created by Ami Perlmuter on May 30, 2007 and attached to bug 644. I missed getting it into 1.2.5. --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 09:04:36.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 15:37:57.000000000 -0500 @@ -163,8 +163,7 @@ void sdp_post_send(struct sdp_sock *ssk, ssk->tx_wr.num_sge = frags + 1; ssk->tx_wr.opcode = IB_WR_SEND; ssk->tx_wr.send_flags = IB_SEND_SIGNALED; - if (unlikely(mid != SDP_MID_DATA) || - unlikely(TCP_SKB_CB(skb)->flags & TCPCB_URG)) + if (unlikely(TCP_SKB_CB(skb)->flags & TCPCB_URG)) ssk->tx_wr.send_flags |= IB_SEND_SOLICITED; rc = ib_post_send(ssk->qp, &ssk->tx_wr, &bad_wr); ++ssk->tx_head; From jim at mellanox.com Mon Oct 22 13:33:07 2007 From: jim at mellanox.com (Jim Mott) Date: Mon, 22 Oct 2007 13:33:07 -0700 Subject: [ofa-general] [PATCH 1/1] SDP - Bug646 (DisConn message is sent even if there is 1 send credit and no credit update is sent) Message-ID: Compliance with CA4-82: If one credit is available, an implementation shall only send SDP messages that provide additional credits and also do not contain ULP payload. Signed-off-by: Jim Mott This patch was created by Ami Perlmuter on May 30, 2007 and attached to bug 646. I missed getting it into 1.2.5. --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 15:32:02.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 15:37:51.000000000 -0500 @@ -476,7 +476,7 @@ void sdp_post_sends(struct sdp_sock *ssk if (unlikely((1 << ssk->isk.sk.sk_state) & (TCPF_FIN_WAIT1 | TCPF_LAST_ACK)) && !ssk->isk.sk.sk_send_head && - ssk->bufs) { + ssk->bufs > (ssk->remote_credits >= ssk->rx_head - ssk->rx_tail)) { skb = sk_stream_alloc_skb(&ssk->isk.sk, sizeof(struct sdp_bsdh), GFP_KERNEL); From cap at nsc.liu.se Mon Oct 22 13:49:39 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Mon, 22 Oct 2007 22:49:39 +0200 Subject: [ofa-general] Expected RDMA performance In-Reply-To: <6.2.0.14.2.20071022124627.02a87a10@esmail.cup.hp.com> References: <200710202333.56962.cap@nsc.liu.se> <6.2.0.14.2.20071022124627.02a87a10@esmail.cup.hp.com> Message-ID: <200710222249.39708.cap@nsc.liu.se> On Monday 22 October 2007, Michael Krause wrote: ... > > > There is also IB protocol overhead combined with driver / device > > > control traffic overhead (consumes device as well as PCI resources / > > > bandwidth), end-to-end control traffic  which is also a function of how > > > the application is constructed.   In general, hitting about 80-85% of > > > the theoretical maximum is possible. > > > >IB can do much better than that. On an SDR system I typically get 950 MB/s > >(10-base), 95%. This on 8x pci-express so the limitations of pci-e above > > does not bite. If IB DDR could strech it's legs (if we had faster pci-e, > > say pci-e-2.0...) then maybe we would see 95% there too :-). > > While there are certainly marketing workloads that can hit such high > efficiencies, the number of real world workloads is rather small. Wow, slow down. Who said real workloads? And what is a real workload anyway? I don't care since that wasn't the issue here. The OP asked about ib_write_bw (or similar) _not_ what you could expect running some random application. My answer included comments on how his figures matched his hardware, a bit on why he didn't see 1907 MiB/s etc. I think that nicely ends this thread. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From jim at mellanox.com Mon Oct 22 13:52:24 2007 From: jim at mellanox.com (Jim Mott) Date: Mon, 22 Oct 2007 13:52:24 -0700 Subject: [ofa-general] [PATCH 1/1] SDP - Bug647 (size recieved from ChRcvBuf is never checked to see if it is in acceptable range) Message-ID: Clean up the buffer resize code to comply with CA4-83: Upon receipt of ChRcvBuf message, the remote peer shall not change the buffer size in the direction opposite of that requested. Also add some comments and pretty up the code. Signed-off-by: Jim Mott This patch was created by Ami Perlmuter on May 30, 2007 and attached to bug 647 (and duplicate bug 640). I missed getting it into 1.2.5. --- Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h 2007-10-10 15:36:46.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp.h 2007-10-10 15:37:35.000000000 -0500 @@ -148,13 +148,16 @@ struct sdp_sock { struct ib_send_wr tx_wr; /* SDP slow start */ - int rcvbuf_scale; - int sent_request; - int sent_request_head; - int recv_request_head; - int recv_request; - int recv_frags; - int send_frags; + int rcvbuf_scale; /* local recv buf scale for each socket */ + int sent_request_head; /* mark the tx_head of the last send resize + request */ + int sent_request; /* 0 - not sent yet, 1 - request pending + -1 - resize done succesfully */ + int recv_request_head; /* mark the rx_head when the resize request + was recieved */ + int recv_request; /* flag if request to resize was recieved */ + int recv_frags; /* max skb frags in recv packets */ + int send_frags; /* max skb frags in send packets */ struct ib_sge ibsge[SDP_MAX_SEND_SKB_FRAGS + 1]; struct ib_wc ibwc[SDP_NUM_WC]; @@ -227,9 +230,10 @@ struct sk_buff *sdp_recv_completion(stru struct sk_buff *sdp_send_completion(struct sdp_sock *ssk, int mseq); void sdp_urg(struct sdp_sock *ssk, struct sk_buff *skb); void sdp_add_sock(struct sdp_sock *ssk); +void sdp_remove_sock(struct sdp_sock *ssk); +void sdp_remove_large_sock(struct sdp_sock *ssk); +int sdp_resize_buffers(struct sdp_sock *ssk, u32 new_size); void sdp_post_keepalive(struct sdp_sock *ssk); void sdp_start_keepalive_timer(struct sock *sk); -void sdp_remove_sock(struct sdp_sock *ssk); -void sdp_remove_large_sock(void); #endif Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 15:36:46.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-10-10 15:37:41.000000000 -0500 @@ -70,9 +70,13 @@ static int curr_large_sockets = 0; atomic_t sdp_current_mem_usage; spinlock_t sdp_large_sockets_lock; -static int sdp_can_resize(void) +static int sdp_get_large_socket(struct sdp_sock *ssk) { int count, ret; + + if (ssk->recv_request) + return 1; + spin_lock_irq(&sdp_large_sockets_lock); count = curr_large_sockets; ret = curr_large_sockets < max_large_sockets; @@ -83,11 +87,13 @@ static int sdp_can_resize(void) return ret; } -void sdp_remove_large_sock(void) +void sdp_remove_large_sock(struct sdp_sock *ssk) { - spin_lock_irq(&sdp_large_sockets_lock); - curr_large_sockets--; - spin_unlock_irq(&sdp_large_sockets_lock); + if (ssk->recv_frags) { + spin_lock_irq(&sdp_large_sockets_lock); + curr_large_sockets--; + spin_unlock_irq(&sdp_large_sockets_lock); + } } /* Like tcp_fin */ @@ -458,7 +464,7 @@ void sdp_post_sends(struct sdp_sock *ssk /* FIXME */ BUG_ON(!skb); resp_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *resp_size); - resp_size->size = htons(ssk->recv_frags * PAGE_SIZE); + resp_size->size = htonl(ssk->recv_frags * PAGE_SIZE); sdp_post_send(ssk, skb, SDP_MID_CHRCVBUF_ACK); } @@ -485,7 +491,7 @@ void sdp_post_sends(struct sdp_sock *ssk ssk->sent_request = SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; ssk->sent_request_head = ssk->tx_head; req_size = (struct sdp_chrecvbuf *)skb_put(skb, sizeof *req_size); - req_size->size = htons(ssk->sent_request); + req_size->size = htonl(ssk->sent_request); sdp_post_send(ssk, skb, SDP_MID_CHRCVBUF); } @@ -521,11 +527,42 @@ void sdp_post_sends(struct sdp_sock *ssk } } -static inline void sdp_resize(struct sdp_sock *ssk, u32 new_size) +int sdp_resize_buffers(struct sdp_sock *ssk, u32 new_size) +{ + u32 curr_size = SDP_HEAD_SIZE + ssk->recv_frags * PAGE_SIZE; + u32 max_size = SDP_HEAD_SIZE + SDP_MAX_SEND_SKB_FRAGS * PAGE_SIZE; + + if (new_size > curr_size && new_size <= max_size && + sdp_get_large_socket(ssk)) { + ssk->rcvbuf_scale = rcvbuf_scale; + ssk->recv_frags = PAGE_ALIGN(new_size - SDP_HEAD_SIZE) / PAGE_SIZE; + if (ssk->recv_frags > SDP_MAX_SEND_SKB_FRAGS) + ssk->recv_frags = SDP_MAX_SEND_SKB_FRAGS; + return 0; + } else + return -1; +} + +static void sdp_handle_resize_request(struct sdp_sock *ssk, struct sdp_chrecvbuf *buf) { - ssk->recv_frags = PAGE_ALIGN(new_size - SDP_HEAD_SIZE) / PAGE_SIZE; - if (ssk->recv_frags > SDP_MAX_SEND_SKB_FRAGS) - ssk->recv_frags = SDP_MAX_SEND_SKB_FRAGS; + if (sdp_resize_buffers(ssk, ntohl(buf->size)) == 0) + ssk->recv_request_head = ssk->rx_head + 1; + else + ssk->recv_request_head = ssk->rx_tail; + ssk->recv_request = 1; +} + +static void sdp_handle_resize_ack(struct sdp_sock *ssk, struct sdp_chrecvbuf *buf) +{ + u32 new_size = ntohl(buf->size); + + if (new_size > ssk->xmit_size_goal) { + ssk->sent_request = -1; + ssk->xmit_size_goal = new_size; + ssk->send_frags = + PAGE_ALIGN(ssk->xmit_size_goal) / PAGE_SIZE; + } else + ssk->sent_request = 0; } static void sdp_handle_wc(struct sdp_sock *ssk, struct ib_wc *wc) @@ -605,28 +642,10 @@ static void sdp_handle_wc(struct sdp_soc sdp_sock_queue_rcv_skb(&ssk->isk.sk, skb); sdp_fin(&ssk->isk.sk); } else if (h->mid == SDP_MID_CHRCVBUF) { - u32 new_size = *(u32 *)skb->data; - - if (ssk->recv_request || sdp_can_resize()) { - ssk->rcvbuf_scale = rcvbuf_scale; - sdp_resize(ssk, ntohs(new_size)); - ssk->recv_request_head = ssk->rx_head + 1; - } else - ssk->recv_request_head = ssk->rx_tail; - ssk->recv_request = 1; + sdp_handle_resize_request(ssk, (struct sdp_chrecvbuf *)skb->data); __kfree_skb(skb); } else if (h->mid == SDP_MID_CHRCVBUF_ACK) { - u32 new_size = *(u32 *)skb->data; - new_size = ntohs(new_size); - - if (new_size > ssk->xmit_size_goal) { - ssk->sent_request = -1; - ssk->xmit_size_goal = new_size; - ssk->send_frags = - PAGE_ALIGN(ssk->xmit_size_goal) / - PAGE_SIZE; - } else - ssk->sent_request = 0; + sdp_handle_resize_ack(ssk, (struct sdp_chrecvbuf *)skb->data); __kfree_skb(skb); } else { /* TODO: Handle other messages */ Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_cma.c 2007-10-10 15:36:46.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c 2007-10-10 15:36:59.000000000 -0500 @@ -241,6 +241,7 @@ int sdp_connect_handler(struct sock *sk, sizeof(struct sdp_bsdh); sdp_sk(child)->send_frags = PAGE_ALIGN(sdp_sk(child)->xmit_size_goal) / PAGE_SIZE; + sdp_resize_buffers(sdp_sk(child), ntohl(h->desremrcvsz)); sdp_dbg(child, "%s bufs %d xmit_size_goal %d\n", __func__, sdp_sk(child)->bufs, Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c 2007-10-10 15:36:46.000000000 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/sdp/sdp_main.c 2007-10-10 15:37:35.000000000 -0500 @@ -204,8 +204,7 @@ static void sdp_destroy_qp(struct sdp_so if (pd) ib_dealloc_pd(pd); - if (ssk->recv_frags) - sdp_remove_large_sock(); + sdp_remove_large_sock(ssk); kfree(ssk->rx_ring); kfree(ssk->tx_ring); From rdreier at cisco.com Mon Oct 22 14:42:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 22 Oct 2007 14:42:30 -0700 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710221013.40112.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 22 Oct 2007 10:13:39 +0200") References: <200710221013.40112.jackm@dev.mellanox.co.il> Message-ID: > Several of the GCC compilers (I tried out gcc 4.1.1 on Red Hat Enterprise Linux 5, > gcc 4.1.0 on SuSE SLES 10, and gcc 3.4.6 on RHEL4 update 5) do not recognize this macro > (and consequently emit a compilation error). Something is fishy with your compilation process. __always_inline should never be visible to gcc; it is a macro defined in . - R. From rdreier at cisco.com Mon Oct 22 14:44:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 22 Oct 2007 14:44:08 -0700 Subject: [ofa-general] [PATCH] libmlx4: fix thinko in headroom marking order commit In-Reply-To: <200710221530.39454.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 22 Oct 2007 15:30:39 +0200") References: <200710221530.39454.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From sean.hefty at intel.com Mon Oct 22 15:03:00 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 22 Oct 2007 15:03:00 -0700 Subject: [ofa-general] [RFC/PATCH 2.6.24] ib/multicast: report errors on multicast groups if pkeys change Message-ID: <000201c814f7$500a2960$5acc180a@amr.corp.intel.com> Pkey changes can invalidate multicast groups. Report errors on any multicast group affected by a pkey change. Signed-off-by: Sean Hefty --- drivers/infiniband/core/multicast.c | 55 +++++++++++++++++++++++++++++------ 1 files changed, 45 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 15b4c4d..5d4a6f1 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -73,11 +73,20 @@ struct mcast_device { }; enum mcast_state { - MCAST_IDLE, MCAST_JOINING, MCAST_MEMBER, + MCAST_ERROR, +}; + +enum mcast_group_state { + MCAST_IDLE, MCAST_BUSY, - MCAST_ERROR + MCAST_GROUP_ERROR, + MCAST_PKEY_EVENT +}; + +enum { + MCAST_INVALID_PKEY_INDEX = 0xFFFF }; struct mcast_member; @@ -93,9 +102,10 @@ struct mcast_group { struct mcast_member *last_join; int members[3]; atomic_t refcount; - enum mcast_state state; + enum mcast_group_state state; struct ib_sa_query *query; int query_id; + u16 pkey_index; }; struct mcast_member { @@ -378,9 +388,19 @@ static int fail_join(struct mcast_group *group, struct mcast_member *member, static void process_group_error(struct mcast_group *group) { struct mcast_member *member; - int ret; + int ret = 0; + u16 pkey_index; + + if (group->state == MCAST_PKEY_EVENT) + ret = ib_find_pkey(group->port->dev->device, + group->port->port_num, + be16_to_cpu(group->rec.pkey), &pkey_index); spin_lock_irq(&group->lock); + if (group->state == MCAST_PKEY_EVENT && !ret && + group->pkey_index == pkey_index) + goto out; + while (!list_empty(&group->active_list)) { member = list_entry(group->active_list.next, struct mcast_member, list); @@ -399,6 +419,7 @@ static void process_group_error(struct mcast_group *group) } group->rec.join_state = 0; +out: group->state = MCAST_BUSY; spin_unlock_irq(&group->lock); } @@ -415,9 +436,9 @@ static void mcast_work_handler(struct work_struct *work) retest: spin_lock_irq(&group->lock); while (!list_empty(&group->pending_list) || - (group->state == MCAST_ERROR)) { + (group->state != MCAST_BUSY)) { - if (group->state == MCAST_ERROR) { + if (group->state != MCAST_BUSY) { spin_unlock_irq(&group->lock); process_group_error(group); goto retest; @@ -494,12 +515,19 @@ static void join_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { struct mcast_group *group = context; + u16 pkey_index = MCAST_INVALID_PKEY_INDEX; if (status) process_join_error(group, status); else { + ib_find_pkey(group->port->dev->device, group->port->port_num, + be16_to_cpu(rec->pkey), &pkey_index); + spin_lock_irq(&group->port->lock); group->rec = *rec; + if (group->state == MCAST_BUSY && + group->pkey_index == MCAST_INVALID_PKEY_INDEX) + group->pkey_index = pkey_index; if (!memcmp(&mgid0, &group->rec.mgid, sizeof mgid0)) { rb_erase(&group->node, &group->port->table); mcast_insert(group->port, group, 1); @@ -539,6 +567,7 @@ static struct mcast_group *acquire_group(struct mcast_port *port, group->port = port; group->rec.mgid = *mgid; + group->pkey_index = MCAST_INVALID_PKEY_INDEX; INIT_LIST_HEAD(&group->pending_list); INIT_LIST_HEAD(&group->active_list); INIT_WORK(&group->work, mcast_work_handler); @@ -707,7 +736,8 @@ int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num, } EXPORT_SYMBOL(ib_init_ah_from_mcmember); -static void mcast_groups_lost(struct mcast_port *port) +static void mcast_groups_event(struct mcast_port *port, + enum mcast_group_state state) { struct mcast_group *group; struct rb_node *node; @@ -721,7 +751,8 @@ static void mcast_groups_lost(struct mcast_port *port) atomic_inc(&group->refcount); queue_work(mcast_wq, &group->work); } - group->state = MCAST_ERROR; + if (group->state != MCAST_GROUP_ERROR) + group->state = state; spin_unlock(&group->lock); } spin_unlock_irqrestore(&port->lock, flags); @@ -731,16 +762,20 @@ static void mcast_event_handler(struct ib_event_handler *handler, struct ib_event *event) { struct mcast_device *dev; + int index; dev = container_of(handler, struct mcast_device, event_handler); + index = event->element.port_num - dev->start_port; switch (event->event) { case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: case IB_EVENT_SM_CHANGE: case IB_EVENT_CLIENT_REREGISTER: - mcast_groups_lost(&dev->port[event->element.port_num - - dev->start_port]); + mcast_groups_event(&dev->port[index], MCAST_GROUP_ERROR); + break; + case IB_EVENT_PKEY_CHANGE: + mcast_groups_event(&dev->port[index], MCAST_PKEY_EVENT); break; default: break; From rdreier at cisco.com Mon Oct 22 19:17:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 22 Oct 2007 19:17:44 -0700 Subject: [ofa-general] Re: [PATCH 1/5 v2] libnes: library init entry points In-Reply-To: <200710192031.l9JKV0xk021882@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:31:00 -0500") References: <200710192031.l9JKV0xk021882@neteffect.com> Message-ID: > + global: > + ibv_driver_init; There's no version of libibverbs that ever used the ibv_driver_init entry point, so you can just kill this. > + openib_driver_init; From jackm at dev.mellanox.co.il Tue Oct 23 00:22:09 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 23 Oct 2007 09:22:09 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: References: <200710221013.40112.jackm@dev.mellanox.co.il> Message-ID: <200710230922.09702.jackm@dev.mellanox.co.il> On Monday 22 October 2007 23:42, Roland Dreier wrote: > Something is fishy with your compilation process.  __always_inline > should never be visible to gcc; it is a macro defined in . > Might this be a gcc installation problem? I have the following file (foobar.c): ============== static __always_inline int foo(int i) {return (i+1); } int main(void) { return foo(5); } ============== If I just do: gcc foobar.c, I get the following output: foobar.c:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘int’ If I modify the file to: ============== static __attribute__((always_inline)) int foo(int i) {return (i+1); } int main(void) { return foo(5); } ============== and do gcc foobar.c, the compilation succeeds. I'm using gcc (GCC) 4.1.0 (SUSE Linux), on SLES 10. -- Jack From glebn at voltaire.com Tue Oct 23 00:23:20 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 23 Oct 2007 09:23:20 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710230922.09702.jackm@dev.mellanox.co.il> References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710230922.09702.jackm@dev.mellanox.co.il> Message-ID: <20071023072320.GA2667@minantech.com> On Tue, Oct 23, 2007 at 09:22:09AM +0200, Jack Morgenstein wrote: > On Monday 22 October 2007 23:42, Roland Dreier wrote: > > Something is fishy with your compilation process.  __always_inline > > should never be visible to gcc; it is a macro defined in . > > > > Might this be a gcc installation problem? > > I have the following file (foobar.c): > > ============== > static __always_inline int foo(int i) {return (i+1); } > int main(void) { return foo(5); } > ============== And where is "#include " here? > > If I just do: gcc foobar.c, > I get the following output: > foobar.c:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘int’ > > If I modify the file to: > ============== > static __attribute__((always_inline)) int foo(int i) {return (i+1); } > int main(void) { return foo(5); } > ============== > > and do gcc foobar.c, > the compilation succeeds. > > I'm using gcc (GCC) 4.1.0 (SUSE Linux), on SLES 10. > > -- Jack > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From moshek at voltaire.com Tue Oct 23 00:45:31 2007 From: moshek at voltaire.com (Moshe Kazir) Date: Tue, 23 Oct 2007 09:45:31 +0200 Subject: [ofa-general] ibutils.src.rpm do not pass compile on PPC64 -SLES 10 SP1 Message-ID: <39C75744D164D948A170E9792AF8E7CA4D2BA2@exil.voltaire.com> The attached patch enable compile of ibutils with no error on SLES 10 SP1 PPC64 JS21 Idid not test it on other environments. When I tried to rpmbuild the new created ibutils...src.rpm I face a new problem ... make[1]: Leaving directory `/usr/src/packages/BUILD/ibutils-1.2' + install -d /var/tmp/ibutils-1.2-0.4.ofed20070930-root-root/etc/profile.d + cat + cat + touch ibutils-files + install -d /var/tmp/ibutils-1.2-0.4.ofed20070930-root-root/etc/ld.so.conf.d + echo /usr/lib64 + case /usr in + /usr/lib/rpm/brp-lib64-linux sf at suse.de: if you find problems with this script, drop me a note RPATH /usr/lib/gcc/powerpc64-suse-linux/4.1.2/64:/usr/lib /var/tmp/ibutils-1.2-0.4.ofed20070930-root-root/usr/lib64/libibdm.so: rpath to 32bit lib error: Bad exit status from /var/tmp/rpm-tmp.12275 (%install) I'm starting to dig it now. Patch explanation: In the configure files you'll find the line "$CC -print-search-dirs ....." On a non PPC64 that's the right way to find the gcc & g++ library path. But it is not good for PPC64 ! The defualt for PPC64 is ELF 32 bit, therefore if one When you search libraries for ELF 64 objects the right command is -> "$CC $CFLAGS $CPPFLAGS $LDFLAGS -print-search-dirs ....." Also the patch include spell mistake that caused link errors -> somewere "CFLAGS=save_cfalgs" is fixed tobee "CFLAGS=save_cflags" Moshe Can someone help me to fix this problem in the right way, I.e. changing the autoconf input files and not the configure / aclocal.m4 files Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ibutils_sles10_sp1_ppc64.patch Type: application/octet-stream Size: 28751 bytes Desc: ibutils_sles10_sp1_ppc64.patch URL: From jackm at dev.mellanox.co.il Tue Oct 23 01:28:16 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 23 Oct 2007 10:28:16 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <20071023072320.GA2667@minantech.com> References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710230922.09702.jackm@dev.mellanox.co.il> <20071023072320.GA2667@minantech.com> Message-ID: <200710231028.16902.jackm@dev.mellanox.co.il> On Tuesday 23 October 2007 09:23, Gleb Natapov wrote: > And where is "#include " here? > Point taken. However, I checked on Red Hat Enterprise Linux 4 (update 5) distributions. the macro "__always_inline" is not present there (see below). They use "inline" or "__inline__" or "__inline" instead. How do we avoid "backports" for gcc?? ================================================================ /* Never include this file directly. Include instead. */ /* These definitions are for GCC v3.x. */ #include #if __GNUC_MINOR__ >= 1 && __GNUC_MINOR__ < 4 # define inline __inline__ __attribute__((always_inline)) # define __inline__ __inline__ __attribute__((always_inline)) # define __inline __inline__ __attribute__((always_inline)) #endif #if __GNUC_MINOR__ > 0 # define __deprecated __attribute__((deprecated)) #endif #if __GNUC_MINOR__ >= 3 # define __attribute_used__ __attribute__((__used__)) #else # define __attribute_used__ __attribute__((__unused__)) #endif #define __attribute_pure__ __attribute__((pure)) #define __attribute_const__ __attribute__((__const__)) #if __GNUC_MINOR__ >= 1 #define noinline __attribute__((noinline)) #endif #if __GNUC_MINOR__ >= 4 #define __must_check __attribute__((warn_unused_result)) #endif #if __GNUC_MINOR__ >= 5 #define __compiler_offsetof(a,b) __builtin_offsetof(a,b) #endif ~ From jackm at dev.mellanox.co.il Tue Oct 23 01:43:16 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 23 Oct 2007 10:43:16 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710231028.16902.jackm@dev.mellanox.co.il> References: <200710221013.40112.jackm@dev.mellanox.co.il> <20071023072320.GA2667@minantech.com> <200710231028.16902.jackm@dev.mellanox.co.il> Message-ID: <200710231043.16411.jackm@dev.mellanox.co.il> On Tuesday 23 October 2007 10:28, Jack Morgenstein wrote: > On Tuesday 23 October 2007 09:23, Gleb Natapov wrote: > > And where is "#include " here? > > > Point taken.  However, I checked on Red Hat Enterprise Linux 4 (update 5) > distributions.  the macro "__always_inline" is not present there (see below). > They use "inline" or "__inline__" or "__inline" instead. > Correction. In Kernel space, the __always_inline macro is present (in file /lib/modules//source/linux/compiler.h. However, in user space, the file used is: /usr/include/linux/compiler.h: #ifndef __LINUX_COMPILER_H #define __LINUX_COMPILER_H #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) #endif /* __LINUX_COMPILER_H */ The __always_inline macro is not defined for userspace in RHEL4. Any ideas (other than just including a macro ourselves: #ifndef __always_inline #define __always_inline inline #endif )? - Jack From glebn at voltaire.com Tue Oct 23 01:40:05 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 23 Oct 2007 10:40:05 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710231043.16411.jackm@dev.mellanox.co.il> References: <200710221013.40112.jackm@dev.mellanox.co.il> <20071023072320.GA2667@minantech.com> <200710231028.16902.jackm@dev.mellanox.co.il> <200710231043.16411.jackm@dev.mellanox.co.il> Message-ID: <20071023084004.GC2667@minantech.com> On Tue, Oct 23, 2007 at 10:43:16AM +0200, Jack Morgenstein wrote: > On Tuesday 23 October 2007 10:28, Jack Morgenstein wrote: > > On Tuesday 23 October 2007 09:23, Gleb Natapov wrote: > > > And where is "#include " here? > > > > > Point taken.  However, I checked on Red Hat Enterprise Linux 4 (update 5) > > distributions.  the macro "__always_inline" is not present there (see below). > > They use "inline" or "__inline__" or "__inline" instead. > > > Correction. > > In Kernel space, the __always_inline macro is present (in file /lib/modules//source/linux/compiler.h. > > However, in user space, the file used is: /usr/include/linux/compiler.h: > #ifndef __LINUX_COMPILER_H > #define __LINUX_COMPILER_H > > #define likely(x) __builtin_expect((x),1) > #define unlikely(x) __builtin_expect((x),0) > > #endif /* __LINUX_COMPILER_H */ > > The __always_inline macro is not defined for userspace in RHEL4. I am not sure those macros (and header) are meant to be used by userspace. > > Any ideas (other than just including a macro ourselves: > #ifndef __always_inline > #define __always_inline inline > #endif > )? > > - Jack -- Gleb. From jackm at dev.mellanox.co.il Tue Oct 23 01:58:26 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 23 Oct 2007 10:58:26 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <20071023084004.GC2667@minantech.com> References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710231043.16411.jackm@dev.mellanox.co.il> <20071023084004.GC2667@minantech.com> Message-ID: <200710231058.26784.jackm@dev.mellanox.co.il> On Tuesday 23 October 2007 10:40, Gleb Natapov wrote: > > The __always_inline macro is not defined for userspace in RHEL4. > I am not sure those macros (and header) are meant to be used by userspace. > The macro is currently being used in libmlx4 (src/qp.c). That's what started the thread. - Jack From kliteyn at dev.mellanox.co.il Tue Oct 23 02:03:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 23 Oct 2007 11:03:30 +0200 Subject: [ofa-general] Re: [PATCH V2] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <20071017223422.GP6945@sashak.voltaire.com> References: <4714C9A1.5010304@dev.mellanox.co.il> <20071017221322.GN6945@sashak.voltaire.com> <20071017223422.GP6945@sashak.voltaire.com> Message-ID: <471DB8E2.1040001@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 00:13 Thu 18 Oct , Sasha Khapyorsky wrote: >>> diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h >>> index 0969755..3685007 100644 >>> --- a/opensm/include/iba/ib_types.h >>> +++ b/opensm/include/iba/ib_types.h >>> @@ -3247,8 +3247,7 @@ typedef struct _ib_class_port_info { >>> uint8_t base_ver; >>> uint8_t class_ver; >>> ib_net16_t cap_mask; >>> - uint8_t reserved[3]; >>> - uint8_t resp_time_val; >>> + ib_net32_t cap_mask2_resp_time; > > This will break ibutils. We are in OFED already, so I think the patch > for ibutils should be committed/pushed at same time. Patch for ibutils is ready and waiting for the opensm patch to be applied. -- Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Tue Oct 23 02:03:32 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 23 Oct 2007 11:03:32 +0200 Subject: [ofa-general] [PATCH v3] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit Message-ID: <471DB8E4.8030400@dev.mellanox.co.il> Adding ClassPortInfo:CapabilityMask2 field and turning on OSM QoS capability bit (OSM_CAP2_IS_QOS_SUPPORTED). Signed-off-by: Yevgeny Kliteynik --- infiniband-diags/src/saquery.c | 6 +- opensm/include/iba/ib_types.h | 137 +++++++++++++++++++++++++++++++- opensm/include/opensm/osm_base.h | 12 +++ opensm/opensm/osm_sa_class_port_info.c | 5 +- opensm/osmtest/osmtest.c | 13 +++- 5 files changed, 163 insertions(+), 10 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index a9a8da4..e17ec5a 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -262,7 +262,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) "\t\tBase version.............%d\n" "\t\tClass version............%d\n" "\t\tCapability mask..........0x%04X\n" - "\t\tResponse time value......0x%08X\n" + "\t\tCapability mask 2........0x%08X\n" + "\t\tResponse time value......0x%02X\n" "\t\tRedirect GID.............0x%s\n" "\t\tRedirect TC/SL/FL........0x%08X\n" "\t\tRedirect LID.............0x%04X\n" @@ -279,7 +280,8 @@ print_class_port_info(ib_class_port_info_t *class_port_info) class_port_info->base_ver, class_port_info->class_ver, cl_ntoh16(class_port_info->cap_mask), - class_port_info->resp_time_val, + ib_class_cap_mask2(class_port_info), + ib_class_resp_time_val(class_port_info), sprint_gid(&(class_port_info->redir_gid), gid_str, GID_STR_LEN), cl_ntoh32(class_port_info->redir_tc_sl_fl), cl_ntoh16(class_port_info->redir_lid), diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index ce4bfb3..d904d9c 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -3249,8 +3249,7 @@ typedef struct _ib_class_port_info { uint8_t base_ver; uint8_t class_ver; ib_net16_t cap_mask; - uint8_t reserved[3]; - uint8_t resp_time_val; + ib_net32_t cap_mask2_resp_time; ib_gid_t redir_gid; ib_net32_t redir_tc_sl_fl; ib_net16_t redir_lid; @@ -3277,8 +3276,9 @@ typedef struct _ib_class_port_info { * cap_mask * Supported capabilities of this management class. * -* resp_time_value -* Maximum expected response time. +* cap_mask2_resp_time +* Maximum expected response time and additional +* supported capabilities of this management class. * * redr_gid * GID to use for redirection, or zero @@ -3324,6 +3324,135 @@ typedef struct _ib_class_port_info { * *********/ +/****f* IBA Base: Types/ib_class_set_resp_time_val +* NAME +* ib_class_set_resp_time_val +* +* DESCRIPTION +* Set maximum expected response time. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_resp_time_val(IN ib_class_port_info_t * const p_cpi, + IN const uint8_t val) +{ + p_cpi->cap_mask2_resp_time = + (p_cpi->cap_mask2_resp_time & CL_HTON32(~IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(val & IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* val +* [in] Response time value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_resp_time_val +* NAME +* ib_class_resp_time_val +* +* DESCRIPTION +* Get response time value. +* +* SYNOPSIS +*/ +static inline uint8_t OSM_API +ib_class_resp_time_val(IN ib_class_port_info_t * const p_cpi) +{ + return (uint8_t)(cl_ntoh32(p_cpi->cap_mask2_resp_time) & + IB_CLASS_RESP_TIME_MASK); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* Response time value. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_set_cap_mask2 +* NAME +* ib_class_set_cap_mask2 +* +* DESCRIPTION +* Set ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline void OSM_API +ib_class_set_cap_mask2(IN ib_class_port_info_t * const p_cpi, + IN const uint32_t cap_mask2) +{ + p_cpi->cap_mask2_resp_time = (p_cpi->cap_mask2_resp_time & + CL_HTON32(IB_CLASS_RESP_TIME_MASK)) | + cl_hton32(cap_mask2 << 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* cap_mask2 +* [in] CapabilityMask2 value to set. +* +* RETURN VALUES +* None +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + +/****f* IBA Base: Types/ib_class_cap_mask2 +* NAME +* ib_class_cap_mask2 +* +* DESCRIPTION +* Get ClassPortInfo:CapabilityMask2. +* +* SYNOPSIS +*/ +static inline uint32_t OSM_API +ib_class_cap_mask2(IN const ib_class_port_info_t * const p_cpi) +{ + return (cl_ntoh32(p_cpi->cap_mask2_resp_time) >> 5); +} + +/* +* PARAMETERS +* p_cpi +* [in] Pointer to the class port info object. +* +* RETURN VALUES +* CapabilityMask2 of the ClassPortInfo. +* +* NOTES +* +* SEE ALSO +* ib_class_port_info_t +*********/ + /****s* IBA Base: Types/ib_sm_info_t * NAME * ib_sm_info_t diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index e635dcb..26ef067 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -661,6 +661,18 @@ typedef enum _osm_thread_state { #define OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED (1 << 13) /***********/ +/****d* OpenSM: Base/OSM_CAP2_IS_QOS_SUPPORTED +* Name +* OSM_CAP2_IS_QOS_SUPPORTED +* +* DESCRIPTION +* QoS is supported +* +* SYNOPSIS +*/ +#define OSM_CAP2_IS_QOS_SUPPORTED (1 << 1) +/***********/ + /****d* OpenSM: Base/osm_sm_state_t * NAME * osm_sm_state_t diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c index d5c9f82..8a49398 100644 --- a/opensm/opensm/osm_sa_class_port_info.c +++ b/opensm/opensm/osm_sa_class_port_info.c @@ -170,7 +170,7 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, } } rtv += 8; - p_resp_cpi->resp_time_val = rtv; + ib_class_set_resp_time_val(p_resp_cpi, rtv); p_resp_cpi->redir_gid = zero_gid; p_resp_cpi->redir_tc_sl_fl = 0; p_resp_cpi->redir_lid = 0; @@ -209,6 +209,9 @@ __osm_cpi_rcv_respond(IN osm_cpi_rcv_t * const p_rcv, p_resp_cpi->cap_mask = OSM_CAP_IS_SUBN_GET_SET_NOTICE_SUP | OSM_CAP_IS_PORT_INFO_CAPMASK_MATCH_SUPPORTED; #endif + if (p_rcv->p_subn->opt.qos) + ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); + if (p_rcv->p_subn->opt.no_multicast_option != TRUE) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); diff --git a/opensm/osmtest/osmtest.c b/opensm/osmtest/osmtest.c index 73933a3..de54f2d 100644 --- a/opensm/osmtest/osmtest.c +++ b/opensm/osmtest/osmtest.c @@ -713,10 +713,17 @@ ib_api_status_t osmtest_validate_sa_class_port_info(IN osmtest_t * const p_osmt) (ib_class_port_info_t *) ib_sa_mad_get_payload_ptr(p_resp_sa_madp); osm_log(&p_osmt->log, OSM_LOG_INFO, - "osmtest_validate_sa_class_port_info:\n-----------------------------\nSA Class Port Info:\n" - " base_ver:%u\n class_ver:%u\n cap_mask:0x%X\n resp_time_val:0x%X\n-----------------------------\n", + "osmtest_validate_sa_class_port_info:\n" + "-----------------------------\n" + "SA Class Port Info:\n" + " base_ver:%u\n" + " class_ver:%u\n" + " cap_mask:0x%X\n" + " cap_mask2:0x%X\n" + " resp_time_val:0x%X\n" + "-----------------------------\n", p_cpi->base_ver, p_cpi->class_ver, cl_ntoh16(p_cpi->cap_mask), - p_cpi->resp_time_val); + ib_class_cap_mask2(p_cpi), ib_class_resp_time_val(p_cpi)); Exit: #if 0 -- 1.5.1.4 From He.Huang at Sun.COM Tue Oct 23 02:20:56 2007 From: He.Huang at Sun.COM (Isaac Huang) Date: Tue, 23 Oct 2007 17:20:56 +0800 Subject: [ofa-general] OFED version macro Message-ID: <20071023092056.GA537@sun.com> Hi, I'm looking for something similar to LINUX_VERSION_CODE so that my code can build against both OFED 1.1 and 1.2.x, since there's been some API changes. But grepping through the headers didn't reveal anything close to it. In particular, ib_create_cq prototype has changed to take an extra parameter, how can I detect this at preprocessing time? Also, is interoperability between OFED 1.1 and 1.2.x supported? Thanks, Isaac From vlad at lists.openfabrics.org Tue Oct 23 02:56:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 23 Oct 2007 02:56:35 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071023-0200 daily build status Message-ID: <20071023095636.32E24E60843@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.23 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From glebn at voltaire.com Tue Oct 23 03:01:38 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 23 Oct 2007 12:01:38 +0200 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710231058.26784.jackm@dev.mellanox.co.il> References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710231043.16411.jackm@dev.mellanox.co.il> <20071023084004.GC2667@minantech.com> <200710231058.26784.jackm@dev.mellanox.co.il> Message-ID: <20071023100138.GD2667@minantech.com> On Tue, Oct 23, 2007 at 10:58:26AM +0200, Jack Morgenstein wrote: > On Tuesday 23 October 2007 10:40, Gleb Natapov wrote: > > > The __always_inline macro is not defined for userspace in RHEL4. > > I am not sure those macros (and header) are meant to be used by userspace. > > > The macro is currently being used in libmlx4 (src/qp.c). That's what started > the thread. > I see. The question is back to Roland then :) -- Gleb. From dotanb at dev.mellanox.co.il Tue Oct 23 05:39:25 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 23 Oct 2007 14:39:25 +0200 Subject: [ofa-general] [PATCH] libibverbs/examples: add command line parameter for SL Message-ID: <200710231439.25367.dotanb@dev.mellanox.co.il> Added command line parameter to support changing the SL of the QP/AH. (This is being used mainly in order to check the QoS feature) Signed-off-by: Dotan Barak --- diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c index 258eb8f..4a90498 100644 --- a/examples/rc_pingpong.c +++ b/examples/rc_pingpong.c @@ -76,7 +76,8 @@ struct pingpong_dest { }; static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, - enum ibv_mtu mtu, struct pingpong_dest *dest) + enum ibv_mtu mtu, int sl, + struct pingpong_dest *dest) { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, @@ -88,7 +89,7 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, .ah_attr = { .is_global = 0, .dlid = dest->lid, - .sl = 0, + .sl = sl, .src_path_bits = 0, .port_num = port } @@ -192,7 +193,8 @@ out: } static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, - int ib_port, enum ibv_mtu mtu, int port, + int ib_port, enum ibv_mtu mtu, + int port, int sl, const struct pingpong_dest *my_dest) { struct addrinfo *res, *t; @@ -259,7 +261,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, sscanf(msg, "%x:%x:%x", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn); - if (pp_connect_ctx(ctx, ib_port, my_dest->psn, mtu, rem_dest)) { + if (pp_connect_ctx(ctx, ib_port, my_dest->psn, mtu, sl, rem_dest)) { fprintf(stderr, "Couldn't connect to remote QP\n"); free(rem_dest); rem_dest = NULL; @@ -473,6 +475,7 @@ static void usage(const char *argv0) printf(" -m, --mtu= path MTU (default 1024)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); + printf(" -l, --sl= service level value\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); } @@ -496,6 +499,7 @@ int main(int argc, char *argv[]) int routs; int rcnt, scnt; int num_cq_events = 0; + int sl = 0; srand48(getpid() * time(NULL)); @@ -510,11 +514,12 @@ int main(int argc, char *argv[]) { .name = "mtu", .has_arg = 1, .val = 'm' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:m:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:e", long_options, NULL); if (c == -1) break; @@ -558,6 +563,10 @@ int main(int argc, char *argv[]) iters = strtol(optarg, NULL, 0); break; + case 'l': + sl = strtol(optarg, NULL, 0); + break; + case 'e': ++use_event; break; @@ -631,7 +640,7 @@ int main(int argc, char *argv[]) if (servername) rem_dest = pp_client_exch_dest(servername, port, &my_dest); else - rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, &my_dest); + rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, sl, &my_dest); if (!rem_dest) return 1; @@ -640,7 +649,7 @@ int main(int argc, char *argv[]) rem_dest->lid, rem_dest->qpn, rem_dest->psn); if (servername) - if (pp_connect_ctx(ctx, ib_port, my_dest.psn, mtu, rem_dest)) + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, mtu, sl, rem_dest)) return 1; ctx->pending = PINGPONG_RECV_WRID; diff --git a/examples/srq_pingpong.c b/examples/srq_pingpong.c index 490ad0a..1ff4668 100644 --- a/examples/srq_pingpong.c +++ b/examples/srq_pingpong.c @@ -80,7 +80,7 @@ struct pingpong_dest { }; static int pp_connect_ctx(struct pingpong_context *ctx, int port, enum ibv_mtu mtu, - const struct pingpong_dest *my_dest, + int sl, const struct pingpong_dest *my_dest, const struct pingpong_dest *dest) { int i; @@ -96,7 +96,7 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, enum ibv_mtu m .ah_attr = { .is_global = 0, .dlid = dest[i].lid, - .sl = 0, + .sl = sl, .src_path_bits = 0, .port_num = port } @@ -214,7 +214,8 @@ out: } static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, - int ib_port, enum ibv_mtu mtu, int port, + int ib_port, enum ibv_mtu mtu, + int port, int sl, const struct pingpong_dest *my_dest) { struct addrinfo *res, *t; @@ -291,7 +292,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, &rem_dest[i].lid, &rem_dest[i].qpn, &rem_dest[i].psn); } - if (pp_connect_ctx(ctx, ib_port, mtu, my_dest, rem_dest)) { + if (pp_connect_ctx(ctx, ib_port, mtu, sl, my_dest, rem_dest)) { fprintf(stderr, "Couldn't connect to remote QP\n"); free(rem_dest); rem_dest = NULL; @@ -544,6 +545,7 @@ static void usage(const char *argv0) printf(" -q, --num-qp= number of QPs to use (default 16)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges per QP(default 1000)\n"); + printf(" -l, --sl= service level value\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); } @@ -571,6 +573,7 @@ int main(int argc, char *argv[]) int num_wc; int i; int num_cq_events = 0; + int sl = 0; srand48(getpid() * time(NULL)); @@ -586,11 +589,12 @@ int main(int argc, char *argv[]) { .name = "num-qp", .has_arg = 1, .val = 'q' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:m:q:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:q:r:n:l:e", long_options, NULL); if (c == -1) break; @@ -639,6 +643,10 @@ int main(int argc, char *argv[]) iters = strtol(optarg, NULL, 0); break; + case 'l': + sl = strtol(optarg, NULL, 0); + break; + case 'e': ++use_event; break; @@ -726,7 +734,7 @@ int main(int argc, char *argv[]) if (servername) rem_dest = pp_client_exch_dest(servername, port, my_dest); else - rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, my_dest); + rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, sl, my_dest); if (!rem_dest) return 1; @@ -736,7 +744,7 @@ int main(int argc, char *argv[]) rem_dest[i].lid, rem_dest[i].qpn, rem_dest[i].psn); if (servername) - if (pp_connect_ctx(ctx, ib_port, mtu, my_dest, rem_dest)) + if (pp_connect_ctx(ctx, ib_port, mtu, sl, my_dest, rem_dest)) return 1; if (servername) diff --git a/examples/uc_pingpong.c b/examples/uc_pingpong.c index b6051c8..45be804 100644 --- a/examples/uc_pingpong.c +++ b/examples/uc_pingpong.c @@ -76,7 +76,8 @@ struct pingpong_dest { }; static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, - enum ibv_mtu mtu, struct pingpong_dest *dest) + enum ibv_mtu mtu, int sl, + struct pingpong_dest *dest) { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, @@ -86,7 +87,7 @@ static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, .ah_attr = { .is_global = 0, .dlid = dest->lid, - .sl = 0, + .sl = sl, .src_path_bits = 0, .port_num = port } @@ -180,7 +181,8 @@ out: } static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, - int ib_port, enum ibv_mtu mtu, int port, + int ib_port, enum ibv_mtu mtu, + int port, int sl, const struct pingpong_dest *my_dest) { struct addrinfo *res, *t; @@ -247,7 +249,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, sscanf(msg, "%x:%x:%x", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn); - if (pp_connect_ctx(ctx, ib_port, my_dest->psn, mtu, rem_dest)) { + if (pp_connect_ctx(ctx, ib_port, my_dest->psn, mtu, sl, rem_dest)) { fprintf(stderr, "Couldn't connect to remote QP\n"); free(rem_dest); rem_dest = NULL; @@ -461,6 +463,7 @@ static void usage(const char *argv0) printf(" -m, --mtu= path MTU (default 1024)\n"); printf(" -r, --rx-depth= number of receives to post at a time (default 500)\n"); printf(" -n, --iters= number of exchanges (default 1000)\n"); + printf(" -l, --sl= service level value\n"); printf(" -e, --events sleep on CQ events (default poll)\n"); } @@ -484,6 +487,7 @@ int main(int argc, char *argv[]) int routs; int rcnt, scnt; int num_cq_events = 0; + int sl = 0; srand48(getpid() * time(NULL)); @@ -498,11 +502,12 @@ int main(int argc, char *argv[]) { .name = "mtu", .has_arg = 1, .val = 'm' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:m:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:e", long_options, NULL); if (c == -1) break; @@ -546,6 +551,10 @@ int main(int argc, char *argv[]) iters = strtol(optarg, NULL, 0); break; + case 'l': + sl = strtol(optarg, NULL, 0); + break; + case 'e': ++use_event; break; @@ -619,7 +628,7 @@ int main(int argc, char *argv[]) if (servername) rem_dest = pp_client_exch_dest(servername, port, &my_dest); else - rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, &my_dest); + rem_dest = pp_server_exch_dest(ctx, ib_port, mtu, port, sl, &my_dest); if (!rem_dest) return 1; @@ -628,7 +637,7 @@ int main(int argc, char *argv[]) rem_dest->lid, rem_dest->qpn, rem_dest->psn); if (servername) - if (pp_connect_ctx(ctx, ib_port, my_dest.psn, mtu, rem_dest)) + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, mtu, sl, rem_dest)) return 1; ctx->pending = PINGPONG_RECV_WRID; diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c index c631e25..981c503 100644 --- a/examples/ud_pingpong.c +++ b/examples/ud_pingpong.c @@ -77,13 +77,13 @@ struct pingpong_dest { }; static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn, - struct pingpong_dest *dest) + int sl, struct pingpong_dest *dest) { struct ibv_qp_attr attr; struct ibv_ah_attr ah_attr = { .is_global = 0, .dlid = dest->lid, - .sl = 0, + .sl = sl, .src_path_bits = 0, .port_num = port }; @@ -181,7 +181,7 @@ out: } static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, - int ib_port, int port, + int ib_port, int port, int sl, const struct pingpong_dest *my_dest) { struct addrinfo *res, *t; @@ -248,7 +248,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, sscanf(msg, "%x:%x:%x", &rem_dest->lid, &rem_dest->qpn, &rem_dest->psn); - if (pp_connect_ctx(ctx, ib_port, my_dest->psn, rem_dest)) { + if (pp_connect_ctx(ctx, ib_port, my_dest->psn, sl, rem_dest)) { fprintf(stderr, "Couldn't connect to remote QP\n"); free(rem_dest); rem_dest = NULL; @@ -495,6 +495,7 @@ int main(int argc, char *argv[]) int routs; int rcnt, scnt; int num_cq_events = 0; + int sl = 0; srand48(getpid() * time(NULL)); @@ -508,11 +509,12 @@ int main(int argc, char *argv[]) { .name = "size", .has_arg = 1, .val = 's' }, { .name = "rx-depth", .has_arg = 1, .val = 'r' }, { .name = "iters", .has_arg = 1, .val = 'n' }, + { .name = "sl", .has_arg = 1, .val = 'l' }, { .name = "events", .has_arg = 0, .val = 'e' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:r:n:e", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:s:r:n:l:e", long_options, NULL); if (c == -1) break; @@ -549,6 +551,10 @@ int main(int argc, char *argv[]) iters = strtol(optarg, NULL, 0); break; + case 'l': + sl = strtol(optarg, NULL, 0); + break; + case 'e': ++use_event; break; @@ -622,7 +628,7 @@ int main(int argc, char *argv[]) if (servername) rem_dest = pp_client_exch_dest(servername, port, &my_dest); else - rem_dest = pp_server_exch_dest(ctx, ib_port, port, &my_dest); + rem_dest = pp_server_exch_dest(ctx, ib_port, port, sl, &my_dest); if (!rem_dest) return 1; @@ -631,7 +637,7 @@ int main(int argc, char *argv[]) rem_dest->lid, rem_dest->qpn, rem_dest->psn); if (servername) - if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) + if (pp_connect_ctx(ctx, ib_port, my_dest.psn, sl, rem_dest)) return 1; ctx->pending = PINGPONG_RECV_WRID; From ytpacnycg at bleufetishvideos.com Tue Oct 23 05:27:49 2007 From: ytpacnycg at bleufetishvideos.com (Mitch Pack) Date: Tue, 23 Oct 2007 00:27:49 -1200 Subject: [ofa-general] Jetzt bestellen und ein blaues Wunder erleben Message-ID: <042862724.49129201645891@bleufetishvideos.com> Sie leben nur einmal - warum dann nicht was neues ausprobieren? Original - 100% wirksam Ciiiaaaaaalis 10 Pack. 26,99 Euro Viiiaaaagra 10 Pack. 20,99 Euro - Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen - Kein peinlicher Arztbesuch erforderlich - Diskrete Verpackung und Zahlung - Kostenlose, arztliche Telefon-Beratung - Bequem und diskret online bestellen. - Visa verifizierter Onlineshop - keine versteckte Kosten Mit unseren Produkten vergessen Sie Ihre Enttauschungen, anhaltende Versagensangste und wiederholte peinliche Situationen Jetzt bestellen - und vier Pillen umsonst erhalten http://digjisj.factonly.cn/?448859891598 (bitte warten Sie einen Moment bis die Seite vollstandig geladen wird) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Tue Oct 23 06:05:44 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 06:05:44 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/scripts: Eliminate some duplicated messages Message-ID: <1193144744.18113.69.camel@hrosenstock-ws.xsigo.com> infiniband-diags/scripts: Eliminate some duplicated messages Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in index ebf44ec..cac2475 100644 --- a/infiniband-diags/scripts/ibcheckerrors.in +++ b/infiniband-diags/scripts/ibcheckerrors.in @@ -131,7 +131,7 @@ function check_node(lid) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index 1a2d228..305379a 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -188,7 +188,7 @@ BEGIN { /AllPortSelect/ {next} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in index a47ab8e..b6e0945 100644 --- a/infiniband-diags/scripts/ibchecknet.in +++ b/infiniband-diags/scripts/ibchecknet.in @@ -132,7 +132,7 @@ function check_node(lid) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckport.in b/infiniband-diags/scripts/ibcheckport.in index 94cfc6c..fa5e81e 100644 --- a/infiniband-diags/scripts/ibcheckport.in +++ b/infiniband-diags/scripts/ibcheckport.in @@ -116,7 +116,7 @@ function blue(s) #/^LocalPort/ { if ($2 != '$portnum') {err = err "#error: port " $2 " does not match query ('$portnum')\n"; exit -1}} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckportstate.in b/infiniband-diags/scripts/ibcheckportstate.in index 2931f06..47e72dc 100644 --- a/infiniband-diags/scripts/ibcheckportstate.in +++ b/infiniband-diags/scripts/ibcheckportstate.in @@ -108,7 +108,7 @@ function blue(s) /^LinkState/{ if ($2 != "Active") warn = warn "#warn: Logical link state is " $2 " lid '$lid' port '$portnum'\n"} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckportwidth.in b/infiniband-diags/scripts/ibcheckportwidth.in index 84f1ef7..32c5c5e 100644 --- a/infiniband-diags/scripts/ibcheckportwidth.in +++ b/infiniband-diags/scripts/ibcheckportwidth.in @@ -106,7 +106,7 @@ function blue(s) /^LinkWidthSupported/{ if ($2 != "1X") { next } } /^LinkWidthActive/{ if ($2 == "1X") warn = warn "#warn: Link configured as 1X lid '$lid' port '$portnum'\n"} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in index 6ce0854..63551d5 100644 --- a/infiniband-diags/scripts/ibcheckstate.in +++ b/infiniband-diags/scripts/ibcheckstate.in @@ -122,7 +122,7 @@ function check_node(lid) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in index f8f6a8b..6b723c5 100644 --- a/infiniband-diags/scripts/ibcheckwidth.in +++ b/infiniband-diags/scripts/ibcheckwidth.in @@ -122,7 +122,7 @@ function check_node(lid) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in index 1818c42..0413d86 100644 --- a/infiniband-diags/scripts/ibclearcounters.in +++ b/infiniband-diags/scripts/ibclearcounters.in @@ -102,7 +102,7 @@ function clear_port_counters(lid, port) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in index c63283a..930efa6 100644 --- a/infiniband-diags/scripts/ibclearerrors.in +++ b/infiniband-diags/scripts/ibclearerrors.in @@ -95,7 +95,7 @@ function clear_errors(lid, port) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in index 902a865..7f0df1c 100644 --- a/infiniband-diags/scripts/ibdatacounters.in +++ b/infiniband-diags/scripts/ibdatacounters.in @@ -130,7 +130,7 @@ function check_node(lid) } } -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in index ccf9f34..4d8bfa1 100644 --- a/infiniband-diags/scripts/ibdatacounts.in +++ b/infiniband-diags/scripts/ibdatacounts.in @@ -132,7 +132,7 @@ function blue(s) /AllPortSelect/ {next} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibhosts.in b/infiniband-diags/scripts/ibhosts.in index a287edf..92b2dff 100644 --- a/infiniband-diags/scripts/ibhosts.in +++ b/infiniband-diags/scripts/ibhosts.in @@ -52,7 +52,7 @@ rv=$? echo "$text" | awk ' /^Ca/ {print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ substr($0, match($0, "#[ \t]*")+RLENGTH)} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibrouters.in b/infiniband-diags/scripts/ibrouters.in index e053794..573ad0d 100644 --- a/infiniband-diags/scripts/ibrouters.in +++ b/infiniband-diags/scripts/ibrouters.in @@ -52,7 +52,7 @@ rv=$? echo "$text" | awk ' /^Rt/ {print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ substr($0, match($0, "#[ \t]*")+RLENGTH)} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} diff --git a/infiniband-diags/scripts/ibswitches.in b/infiniband-diags/scripts/ibswitches.in index 0476d0e..59301f0 100644 --- a/infiniband-diags/scripts/ibswitches.in +++ b/infiniband-diags/scripts/ibswitches.in @@ -71,7 +71,7 @@ echo "$text" | awk ' else print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ desc " " type " " pinfo} -/^ib/ {print $0} +/^ib/ {print $0; next} /ibpanic:/ {print $0} /ibwarn:/ {print $0} /iberror:/ {print $0} From fauxpas at singnet.com.sg Tue Oct 23 07:15:08 2007 From: fauxpas at singnet.com.sg (IRISH LOTTERY BOARD) Date: Tue, 23 Oct 2007 22:15:08 +0800 (SGT) Subject: [ofa-general] ***SPAM*** ATTENTION: YOUR E-MAIL JUST WON Message-ID: <1193148908.471e01ecd837f@discus.singnet.com.sg> 11 G Lower Dorset Street, Dublin 1, Ireland. P O Box 1010. ATTENTION: YOUR E-MAIL JUST WON FOR YOU�1,350,000.00 We are pleased to inform you today 17 October, 2007 of the result of the winners of the IRISH NATIONAL LOTTERY ONLINE PROMO PROGRAMME, held on 06 October 2007, ticketnumber:56475600545 188 with Serial number 5368/02, this are your lucky numbers:06, 17, 24, 26, 36, 44, Bonus 37, You have therefore been approved for a lump sum pay out of �1,350,000 (One million, three hundred and fifty thousand, pounds sterling) in cash. To file your claims contact our fiduciary agent for claims: Dr spencer williams Email: worldwideclaimsofficer_agant011 at yahoo.co.uk Tel: (+44)-7024066880 Tel: (+44) 701113 7597 Provide him with the information below: 1.Full Name:................... 2.Full Address:................ 3.Marital Status:.............. 4.Occupation:.................. 5.Age:......................... 6.Sex:......................... 7.Nationality:................. 8.Country Of Residence:........ 9.Telephone Number:............ Congratulations once more. Sincerely, Sir.kolyn parkins Online coordinator for THE IRISH LOTTERY Sweepstakes From tziporet at mellanox.co.il Tue Oct 23 07:45:38 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 23 Oct 2007 16:45:38 +0200 Subject: [ofa-general] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks Message-ID: <6C2C79E72C305246B504CBA17B5500C9015640A8@mtlexch01.mtl.com> OFED October 8 meeting summary on OFED 1.3 alpha status and beta tasks: Meeting summary: ================ 1. Alpha release status: * Cisco - tested only RHEL 4 & 5 on x86_64 systems * Qlogic - in general looks good * Intel - test Intel MPI on 16 nodes cluster on RHEL 5.1 * Voltaire - partial regression SLES10 SP1 Redhat 5; few on RHEL 4 up5 * Mellanox - regression tests pass on all HCAs. Tested SLES10, RHEL 4 up 4 & up5 , RHEL 5 * IBM - test mainly SLES10 SP1 on PPC; solve issues in ehca with their new HCA; see issues with 32 bits library 2. MPI status: * MVAPICH - We wish to integrate the 1.0 code by the end of this week. In this way it will be ready for the OFED beta release next week - need DK approval * Open MPI - Open MPI v1.2.4 is sufficient for OFED 1.3. It contains a critical fix for ConnectX hardware that will greatly improve the point-to-point latency for small messages. Open MPI v1.3 is far enough away that we're not ready for anything newer to go into OFED 1.3. * VMAPICH 2 - ready for the release. 3. Tasks that should completed for the beta: 1. Integrate all SDP features - Jim (Mellanox) - will be completed this week 2. Complete RDS work - Vlad (Mellanox) - on work 3. Apply patches that fix warning of backport patches - Vlad (Mellanox) 4. Fix compilation problems on PPC with 32 bits - Vlad (Mellanox) - Nam please open a bug on this issue 5. Add qperf test from Qlogic - Johann (Qlogic) - on work 6. Rebase kernel code on 2.6.24 rc1 (depending it's availability) Will require changes in the bonding module too. 7. Support RHEL 5 up1 - Woody & Vlad - done 8. SPEC files should be part of each user space package - each owner should take the spec file 9. Multiple uDAPL libs (1.0 & 2.0) 10. iSER - update the open iscsi package 11. nes - need to update some backport patches (Gleb) * Beta target date will be decided in next week meeting based on the progress of the above 4. OFED meetings: * Starting next week we will move to weekly meetings to have a better tracking for the release progress * Jeff S. will send the meeting schedule and details 5. NFS-RDMA: * NFS-RDMA will be part of Linux kernel 2.6.24 * Thus it will be in OFED 1.3 kernel, but we need backport patches to support the distros * Bill Boas will speak to OGC (Tom Tucker and Steve Wise) to see if this can be done for the beta, at least for the latest OSes (e.g. RHEL5, SLES10 SP1) 6. Developers' summit: * All people requested to review the agenda Johann will publish and make sure all subjects we wish to discuss are covered. From HNGUYEN at de.ibm.com Tue Oct 23 08:01:33 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Tue, 23 Oct 2007 17:01:33 +0200 Subject: [ofa-general] Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015640A8@mtlexch01.mtl.com> Message-ID: Hi, > 3. Tasks that should completed for the beta: > 4. Fix compilation problems on PPC with 32 bits - Vlad > (Mellanox) - > Nam please open a bug on this issue Stefan has created #746 "Installation of 32-bit libibverbs failed". @Vlad, since we'll have rpm spec for user space with beta, would it better to tackle this with rpm specs? For libibverbs is it libibverbs.spec.in we need to look at? Thanks Nam From vlad at dev.mellanox.co.il Tue Oct 23 08:10:55 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 23 Oct 2007 17:10:55 +0200 Subject: [ofa-general] Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks In-Reply-To: References: Message-ID: <471E0EFF.5040604@dev.mellanox.co.il> Hoang-Nam Nguyen wrote: > Hi, >> 3. Tasks that should completed for the beta: >> 4. Fix compilation problems on PPC with 32 bits - Vlad >> (Mellanox) - >> Nam please open a bug on this issue > Stefan has created #746 "Installation of 32-bit libibverbs failed". > @Vlad, since we'll have rpm spec for user space with beta, would it > better to tackle this with rpm specs? For libibverbs is it > libibverbs.spec.in we need to look at? > Thanks > Nam > Hi Nam, Please recheck this issue with the latest OFED-1.3 build. If this issue still exist then it is probably install.pl issue. Please update me, Thanks, Vladimir From rdreier at cisco.com Tue Oct 23 09:02:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 09:02:06 -0700 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: <200710231058.26784.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 23 Oct 2007 10:58:26 +0200") References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710231043.16411.jackm@dev.mellanox.co.il> <20071023084004.GC2667@minantech.com> <200710231058.26784.jackm@dev.mellanox.co.il> Message-ID: > The macro is currently being used in libmlx4 (src/qp.c). That's what started > the thread. Oh, I misunderstood completely. __always_inline is a kernel-only thing. I'll fix up libmlx4. - R. From eli at mellanox.co.il Tue Oct 23 09:07:46 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 23 Oct 2007 18:07:46 +0200 Subject: [ofa-general] [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments Message-ID: <1193155667.25235.4.camel@mtls03> IPOIB CM rx use higher order fragments In order to reduce the overhead of iterating the fragments of an SKB in the receive flow, we use fragments of higher order and thus reduce the number of iterations. This patch seams to improve receive throughput of small UDP messages. Signed-off-by: Eli Cohen --- I used the following command line to see improvemet: netperf -H 12.4.3.175 -t UDP_STREAM -- -m 128 drivers/infiniband/ulp/ipoib/ipoib.h | 5 ++++- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +++++++++--------- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 0a00ea0..6cf14ff 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -57,6 +57,8 @@ enum { IPOIB_PACKET_SIZE = 2048, + IPOIB_FRAG_ORDER = 2, + IPOIB_FRAG_SIZE = PAGE_SIZE << IPOIB_FRAG_ORDER, IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, IPOIB_ENCAP_LEN = 4, @@ -64,7 +66,8 @@ enum { IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, - IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, + IPOIB_CM_RX_SG = 1 + ALIGN(IPOIB_CM_BUF_SIZE - IPOIB_CM_HEAD_SIZE, + IPOIB_FRAG_SIZE) / IPOIB_FRAG_SIZE, IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8761077..5fee3c6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -78,7 +78,7 @@ static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags, ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); for (i = 0; i < frags; ++i) - ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); + ib_dma_unmap_single(priv->ca, mapping[i + 1], IPOIB_FRAG_SIZE, DMA_FROM_DEVICE); } static int ipoib_cm_post_receive(struct net_device *dev, int id) @@ -129,14 +129,14 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int } for (i = 0; i < frags; i++) { - struct page *page = alloc_page(GFP_ATOMIC); + struct page *page = alloc_pages(GFP_ATOMIC | __GFP_COMP, IPOIB_FRAG_ORDER); if (!page) goto partial_error; - skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE); + skb_fill_page_desc(skb, i, page, 0, IPOIB_FRAG_SIZE); mapping[i + 1] = ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[i].page, - 0, PAGE_SIZE, DMA_FROM_DEVICE); + 0, IPOIB_FRAG_SIZE, DMA_FROM_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, mapping[i + 1]))) goto partial_error; } @@ -384,10 +384,10 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, if (length == 0) { /* don't need this page */ - skb_fill_page_desc(toskb, i, frag->page, 0, PAGE_SIZE); + skb_fill_page_desc(toskb, i, frag->page, 0, IPOIB_FRAG_SIZE); --skb_shinfo(skb)->nr_frags; } else { - size = min(length, (unsigned) PAGE_SIZE); + size = min(length, (unsigned) IPOIB_FRAG_SIZE); frag->size = size; skb->data_len += size; @@ -447,8 +447,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) } } - frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, - (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + frags = ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE), IPOIB_FRAG_SIZE) / IPOIB_FRAG_SIZE; newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); if (unlikely(!newskb)) { @@ -1302,7 +1302,7 @@ int ipoib_cm_dev_init(struct net_device *dev) priv->cm.rx_sge[0].length = IPOIB_CM_HEAD_SIZE; for (i = 1; i < IPOIB_CM_RX_SG; ++i) - priv->cm.rx_sge[i].length = PAGE_SIZE; + priv->cm.rx_sge[i].length = IPOIB_FRAG_SIZE; priv->cm.rx_wr.next = NULL; priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; -- 1.5.3.4 From erezz at Voltaire.COM Tue Oct 23 09:17:22 2007 From: erezz at Voltaire.COM (Erez Zilber) Date: Tue, 23 Oct 2007 18:17:22 +0200 Subject: [ofa-general] OFED October 22 meeting summary on OFED 1.3 alphastatus and beta tasks In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015640A8@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9015640A8@mtlexch01.mtl.com> Message-ID: <471E1E92.1050100@Voltaire.COM> Tziporet Koren wrote: > > OFED October 8 meeting summary on OFED 1.3 alpha status and beta tasks: > > Meeting summary: > ================ > 1. Alpha release status: > * Cisco - tested only RHEL 4 & 5 on x86_64 systems > * Qlogic - in general looks good > * Intel - test Intel MPI on 16 nodes cluster on RHEL 5.1 > * Voltaire - partial regression SLES10 SP1 Redhat 5; few on RHEL 4 up5 > * Mellanox - regression tests pass on all HCAs. Tested SLES10, RHEL 4 up > 4 & up5 , RHEL 5 > * IBM - test mainly SLES10 SP1 on PPC; solve issues in ehca with their > new HCA; > see issues with 32 bits library > > 2. MPI status: > * MVAPICH - We wish to integrate the 1.0 code by the end of this week. > In this way it will be ready for the OFED beta release next week - need > DK approval > * Open MPI - Open MPI v1.2.4 is sufficient for OFED 1.3. It contains a > critical fix for ConnectX hardware that will greatly improve the > point-to-point latency for small messages. > Open MPI v1.3 is far enough away that we're not ready for anything newer > to go into OFED 1.3. > * VMAPICH 2 - ready for the release. > > 3. Tasks that should completed for the beta: > 1. Integrate all SDP features - Jim (Mellanox) - will be > completed this week > 2. Complete RDS work - Vlad (Mellanox) - on work > 3. Apply patches that fix warning of backport patches - Vlad > (Mellanox) > 4. Fix compilation problems on PPC with 32 bits - Vlad > (Mellanox) - > Nam please open a bug on this issue > 5. Add qperf test from Qlogic - Johann (Qlogic) - on work > 6. Rebase kernel code on 2.6.24 rc1 (depending it's > availability) > Will require changes in the bonding module too. > 7. Support RHEL 5 up1 - Woody & Vlad - done > 8. SPEC files should be part of each user space package - each > owner should take the spec file > 9. Multiple uDAPL libs (1.0 & 2.0) > 10. iSER - update the open iscsi package > Done > 11. nes - need to update some backport patches (Gleb) > > * Beta target date will be decided in next week meeting based on the > progress of the above > > > 4. OFED meetings: > * Starting next week we will move to weekly meetings to have a better > tracking for the release progress > * Jeff S. will send the meeting schedule and details > > 5. NFS-RDMA: > * NFS-RDMA will be part of Linux kernel 2.6.24 > * Thus it will be in OFED 1.3 kernel, but we need backport patches to > support the distros > * Bill Boas will speak to OGC (Tom Tucker and Steve Wise) to see if this > can be done for the beta, at least for the latest OSes (e.g. RHEL5, > SLES10 SP1) > > 6. Developers' summit: > * All people requested to review the agenda Johann will publish and make > sure all subjects we wish to discuss are covered. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Oct 23 09:30:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 09:30:33 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus this will get some more fixes/changes for 2.6.24. I have one more IPoIB feature (support for CM without SRQs) I hope to send later today, but we'll see... Anton Blanchard (1): IPoIB: Use round_jiffies() for ah_reap_task Jack Morgenstein (2): IB/mlx4: Sanity check userspace send queue sizes mlx4_core: Increase command timeout for INIT_HCA to 10 seconds Joachim Fenkes (5): IB/ehca: Supply QP token for SRQ base QPs IB/ehca: Fix masking error in {,re}reg_phys_mr() IB/ehca: Fix ehca_encode_hwpage_size() and alloc_fmr() IB/ehca: Change meaning of hca_cap_mr_pgsize IB/ehca: Enable large page MRs by default Michael S. Tsirkin (1): IPoIB/cm: Use common CQ for CM send completions Roland Dreier (4): mlx4_core: Kill mlx4_write64_raw() IB/mthca: Avoid alignment traps when writing doorbells IPoIB: Rewrite "if (!likely(...))" as "if (unlikely(!(...)))" IB/uverbs: Fix checking of userspace object ownership Sean Hefty (2): RDMA/cma: Add locking around QP accesses RDMA/cma: Fix deadlock destroying listen requests drivers/infiniband/core/cma.c | 160 +++++++++++++------------ drivers/infiniband/core/uverbs_cmd.c | 8 +- drivers/infiniband/hw/ehca/ehca_classes.h | 1 - drivers/infiniband/hw/ehca/ehca_hca.c | 1 + drivers/infiniband/hw/ehca/ehca_main.c | 20 +++- drivers/infiniband/hw/ehca/ehca_mrmw.c | 57 ++++----- drivers/infiniband/hw/ehca/ehca_qp.c | 4 +- drivers/infiniband/hw/mlx4/qp.c | 16 +++- drivers/infiniband/hw/mthca/mthca_cq.c | 53 +++------ drivers/infiniband/hw/mthca/mthca_doorbell.h | 13 ++- drivers/infiniband/hw/mthca/mthca_eq.c | 21 +--- drivers/infiniband/hw/mthca/mthca_qp.c | 45 +++----- drivers/infiniband/hw/mthca/mthca_srq.c | 11 +-- drivers/infiniband/ulp/ipoib/ipoib.h | 15 ++- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 114 ++++++++----------- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 52 +++++---- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 +- drivers/net/mlx4/fw.c | 2 +- include/linux/mlx4/doorbell.h | 11 -- 19 files changed, 284 insertions(+), 324 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 93644f8..ee946cc 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -114,13 +114,16 @@ struct rdma_id_private { struct rdma_bind_list *bind_list; struct hlist_node node; - struct list_head list; - struct list_head listen_list; + struct list_head list; /* listen_any_list or cma_device.list */ + struct list_head listen_list; /* per device listens */ struct cma_device *cma_dev; struct list_head mc_list; + int internal_id; enum cma_state state; spinlock_t lock; + struct mutex qp_mutex; + struct completion comp; atomic_t refcount; wait_queue_head_t wait_remove; @@ -389,6 +392,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, id_priv->id.event_handler = event_handler; id_priv->id.ps = ps; spin_lock_init(&id_priv->lock); + mutex_init(&id_priv->qp_mutex); init_completion(&id_priv->comp); atomic_set(&id_priv->refcount, 1); init_waitqueue_head(&id_priv->wait_remove); @@ -474,61 +478,86 @@ EXPORT_SYMBOL(rdma_create_qp); void rdma_destroy_qp(struct rdma_cm_id *id) { - ib_destroy_qp(id->qp); + struct rdma_id_private *id_priv; + + id_priv = container_of(id, struct rdma_id_private, id); + mutex_lock(&id_priv->qp_mutex); + ib_destroy_qp(id_priv->id.qp); + id_priv->id.qp = NULL; + mutex_unlock(&id_priv->qp_mutex); } EXPORT_SYMBOL(rdma_destroy_qp); -static int cma_modify_qp_rtr(struct rdma_cm_id *id) +static int cma_modify_qp_rtr(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); if (ret) - return ret; + goto out; qp_attr.qp_state = IB_QPS_RTR; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_rts(struct rdma_cm_id *id) +static int cma_modify_qp_rts(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_RTS; - ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(&id_priv->id, &qp_attr, &qp_attr_mask); if (ret) - return ret; + goto out; - return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } -static int cma_modify_qp_err(struct rdma_cm_id *id) +static int cma_modify_qp_err(struct rdma_id_private *id_priv) { struct ib_qp_attr qp_attr; + int ret; - if (!id->qp) - return 0; + mutex_lock(&id_priv->qp_mutex); + if (!id_priv->id.qp) { + ret = 0; + goto out; + } qp_attr.qp_state = IB_QPS_ERR; - return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, IB_QP_STATE); +out: + mutex_unlock(&id_priv->qp_mutex); + return ret; } static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, @@ -717,50 +746,27 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) } } -static inline int cma_internal_listen(struct rdma_id_private *id_priv) -{ - return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && - cma_any_addr(&id_priv->id.route.addr.src_addr); -} - -static void cma_destroy_listen(struct rdma_id_private *id_priv) -{ - cma_exch(id_priv, CMA_DESTROYING); - - if (id_priv->cma_dev) { - switch (rdma_node_get_transport(id_priv->id.device->node_type)) { - case RDMA_TRANSPORT_IB: - if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) - ib_destroy_cm_id(id_priv->cm_id.ib); - break; - case RDMA_TRANSPORT_IWARP: - if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) - iw_destroy_cm_id(id_priv->cm_id.iw); - break; - default: - break; - } - cma_detach_from_dev(id_priv); - } - list_del(&id_priv->listen_list); - - cma_deref_id(id_priv); - wait_for_completion(&id_priv->comp); - - kfree(id_priv); -} - static void cma_cancel_listens(struct rdma_id_private *id_priv) { struct rdma_id_private *dev_id_priv; + /* + * Remove from listen_any_list to prevent added devices from spawning + * additional listen requests. + */ mutex_lock(&lock); list_del(&id_priv->list); while (!list_empty(&id_priv->listen_list)) { dev_id_priv = list_entry(id_priv->listen_list.next, struct rdma_id_private, listen_list); - cma_destroy_listen(dev_id_priv); + /* sync with device removal to avoid duplicate destruction */ + list_del_init(&dev_id_priv->list); + list_del(&dev_id_priv->listen_list); + mutex_unlock(&lock); + + rdma_destroy_id(&dev_id_priv->id); + mutex_lock(&lock); } mutex_unlock(&lock); } @@ -848,6 +854,9 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv); wait_for_completion(&id_priv->comp); + if (id_priv->internal_id) + cma_deref_id(id_priv->id.context); + kfree(id_priv->id.route.path_rec); kfree(id_priv); } @@ -857,11 +866,11 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) { int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto reject; - ret = cma_modify_qp_rts(&id_priv->id); + ret = cma_modify_qp_rts(id_priv); if (ret) goto reject; @@ -871,7 +880,7 @@ static int cma_rep_recv(struct rdma_id_private *id_priv) return 0; reject: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; @@ -947,7 +956,7 @@ static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) /* ignore event */ goto out; case IB_CM_REJ_RECEIVED: - cma_modify_qp_err(&id_priv->id); + cma_modify_qp_err(id_priv); event.status = ib_event->param.rej_rcvd.reason; event.event = RDMA_CM_EVENT_REJECTED; event.param.conn.private_data = ib_event->private_data; @@ -1404,14 +1413,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv, cma_attach_to_dev(dev_id_priv, cma_dev); list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + atomic_inc(&id_priv->refcount); + dev_id_priv->internal_id = 1; ret = rdma_listen(id, id_priv->backlog); if (ret) - goto err; - - return; -err: - cma_destroy_listen(dev_id_priv); + printk(KERN_WARNING "RDMA CMA: cma_listen_on_dev, error %d, " + "listening on device %s", ret, cma_dev->device->name); } static void cma_listen_on_all(struct rdma_id_private *id_priv) @@ -2264,7 +2272,7 @@ static int cma_connect_iw(struct rdma_id_private *id_priv, sin = (struct sockaddr_in*) &id_priv->id.route.addr.dst_addr; cm_id->remote_addr = *sin; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2331,7 +2339,7 @@ static int cma_accept_ib(struct rdma_id_private *id_priv, int qp_attr_mask, ret; if (id_priv->id.qp) { - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) goto out; @@ -2370,7 +2378,7 @@ static int cma_accept_iw(struct rdma_id_private *id_priv, struct iw_cm_conn_param iw_param; int ret; - ret = cma_modify_qp_rtr(&id_priv->id); + ret = cma_modify_qp_rtr(id_priv); if (ret) return ret; @@ -2442,7 +2450,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) return 0; reject: - cma_modify_qp_err(id); + cma_modify_qp_err(id_priv); rdma_reject(id, NULL, 0); return ret; } @@ -2512,7 +2520,7 @@ int rdma_disconnect(struct rdma_cm_id *id) switch (rdma_node_get_transport(id->device->node_type)) { case RDMA_TRANSPORT_IB: - ret = cma_modify_qp_err(id); + ret = cma_modify_qp_err(id_priv); if (ret) goto out; /* Initiate or respond to a disconnect. */ @@ -2543,9 +2551,11 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) return 0; + mutex_lock(&id_priv->qp_mutex); if (!status && id_priv->id.qp) status = ib_attach_mcast(id_priv->id.qp, &multicast->rec.mgid, multicast->rec.mlid); + mutex_unlock(&id_priv->qp_mutex); memset(&event, 0, sizeof event); event.status = status; @@ -2757,16 +2767,12 @@ static void cma_process_remove(struct cma_device *cma_dev) id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); - if (cma_internal_listen(id_priv)) { - cma_destroy_listen(id_priv); - continue; - } - + list_del(&id_priv->listen_list); list_del_init(&id_priv->list); atomic_inc(&id_priv->refcount); mutex_unlock(&lock); - ret = cma_remove_id_dev(id_priv); + ret = id_priv->internal_id ? 1 : cma_remove_id_dev(id_priv); cma_deref_id(id_priv); if (ret) rdma_destroy_id(&id_priv->id); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 01d7008..495c803 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -147,8 +147,12 @@ static struct ib_uobject *__idr_get_uobj(struct idr *idr, int id, spin_lock(&ib_uverbs_idr_lock); uobj = idr_find(idr, id); - if (uobj) - kref_get(&uobj->ref); + if (uobj) { + if (uobj->context == context) + kref_get(&uobj->ref); + else + uobj = NULL; + } spin_unlock(&ib_uverbs_idr_lock); return uobj; diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 0f7a55d..365bc5d 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -323,7 +323,6 @@ extern int ehca_static_rate; extern int ehca_port_act_time; extern int ehca_use_hp_mr; extern int ehca_scaling_code; -extern int ehca_mr_largepage; struct ipzu_queue_resp { u32 qe_size; /* queue entry size */ diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 4aa3ffa..15806d1 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -77,6 +77,7 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) } memset(props, 0, sizeof(struct ib_device_attr)); + props->page_size_cap = shca->hca_cap_mr_pgsize; props->fw_ver = rblock->hw_ver; props->max_mr_size = rblock->max_mr_size; props->vendor_id = rblock->vendor_id >> 8; diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 403467f..2f51c13 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -65,7 +65,7 @@ int ehca_port_act_time = 30; int ehca_poll_all_eqs = 1; int ehca_static_rate = -1; int ehca_scaling_code = 0; -int ehca_mr_largepage = 0; +int ehca_mr_largepage = 1; module_param_named(open_aqp1, ehca_open_aqp1, int, S_IRUGO); module_param_named(debug_level, ehca_debug_level, int, S_IRUGO); @@ -260,13 +260,20 @@ static struct cap_descr { { HCA_CAP_MINI_QP, "HCA_CAP_MINI_QP" }, }; -int ehca_sense_attributes(struct ehca_shca *shca) +static int ehca_sense_attributes(struct ehca_shca *shca) { int i, ret = 0; u64 h_ret; struct hipz_query_hca *rblock; struct hipz_query_port *port; + static const u32 pgsize_map[] = { + HCA_CAP_MR_PGSIZE_4K, 0x1000, + HCA_CAP_MR_PGSIZE_64K, 0x10000, + HCA_CAP_MR_PGSIZE_1M, 0x100000, + HCA_CAP_MR_PGSIZE_16M, 0x1000000, + }; + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); @@ -329,8 +336,15 @@ int ehca_sense_attributes(struct ehca_shca *shca) if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) ehca_gen_dbg(" %s", hca_cap_descr[i].descr); - shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported; + /* translate supported MR page sizes; always support 4K */ + shca->hca_cap_mr_pgsize = EHCA_PAGESIZE; + if (ehca_mr_largepage) { /* support extra sizes only if enabled */ + for (i = 0; i < ARRAY_SIZE(pgsize_map); i += 2) + if (rblock->memory_page_size_supported & pgsize_map[i]) + shca->hca_cap_mr_pgsize |= pgsize_map[i + 1]; + } + /* query max MTU from first port -- it's the same for all ports */ port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index da88738..bb97915 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -72,24 +72,14 @@ enum ehca_mr_pgsize { static u32 ehca_encode_hwpage_size(u32 pgsize) { - u32 idx = 0; - pgsize >>= 12; - /* - * map mr page size into hw code: - * 0, 1, 2, 3 for 4K, 64K, 1M, 64M - */ - while (!(pgsize & 1)) { - idx++; - pgsize >>= 4; - } - return idx; + int log = ilog2(pgsize); + WARN_ON(log < 12 || log > 24 || log & 3); + return (log - 12) / 4; } static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca) { - if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M) - return EHCA_MR_PGSIZE16M; - return EHCA_MR_PGSIZE4K; + return 1UL << ilog2(shca->hca_cap_mr_pgsize); } static struct ehca_mr *ehca_mr_new(void) @@ -259,7 +249,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; pginfo.next_hwpage = - ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; + ((u64)iova_start & ~PAGE_MASK) / hw_pgsize; ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, @@ -296,7 +286,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, container_of(pd->device, struct ehca_shca, ib_device); struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd); struct ehca_mr_pginfo pginfo; - int ret; + int ret, page_shift; u32 num_kpages; u32 num_hwpages; u64 hwpage_size; @@ -351,19 +341,20 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, /* determine number of MR pages */ num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); /* select proper hw_pgsize */ - if (ehca_mr_largepage && - (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { - int page_shift = PAGE_SHIFT; - if (e_mr->umem->hugetlb) { - /* determine page_shift, clamp between 4K and 16M */ - page_shift = (fls64(length - 1) + 3) & ~3; - page_shift = min(max(page_shift, EHCA_MR_PGSHIFT4K), - EHCA_MR_PGSHIFT16M); - } - hwpage_size = 1UL << page_shift; - } else - hwpage_size = EHCA_MR_PGSIZE4K; /* ehca1 only supports 4k */ - ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); + page_shift = PAGE_SHIFT; + if (e_mr->umem->hugetlb) { + /* determine page_shift, clamp between 4K and 16M */ + page_shift = (fls64(length - 1) + 3) & ~3; + page_shift = min(max(page_shift, EHCA_MR_PGSHIFT4K), + EHCA_MR_PGSHIFT16M); + } + hwpage_size = 1UL << page_shift; + + /* now that we have the desired page size, shift until it's + * supported, too. 4K is always supported, so this terminates. + */ + while (!(hwpage_size & shca->hca_cap_mr_pgsize)) + hwpage_size >>= 4; reg_user_mr_fallback: num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size); @@ -547,7 +538,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; pginfo.next_hwpage = - ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; + ((u64)iova_start & ~PAGE_MASK) / hw_pgsize; } if (mr_rereg_mask & IB_MR_REREG_ACCESS) new_acl = mr_access_flags; @@ -809,8 +800,9 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, ib_fmr = ERR_PTR(-EINVAL); goto alloc_fmr_exit0; } - hw_pgsize = ehca_get_max_hwpage_size(shca); - if ((1 << fmr_attr->page_shift) != hw_pgsize) { + + hw_pgsize = 1 << fmr_attr->page_shift; + if (!(hw_pgsize & shca->hca_cap_mr_pgsize)) { ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x", fmr_attr->page_shift); ib_fmr = ERR_PTR(-EINVAL); @@ -826,6 +818,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); + pginfo.hwpage_size = hw_pgsize; /* * pginfo.num_hwpages==0, ie register_rpages() will not be called * but deferred to map_phys_fmr() diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index e2bd62b..de18264 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -451,7 +451,6 @@ static struct ehca_qp *internal_create_qp( has_srq = 1; parms.ext_type = EQPT_SRQBASE; parms.srq_qpn = my_srq->real_qp_num; - parms.srq_token = my_srq->token; } if (is_llqp && has_srq) { @@ -583,6 +582,9 @@ static struct ehca_qp *internal_create_qp( goto create_qp_exit1; } + if (has_srq) + parms.srq_token = my_qp->token; + parms.servicetype = ibqptype2servicetype(qp_type); if (parms.servicetype < 0) { ret = -EINVAL; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 31a480e..6b33224 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -63,6 +63,10 @@ struct mlx4_ib_sqp { u8 header_buf[MLX4_IB_UD_HEADER_SIZE]; }; +enum { + MLX4_IB_MIN_SQ_STRIDE = 6 +}; + static const __be32 mlx4_ib_opcode[] = { [IB_WR_SEND] = __constant_cpu_to_be32(MLX4_OPCODE_SEND), [IB_WR_SEND_WITH_IMM] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_IMM), @@ -285,9 +289,17 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, return 0; } -static int set_user_sq_size(struct mlx4_ib_qp *qp, +static int set_user_sq_size(struct mlx4_ib_dev *dev, + struct mlx4_ib_qp *qp, struct mlx4_ib_create_qp *ucmd) { + /* Sanity check SQ size before proceeding */ + if ((1 << ucmd->log_sq_bb_count) > dev->dev->caps.max_wqes || + ucmd->log_sq_stride > + ilog2(roundup_pow_of_two(dev->dev->caps.max_sq_desc_sz)) || + ucmd->log_sq_stride < MLX4_IB_MIN_SQ_STRIDE) + return -EINVAL; + qp->sq.wqe_cnt = 1 << ucmd->log_sq_bb_count; qp->sq.wqe_shift = ucmd->log_sq_stride; @@ -330,7 +342,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->sq_no_prefetch = ucmd.sq_no_prefetch; - err = set_user_sq_size(qp, &ucmd); + err = set_user_sq_size(dev, qp, &ucmd); if (err) goto err; diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index be6e1e0..6bd9f13 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -204,16 +204,11 @@ static void dump_cqe(struct mthca_dev *dev, void *cqe_ptr) static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, int incr) { - __be32 doorbell[2]; - if (mthca_is_memfree(dev)) { *cq->set_ci_db = cpu_to_be32(cq->cons_index); wmb(); } else { - doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(incr - 1); - - mthca_write64(doorbell, + mthca_write64(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn, incr - 1, dev->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -731,17 +726,12 @@ repoll: int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) { - __be32 doorbell[2]; + u32 dbhi = ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : + MTHCA_TAVOR_CQ_DB_REQ_NOT) | + to_mcq(cq)->cqn; - doorbell[0] = cpu_to_be32(((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? - MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : - MTHCA_TAVOR_CQ_DB_REQ_NOT) | - to_mcq(cq)->cqn); - doorbell[1] = (__force __be32) 0xffffffff; - - mthca_write64(doorbell, - to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_write64(dbhi, 0xffffffff, to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); return 0; @@ -750,19 +740,16 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags) int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) { struct mthca_cq *cq = to_mcq(ibcq); - __be32 doorbell[2]; - u32 sn; - __be32 ci; - - sn = cq->arm_sn & 3; - ci = cpu_to_be32(cq->cons_index); + __be32 db_rec[2]; + u32 dbhi; + u32 sn = cq->arm_sn & 3; - doorbell[0] = ci; - doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | - ((flags & IB_CQ_SOLICITED_MASK) == - IB_CQ_SOLICITED ? 1 : 2)); + db_rec[0] = cpu_to_be32(cq->cons_index); + db_rec[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + ((flags & IB_CQ_SOLICITED_MASK) == + IB_CQ_SOLICITED ? 1 : 2)); - mthca_write_db_rec(doorbell, cq->arm_db); + mthca_write_db_rec(db_rec, cq->arm_db); /* * Make sure that the doorbell record in host memory is @@ -770,14 +757,12 @@ int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) */ wmb(); - doorbell[0] = cpu_to_be32((sn << 28) | - ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? - MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : - MTHCA_ARBEL_CQ_DB_REQ_NOT) | - cq->cqn); - doorbell[1] = ci; + dbhi = (sn << 28) | + ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED ? + MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : + MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn; - mthca_write64(doorbell, + mthca_write64(dbhi, cq->cons_index, to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index dd9a44d..b374dc3 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -58,10 +58,10 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writeq((__force u64) val, dest); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { - __raw_writeq(*(u64 *) val, dest); + __raw_writeq((__force u64) cpu_to_be64((u64) hi << 32 | lo), dest); } static inline void mthca_write_db_rec(__be32 val[2], __be32 *db) @@ -87,14 +87,17 @@ static inline void mthca_write64_raw(__be64 val, void __iomem *dest) __raw_writel(((__force u32 *) &val)[1], dest + 4); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_write64(u32 hi, u32 lo, void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; + hi = (__force u32) cpu_to_be32(hi); + lo = (__force u32) cpu_to_be32(lo); + spin_lock_irqsave(doorbell_lock, flags); - __raw_writel((__force u32) val[0], dest); - __raw_writel((__force u32) val[1], dest + 4); + __raw_writel(hi, dest); + __raw_writel(lo, dest + 4); spin_unlock_irqrestore(doorbell_lock, flags); } diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 8592b26..b29de51 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -173,11 +173,6 @@ static inline u64 async_mask(struct mthca_dev *dev) static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); - doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); - /* * This barrier makes sure that all updates to ownership bits * done by set_eqe_hw() hit memory before the consumer index @@ -187,7 +182,7 @@ static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u * having set_eqe_hw() overwrite the owner field. */ wmb(); - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_SET_CI | eq->eqn, ci & (eq->nent - 1), dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -212,12 +207,7 @@ static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); - doorbell[1] = 0; - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_REQ_NOT | eqn, 0, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -230,12 +220,7 @@ static inline void arbel_eq_req_not(struct mthca_dev *dev, u32 eqn_mask) static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { if (!mthca_is_memfree(dev)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); - - mthca_write64(doorbell, + mthca_write64(MTHCA_EQ_DB_DISARM_CQ | eqn, cqn, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index df01b20..0e5461c 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1799,15 +1799,11 @@ int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - __be32 doorbell[2]; - - doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + - qp->send_wqe_offset) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); - wmb(); - mthca_write64(doorbell, + mthca_write64(((qp->sq.next_ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0, + (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* @@ -1829,7 +1825,6 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1907,13 +1902,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32(qp->qpn << 8); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); qp->rq.next_ind = ind; @@ -1923,13 +1915,10 @@ int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); - wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_write64((qp->rq.next_ind << qp->rq.wqe_shift) | size0, + qp->qpn << 8 | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -1951,7 +1940,7 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + u32 dbhi; void *wqe; void *prev_wqe; unsigned long flags; @@ -1981,10 +1970,8 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + dbhi = (MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0; qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; @@ -2000,7 +1987,8 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, + + mthca_write64(dbhi, (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -2154,10 +2142,7 @@ int mthca_arbel_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((nreq << 24) | - ((qp->sq.head & 0xffff) << 8) | - f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + dbhi = (nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0; qp->sq.head += nreq; @@ -2173,8 +2158,8 @@ out: * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + + mthca_write64(dbhi, (qp->qpn << 8) | size0, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 3f58c11..553d681 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -491,7 +491,6 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); - __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -563,16 +562,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32(srq->srqn << 8); - /* * Make sure that descriptors are written * before doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, srq->srqn << 8, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); @@ -581,16 +577,13 @@ int mthca_tavor_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr, } if (likely(nreq)) { - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); - /* * Make sure that descriptors are written before * doorbell is rung. */ wmb(); - mthca_write64(doorbell, + mthca_write64(first_ind << srq->wqe_shift, (srq->srqn << 8) | nreq, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 6545fa7..0a00ea0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -84,9 +84,8 @@ enum { IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, IPOIB_MCAST_STARTED = 8, - IPOIB_FLAG_NETIF_STOPPED = 9, - IPOIB_FLAG_ADMIN_CM = 10, - IPOIB_FLAG_UMCAST = 11, + IPOIB_FLAG_ADMIN_CM = 9, + IPOIB_FLAG_UMCAST = 10, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -98,9 +97,9 @@ enum { #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_OP_CM (1ul << 30) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_OP_CM (0) #endif /* structs */ @@ -197,7 +196,6 @@ struct ipoib_cm_rx { struct ipoib_cm_tx { struct ib_cm_id *id; - struct ib_cq *cq; struct ib_qp *qp; struct list_head list; struct net_device *dev; @@ -294,6 +292,7 @@ struct ipoib_dev_priv { unsigned tx_tail; struct ib_sge tx_sge; struct ib_send_wr tx_wr; + unsigned tx_outstanding; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -502,6 +501,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); #else struct ipoib_cm_tx; @@ -590,6 +590,9 @@ static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *w { } +static inline void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) +{ +} #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 0a0dcb8..8761077 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -87,7 +87,7 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; @@ -401,7 +401,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + unsigned int wr_id = wc->wr_id & ~(IPOIB_OP_CM | IPOIB_OP_RECV); struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; @@ -412,7 +412,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~(IPOIB_OP_CM | IPOIB_OP_RECV))) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); @@ -434,7 +434,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) goto repost; } - if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { p = wc->qp->qp_context; if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { spin_lock_irqsave(&priv->lock, flags); @@ -498,7 +498,7 @@ static inline int post_send(struct ipoib_dev_priv *priv, priv->tx_sge.addr = addr; priv->tx_sge.length = len; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr_id = wr_id | IPOIB_OP_CM; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -549,20 +549,19 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ dev->trans_start = jiffies; ++tx->tx_head; - if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) { + if (++priv->tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", tx->qp->qp_num); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); } } } -static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, - struct ib_wc *wc) +void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id; + struct ipoib_cm_tx *tx = wc->qp->qp_context; + unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM; struct ipoib_tx_buf *tx_req; unsigned long flags; @@ -587,11 +586,10 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx spin_lock_irqsave(&priv->tx_lock, flags); ++tx->tx_tail; - if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags)) && - tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) { - clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(dev); - } if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) { @@ -614,11 +612,6 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx tx->neigh = NULL; } - /* queue would be re-started anyway when TX is destroyed, - * but it makes sense to do it ASAP here. */ - if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags)) - netif_wake_queue(dev); - if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { list_move(&tx->list, &priv->cm.reap_list); queue_work(ipoib_workqueue, &priv->cm.reap_task); @@ -632,19 +625,6 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx spin_unlock_irqrestore(&priv->tx_lock, flags); } -static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) -{ - struct ipoib_cm_tx *tx = tx_ptr; - int n, i; - - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); - do { - n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); - for (i = 0; i < n; ++i) - ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); - } while (n == IPOIB_NUM_WC); -} - int ipoib_cm_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -807,17 +787,18 @@ static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even return 0; } -static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) +static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_cm_tx *tx) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { - .send_cq = cq, + .send_cq = priv->cq, .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = ipoib_sendq_size, .cap.max_send_sge = 1, .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, + .qp_context = tx }; return ib_create_qp(priv->pd, &attr); @@ -899,21 +880,7 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, goto err_tx; } - p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, - ipoib_sendq_size + 1, 0); - if (IS_ERR(p->cq)) { - ret = PTR_ERR(p->cq); - ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); - goto err_cq; - } - - ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP); - if (ret) { - ipoib_warn(priv, "failed to request completion notification: %d\n", ret); - goto err_req_notify; - } - - p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq); + p->qp = ipoib_cm_create_tx_qp(p->dev, p); if (IS_ERR(p->qp)) { ret = PTR_ERR(p->qp); ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret); @@ -950,12 +917,8 @@ err_modify: err_id: p->id = NULL; ib_destroy_qp(p->qp); -err_req_notify: err_qp: p->qp = NULL; - ib_destroy_cq(p->cq); -err_cq: - p->cq = NULL; err_tx: return ret; } @@ -964,6 +927,8 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) { struct ipoib_dev_priv *priv = netdev_priv(p->dev); struct ipoib_tx_buf *tx_req; + unsigned long flags; + unsigned long begin; ipoib_dbg(priv, "Destroy active connection 0x%x head 0x%x tail 0x%x\n", p->qp ? p->qp->qp_num : 0, p->tx_head, p->tx_tail); @@ -971,27 +936,40 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) if (p->id) ib_destroy_cm_id(p->id); - if (p->qp) - ib_destroy_qp(p->qp); - - if (p->cq) - ib_destroy_cq(p->cq); - - if (test_bit(IPOIB_FLAG_NETIF_STOPPED, &p->flags)) - netif_wake_queue(p->dev); - if (p->tx_ring) { + /* Wait for all sends to complete */ + begin = jiffies; while ((int) p->tx_tail - (int) p->tx_head < 0) { - tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, - DMA_TO_DEVICE); - dev_kfree_skb_any(tx_req->skb); - ++p->tx_tail; + if (time_after(jiffies, begin + 5 * HZ)) { + ipoib_warn(priv, "timing out; %d sends not completed\n", + p->tx_head - p->tx_tail); + goto timeout; + } + + msleep(1); } + } - kfree(p->tx_ring); +timeout: + + while ((int) p->tx_tail - (int) p->tx_head < 0) { + tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; + ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(tx_req->skb); + ++p->tx_tail; + spin_lock_irqsave(&priv->tx_lock, flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(p->dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + netif_wake_queue(p->dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); } + if (p->qp) + ib_destroy_qp(p->qp); + + kfree(p->tx_ring); kfree(p); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1a77e79..5063dd5 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -267,11 +267,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; - if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { - clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && + netif_queue_stopped(dev) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(dev); - } spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -301,14 +300,18 @@ poll_more: for (i = 0; i < n; i++) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { - ++done; - ipoib_cm_handle_rx_wc(dev, wc); - } else if (wc->wr_id & IPOIB_OP_RECV) { + if (wc->wr_id & IPOIB_OP_RECV) { ++done; - ipoib_ib_handle_rx_wc(dev, wc); - } else - ipoib_ib_handle_tx_wc(dev, wc); + if (wc->wr_id & IPOIB_OP_CM) + ipoib_cm_handle_rx_wc(dev, wc); + else + ipoib_ib_handle_rx_wc(dev, wc); + } else { + if (wc->wr_id & IPOIB_OP_CM) + ipoib_cm_handle_tx_wc(dev, wc); + else + ipoib_ib_handle_tx_wc(dev, wc); + } } if (n != t) @@ -401,10 +404,9 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, address->last_send = priv->tx_head; ++priv->tx_head; - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { + if (++priv->tx_outstanding == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); } } } @@ -436,7 +438,8 @@ void ipoib_reap_ah(struct work_struct *work) __ipoib_reap_ah(dev); if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + round_jiffies_relative(HZ)); } int ipoib_ib_dev_open(struct net_device *dev) @@ -472,7 +475,8 @@ int ipoib_ib_dev_open(struct net_device *dev) } clear_bit(IPOIB_STOP_REAPER, &priv->flags); - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + round_jiffies_relative(HZ)); set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); @@ -561,12 +565,17 @@ void ipoib_drain_cq(struct net_device *dev) if (priv->ibwc[i].status == IB_WC_SUCCESS) priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) - ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); - else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) - ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); - else - ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { + if (priv->ibwc[i].wr_id & IPOIB_OP_CM) + ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + } else { + if (priv->ibwc[i].wr_id & IPOIB_OP_CM) + ipoib_cm_handle_tx_wc(dev, priv->ibwc + i); + else + ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + } } } while (n == IPOIB_NUM_WC); } @@ -612,6 +621,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; + --priv->tx_outstanding; } for (i = 0; i < ipoib_recvq_size; ++i) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index e072f3c..ace2345 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -148,8 +148,6 @@ static int ipoib_stop(struct net_device *dev) netif_stop_queue(dev); - clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); - /* * Now flush workqueue to make sure a scheduled task doesn't * bring our internal state back up. @@ -895,7 +893,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) goto out_rx_ring_cleanup; } - /* priv->tx_head & tx_tail are already 0 */ + /* priv->tx_head, tx_tail & tx_outstanding are already 0 */ if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index 6471d33..5064873 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -736,7 +736,7 @@ int mlx4_INIT_HCA(struct mlx4_dev *dev, struct mlx4_init_hca_param *param) MLX4_PUT(inbox, (u8) (PAGE_SHIFT - 12), INIT_HCA_UAR_PAGE_SZ_OFFSET); MLX4_PUT(inbox, param->log_uar_sz, INIT_HCA_LOG_UAR_SZ_OFFSET); - err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 1000); + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_INIT_HCA, 10000); if (err) mlx4_err(dev, "INIT_HCA returns %d\n", err); diff --git a/include/linux/mlx4/doorbell.h b/include/linux/mlx4/doorbell.h index 3f2da44..f31bba2 100644 --- a/include/linux/mlx4/doorbell.h +++ b/include/linux/mlx4/doorbell.h @@ -52,11 +52,6 @@ #define MLX4_INIT_DOORBELL_LOCK(ptr) do { } while (0) #define MLX4_GET_DOORBELL_LOCK(ptr) (NULL) -static inline void mlx4_write64_raw(__be64 val, void __iomem *dest) -{ - __raw_writeq((__force u64) val, dest); -} - static inline void mlx4_write64(__be32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { @@ -75,12 +70,6 @@ static inline void mlx4_write64(__be32 val[2], void __iomem *dest, #define MLX4_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) #define MLX4_GET_DOORBELL_LOCK(ptr) (ptr) -static inline void mlx4_write64_raw(__be64 val, void __iomem *dest) -{ - __raw_writel(((__force u32 *) &val)[0], dest); - __raw_writel(((__force u32 *) &val)[1], dest + 4); -} - static inline void mlx4_write64(__be32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { From rdreier at cisco.com Tue Oct 23 11:35:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 11:35:26 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: <1193155667.25235.4.camel@mtls03> (Eli Cohen's message of "Tue, 23 Oct 2007 18:07:46 +0200") References: <1193155667.25235.4.camel@mtls03> Message-ID: > In order to reduce the overhead of iterating the fragments of an > SKB in the receive flow, we use fragments of higher order and thus > reduce the number of iterations. This patch seams to improve receive > throughput of small UDP messages. I don't think we want to do this -- it may be good for benchmarks but it will hurt reliability, since systems often have highly fragmented memory so higher-order atomic allocations will fail. - R. From rdreier at cisco.com Tue Oct 23 11:36:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 11:36:09 -0700 Subject: [ofa-general] Re: __always_inline macro usage In-Reply-To: (Roland Dreier's message of "Tue, 23 Oct 2007 09:02:06 -0700") References: <200710221013.40112.jackm@dev.mellanox.co.il> <200710231043.16411.jackm@dev.mellanox.co.il> <20071023084004.GC2667@minantech.com> <200710231058.26784.jackm@dev.mellanox.co.il> Message-ID: Actually just replacing __always_inline with inline seems to work OK for me. I think I'll do that for libmlx. From ardavis at ichips.intel.com Tue Oct 23 11:58:00 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 23 Oct 2007 11:58:00 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU Message-ID: <471E4438.6080300@ichips.intel.com> There has been much discussion on a private thread regarding bug #735 - "dapltest performance tests don't adhere to iWARP standard" that needs to move to the general list. iWARP, has a requirement that the active side of the connection MUST be the first to send the first FPDU (SEND or RDMA operation). This presents a problem with applications written for uDAPL and OFA verbs given that there is no such restriction. So, short of requiring every OFA application/ULP to adhere to this restriction, we need the iWARP vendors to come up with a standard method to remove the restriction. Can someone come up with a solution, possibly in iWARP CM, that will work and insure interoperability between iWARP devices? -arlin From mshefty at ichips.intel.com Tue Oct 23 12:11:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 23 Oct 2007 12:11:47 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E4438.6080300@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> Message-ID: <471E4773.4040000@ichips.intel.com> > There has been much discussion on a private thread regarding bug #735 - > "dapltest performance tests don't adhere to iWARP standard" that needs > to move to the general list. This bug would be better titled "iWarp cannot support uDAPL API". :) Seriously, the iWarp and uDAPL specs conflict. One needs to change. > Can someone come up with a solution, possibly in iWARP CM, that will > work and insure interoperability between iWARP devices? I thought the restriction was there to support switching between streaming and rdma mode. If a connection only uses rdma mode, is the restriction really needed at all? - Sean From rdreier at cisco.com Tue Oct 23 12:37:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 12:37:54 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E4438.6080300@ichips.intel.com> (Arlin Davis's message of "Tue, 23 Oct 2007 11:58:00 -0700") References: <471E4438.6080300@ichips.intel.com> Message-ID: > iWARP, has a requirement that the active side of the connection MUST > be the first to send the first FPDU (SEND or RDMA operation). This > presents a problem with applications written for uDAPL and OFA verbs > given that there is no such restriction. So, short of requiring every > OFA application/ULP to adhere to this restriction, we need the iWARP > vendors to come up with a standard method to remove the restriction. I actually don't see a problem with just documenting that verbs consumers that want to work on top of iWARP must follow the documented iWARP rules. - R. From johann.george at qlogic.com Tue Oct 23 13:03:29 2007 From: johann.george at qlogic.com (Johann George) Date: Tue, 23 Oct 2007 13:03:29 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda Message-ID: <20071023200329.GA6368@cuprite.pathscale.com> Below is a tentative agenda for the upcoming OpenFabrics Developer's Summit. While most sessions are confirmed, at least a couple of speakers are attempting to resolve conflicts so they may attend. We have attempted to accommodate everyone but if you are speaking and see a conflict with the proposed time of your session, let me know. An updated agenda will be made available in the future. You can register by clicking on this link: http://www.acteva.com/booking.cfm?bevaid=143964 The Developer's Summit is being held November during 15-16, 2007 at the Boomtown Hotel near Reno. Dinner will be provided on Thursday as well as breakfast and lunch on Friday. Registration is $195 with a student rate of $95. Rooms are available at the Boomtown hotel starting at $59/night. Comments, suggestions and feedback on the agenda are also welcome. Johann Thursday, November 15, 2007 --------------------------- 13:00 20m OFED: Feedback from Alexa Ekechi Nwokah, Alexa 13:20 20m OFED: Feedback from RedHat Doug Ledford, RedHat 13:40 20m OFED: Feedback from SuSE Moiz Kohari, Novell 14:00 20m The Journey of a Patch: from Submission to Distros Roland Drier, Cisco 14:20 30m OFED 1.3 Update and Procedure Review Tziporet Koren, Mellanox ------------------------------------ 14:50 20m Break ------------------------------------ 15:10 20m OpenFabrics Logo Program: Experience So Far Arkady Kanevsky, Network Appliance 15:30 30m Update on MVAPICH and MVAPICH2 DK Panda, Ohio State University 16:00 20m Update on OpenMPI Jeff Sqyures, Cisco 16:20 20m uDAPL 2.0 Arkady Kanevsky, Network Appliance 16:40 20m SA Caching Sean Hefty, Intel ------------------------------------ 17:00 60m Dinner ------------------------------------ 18:00 20m Update on NFSoRDMA James Lentini, Network Appliance 18:20 20m Lustre Eric Barton, Sun Microsystems 18:40 20m Bonding Or Gerlitz, Voltaire 19:00 20m iWARP update Bill Boas, System Fabric Works 19:20 40m iWARP discussion on issues Bill Boas, System Fabric Works; Gopal Hegde, Cisco; Glenn Grundstrom, NetEffect; Bruck Girmay, Chelsio Friday, November 16, 2007 ------------------------- 07:15 45m Breakfast ------------------------------------ 08:00 30m WinOF: Update and Futures Gilad Shainer, Mellanox 08:30 30m CCS Ve2 Preview Eric Lantz, Microsoft 09:00 30m OFED 1.4 Planned Features Tziporet Koren, Mellanox 09:30 20m OFED Management Tools Ira Weiny, Lawrence Livermore National Laboratories ------------------------------------ 09:50 20m Break ------------------------------------ 10:10 20m RDS with Zero Copy Rick Frank, Oracle 10:30 20m QoS Support Sean Hefty, Intel; Dror Goldenberg, Mellanox 10:50 20m InfiniBand Routing Update Jason Gunthorpe, Obsidian Research 11:10 20m IPoIB Stateless Offloads Liran Liss, Mellanox 11:30 20m Using XRC Dror Goldenberg, Mellanox and Dr. Panda, Ohio State University 11:50 20m Fibre Channel over InfiniBand Dror Goldenberg, Mellanox ------------------------------------ 12:10 60m Lunch ------------------------------------ From swise at opengridcomputing.com Tue Oct 23 13:17:03 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 23 Oct 2007 15:17:03 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E4773.4040000@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> <471E4773.4040000@ichips.intel.com> Message-ID: <471E56BF.2080407@opengridcomputing.com> Sean Hefty wrote: >> There has been much discussion on a private thread regarding bug #735 >> - "dapltest performance tests don't adhere to iWARP standard" that >> needs to move to the general list. > > This bug would be better titled "iWarp cannot support uDAPL API". :) > > Seriously, the iWarp and uDAPL specs conflict. One needs to change. > >> Can someone come up with a solution, possibly in iWARP CM, that will >> work and insure interoperability between iWARP devices? > > I thought the restriction was there to support switching between > streaming and rdma mode. If a connection only uses rdma mode, is the > restriction really needed at all? > Yes because all iWARP connections start out as TCP streaming mode connections, and the MPA startup messages are sent in streaming mode. Then the connection is transitioned into FPDU (Framed PDU) mode using the MPA protocol. From sashak at voltaire.com Tue Oct 23 13:29:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 23 Oct 2007 22:29:51 +0200 Subject: [ofa-general] Re: [PATCH v3] osm: QoS - adding CPI:CapabilityMask2 and turning on QOS_SUPPORTED bit In-Reply-To: <471DB8E4.8030400@dev.mellanox.co.il> References: <471DB8E4.8030400@dev.mellanox.co.il> Message-ID: <20071023202951.GD7088@sashak.voltaire.com> On 11:03 Tue 23 Oct , Yevgeny Kliteynik wrote: > Adding ClassPortInfo:CapabilityMask2 field and turning > on OSM QoS capability bit (OSM_CAP2_IS_QOS_SUPPORTED). > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Tue Oct 23 13:35:51 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 23 Oct 2007 22:35:51 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts: Eliminate some duplicated messages In-Reply-To: <1193144744.18113.69.camel@hrosenstock-ws.xsigo.com> References: <1193144744.18113.69.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071023203551.GF7088@sashak.voltaire.com> On 06:05 Tue 23 Oct , Hal Rosenstock wrote: > infiniband-diags/scripts: Eliminate some duplicated messages > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From swise at opengridcomputing.com Tue Oct 23 13:24:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 23 Oct 2007 15:24:32 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: References: <471E4438.6080300@ichips.intel.com> Message-ID: <471E5880.1030100@opengridcomputing.com> Roland Dreier wrote: > > iWARP, has a requirement that the active side of the connection MUST > > be the first to send the first FPDU (SEND or RDMA operation). This > > presents a problem with applications written for uDAPL and OFA verbs > > given that there is no such restriction. So, short of requiring every > > OFA application/ULP to adhere to this restriction, we need the iWARP > > vendors to come up with a standard method to remove the restriction. > > I actually don't see a problem with just documenting that verbs > consumers that want to work on top of iWARP must follow the documented > iWARP rules. > That is what I've been trying to push. Both MVAPICH2 and OMPI have been open to adjusting their transports to adhere to this requirement. I wouldn't mind implementing something to enforce this in the IWCM or the iWARP drivers IF there was a clean way to do it. So far there hasn't been a clean way proposed. From or.gerlitz at gmail.com Tue Oct 23 13:48:41 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 23 Oct 2007 22:48:41 +0200 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <20071023200329.GA6368@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com> Message-ID: <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> On 10/23/07, Johann George wrote: > > Below is a tentative agenda for the upcoming OpenFabrics Developer's > Summit. Comments, suggestions and feedback on the agenda are also > welcome. Going strict develpers wise, the issues that come to my mind by order of priority are: 1) IPoIB Stateless Offloads - checksum/LSO/LRO - design, open issues 2) the long time open SA caching thing - problem statement, possible solutions/designs 3) QoS update/status - past, present and future 4) RDS - status, update, upstream plans 5) InfiniBand Routing Update 6) XRC and also 7) Fibre Channel over InfiniBand 8) ConnectX Reliable Multicast I suggest that --none-- of this list (with the possible exception of 7 && 8) would be left for Friday - maybe in the price of doing parallel sessions. And yes, some or most of them need way more then 20 minutes, specifically 1 && 2 where we will need at least 45 minutes each - this is my take. For that end, I suggest that my session would be moved to the next day and same for all the sessions (except for the SA cache) planned to the first day after the 14:50 break. I will be availble to further argue on the matter starting on Sunday when back from vacation... Thursday, November 15, 2007 > --------------------------- > 13:00 20m OFED: Feedback from Alexa > Ekechi Nwokah, Alexa > 13:20 20m OFED: Feedback from RedHat > Doug Ledford, RedHat > 13:40 20m OFED: Feedback from SuSE > Moiz Kohari, Novell > 14:00 20m The Journey of a Patch: from Submission to Distros > Roland Drier, Cisco > 14:20 30m OFED 1.3 Update and Procedure Review > Tziporet Koren, Mellanox > ------------------------------------ > 14:50 20m Break > ------------------------------------ > 15:10 20m OpenFabrics Logo Program: Experience So Far > Arkady Kanevsky, Network Appliance > 15:30 30m Update on MVAPICH and MVAPICH2 > DK Panda, Ohio State University > 16:00 20m Update on OpenMPI > Jeff Sqyures, Cisco > 16:20 20m uDAPL 2.0 > Arkady Kanevsky, Network Appliance > 16:40 20m SA Caching > Sean Hefty, Intel > ------------------------------------ > 17:00 60m Dinner > ------------------------------------ > 18:00 20m Update on NFSoRDMA > James Lentini, Network Appliance > 18:20 20m Lustre > Eric Barton, Sun Microsystems > 18:40 20m Bonding > Or Gerlitz, Voltaire > 19:00 20m iWARP update > Bill Boas, System Fabric Works > 19:20 40m iWARP discussion on issues > Bill Boas, System Fabric Works; Gopal Hegde, Cisco; > Glenn > Grundstrom, NetEffect; Bruck Girmay, Chelsio > > > > Friday, November 16, 2007 > ------------------------- > 07:15 45m Breakfast > ------------------------------------ > 08:00 30m WinOF: Update and Futures > Gilad Shainer, Mellanox > 08:30 30m CCS Ve2 Preview > Eric Lantz, Microsoft > 09:00 30m OFED 1.4 Planned Features > Tziporet Koren, Mellanox > 09:30 20m OFED Management Tools > Ira Weiny, Lawrence Livermore National Laboratories > ------------------------------------ > 09:50 20m Break > ------------------------------------ > 10:10 20m RDS with Zero Copy > Rick Frank, Oracle > 10:30 20m QoS Support > Sean Hefty, Intel; Dror Goldenberg, Mellanox > 10:50 20m InfiniBand Routing Update > Jason Gunthorpe, Obsidian Research > 11:10 20m IPoIB Stateless Offloads > Liran Liss, Mellanox > 11:30 20m Using XRC > Dror Goldenberg, Mellanox and Dr. Panda, Ohio State > University > 11:50 20m Fibre Channel over InfiniBand > Dror Goldenberg, Mellanox > ------------------------------------ > 12:10 60m Lunch > ------------------------------------ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Tue Oct 23 14:08:09 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 14:08:09 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] infiniband-diags/sminfo: Fix activity count display Message-ID: <1193173689.18113.220.camel@hrosenstock-ws.xsigo.com> infiniband-diags/sminfo: Fix activity count display Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index 0cd63f9..87f09ac 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -42,7 +42,7 @@ #include #include -#define __BUILD_VERSION_TAG__ 1.2.1 +#define __BUILD_VERSION_TAG__ 1.2.2 #include #include #include @@ -89,7 +89,8 @@ main(int argc, char **argv) ib_portid_t portid = {0}; int timeout = 0; /* use default */ uint8_t *p; - int act = 0, prio = 0, state = SMINFO_STANDBY; + uint act = 0; + int prio = 0, state = SMINFO_STANDBY; uint64_t guid = 0, key = 0; extern int ibdebug; int dest_type = IB_DEST_LID; @@ -199,7 +200,7 @@ main(int argc, char **argv) mad_decode_field(sminfo, IB_SMINFO_PRIO_F, &prio); mad_decode_field(sminfo, IB_SMINFO_STATE_F, &state); - printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %d priority %d state %d %s\n", + printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %u priority %d state %d %s\n", portid.lid, guid, act, prio, state, STATESTR(state)); exit(0); From hrosenstock at xsigo.com Tue Oct 23 14:11:12 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 14:11:12 -0700 Subject: [ofa-general] Re: [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <4718EC48.mail8WG11T65C@systemfabricworks.com> References: <4718EC48.mail8WG11T65C@systemfabricworks.com> Message-ID: <1193173872.18113.225.camel@hrosenstock-ws.xsigo.com> Steve, On Fri, 2007-10-19 at 12:41 -0500, swelch at systemfabricworks.com wrote: > > This patch [v4] replaces the [v3] patch; it's identicial other than > the patch description has been updated to put back in the detailed > patch description absent from the [v3] patch. > > The local loopback of an outgoing DR SMP response is limited to those > that originate at the driver specific SMA implementation during the > driver specific process_mad() function. This patch enables a > returning DR SMP originating in userspace (or elsewhere) to be > delivered to the local managment stack. In this specific case > the driver process_mad() function does not consume or process > the MAD, so a reponse mad has not be created and the original > MAD must manually be copied to the MAD buffer that is to be handed > off to the local agent. > > For consistent bahavior on top of iPath hardware, a subsequent patch > to be submitted by Ralph Campbell to update process_mad() return values > is required. Guess you didn't like my naming consistency change (which could be done separately). Acked-by: Hal Rosenstock but we ought to wait to hear from Sasha to be sure -- Hal > Thanks, Steve > > Signed-off-by: Steve Welch > --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/smi.h | 18 +++++++++++++++++- > 2 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 6f42877..98148d6 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > } > > /* Check to post send on QP or process locally */ > - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) > + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && > + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) > goto out; > > local = kmalloc(sizeof *local, GFP_ATOMIC); > @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > mad_agent_priv->agent.port_num); > if (port_priv) { > - mad_priv->mad.mad.mad_hdr.tid = > - ((struct ib_mad *)smp)->mad_hdr.tid; > + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); > recv_mad_agent = find_mad_agent(port_priv, > &mad_priv->mad.mad); > } > diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h > index 1cfc298..aff96ba 100644 > --- a/drivers/infiniband/core/smi.h > +++ b/drivers/infiniband/core/smi.h > @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, > u8 node_type, int port_num); > > /* > - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > */ > static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > struct ib_device *device) > @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, > (smp->hop_ptr == smp->hop_cnt + 1)) ? > IB_SMI_HANDLE : IB_SMI_DISCARD); > } > + > +/* > + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM > + * via process_mad > + */ > +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, > + struct ib_device *device) > +{ > + /* C14-13:3 -- We're at the end of the DR segment of path */ > + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ > + return ((device->process_mad && > + ib_get_smp_direction(smp) && > + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); > +} > + > #endif /* __SMI_H_ */ From hrosenstock at xsigo.com Tue Oct 23 14:11:51 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 14:11:51 -0700 Subject: [ofa-general] [PATCH] IB/ipath - Enable loopback of DR SMP responses from userspace In-Reply-To: <1192826026.6112.43.camel@brick.pathscale.com> References: <1192826026.6112.43.camel@brick.pathscale.com> Message-ID: <1193173911.18113.227.camel@hrosenstock-ws.xsigo.com> On Fri, 2007-10-19 at 13:33 -0700, Ralph Campbell wrote: > This patch is in response to reviewing a patch to the core MAD processing > which fixes loopback of directed route packets to/from user level > MAD agents. This change enables the core code to work for ib_ipath. > > Signed-off-by: Ralph Campbell Acked-by: Hal Rosenstock > diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c > index 3d1432d..1978c34 100644 > --- a/drivers/infiniband/hw/ipath/ipath_mad.c > +++ b/drivers/infiniband/hw/ipath/ipath_mad.c > @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, > * before checking for other consumers. > * Just tell the caller to process it normally. > */ > - ret = IB_MAD_RESULT_FAILURE; > + ret = IB_MAD_RESULT_SUCCESS; > goto bail; > default: > smp->status |= IB_SMP_UNSUP_METHOD; > @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 port_num, > * before checking for other consumers. > * Just tell the caller to process it normally. > */ > - ret = IB_MAD_RESULT_FAILURE; > + ret = IB_MAD_RESULT_SUCCESS; > goto bail; > default: > pmp->status |= IB_SMP_UNSUP_METHOD; > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Oct 23 14:38:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 23 Oct 2007 23:38:04 +0200 Subject: [ofa-general] Re: [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace In-Reply-To: <1193173872.18113.225.camel@hrosenstock-ws.xsigo.com> References: <4718EC48.mail8WG11T65C@systemfabricworks.com> <1193173872.18113.225.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071023213804.GK7088@sashak.voltaire.com> On 14:11 Tue 23 Oct , Hal Rosenstock wrote: > > but we ought to wait to hear from Sasha to be sure I didn't see any issues with this patch. All works as expected. Sasha From sashak at voltaire.com Tue Oct 23 14:47:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 23 Oct 2007 23:47:33 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] infiniband-diags/sminfo: Fix activity count display In-Reply-To: <1193173689.18113.220.camel@hrosenstock-ws.xsigo.com> References: <1193173689.18113.220.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071023214733.GL7088@sashak.voltaire.com> On 14:08 Tue 23 Oct , Hal Rosenstock wrote: > infiniband-diags/sminfo: Fix activity count display > > Signed-off-by: Hal Rosenstock Appied. Thanks. Sasha From hrosenstock at xsigo.com Tue Oct 23 14:48:44 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 14:48:44 -0700 Subject: [ofa-general] Outstanding MAD/SMI related patches In-Reply-To: References: <471E4438.6080300@ichips.intel.com> Message-ID: <1193176124.18113.236.camel@hrosenstock-ws.xsigo.com> Hi Roland, AFAIK there are three outstanding patches now related to MAD and SMI which have all been acked and I believe are ready to go ahead: 1. [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler(): Ralph Campbell 10/17 2. [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace: Steve Welch 10/19 3. [PATCH] IB/ipath - Enable loopback of DR SMP responses from userspace: Ralph Campbell 10/19 Just wanted to be clear on their status as these have floated around for a while now. Thanks. -- Hal From rdreier at cisco.com Tue Oct 23 14:53:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 23 Oct 2007 14:53:42 -0700 Subject: [ofa-general] Outstanding MAD/SMI related patches In-Reply-To: <1193176124.18113.236.camel@hrosenstock-ws.xsigo.com> (Hal Rosenstock's message of "Tue, 23 Oct 2007 14:48:44 -0700") References: <471E4438.6080300@ichips.intel.com> <1193176124.18113.236.camel@hrosenstock-ws.xsigo.com> Message-ID: > AFAIK there are three outstanding patches now related to MAD and SMI > which have all been acked and I believe are ready to go ahead: Thanks. I kind of stopped following this and was waiting for exactly this kind of clear message about what everyone agreed on. If you have been tracking this, could you please send me the final versions of the patches with all the accumulated Acked-by lines so I can merge the right versions of everything? - R. From rchin at philipyork.com Tue Oct 23 15:03:09 2007 From: rchin at philipyork.com (Esperanza Doran) Date: Tue, 23 Oct 2007 23:03:09 +0100 Subject: [ofa-general] experience extended dick Nancy Message-ID: <01c815c8$e0ffeb90$219d753e@rchin> monumental erectile organ for Edwina http://mewhing.com From hrosenstock at xsigo.com Tue Oct 23 15:04:15 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 15:04:15 -0700 Subject: [Fwd: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler()] Message-ID: <1193177055.18113.245.camel@hrosenstock-ws.xsigo.com> Hi Roland, This is patch 1. Hope my mailer doesn't munge it. Acked-by: Hal Rosenstock -- Hal -------- Forwarded Message -------- From: Ralph Campbell To: openib Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler() Date: Wed, 17 Oct 2007 18:06:42 -0700 In ib_mad_recv_done_handler(), the response pointer is checked for NULL after allocating it. It is then checked again in the local process_mad() path but there is no possibility of it changing in between. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..f82900d 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1931,15 +1931,6 @@ local: if (port_priv->device->process_mad) { int ret; - if (!response) { - printk(KERN_ERR PFX "No memory for response MAD\n"); - /* - * Is it better to assume that - * it wouldn't be processed ? - */ - goto out; - } - ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc, &recv->grh, _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Tue Oct 23 15:06:10 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 15:06:10 -0700 Subject: [ofa-general] [Fwd: [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace] Message-ID: <1193177170.18113.248.camel@hrosenstock-ws.xsigo.com> Hi again Roland, This is patch 2. Acked-by: Hal Rosenstock -- Hal -------- Forwarded Message -------- From: swelch at systemfabricworks.com To: ralph.campbell at qlogic.com, hrosenstock at xsigo.com, sean.hefty at intel.com, rdreier at cisco.com, general at lists.openfabrics.org Subject: [PATCH V4] infiniband/core: Enable loopback of DR SMP responses from userspace Date: Fri, 19 Oct 2007 12:41:28 -0500 This patch [v4] replaces the [v3] patch; it's identicial other than the patch description has been updated to put back in the detailed patch description absent from the [v3] patch. The local loopback of an outgoing DR SMP response is limited to those that originate at the driver specific SMA implementation during the driver specific process_mad() function. This patch enables a returning DR SMP originating in userspace (or elsewhere) to be delivered to the local managment stack. In this specific case the driver process_mad() function does not consume or process the MAD, so a reponse mad has not be created and the original MAD must manually be copied to the MAD buffer that is to be handed off to the local agent. For consistent bahavior on top of iPath hardware, a subsequent patch to be submitted by Ralph Campbell to update process_mad() return values is required. Thanks, Steve Signed-off-by: Steve Welch --- drivers/infiniband/core/mad.c | 6 +++--- drivers/infiniband/core/smi.h | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..98148d6 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -701,7 +701,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, } /* Check to post send on QP or process locally */ - if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD) + if (smi_check_local_smp(smp, device) == IB_SMI_DISCARD && + smi_check_local_returning_smp(smp, device) == IB_SMI_DISCARD) goto out; local = kmalloc(sizeof *local, GFP_ATOMIC); @@ -752,8 +753,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, port_priv = ib_get_mad_port(mad_agent_priv->agent.device, mad_agent_priv->agent.port_num); if (port_priv) { - mad_priv->mad.mad.mad_hdr.tid = - ((struct ib_mad *)smp)->mad_hdr.tid; + memcpy(&mad_priv->mad.mad, smp, sizeof(struct ib_mad)); recv_mad_agent = find_mad_agent(port_priv, &mad_priv->mad.mad); } diff --git a/drivers/infiniband/core/smi.h b/drivers/infiniband/core/smi.h index 1cfc298..aff96ba 100644 --- a/drivers/infiniband/core/smi.h +++ b/drivers/infiniband/core/smi.h @@ -59,7 +59,8 @@ extern enum smi_action smi_handle_dr_smp_send(struct ib_smp *smp, u8 node_type, int port_num); /* - * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad */ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, struct ib_device *device) @@ -71,4 +72,19 @@ static inline enum smi_action smi_check_local_smp(struct ib_smp *smp, (smp->hop_ptr == smp->hop_cnt + 1)) ? IB_SMI_HANDLE : IB_SMI_DISCARD); } + +/* + * Return IB_SMI_HANDLE if the SMP should be handled by the local SMA/SM + * via process_mad + */ +static inline enum smi_action smi_check_local_returning_smp(struct ib_smp *smp, + struct ib_device *device) +{ + /* C14-13:3 -- We're at the end of the DR segment of path */ + /* C14-13:4 -- Hop Pointer == 0 -> give to SM */ + return ((device->process_mad && + ib_get_smp_direction(smp) && + !smp->hop_ptr) ? IB_SMI_HANDLE : IB_SMI_DISCARD); +} + #endif /* __SMI_H_ */ From hrosenstock at xsigo.com Tue Oct 23 15:07:41 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 15:07:41 -0700 Subject: [Fwd: [ofa-general] [PATCH] IB/ipath - Enable loopback of DR SMP responses from userspace] Message-ID: <1193177261.18113.250.camel@hrosenstock-ws.xsigo.com> Hi Roland, This is patch 3. Acked-by: Hal Rosenstock -- Hal -------- Forwarded Message -------- From: Ralph Campbell To: openib Subject: [ofa-general] [PATCH] IB/ipath - Enable loopback of DR SMP responses from userspace Date: Fri, 19 Oct 2007 13:33:46 -0700 This patch is in response to reviewing a patch to the core MAD processing which fixes loopback of directed route packets to/from user level MAD agents. This change enables the core code to work for ib_ipath. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c index 3d1432d..1978c34 100644 --- a/drivers/infiniband/hw/ipath/ipath_mad.c +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -1434,7 +1434,7 @@ static int process_subn(struct ib_device *ibdev, int mad_flags, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: smp->status |= IB_SMP_UNSUP_METHOD; @@ -1516,7 +1516,7 @@ static int process_perf(struct ib_device *ibdev, u8 port_num, * before checking for other consumers. * Just tell the caller to process it normally. */ - ret = IB_MAD_RESULT_FAILURE; + ret = IB_MAD_RESULT_SUCCESS; goto bail; default: pmp->status |= IB_SMP_UNSUP_METHOD; _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Tue Oct 23 15:23:45 2007 From: krause at cup.hp.com (Michael Krause) Date: Tue, 23 Oct 2007 15:23:45 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E56BF.2080407@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com> <471E4773.4040000@ichips.intel.com> <471E56BF.2080407@opengridcomputing.com> Message-ID: <6.2.0.14.2.20071023152004.02d3bc40@esmail.cup.hp.com> At 01:17 PM 10/23/2007, Steve Wise wrote: >Sean Hefty wrote: >>>There has been much discussion on a private thread regarding bug #735 - >>>"dapltest performance tests don't adhere to iWARP standard" that needs >>>to move to the general list. >>This bug would be better titled "iWarp cannot support uDAPL API". :) >>Seriously, the iWarp and uDAPL specs conflict. One needs to change. >> >>>Can someone come up with a solution, possibly in iWARP CM, that will >>>work and insure interoperability between iWARP devices? >>I thought the restriction was there to support switching between >>streaming and rdma mode. If a connection only uses rdma mode, is the >>restriction really needed at all? > >Yes because all iWARP connections start out as TCP streaming mode >connections, and the MPA startup messages are sent in streaming mode. Then >the connection is transitioned into FPDU (Framed PDU) mode using the MPA >protocol. Correct. The IETF was very clear on these requirements (significant debate occurred over at least 12-18 months) and there is unlikely to be any traction in changing the iWARP specifications to provide another mechanism. Best to provide API that detect which semantics are required and then if the application cannot adjust, then it cannot use the iWARP semantics. BTW, if one uses the SDP port mapper protocol (see the IETF SDP specification), one can detect from the start that RDMA is being used and one could start in RDMA mode sans the MPA requirement. The SDP port mapper protocol also enables one to apply various other policies such as determining whether the application / remote node session should be allowed to run over RDMA or not - simple point of control for management. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Tue Oct 23 16:01:36 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 23 Oct 2007 19:01:36 -0400 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> Message-ID: I, too, would suggest that sessions requiring active discussion among the OFED developers should be given higher priority in the schedule / longer sessions than those that are just presenting a status update (e.g., Open MPI and MVAPICH). I would think that both the MPI status updates could be shortened down to 15 minutes each and moved to early Friday morning, for example. Is there any intent for HP MPI or Intel MPI to speak? I would be interested to hear what they have to say (e.g., feedback on the OFED stack vs. other network stacks and other status update kinds of things). That would be another reason to shorten the existing MPI sessions (so that all the MPI's can speak). Just my $0.02... On Oct 23, 2007, at 4:48 PM, Or Gerlitz wrote: > On 10/23/07, Johann George wrote: > Below is a tentative agenda for the upcoming OpenFabrics Developer's > Summit. Comments, suggestions and feedback on the agenda are also > welcome. > > Going strict develpers wise, the issues that come to my mind by > order of priority are: > > 1) IPoIB Stateless Offloads - checksum/LSO/LRO - design, open issues > 2) the long time open SA caching thing - problem statement, > possible solutions/designs > 3) QoS update/status - past, present and future > 4) RDS - status, update, upstream plans > 5) InfiniBand Routing Update > 6) XRC > and also > 7) Fibre Channel over InfiniBand > 8) ConnectX Reliable Multicast > > I suggest that --none-- of this list (with the possible exception > of 7 && 8) would be left for Friday - maybe in the price of doing > parallel sessions. And yes, some or most of them need way more then > 20 minutes, specifically 1 && 2 where we will need at least 45 > minutes each - this is my take. > > For that end, I suggest that my session would be moved to the next > day and same for all the sessions (except for the SA cache) planned > to the first day after the 14:50 break. > > I will be availble to further argue on the matter starting on > Sunday when back from vacation... > > Thursday, November 15, 2007 > --------------------------- > 13:00 20m OFED: Feedback from Alexa > Ekechi Nwokah, Alexa > 13:20 20m OFED: Feedback from RedHat > Doug Ledford, RedHat > 13:40 20m OFED: Feedback from SuSE > Moiz Kohari, Novell > 14:00 20m The Journey of a Patch: from Submission to Distros > Roland Drier, Cisco > 14:20 30m OFED 1.3 Update and Procedure Review > Tziporet Koren, Mellanox > ------------------------------------ > 14:50 20m Break > ------------------------------------ > 15:10 20m OpenFabrics Logo Program: Experience So Far > Arkady Kanevsky, Network Appliance > 15:30 30m Update on MVAPICH and MVAPICH2 > DK Panda, Ohio State University > 16:00 20m Update on OpenMPI > Jeff Sqyures, Cisco > 16:20 20m uDAPL 2.0 > Arkady Kanevsky, Network Appliance > 16:40 20m SA Caching > Sean Hefty, Intel > ------------------------------------ > 17:00 60m Dinner > ------------------------------------ > 18:00 20m Update on NFSoRDMA > James Lentini, Network Appliance > 18:20 20m Lustre > Eric Barton, Sun Microsystems > 18:40 20m Bonding > Or Gerlitz, Voltaire > 19:00 20m iWARP update > Bill Boas, System Fabric Works > 19:20 40m iWARP discussion on issues > Bill Boas, System Fabric Works; Gopal Hegde, > Cisco; Glenn > Grundstrom, NetEffect; Bruck Girmay, Chelsio > > > > Friday, November 16, 2007 > ------------------------- > 07:15 45m Breakfast > ------------------------------------ > 08:00 30m WinOF: Update and Futures > Gilad Shainer, Mellanox > 08:30 30m CCS Ve2 Preview > Eric Lantz, Microsoft > 09:00 30m OFED 1.4 Planned Features > Tziporet Koren, Mellanox > 09:30 20m OFED Management Tools > Ira Weiny, Lawrence Livermore National > Laboratories > ------------------------------------ > 09:50 20m Break > ------------------------------------ > 10:10 20m RDS with Zero Copy > Rick Frank, Oracle > 10:30 20m QoS Support > Sean Hefty, Intel; Dror Goldenberg, Mellanox > 10:50 20m InfiniBand Routing Update > Jason Gunthorpe, Obsidian Research > 11:10 20m IPoIB Stateless Offloads > Liran Liss, Mellanox > 11:30 20m Using XRC > Dror Goldenberg, Mellanox and Dr. Panda, Ohio > State > University > 11:50 20m Fibre Channel over InfiniBand > Dror Goldenberg, Mellanox > ------------------------------------ > 12:10 60m Lunch > ------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From mshefty at ichips.intel.com Tue Oct 23 16:57:30 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 23 Oct 2007 16:57:30 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E5880.1030100@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> Message-ID: <471E8A6A.2030207@ichips.intel.com> > That is what I've been trying to push. Both MVAPICH2 and OMPI have been > open to adjusting their transports to adhere to this requirement. > > I wouldn't mind implementing something to enforce this in the IWCM or > the iWARP drivers IF there was a clean way to do it. So far there > hasn't been a clean way proposed. Why can't either uDAPL or iW CM always do a send from the active to passive side that gets stripped off? From the active side, the first send is always posted before any user sends, and if necessary, a user send can be queued by software to avoid a QP/CQ overrun. The completion can simply be eaten by software. On the passive side, you have a similar process for receiving the data. (Yes this adds wire protocol, which requires both sides to support it.) - Sean From johann.george at qlogic.com Tue Oct 23 17:31:59 2007 From: johann.george at qlogic.com (Johann George) Date: Tue, 23 Oct 2007 17:31:59 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> Message-ID: <20071024003159.GA10244@cuprite.pathscale.com> Or, Thanks for your feedback. We can certainly expand, shorten and change sessions around to maximize benefit for the attendees. > I suggest that --none-- of this list (with the possible exception of > 7 && 8) would be left for Friday We are hoping that most attendees will stay for the entire summit. Nevertheless, some people have early flights on Friday while others have commitments that prevent them from attending several of the sessions on Thursday. Hopefully both these groups are a minority. > maybe in the price of doing parallel sessions. We usually get negative feedback with parallel sessions since attendees inevitably run into conflicts. > And yes, some or most of them need way more then 20 minutes, > specifically 1 && 2 where we will need at least 45 minutes each - > this is my take. I do agree that we would benefit from more discussion but it is near impossible to pack all the topics in the time allotted. What I am hoping is that we can present the issues concisely during the sessions and facilitate continued discussions afterwards. Many members of the community are unfamiliar with the details of the issues. > I will be availble to further argue on the matter starting on Sunday > when back from vacation... I can't wait. :-) Seriously, thanks much for the feedback. If the majority feel we should move some sessions, we'll do our best to accommodate people's schedules. Johann From johann.george at qlogic.com Tue Oct 23 17:40:42 2007 From: johann.george at qlogic.com (Johann George) Date: Tue, 23 Oct 2007 17:40:42 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> Message-ID: <20071024004042.GB10244@cuprite.pathscale.com> Jeff, > Is there any intent for HP MPI or Intel MPI to speak? I would be > interested to hear what they have to say (e.g., feedback on the OFED > stack vs. other network stacks and other status update kinds of > things). We considered it but given the time constraints, thought we should wait until Sonoma. Priority was given to OpenMPI and MVAPICH since they are being shipped as part of OFED. Still, as you point out, getting feedback on their view of OFED vs. other networking stacks could be valuable. Johann From ggrundstrom at NetEffect.com Tue Oct 23 18:02:06 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Tue, 23 Oct 2007 20:02:06 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E8A6A.2030207@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> > > That is what I've been trying to push. Both MVAPICH2 and > OMPI have been > > open to adjusting their transports to adhere to this requirement. > > > > I wouldn't mind implementing something to enforce this in > the IWCM or > > the iWARP drivers IF there was a clean way to do it. So far there > > hasn't been a clean way proposed. > > Why can't either uDAPL or iW CM always do a send from the active to > passive side that gets stripped off? From the active side, the first > send is always posted before any user sends, and if necessary, a user > send can be queued by software to avoid a QP/CQ overrun. The > completion > can simply be eaten by software. On the passive side, you have a > similar process for receiving the data. This is similar to an option in the NetEffect driver. A zero byte RDMA write is sent from the active side and accounted for on the passive side. This can be turned on and off by compile and module options for compatibility. I second Sean's question - why can't uDAPL or the iw_cm do this? > > (Yes this adds wire protocol, which requires both sides to > support it.) > > - Sean > From Arkady.Kanevsky at netapp.com Tue Oct 23 18:25:46 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 23 Oct 2007 21:25:46 -0400 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> Message-ID: This is still a protocol and should be defined by IETF not OFA. But if we get agreement from all iWARP vendors this will be a good step. If we can not get agreement on it on reflector lets do it at SC'07 OFA dev. conference. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > Sent: Tuesday, October 23, 2007 9:02 PM > To: Sean Hefty; Steve Wise > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > OpenFabrics General > Subject: RE: [ofa-general] [RFP] support for iWARP > requirement - activeconnect side MUST send first FPDU > > > > That is what I've been trying to push. Both MVAPICH2 and > > OMPI have been > > > open to adjusting their transports to adhere to this requirement. > > > > > > I wouldn't mind implementing something to enforce this in > > the IWCM or > > > the iWARP drivers IF there was a clean way to do it. So > far there > > > hasn't been a clean way proposed. > > > > Why can't either uDAPL or iW CM always do a send from the active to > > passive side that gets stripped off? From the active side, > the first > > send is always posted before any user sends, and if > necessary, a user > > send can be queued by software to avoid a QP/CQ overrun. The > > completion can simply be eaten by software. On the passive > side, you > > have a similar process for receiving the data. > > This is similar to an option in the NetEffect driver. A zero > byte RDMA write is sent from the active side and accounted > for on the passive side. This can be turned on and off by > compile and module options for compatibility. > > I second Sean's question - why can't uDAPL or the iw_cm do this? > > > > > (Yes this adds wire protocol, which requires both sides to support > > it.) > > > > - Sean > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From felix at chelsio.com Tue Oct 23 19:05:50 2007 From: felix at chelsio.com (Felix Marti) Date: Tue, 23 Oct 2007 19:05:50 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> Message-ID: <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com> > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady > Sent: Tuesday, October 23, 2007 6:26 PM > To: Glenn Grundstrom; Sean Hefty; Steve Wise > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics > General > Subject: RE: [ofa-general] [RFP] support for iWARP requirement - > activeconnectside MUST send first FPDU > > This is still a protocol and should be defined by IETF not OFA. > But if we get agreement from all iWARP vendors this will be a good > step. [felix] This will not work with a Chelsio RNIC which follows the IETF specification by a) not issuing a 0B RDMA Write to the wire and b) silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot be 'abused' for such a synchronization mechanism. I believe that the mentioned apps adhering to the iWarp requirement do a 'send' from the active side and only have the passive side issue RDMA ops once the incoming send has been received. I would guess that following a similar model is the best way to go and supported by all iWarp vendors implementing the IETF spec. > If we can not get agreement on it on reflector lets do > it at SC'07 OFA dev. conference. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > > Sent: Tuesday, October 23, 2007 9:02 PM > > To: Sean Hefty; Steve Wise > > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > > OpenFabrics General > > Subject: RE: [ofa-general] [RFP] support for iWARP > > requirement - activeconnect side MUST send first FPDU > > > > > > That is what I've been trying to push. Both MVAPICH2 and > > > OMPI have been > > > > open to adjusting their transports to adhere to this requirement. > > > > > > > > I wouldn't mind implementing something to enforce this in > > > the IWCM or > > > > the iWARP drivers IF there was a clean way to do it. So > > far there > > > > hasn't been a clean way proposed. > > > > > > Why can't either uDAPL or iW CM always do a send from the active to > > > passive side that gets stripped off? From the active side, > > the first > > > send is always posted before any user sends, and if > > necessary, a user > > > send can be queued by software to avoid a QP/CQ overrun. The > > > completion can simply be eaten by software. On the passive > > side, you > > > have a similar process for receiving the data. > > > > This is similar to an option in the NetEffect driver. A zero > > byte RDMA write is sent from the active side and accounted > > for on the passive side. This can be turned on and off by > > compile and module options for compatibility. > > > > I second Sean's question - why can't uDAPL or the iw_cm do this? > > > > > > > > (Yes this adds wire protocol, which requires both sides to support > > > it.) > > > > > > - Sean > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From palmcrist at creekstoneinn.com Tue Oct 23 20:59:31 2007 From: palmcrist at creekstoneinn.com (Old Ward) Date: Wed, 24 Oct 2007 05:59:31 +0200 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c815e9$fa3e2880$0100007f@localhost> adobe4less . com From shaled at tecknadebilder.com Tue Oct 23 22:31:08 2007 From: shaled at tecknadebilder.com (Corey Armstrong) Date: Tue, 23 Oct 2007 23:31:08 -0600 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c815ee$370c1e80$0100007f@localhost> adobe4less . com From hrosenstock at xsigo.com Tue Oct 23 20:39:59 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 20:39:59 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] libibumad/umad_poll.3: Fix man page return value description Message-ID: <1193197199.22038.35.camel@hrosenstock-ws.xsigo.com> libibumad/umad_poll.3: Fix man page return value description Signed-off-by: Hal Rosenstock diff --git a/libibumad/man/umad_poll.3 b/libibumad/man/umad_poll.3 index 38f71b0..d724ed3 100644 --- a/libibumad/man/umad_poll.3 +++ b/libibumad/man/umad_poll.3 @@ -1,6 +1,6 @@ .\" -*- nroff -*- .\" -.TH UMAD_POLL 3 "May 11, 2007" "OpenIB" "OpenIB Programmer\'s Manual" +.TH UMAD_POLL 3 "October 23, 2007" "OpenIB" "OpenIB Programmer\'s Manual" .SH "NAME" umad_poll \- poll umad .SH "SYNOPSIS" @@ -28,8 +28,8 @@ parameter of zero to .B umad_recv() to ensure a non-blocking read. .SH "RETURN VALUE" -.B umad_recv() -returns non negative receiving agentid on success, and a negative value on error as follows: +.B umad_poll() +returns 0 on success, and a negative value on error as follows: -EINVAL invalid port handle or agentid -ETIMEDOUT poll operation timed out -EIO poll operation failed From hrosenstock at xsigo.com Tue Oct 23 20:42:54 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 23 Oct 2007 20:42:54 -0700 Subject: [Fwd: [Fwd: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler()]] Message-ID: <1193197374.22038.39.camel@hrosenstock-ws.xsigo.com> Actually to be complete, this one is: Acked-by: Ralph Campbell Acked-by: Hal Rosenstock -------- Forwarded Message -------- From: Hal Rosenstock To: Roland Dreier Cc: general at lists.openfabrics.org, Ralph Campbell Subject: [Fwd: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler()] Date: Tue, 23 Oct 2007 15:04:15 -0700 Hi Roland, This is patch 1. Hope my mailer doesn't munge it. Acked-by: Hal Rosenstock -- Hal -------- Forwarded Message -------- From: Ralph Campbell To: openib Subject: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler() Date: Wed, 17 Oct 2007 18:06:42 -0700 In ib_mad_recv_done_handler(), the response pointer is checked for NULL after allocating it. It is then checked again in the local process_mad() path but there is no possibility of it changing in between. Signed-off-by: Ralph Campbell diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 6f42877..f82900d 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1931,15 +1931,6 @@ local: if (port_priv->device->process_mad) { int ret; - if (!response) { - printk(KERN_ERR PFX "No memory for response MAD\n"); - /* - * Is it better to assume that - * it wouldn't be processed ? - */ - goto out; - } - ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc, &recv->grh, _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Tue Oct 23 21:16:55 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 23 Oct 2007 23:16:55 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <6.2.0.14.2.20071023152004.02d3bc40@esmail.cup.hp.com> References: <471E4438.6080300@ichips.intel.com> <471E4773.4040000@ichips.intel.com> <471E56BF.2080407@opengridcomputing.com> <6.2.0.14.2.20071023152004.02d3bc40@esmail.cup.hp.com> Message-ID: <471EC737.8050605@opengridcomputing.com> Michael Krause wrote: > At 01:17 PM 10/23/2007, Steve Wise wrote: > > >> Sean Hefty wrote: >>>> There has been much discussion on a private thread regarding bug >>>> #735 - "dapltest performance tests don't adhere to iWARP standard" >>>> that needs to move to the general list. >>> This bug would be better titled "iWarp cannot support uDAPL API". :) >>> Seriously, the iWarp and uDAPL specs conflict. One needs to change. >>> >>>> Can someone come up with a solution, possibly in iWARP CM, that >>>> will work and insure interoperability between iWARP devices? >>> I thought the restriction was there to support switching between >>> streaming and rdma mode. If a connection only uses rdma mode, is >>> the restriction really needed at all? >> >> Yes because all iWARP connections start out as TCP streaming mode >> connections, and the MPA startup messages are sent in streaming mode. >> Then the connection is transitioned into FPDU (Framed PDU) mode >> using the MPA protocol. > > Correct. The IETF was very clear on these requirements (significant > debate occurred over at least 12-18 months) and there is unlikely to > be any traction in changing the iWARP specifications to provide > another mechanism. Best to provide API that detect which semantics > are required and then if the application cannot adjust, then it cannot > use the iWARP semantics. First let me apologize in advance, but that is simply not a workable solution for the customer. I'm not taking anything away from the efforts of those involved with the definition of the MPA protocol, however, unfortunately that protracted debate occurred 2-3 years in advance of a deployed solution. The duration of the debate doesn't overcome the absence of practical perspective. There are now multiple implementations, the customers of which are complaining about the cost of the compromises made. We now have the benefit of hindsight and in my option should rev the MPA protocol. After all, that's why the number is there in the header -- right? It may be that those involved with the original debate have no interest in revisiting it, but IMO that is irrelevant. There are now new companies involved that implemented RDDP, have customers using it, and have a sustaining (both interpretations intended) interest in making RDDP better. I, for one, would encourage them to do so. Protocols are not immutable, unless they're dead. > > BTW, if one uses the SDP port mapper protocol (see the IETF SDP > specification), one can detect from the start that RDMA is being used > and one could start in RDMA mode sans the MPA requirement. The SDP > port mapper protocol also enables one to apply various other policies > such as determining whether the application / remote node session > should be allowed to run over RDMA or not - simple point of control > for management. > Really? What about CRC, Markers and Private Data? > Mike > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From cardanic at greenbriartraining.com Tue Oct 23 23:27:13 2007 From: cardanic at greenbriartraining.com (Sally Matthews) Date: Wed, 24 Oct 2007 00:27:13 -0600 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c815f6$645cca80$0100007f@localhost> adobe4less . com From tom at opengridcomputing.com Tue Oct 23 21:31:50 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 23 Oct 2007 23:31:50 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU In-Reply-To: <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com> References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com> Message-ID: <471ECAB6.9050606@opengridcomputing.com> Felix Marti wrote: > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org [mailto:general- >> bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady >> Sent: Tuesday, October 23, 2007 6:26 PM >> To: Glenn Grundstrom; Sean Hefty; Steve Wise >> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics >> General >> Subject: RE: [ofa-general] [RFP] support for iWARP requirement - >> activeconnectside MUST send first FPDU >> >> This is still a protocol and should be defined by IETF not OFA. >> But if we get agreement from all iWARP vendors this will be a good >> step. >> > [felix] This will not work with a Chelsio RNIC which follows the IETF > specification by a) not issuing a 0B RDMA Write to the wire and b) > silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot > be 'abused' for such a synchronization mechanism. I believe that the > mentioned apps adhering to the iWarp requirement do a 'send' from the > active side and only have the passive side issue RDMA ops once the > incoming send has been received. I would guess that following a similar > model is the best way to go and supported by all iWarp vendors > implementing the IETF spec. > > IMO, the iWARP vendors _must_ get together and work on MPA '2'. Standardizing FPDU 'abuse' might be a good place to start, but it needs to be fixed to support peer-to-peer going forward. In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, the iWARP CM or anywhere else except the application seems to me to be the only customer friendly solution. > >> If we can not get agreement on it on reflector lets do >> it at SC'07 OFA dev. conference. >> >> Arkady Kanevsky email: arkady at netapp.com >> Network Appliance Inc. phone: 781-768-5395 >> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >> Waltham, MA 02451 central phone: 781-768-5300 >> >> >> >>> -----Original Message----- >>> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] >>> Sent: Tuesday, October 23, 2007 9:02 PM >>> To: Sean Hefty; Steve Wise >>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; >>> OpenFabrics General >>> Subject: RE: [ofa-general] [RFP] support for iWARP >>> requirement - activeconnect side MUST send first FPDU >>> >>> >>>>> That is what I've been trying to push. Both MVAPICH2 and >>>>> >>>> OMPI have been >>>> >>>>> open to adjusting their transports to adhere to this >>>>> > requirement. > >>>>> I wouldn't mind implementing something to enforce this in >>>>> >>>> the IWCM or >>>> >>>>> the iWARP drivers IF there was a clean way to do it. So >>>>> >>> far there >>> >>>>> hasn't been a clean way proposed. >>>>> >>>> Why can't either uDAPL or iW CM always do a send from the active >>>> > to > >>>> passive side that gets stripped off? From the active side, >>>> >>> the first >>> >>>> send is always posted before any user sends, and if >>>> >>> necessary, a user >>> >>>> send can be queued by software to avoid a QP/CQ overrun. The >>>> completion can simply be eaten by software. On the passive >>>> >>> side, you >>> >>>> have a similar process for receiving the data. >>>> >>> This is similar to an option in the NetEffect driver. A zero >>> byte RDMA write is sent from the active side and accounted >>> for on the passive side. This can be turned on and off by >>> compile and module options for compatibility. >>> >>> I second Sean's question - why can't uDAPL or the iw_cm do this? >>> >>> >>>> (Yes this adds wire protocol, which requires both sides to support >>>> it.) >>>> >>>> - Sean >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> > http://openib.org/mailman/listinfo/openib- > >> general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From bill.boas at gmail.com Tue Oct 23 21:39:51 2007 From: bill.boas at gmail.com (Bill Boas) Date: Tue, 23 Oct 2007 21:39:51 -0700 Subject: [ofa-general] Re: [Interop-wg] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E4438.6080300@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> Message-ID: <19a929370710232139laa1d052ie89498a710ef4a85@mail.gmail.com> All involved with iWARP and OFED contribution and maintenance Before the developer's session in Reno about iWARP, could you as a team working together prepare a complete list of the iWARP technical issues that confront everyone and the differences that you have with each other in the current state of the implementation and diffrences you have about resolving future isses you already have a sense of. And publish that list before SC with the options for resolutions that have been pit forward so far. Arlin are you willing to take the lead on accomplishing this? Is that a reasonable request so we can all understand and help with the resolutions, if possible? Bill. On 10/23/07, Arlin Davis wrote: > > > There has been much discussion on a private thread regarding bug #735 - > "dapltest performance tests don't adhere to iWARP standard" that needs > to move to the general list. > > iWARP, has a requirement that the active side of the connection MUST be > the first to send the first FPDU (SEND or RDMA operation). This presents > a problem with applications written for uDAPL and OFA verbs given that > there is no such restriction. So, short of requiring every OFA > application/ULP to adhere to this restriction, we need the iWARP vendors > to come up with a standard method to remove the restriction. > > Can someone come up with a solution, possibly in iWARP CM, that will > work and insure interoperability between iWARP devices? > > > -arlin > > > > > > > > _______________________________________________ > Interop-wg mailing list > Interop-wg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/interop-wg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From felix at chelsio.com Tue Oct 23 22:30:55 2007 From: felix at chelsio.com (Felix Marti) Date: Tue, 23 Oct 2007 22:30:55 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com> <471ECAB6.9050606@opengridcomputing.com> Message-ID: <8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Tuesday, October 23, 2007 9:32 PM > To: Felix Marti > Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland > Dreier; interop-wg at lists.openfabrics.org; OpenFabrics General > Subject: Re: [ofa-general] [RFP] support for iWARP requirement - > activeconnectside MUST send first FPDU > > Felix Marti wrote: > > > >> -----Original Message----- > >> From: general-bounces at lists.openfabrics.org [mailto:general- > >> bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady > >> Sent: Tuesday, October 23, 2007 6:26 PM > >> To: Glenn Grundstrom; Sean Hefty; Steve Wise > >> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics > >> General > >> Subject: RE: [ofa-general] [RFP] support for iWARP requirement - > >> activeconnectside MUST send first FPDU > >> > >> This is still a protocol and should be defined by IETF not OFA. > >> But if we get agreement from all iWARP vendors this will be a good > >> step. > >> > > [felix] This will not work with a Chelsio RNIC which follows the IETF > > specification by a) not issuing a 0B RDMA Write to the wire and b) > > silently consuming an incoming 0B write. Therefore 0B RDMA Writes > cannot > > be 'abused' for such a synchronization mechanism. I believe that the > > mentioned apps adhering to the iWarp requirement do a 'send' from the > > active side and only have the passive side issue RDMA ops once the > > incoming send has been received. I would guess that following a > similar > > model is the best way to go and supported by all iWarp vendors > > implementing the IETF spec. > > > > > IMO, the iWARP vendors _must_ get together and work on MPA '2'. > Standardizing FPDU 'abuse' might be a good place to start, but it needs > to be fixed to support peer-to-peer going forward. > > In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, > the iWARP CM or anywhere else except the application seems to me to be > the only customer friendly solution. [felix] While I'm not against trying to hide the connection migration details somewhere below the ULP, I'm not convinced that the issue is as severe as you make it to be and I would not press to have the issue resolved in a matter that requires a new MPA version. In fact, the different rdma transports (and maybe even different versions of the same transport (in the case of IB)) provide different features and I would assume that ULPs will eventually code to these features and must thus be aware of the underlying transport protocol. In that bigger picture, the connection migration issue at hand seems fairly trivial to solve even if it requires an ULP change... > > > >> If we can not get agreement on it on reflector lets do > >> it at SC'07 OFA dev. conference. > >> > >> Arkady Kanevsky email: arkady at netapp.com > >> Network Appliance Inc. phone: 781-768-5395 > >> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >> Waltham, MA 02451 central phone: 781-768-5300 > >> > >> > >> > >>> -----Original Message----- > >>> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > >>> Sent: Tuesday, October 23, 2007 9:02 PM > >>> To: Sean Hefty; Steve Wise > >>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > >>> OpenFabrics General > >>> Subject: RE: [ofa-general] [RFP] support for iWARP > >>> requirement - activeconnect side MUST send first FPDU > >>> > >>> > >>>>> That is what I've been trying to push. Both MVAPICH2 and > >>>>> > >>>> OMPI have been > >>>> > >>>>> open to adjusting their transports to adhere to this > >>>>> > > requirement. > > > >>>>> I wouldn't mind implementing something to enforce this in > >>>>> > >>>> the IWCM or > >>>> > >>>>> the iWARP drivers IF there was a clean way to do it. So > >>>>> > >>> far there > >>> > >>>>> hasn't been a clean way proposed. > >>>>> > >>>> Why can't either uDAPL or iW CM always do a send from the active > >>>> > > to > > > >>>> passive side that gets stripped off? From the active side, > >>>> > >>> the first > >>> > >>>> send is always posted before any user sends, and if > >>>> > >>> necessary, a user > >>> > >>>> send can be queued by software to avoid a QP/CQ overrun. The > >>>> completion can simply be eaten by software. On the passive > >>>> > >>> side, you > >>> > >>>> have a similar process for receiving the data. > >>>> > >>> This is similar to an option in the NetEffect driver. A zero > >>> byte RDMA write is sent from the active side and accounted > >>> for on the passive side. This can be turned on and off by > >>> compile and module options for compatibility. > >>> > >>> I second Sean's question - why can't uDAPL or the iw_cm do this? > >>> > >>> > >>>> (Yes this adds wire protocol, which requires both sides to support > >>>> it.) > >>>> > >>>> - Sean > >>>> > >>>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> > > http://openib.org/mailman/listinfo/openib- > > > >> general > >> > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From envios3003 at yahoo.es Sat Oct 20 21:56:06 2007 From: envios3003 at yahoo.es (Maquina del Chi) Date: Sun, 21 Oct 2007 01:56:06 -0300 Subject: [ofa-general] En su Hogar Gimnasia y Salud sin esfuerzo... Message-ID: <362777-22007100214566294@Mauricio> CHI Machine (Energía Vital) Oxigene y desintoxique su cuerpo en la comodidad de su hogar CHI MACHINE EJECUTA UN EJERCICIO AEROBICO QUE AUMENTA SU ENERGIA VITAL, OXIGENA, ESTIMULA EL SISTEMA LINFATICO, EJERCITA Y BALANCEA LA COLUMNA, EL SISTEMA DIGESTIVO, AYUDA A ELIMINAR EL ESTRES Y FAVORESE LA BAJA DE PESO, NIÑOS Y ADULTOS. Esta especial maquina aerobica llamada La Maquina del Chi fue creada tras décadas de investigación de la relación entre los niveles del oxigeno en el cuerpo y la calidad de la salud humana. Reportaje de Discovery Salud http://www.dsalud.com/saludybelleza_numero56.htm Bajar Manual Ver Video Chi-Machine cuenta con GARANTIA y SERVICIO TECNICO en Chile Consultas Fono: 235 12 07 CHI-Machine DIGITAL Oferta Pago CONTADO(Stock Limitado) $135.000.- Control Remoto Digital, velocidad variable (1 -5), tiempo ajustable de 1 a 15 minutos. Precio normal $150.000.- (Pie $50.000.- saldo 2 Cheques de $50.000.- c/u) Este mensaje se envía en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los artículos 2 y 4 de la ley 19.628 sobre protección de la vida privada o datos de carácter personal, todo esto en conformidad a los numerales 4 y 12 de la constitución política. Su dirección ha sido extraída manualmente por personal de nuestra compañía desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envío de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos -------------- next part -------------- An HTML attachment was scrubbed... URL: From mantee9 at netactive.co.za Wed Oct 24 02:34:18 2007 From: mantee9 at netactive.co.za (Dona Padilla) Date: Wed, 24 Oct 2007 16:34:18 +0700 Subject: [ofa-general] We have everything that you need Message-ID: <000f01c8165b$b9055ab0$00a9cc24@252ltnpc> Show your wife that you are big Life is short.... so make the most of it !!! Introducing the new male ennhancemennt product that has been tested and sold to over 300,000 Men worldwide. There is no need to pay more. 78% admitted that they are unhappy with their partner's penis size. Enlarge your manhood today and reap all the benefits, be the most confident man in town! http://http://schoea.cn Best drugs only from USA. 100% safe and 100% money back guarantee if not satisfied. Become the man you have always wanted to be From eli at mellanox.co.il Wed Oct 24 02:38:26 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 24 Oct 2007 11:38:26 +0200 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: References: <1193155667.25235.4.camel@mtls03> Message-ID: <1193218707.25235.18.camel@mtls03> On Tue, 2007-10-23 at 11:35 -0700, Roland Dreier wrote: > > In order to reduce the overhead of iterating the fragments of an > > SKB in the receive flow, we use fragments of higher order and thus > > reduce the number of iterations. This patch seams to improve receive > > throughput of small UDP messages. > > I don't think we want to do this -- it may be good for benchmarks but > it will hurt reliability, since systems often have highly fragmented > memory so higher-order atomic allocations will fail. > > - R. Other drivers do similar allocations. For example, e1000 when working with jumbo frames does such large allocations. Also I did not notice allocation failures though my system was pretty much active but I can monitor for such possible failures. e1000_main.c line 3549: else if (max_frame <= E1000_RXBUFFER_16384) adapter->rx_buffer_len = E1000_RXBUFFER_16384; From vlad at lists.openfabrics.org Wed Oct 24 02:59:26 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 24 Oct 2007 02:59:26 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071024-0200 daily build status Message-ID: <20071024095926.E97D9E608AD@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From Arkady.Kanevsky at netapp.com Wed Oct 24 05:33:58 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 24 Oct 2007 08:33:58 -0400 Subject: [ofa-general] Re: [Interop-wg] [RFP] support for iWARP requirement -active connect side MUST send first FPDU In-Reply-To: <19a929370710232139laa1d052ie89498a710ef4a85@mail.gmail.com> References: <471E4438.6080300@ichips.intel.com> <19a929370710232139laa1d052ie89498a710ef4a85@mail.gmail.com> Message-ID: Bill, we need to go wider then uDAPL. The right folks to ask are UNH folks who run interop event. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Bill Boas [mailto:bill.boas at gmail.com] Sent: Wednesday, October 24, 2007 12:40 AM To: Arlin Davis Cc: xwg at lists.openfabrics.org; ewg at lists.openfabrics.org; OpenFabrics General; interop-wg at lists.openfabrics.org Subject: [ofa-general] Re: [Interop-wg] [RFP] support for iWARP requirement -active connect side MUST send first FPDU All involved with iWARP and OFED contribution and maintenance Before the developer's session in Reno about iWARP, could you as a team working together prepare a complete list of the iWARP technical issues that confront everyone and the differences that you have with each other in the current state of the implementation and diffrences you have about resolving future isses you already have a sense of. And publish that list before SC with the options for resolutions that have been pit forward so far. Arlin are you willing to take the lead on accomplishing this? Is that a reasonable request so we can all understand and help with the resolutions, if possible? Bill. On 10/23/07, Arlin Davis wrote: There has been much discussion on a private thread regarding bug #735 - "dapltest performance tests don't adhere to iWARP standard" that needs to move to the general list. iWARP, has a requirement that the active side of the connection MUST be the first to send the first FPDU (SEND or RDMA operation). This presents a problem with applications written for uDAPL and OFA verbs given that there is no such restriction. So, short of requiring every OFA application/ULP to adhere to this restriction, we need the iWARP vendors to come up with a standard method to remove the restriction. Can someone come up with a solution, possibly in iWARP CM, that will work and insure interoperability between iWARP devices? -arlin _______________________________________________ Interop-wg mailing list Interop-wg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/interop-wg -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Wed Oct 24 06:49:20 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 24 Oct 2007 06:49:20 -0700 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071015103918.GO12364@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> Message-ID: <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > > >> Switches have the NodeDescription filled by FW, and it's usually the > > >> same string for all the switches. > > > It must not be same. Also I suppose that node description can be changed > > > at least for some managed switches even today. > > > > Come on, man... > > How many cluster administrators that you know will actually go and set > > NodeDescription on switches??? > > I know at least one asked for this. Perhaps switch_map can be used in conjunction with this like in the diags ? -- Hal > > I don't want to give user an easy way to make mistakes. > > If the user wants to include all the switches in the port group, there's an > > easy way to do it just by saying "node-type: SWITCH". > > If the user is so advanced that he wants to create port groups with a > > specific > > switches, it can be done by specifying guids. > > The same is true for CAs. So what is your point with "by name" > resolution then? > > > >> 3. If the admin would like to include num. ranges and asterisks in the > > >> port name, he has to make sure that the NodeDescription is created > > >> like it is created now by openibd. > > > Again, why this limitation is needed? What is wrong with wildcards like > > > "myname*", "hostname[1-3] *", etc.? > > > > In the policy file the user specifies *port* names, not *node* names. > > Sure, I meant only node's component here. Have it in 'node name' + 'port > number' form. What is easier? > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Wed Oct 24 07:22:14 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 24 Oct 2007 09:22:14 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471E8A6A.2030207@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> Message-ID: <471F5516.2000801@opengridcomputing.com> Sean Hefty wrote: >> That is what I've been trying to push. Both MVAPICH2 and OMPI have >> been open to adjusting their transports to adhere to this requirement. >> >> I wouldn't mind implementing something to enforce this in the IWCM or >> the iWARP drivers IF there was a clean way to do it. So far there >> hasn't been a clean way proposed. > > Why can't either uDAPL or iW CM always do a send from the active to > passive side that gets stripped off? From the active side, the first > send is always posted before any user sends, and if necessary, a user > send can be queued by software to avoid a QP/CQ overrun. The completion > can simply be eaten by software. On the passive side, you have a > similar process for receiving the data. > > (Yes this adds wire protocol, which requires both sides to support it.) > > - Sean I said "clean way to do it". ;-) Yes, this is the only "under the covers" solution I know of that will work with existing HW. However, I don't think it can be done totally within the rdmacm or iwcm. I think it involves providers "poll" function to deal with the send completion/error and the passive side recv completion/error. Steve. From tom at opengridcomputing.com Wed Oct 24 07:40:11 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 24 Oct 2007 09:40:11 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU In-Reply-To: <8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com> <471ECAB6.9050606@opengridcomputing.com> <8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> Message-ID: <471F594B.7000202@opengridcomputing.com> Felix Marti wrote: > >> -----Original Message----- >> From: Tom Tucker [mailto:tom at opengridcomputing.com] >> Sent: Tuesday, October 23, 2007 9:32 PM >> To: Felix Marti >> Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland >> Dreier; interop-wg at lists.openfabrics.org; OpenFabrics General >> Subject: Re: [ofa-general] [RFP] support for iWARP requirement - >> activeconnectside MUST send first FPDU >> >> Felix Marti wrote: >> >>>> -----Original Message----- >>>> From: general-bounces at lists.openfabrics.org [mailto:general- >>>> bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady >>>> Sent: Tuesday, October 23, 2007 6:26 PM >>>> To: Glenn Grundstrom; Sean Hefty; Steve Wise >>>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics >>>> General >>>> Subject: RE: [ofa-general] [RFP] support for iWARP requirement - >>>> activeconnectside MUST send first FPDU >>>> >>>> This is still a protocol and should be defined by IETF not OFA. >>>> But if we get agreement from all iWARP vendors this will be a good >>>> step. >>>> >>>> >>> [felix] This will not work with a Chelsio RNIC which follows the >>> > IETF > >>> specification by a) not issuing a 0B RDMA Write to the wire and b) >>> silently consuming an incoming 0B write. Therefore 0B RDMA Writes >>> >> cannot >> >>> be 'abused' for such a synchronization mechanism. I believe that the >>> mentioned apps adhering to the iWarp requirement do a 'send' from >>> > the > >>> active side and only have the passive side issue RDMA ops once the >>> incoming send has been received. I would guess that following a >>> >> similar >> >>> model is the best way to go and supported by all iWarp vendors >>> implementing the IETF spec. >>> >>> >>> >> IMO, the iWARP vendors _must_ get together and work on MPA '2'. >> Standardizing FPDU 'abuse' might be a good place to start, but it >> > needs > >> to be fixed to support peer-to-peer going forward. >> >> In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, >> the iWARP CM or anywhere else except the application seems to me to be >> the only customer friendly solution. >> > > [felix] While I'm not against trying to hide the connection migration > details somewhere below the ULP, I'm not convinced that the issue is as > severe as you make it to be and I would not press to have the issue > resolved in a matter that requires a new MPA version. In fact, the > different rdma transports (and maybe even different versions of the same > transport (in the case of IB)) provide different features and I would > assume that ULPs will eventually code to these features and must thus be > aware of the underlying transport protocol. In that bigger picture, the > connection migration issue at hand seems fairly trivial to solve even if > it requires an ULP change... > I didn't make an argument about severity. Qualifying the severity is in the customer's purview. I'm simply pointing out the following: a) the perspective that the restriction is trivial is how we got here, b) making the app change is putting a decision in the customer's hands that IMO an iWARP vendor would rather they didn't have to make "Do I or don't I support iWARP?", and c) you have the power to hide this behavior for most cases. Finally, I believe RFC means "Request for Comment". Well here's one last comment -- "Add an FPDU message at the end of MPA exchange and fix the problem in the protocol." > > >>>> If we can not get agreement on it on reflector lets do >>>> it at SC'07 OFA dev. conference. >>>> >>>> Arkady Kanevsky email: arkady at netapp.com >>>> Network Appliance Inc. phone: 781-768-5395 >>>> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >>>> Waltham, MA 02451 central phone: 781-768-5300 >>>> >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] >>>>> Sent: Tuesday, October 23, 2007 9:02 PM >>>>> To: Sean Hefty; Steve Wise >>>>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; >>>>> OpenFabrics General >>>>> Subject: RE: [ofa-general] [RFP] support for iWARP >>>>> requirement - activeconnect side MUST send first FPDU >>>>> >>>>> >>>>> >>>>>>> That is what I've been trying to push. Both MVAPICH2 and >>>>>>> >>>>>>> >>>>>> OMPI have been >>>>>> >>>>>> >>>>>>> open to adjusting their transports to adhere to this >>>>>>> >>>>>>> >>> requirement. >>> >>> >>>>>>> I wouldn't mind implementing something to enforce this in >>>>>>> >>>>>>> >>>>>> the IWCM or >>>>>> >>>>>> >>>>>>> the iWARP drivers IF there was a clean way to do it. So >>>>>>> >>>>>>> >>>>> far there >>>>> >>>>> >>>>>>> hasn't been a clean way proposed. >>>>>>> >>>>>>> >>>>>> Why can't either uDAPL or iW CM always do a send from the active >>>>>> >>>>>> >>> to >>> >>> >>>>>> passive side that gets stripped off? From the active side, >>>>>> >>>>>> >>>>> the first >>>>> >>>>> >>>>>> send is always posted before any user sends, and if >>>>>> >>>>>> >>>>> necessary, a user >>>>> >>>>> >>>>>> send can be queued by software to avoid a QP/CQ overrun. The >>>>>> completion can simply be eaten by software. On the passive >>>>>> >>>>>> >>>>> side, you >>>>> >>>>> >>>>>> have a similar process for receiving the data. >>>>>> >>>>>> >>>>> This is similar to an option in the NetEffect driver. A zero >>>>> byte RDMA write is sent from the active side and accounted >>>>> for on the passive side. This can be turned on and off by >>>>> compile and module options for compatibility. >>>>> >>>>> I second Sean's question - why can't uDAPL or the iw_cm do this? >>>>> >>>>> >>>>> >>>>>> (Yes this adds wire protocol, which requires both sides to >>>>>> > support > >>>>>> it.) >>>>>> >>>>>> - Sean >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> >>>> >>> http://openib.org/mailman/listinfo/openib- >>> >>> >>>> general >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> >> http://openib.org/mailman/listinfo/openib-general >> From Arkady.Kanevsky at netapp.com Wed Oct 24 07:52:40 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 24 Oct 2007 10:52:40 -0400 Subject: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU In-Reply-To: <471F594B.7000202@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com><471ECAB6.9050606@opengridcomputing.com><8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> <471F594B.7000202@opengridcomputing.com> Message-ID: The bottom line we need to single solution which works for all vendors. This issue cause interoperability problems. So Customers will stay on the sideline until these type of issues are resolved. Hiding behind protocol holes is not going to help adoption. Will sending 0-size send message from initiator side work? Can IWCM on responder side squeeze 0-size buffer to recv this message and swallow it. Hope that there is no check that need to be done on all comletions? Will work for both interrupt and polling mode? I still believe that it will be simplier to add it to MPA protocol. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Wednesday, October 24, 2007 10:40 AM > To: Felix Marti > Cc: Kanevsky, Arkady; Roland Dreier; Glenn Grundstrom; > OpenFabrics General; interop-wg at lists.openfabrics.org > Subject: Re: [ofa-general] [RFP] support for iWARP > requirement- activeconnectside MUST send first FPDU > > Felix Marti wrote: > > > >> -----Original Message----- > >> From: Tom Tucker [mailto:tom at opengridcomputing.com] > >> Sent: Tuesday, October 23, 2007 9:32 PM > >> To: Felix Marti > >> Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; > >> Roland Dreier; interop-wg at lists.openfabrics.org; > OpenFabrics General > >> Subject: Re: [ofa-general] [RFP] support for iWARP requirement - > >> activeconnectside MUST send first FPDU > >> > >> Felix Marti wrote: > >> > >>>> -----Original Message----- > >>>> From: general-bounces at lists.openfabrics.org [mailto:general- > >>>> bounces at lists.openfabrics.org] On Behalf Of Kanevsky, Arkady > >>>> Sent: Tuesday, October 23, 2007 6:26 PM > >>>> To: Glenn Grundstrom; Sean Hefty; Steve Wise > >>>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics > >>>> General > >>>> Subject: RE: [ofa-general] [RFP] support for iWARP requirement - > >>>> activeconnectside MUST send first FPDU > >>>> > >>>> This is still a protocol and should be defined by IETF not OFA. > >>>> But if we get agreement from all iWARP vendors this will > be a good > >>>> step. > >>>> > >>>> > >>> [felix] This will not work with a Chelsio RNIC which follows the > >>> > > IETF > > > >>> specification by a) not issuing a 0B RDMA Write to the > wire and b) > >>> silently consuming an incoming 0B write. Therefore 0B RDMA Writes > >>> > >> cannot > >> > >>> be 'abused' for such a synchronization mechanism. I > believe that the > >>> mentioned apps adhering to the iWarp requirement do a 'send' from > >>> > > the > > > >>> active side and only have the passive side issue RDMA ops > once the > >>> incoming send has been received. I would guess that following a > >>> > >> similar > >> > >>> model is the best way to go and supported by all iWarp vendors > >>> implementing the IETF spec. > >>> > >>> > >>> > >> IMO, the iWARP vendors _must_ get together and work on MPA '2'. > >> Standardizing FPDU 'abuse' might be a good place to start, but it > >> > > needs > > > >> to be fixed to support peer-to-peer going forward. > >> > >> In the mean-time, imperfectly hiding the issue in the Firmware, > >> uDAPL, the iWARP CM or anywhere else except the > application seems to > >> me to be the only customer friendly solution. > >> > > > > [felix] While I'm not against trying to hide the connection > migration > > details somewhere below the ULP, I'm not convinced that the > issue is > > as severe as you make it to be and I would not press to > have the issue > > resolved in a matter that requires a new MPA version. In fact, the > > different rdma transports (and maybe even different versions of the > > same transport (in the case of IB)) provide different > features and I > > would assume that ULPs will eventually code to these > features and must > > thus be aware of the underlying transport protocol. In that bigger > > picture, the connection migration issue at hand seems > fairly trivial > > to solve even if it requires an ULP change... > > > I didn't make an argument about severity. Qualifying the > severity is in the customer's purview. I'm simply pointing > out the following: a) the perspective that the restriction is > trivial is how we got here, b) making the app change is > putting a decision in the customer's hands that IMO an iWARP > vendor would rather they didn't have to make "Do I or don't I > support iWARP?", and c) you have the power to hide this > behavior for most cases. > > Finally, I believe RFC means "Request for Comment". Well > here's one last comment -- "Add an FPDU message at the end of > MPA exchange and fix the problem in the protocol." > > > > > > >>>> If we can not get agreement on it on reflector lets do > it at SC'07 > >>>> OFA dev. conference. > >>>> > >>>> Arkady Kanevsky email: arkady at netapp.com > >>>> Network Appliance Inc. phone: 781-768-5395 > >>>> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >>>> Waltham, MA 02451 central phone: 781-768-5300 > >>>> > >>>> > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > >>>>> Sent: Tuesday, October 23, 2007 9:02 PM > >>>>> To: Sean Hefty; Steve Wise > >>>>> Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > OpenFabrics > >>>>> General > >>>>> Subject: RE: [ofa-general] [RFP] support for iWARP > requirement - > >>>>> activeconnect side MUST send first FPDU > >>>>> > >>>>> > >>>>> > >>>>>>> That is what I've been trying to push. Both MVAPICH2 and > >>>>>>> > >>>>>>> > >>>>>> OMPI have been > >>>>>> > >>>>>> > >>>>>>> open to adjusting their transports to adhere to this > >>>>>>> > >>>>>>> > >>> requirement. > >>> > >>> > >>>>>>> I wouldn't mind implementing something to enforce this in > >>>>>>> > >>>>>>> > >>>>>> the IWCM or > >>>>>> > >>>>>> > >>>>>>> the iWARP drivers IF there was a clean way to do it. So > >>>>>>> > >>>>>>> > >>>>> far there > >>>>> > >>>>> > >>>>>>> hasn't been a clean way proposed. > >>>>>>> > >>>>>>> > >>>>>> Why can't either uDAPL or iW CM always do a send from > the active > >>>>>> > >>>>>> > >>> to > >>> > >>> > >>>>>> passive side that gets stripped off? From the active side, > >>>>>> > >>>>>> > >>>>> the first > >>>>> > >>>>> > >>>>>> send is always posted before any user sends, and if > >>>>>> > >>>>>> > >>>>> necessary, a user > >>>>> > >>>>> > >>>>>> send can be queued by software to avoid a QP/CQ overrun. The > >>>>>> completion can simply be eaten by software. On the passive > >>>>>> > >>>>>> > >>>>> side, you > >>>>> > >>>>> > >>>>>> have a similar process for receiving the data. > >>>>>> > >>>>>> > >>>>> This is similar to an option in the NetEffect driver. > A zero byte > >>>>> RDMA write is sent from the active side and accounted > for on the > >>>>> passive side. This can be turned on and off by compile > and module > >>>>> options for compatibility. > >>>>> > >>>>> I second Sean's question - why can't uDAPL or the iw_cm do this? > >>>>> > >>>>> > >>>>> > >>>>>> (Yes this adds wire protocol, which requires both sides to > >>>>>> > > support > > > >>>>>> it.) > >>>>>> > >>>>>> - Sean > >>>>>> > >>>>>> > >>>>>> > >>>>> _______________________________________________ > >>>>> general mailing list > >>>>> general at lists.openfabrics.org > >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> general mailing list > >>>> general at lists.openfabrics.org > >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>> To unsubscribe, please visit > >>>> > >>>> > >>> http://openib.org/mailman/listinfo/openib- > >>> > >>> > >>>> general > >>>> > >>>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> > >> http://openib.org/mailman/listinfo/openib-general > >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Wed Oct 24 08:23:45 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 24 Oct 2007 10:23:45 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU In-Reply-To: References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com><471ECAB6.9050606@opengridcomputing.com><8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> <471F594B.7000202@opengridcomputing.com> Message-ID: <471F6381.2040303@opengridcomputing.com> Kanevsky, Arkady wrote: > The bottom line we need to single solution which works for all vendors. > This issue cause interoperability problems. > So Customers will stay on the sideline until these type of issues are > resolved. > Hiding behind protocol holes is not going to help adoption. > > Will sending 0-size send message from initiator side work? > Can IWCM on responder side squeeze 0-size buffer to recv this message > and swallow it. Hope that there is no check that need to be done > on all comletions? Will work for both interrupt and polling mode? > > I still believe that it will be simplier to add it to MPA protocol. Adding it to the MPA protocol will solve the problem for existing HW. From sashak at voltaire.com Wed Oct 24 08:39:57 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 24 Oct 2007 17:39:57 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071024153957.GR7088@sashak.voltaire.com> On 06:49 Wed 24 Oct , Hal Rosenstock wrote: > On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > > On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > > > >> Switches have the NodeDescription filled by FW, and it's usually the > > > >> same string for all the switches. > > > > It must not be same. Also I suppose that node description can be changed > > > > at least for some managed switches even today. > > > > > > Come on, man... > > > How many cluster administrators that you know will actually go and set > > > NodeDescription on switches??? > > > > I know at least one asked for this. > > Perhaps switch_map can be used in conjunction with this like in the > diags ? Hmm, right, switch_map is another example of switch naming, which is useful with diags. Perhaps even more generic - guid to name map? And this will work instead of (or in addition to) node description when specified? Sasha From tom at opengridcomputing.com Wed Oct 24 08:29:54 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 24 Oct 2007 10:29:54 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU In-Reply-To: <471F6381.2040303@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com><471ECAB6.9050606@opengridcomputing.com><8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> <471F594B.7000202@opengridcomputing.com> <471F6381.2040303@opengridcomputing.com> Message-ID: <471F64F2.6000708@opengridcomputing.com> Steve Wise wrote: > > > Kanevsky, Arkady wrote: >> The bottom line we need to single solution which works for all vendors. >> This issue cause interoperability problems. >> So Customers will stay on the sideline until these type of issues are >> resolved. >> Hiding behind protocol holes is not going to help adoption. >> >> Will sending 0-size send message from initiator side work? >> Can IWCM on responder side squeeze 0-size buffer to recv this message >> and swallow it. Hope that there is no check that need to be done >> on all comletions? Will work for both interrupt and polling mode? >> >> I still believe that it will be simplier to add it to MPA protocol. > > Adding it to the MPA protocol will solve the problem for existing HW. I think Steve means "will not" solve the problem for existing HW. From sashak at voltaire.com Wed Oct 24 08:51:14 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 24 Oct 2007 17:51:14 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] libibumad/umad_poll.3: Fix man page return value description In-Reply-To: <1193197199.22038.35.camel@hrosenstock-ws.xsigo.com> References: <1193197199.22038.35.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071024155114.GS7088@sashak.voltaire.com> On 20:39 Tue 23 Oct , Hal Rosenstock wrote: > libibumad/umad_poll.3: Fix man page return value description > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From mshefty at ichips.intel.com Wed Oct 24 08:48:34 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 24 Oct 2007 08:48:34 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471F5516.2000801@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <471F5516.2000801@opengridcomputing.com> Message-ID: <471F6952.9040404@ichips.intel.com> > I said "clean way to do it". ;-) I'm referring to an rdma cm connection protocol for iWarp. We have one for IB. I mentioned uDAPL as a possibility because it abstracts the transport, QP, CQ, etc. anyway, and one could argue that the uDAPL iWarp provider should take necessary steps to support the uDAPL API. I don't know that there's a need to change the iWarp architecture. > Yes, this is the only "under the covers" solution I know of that will > work with existing HW. However, I don't think it can be done totally > within the rdmacm or iwcm. I think it involves providers "poll" > function to deal with the send completion/error and the passive side > recv completion/error. This does present more of an issue for an rdma cm solution. IB defines a communication established event that might be of use to pass information between the provider and the rdma cm. But the provider would need to participate in the connection protocol. - Sean From swise at opengridcomputing.com Wed Oct 24 08:49:47 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 24 Oct 2007 10:49:47 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU In-Reply-To: <471F64F2.6000708@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com><471E5880.1030100@opengridcomputing.com><471E8A6A.2030207@ichips.intel.com><5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <8A71B368A89016469F72CD08050AD33401B7BAD1@maui.asicdesigners.com><471ECAB6.9050606@opengridcomputing.com><8A71B368A89016469F72CD08050AD33401B7BAF0@maui.asicdesigners.com> <471F594B.7000202@opengridcomputing.com> <471F6381.2040303@opengridcomputing.com> <471F64F2.6000708@opengridcomputing.com> Message-ID: <471F699B.8030400@opengridcomputing.com> Tom Tucker wrote: > Steve Wise wrote: >> >> >> Kanevsky, Arkady wrote: >>> The bottom line we need to single solution which works for all vendors. >>> This issue cause interoperability problems. >>> So Customers will stay on the sideline until these type of issues are >>> resolved. >>> Hiding behind protocol holes is not going to help adoption. >>> >>> Will sending 0-size send message from initiator side work? >>> Can IWCM on responder side squeeze 0-size buffer to recv this message >>> and swallow it. Hope that there is no check that need to be done >>> on all comletions? Will work for both interrupt and polling mode? >>> >>> I still believe that it will be simplier to add it to MPA protocol. >> >> Adding it to the MPA protocol will solve the problem for existing HW. > I think Steve means "will not" solve the problem for existing HW. oops. yes. "will not". Sorry. From hrosenstock at xsigo.com Wed Oct 24 08:56:36 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 24 Oct 2007 08:56:36 -0700 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071024153957.GR7088@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> Message-ID: <1193241396.22038.100.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-24 at 17:39 +0200, Sasha Khapyorsky wrote: > On 06:49 Wed 24 Oct , Hal Rosenstock wrote: > > On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > > > On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > > > > >> Switches have the NodeDescription filled by FW, and it's usually the > > > > >> same string for all the switches. > > > > > It must not be same. Also I suppose that node description can be changed > > > > > at least for some managed switches even today. > > > > > > > > Come on, man... > > > > How many cluster administrators that you know will actually go and set > > > > NodeDescription on switches??? > > > > > > I know at least one asked for this. > > > > Perhaps switch_map can be used in conjunction with this like in the > > diags ? > > Hmm, right, switch_map is another example of switch naming, which is > useful with diags. > Perhaps even more generic - guid to name map? And this will work instead > of (or in addition to) node description when specified? Right; but is it really needed for HCAs where the NodeDescription can be easily changed ? Not sure if this applies to routers too or not. The downside of this is the amount of configuration. > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at dev.mellanox.co.il Wed Oct 24 09:51:02 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 24 Oct 2007 18:51:02 +0200 Subject: [ofa-general] [PATCH 1 of 5] libmlx4: get needed device params via query_device when creating a ucontext Message-ID: <200710241851.02865.jackm@dev.mellanox.co.il> When creating a new user context, query device for various limits, for use in sanity checks and other resource limitation needs. Passing needed info back to userspace in this manner is preferable to breaking the ABI. Signed-off-by: Jack Morgenstein --- Roland, I use max_qp_wr and max_sge in the second patch in this series, first to check that the qp capabilities do not exceed the qp limits which are reported by ibv_query device; and then to adjust the qp capabilities returned by the kernel, so that the qp capabilities returned to the caller do not exceed the limits which are obtained via ibv_query_device. I use max_cqe in the third patch as a create_cq sanity check on the number of cqe's requested. Performing the check in this way avoids the need for an ABI increment. - Jack diff --git a/src/mlx4.c b/src/mlx4.c index 95902cd..4a22e74 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -109,6 +109,7 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ struct ibv_get_context cmd; struct mlx4_alloc_ucontext_resp resp; int i; + struct ibv_device_attr dev_attrs; context = malloc(sizeof *context); if (!context) @@ -170,8 +171,20 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ context->ibv_ctx.ops = mlx4_ctx_ops; + if (mlx4_query_device(&context->ibv_ctx, &dev_attrs)) + goto query_free; + + context->max_qp_wr = dev_attrs.max_qp_wr; + context->max_sge = dev_attrs.max_sge; + context->max_cqe = dev_attrs.max_cqe; + return &context->ibv_ctx; +query_free: + munmap(context->uar, to_mdev(ibdev)->page_size); + if (context->bf_page) + munmap(context->bf_page, to_mdev(ibdev)->page_size); + err_free: free(context); return NULL; diff --git a/src/mlx4.h b/src/mlx4.h index deb0f55..09e2bdd 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -83,6 +83,20 @@ #define PFX "mlx4: " +#ifndef max +#define max(a,b) \ + ({ typeof (a) _a = (a); \ + typeof (b) _b = (b); \ + _a > _b ? _a : _b; }) +#endif + +#ifndef min +#define min(a,b) \ + ({ typeof (a) _a = (a); \ + typeof (b) _b = (b); \ + _a < _b ? _a : _b; }) +#endif + enum { MLX4_CQ_ENTRY_SIZE = 0x20 }; @@ -166,6 +180,9 @@ struct mlx4_context { int num_qps; int qp_table_shift; int qp_table_mask; + int max_qp_wr; + int max_sge; + int max_cqe; struct { struct mlx4_srq **table; From jackm at dev.mellanox.co.il Wed Oct 24 09:53:01 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 24 Oct 2007 18:53:01 +0200 Subject: [ofa-general] [PATCH 2 of 5] libmlx4: limit qp resources accepted for create_qp per query_device values Message-ID: <200710241853.01858.jackm@dev.mellanox.co.il> Limit qp resources accepted for ibv_create_qp() to the limits reported in ib_query_device(). Make sure that the limits returned to the caller following qp creation also lie within the reported device limits. Signed-off-by: Jack Morgenstein diff --git a/src/qp.c b/src/qp.c index da5d2ed..61f1c9b 100644 --- a/src/qp.c +++ b/src/qp.c @@ -607,6 +607,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, enum ibv_qp_type type) { int wqe_size; + struct mlx4_context *ctx = to_mctx(qp->ibv_qp.context); wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg); switch (type) { @@ -624,8 +625,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, } qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); - cap->max_send_sge = qp->sq.max_gs; - qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; + cap->max_send_sge = min(ctx->max_sge, qp->sq.max_gs); + qp->sq.max_post = min(ctx->max_qp_wr, + qp->sq.wqe_cnt - qp->sq_spare_wqes); cap->max_send_wr = qp->sq.max_post; /* diff --git a/src/verbs.c b/src/verbs.c index ff89dd0..059b534 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -359,12 +359,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) struct ibv_create_qp_resp resp; struct mlx4_qp *qp; int ret; + struct mlx4_context *context = to_mctx(pd->context); + /* Sanity check QP size before proceeding */ - if (attr->cap.max_send_wr > 65536 || - attr->cap.max_recv_wr > 65536 || - attr->cap.max_send_sge > 64 || - attr->cap.max_recv_sge > 64 || + if (attr->cap.max_send_wr > context->max_qp_wr || + attr->cap.max_recv_wr > context->max_qp_wr || + attr->cap.max_send_sge > context->max_sge || + attr->cap.max_recv_sge > context->max_sge || attr->cap.max_inline_data > 1024) return NULL; @@ -430,8 +432,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) if (ret) goto err_destroy; - qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr; + qp->rq.wqe_cnt = attr->cap.max_recv_wr; qp->rq.max_gs = attr->cap.max_recv_sge; + + /* adjust rq maxima to not exceed reported device maxima */ + attr->cap.max_recv_wr = min(context->max_qp_wr, attr->cap.max_recv_wr); + attr->cap.max_recv_sge = min(context->max_sge, attr->cap.max_recv_sge); + + qp->rq.max_post = attr->cap.max_recv_wr; mlx4_set_sq_sizes(qp, &attr->cap, attr->qp_type); qp->doorbell_qpn = htonl(qp->ibv_qp.qp_num << 8); From jackm at dev.mellanox.co.il Wed Oct 24 09:54:09 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 24 Oct 2007 18:54:09 +0200 Subject: [ofa-general] [PATCH 3 OF 5] libmlx4: avoid adding unneeded extra CQE when creating a cq Message-ID: <200710241854.09715.jackm@dev.mellanox.co.il> Do not add an extra CQE when creating a CQ. Sanity-check against returned device capabilities, to avoid breaking ABI. Set minimum to 2, to avoid rejection by kernel. Signed-off-by: Jack Morgenstein diff --git a/src/verbs.c b/src/verbs.c index 059b534..4e92ec7 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -168,11 +168,15 @@ struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, struct mlx4_create_cq_resp resp; struct mlx4_cq *cq; int ret; + struct mlx4_context *mctx = to_mctx(context); /* Sanity check CQ size before proceeding */ - if (cqe > 0x3fffff) + if (cqe < 1 || cqe > mctx->max_cqe) return NULL; + /* raise minimum, to avoid breaking ABI */ + cqe = (cqe == 1) ? 2 : cqe; + cq = malloc(sizeof *cq); if (!cq) return NULL; @@ -182,7 +186,7 @@ struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, if (pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE)) goto err; - cqe = align_queue_size(cqe + 1); + cqe = align_queue_size(cqe); if (mlx4_alloc_buf(&cq->buf, cqe * MLX4_CQ_ENTRY_SIZE, to_mdev(context->device)->page_size)) From yangdong at ncic.ac.cn Wed Oct 24 09:50:15 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Thu, 25 Oct 2007 00:50:15 +0800 Subject: [ofa-general] makefile problem in using librdmacm and libverbs Message-ID: <471F77C7.6010602@ncic.ac.cn> I have some troubles when i was use libverbs and rdmacm. I make a makefile for my amp subsystem, you can see my makefile as follows. The key is LDFLAGS = -lm -lpthread -libverbs -lrdmacm, i make my makefile following Makefile in perftest, this is different from makefile in librdmacm (~/OFED-1.2/SOURCES/ofa_user-1.2/src/userspace/librdmacm). I adopt this style, but my problem came, that is after i do: rdma_create_event_channel,rdma_create_id, rdma_resolve_addr, rdma_resolve_route, ibv_alloc_pd, ibv_create_cq, when i invoke ibv_req_notify_cq, Segmentation fault occur, because cm_id->verbs->ops.req_notify_c is not exist. who can give me some advice, how write my makefile to use librdmacm and libverbs correctly, thanks a lot! Makefile in amp CC = cc RM = rm -f AR = ar rvs MV = mv -f RANLIB = ranlib TOP_DIR = /home/yd/OFED-1.2/SOURCES/ofa_user-1.2/src/userspace INC_VERBS = ${TOP_DIR}/libibverbs/include/ INC_RDMACM = ${TOP_DIR}/librdmacm/include/ INC_THIS = ./ CFLAGS = -DHAVE_CONFIG_H -I../../include/ -I${INC_VERBS} -I${INC_RDMACM} -I${INC_THIS} -Wall -g -D_GNU_SOURCE -O2 -D__RDMA__ LDFLAGS = -lm -lpthread -libverbs -lrdmacm LIBPATH = ../../lib/ OBJS = amp_interface.o amp_conn.o amp_utcp.o amp_uopenib.o amp_protos.o amp_request.o \ amp_uthread.o amp_help.o LIB = libamp.a .c.o: ${CC} ${CFLAGS} ${EXTRA_CFLAGS} -c $*.c lib: ${OBJS} ${AR} ${LIB} ${OBJS} ${RANLIB} ${LIB} ${MV} ${LIB} ${LIBPATH} clean: ${RM} *.o core ~* *.cpp ~ Makefile in perftest: TESTS = write_bw_postlist rdma_lat rdma_bw send_lat send_bw write_lat write_bw read_lat read_bw UTILS = clock_test all: ${TESTS} ${UTILS} CFLAGS += -Wall -g -D_GNU_SOURCE -O2 EXTRA_FILES = get_clock.c EXTRA_HEADERS = get_clock.h #The following seems to help GNU make on some platforms LOADLIBES += LDFLAGS += ${TESTS}: LOADLIBES += -libverbs -lrdmacm ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@ clean: $(foreach fname,${TESTS}, rm -f ib_${fname}) rm -f ${UTILS} .DELETE_ON_ERROR: .PHONY: all clean ~ From jackm at dev.mellanox.co.il Wed Oct 24 09:56:22 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 24 Oct 2007 18:56:22 +0200 Subject: [ofa-general] [PATCH 4 of 5] mlx4: limit qp resources accepted for create_qp per query_device values and headroom requirements Message-ID: <200710241856.23129.jackm@dev.mellanox.co.il> mlx4: limit allowable qp create resources to avoid create_qp failures due to added headroom wqes. In addition, guarantee that qp capabilities following qp creation always lie within limits given by ib_query_device. (for userspace, we perform this limiting in libmlx4, so as not to break the ABI). Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index d8287d9..d40ec2f 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -109,7 +109,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->max_mr_size = ~0ull; props->page_size_cap = dev->dev->caps.page_size_cap; props->max_qp = dev->dev->caps.num_qps - dev->dev->caps.reserved_qps; - props->max_qp_wr = dev->dev->caps.max_wqes; + props->max_qp_wr = dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; props->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); props->max_cq = dev->dev->caps.num_cqs - dev->dev->caps.reserved_cqs; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 2869765..56305e2 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -47,6 +47,13 @@ enum { MLX4_IB_DB_PER_PAGE = PAGE_SIZE / 4 }; +enum { + MLX4_IB_SQ_MIN_WQE_SHIFT = 6 +}; + +#define MLX4_IB_SQ_HEADROOM(shift) ((2048 >> (shift)) + 1) +#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT)) + struct mlx4_ib_db_pgdir; struct mlx4_ib_user_db_page; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 6b33224..d6c1600 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -212,8 +212,9 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, int is_user, int has_srq, struct mlx4_ib_qp *qp) { /* Sanity check RQ size before proceeding */ - if (cap->max_recv_wr > dev->dev->caps.max_wqes || - cap->max_recv_sge > dev->dev->caps.max_rq_sg) + if (cap->max_recv_wr > dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE || + cap->max_recv_sge > + min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg)) return -EINVAL; if (has_srq) { @@ -232,8 +233,19 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->rq.wqe_shift = ilog2(qp->rq.max_gs * sizeof (struct mlx4_wqe_data_seg)); } - cap->max_recv_wr = qp->rq.max_post = qp->rq.wqe_cnt; - cap->max_recv_sge = qp->rq.max_gs; + /* leave userspace return values as they were, so as not to break ABI */ + if (is_user) { + cap->max_recv_wr = qp->rq.max_post = qp->rq.wqe_cnt; + cap->max_recv_sge = qp->rq.max_gs; + } else { + cap->max_recv_wr = qp->rq.max_post = + min(dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, qp->rq.wqe_cnt); + cap->max_recv_sge = min(qp->rq.max_gs, + min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg)); + } + /* We don't support inline sends for kernel QPs (yet) */ + return 0; } @@ -242,8 +254,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { /* Sanity check SQ size before proceeding */ - if (cap->max_send_wr > dev->dev->caps.max_wqes || - cap->max_send_sge > dev->dev->caps.max_sq_sg || + if (cap->max_send_wr > (dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE) || + cap->max_send_sge > + min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg) || cap->max_inline_data + send_wqe_overhead(type) + sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz) return -EINVAL; @@ -261,6 +274,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + send_wqe_overhead(type))); + qp->sq.wqe_shift = max(MLX4_IB_SQ_MIN_WQE_SHIFT, qp->sq.wqe_shift); qp->sq.max_gs = ((1 << qp->sq.wqe_shift) - send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); @@ -268,7 +282,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, * We need to leave 2 KB + 1 WQE of headroom in the SQ to * allow HW to prefetch. */ - qp->sq_spare_wqes = (2048 >> qp->sq.wqe_shift) + 1; + qp->sq_spare_wqes = MLX4_IB_SQ_HEADROOM(qp->sq.wqe_shift); qp->sq.wqe_cnt = roundup_pow_of_two(cap->max_send_wr + qp->sq_spare_wqes); qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) + @@ -281,8 +295,12 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.offset = 0; } - cap->max_send_wr = qp->sq.max_post = qp->sq.wqe_cnt - qp->sq_spare_wqes; - cap->max_send_sge = qp->sq.max_gs; + cap->max_send_wr = qp->sq.max_post = + min(qp->sq.wqe_cnt - qp->sq_spare_wqes, + dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE); + cap->max_send_sge =min(qp->sq.max_gs, + min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg)); /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; From jackm at dev.mellanox.co.il Wed Oct 24 09:58:45 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 24 Oct 2007 18:58:45 +0200 Subject: [ofa-general] [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ Message-ID: <200710241858.45305.jackm@dev.mellanox.co.il> mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ. The extra CQE can cause a huge waste of memory if requesting a power-of-2 number of CQEs. Leave create_cq for userspace CQs as before, to avoid breaking ABI. (Handle this in separate libmlx4 patch) Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..8a1ccc4 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -108,7 +108,13 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector if (!cq) return ERR_PTR(-ENOMEM); - entries = roundup_pow_of_two(entries + 1); + /* eliminate using extra CQE (for kernel space). + * For userspace, do in libmlx4, so that don't break ABI. + */ + if (context) + entries = roundup_pow_of_two(entries + 1); + else + entries = roundup_pow_of_two(entries); cq->ibcq.cqe = entries - 1; buf_size = entries * sizeof (struct mlx4_cqe); spin_lock_init(&cq->lock); diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 89b3f0b..d34b61b 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -141,12 +141,7 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev->caps.max_sq_desc_sz = dev_cap->max_sq_desc_sz; dev->caps.max_rq_desc_sz = dev_cap->max_rq_desc_sz; dev->caps.num_qp_per_mgm = MLX4_QP_PER_MGM; - /* - * Subtract 1 from the limit because we need to allocate a - * spare CQE so the HCA HW can tell the difference between an - * empty CQ and a full CQ. - */ - dev->caps.max_cqes = dev_cap->max_cq_sz - 1; + dev->caps.max_cqes = dev_cap->max_cq_sz; dev->caps.reserved_cqs = dev_cap->reserved_cqs; dev->caps.reserved_eqs = dev_cap->reserved_eqs; dev->caps.reserved_mtts = DIV_ROUND_UP(dev_cap->reserved_mtts, From mshefty at ichips.intel.com Wed Oct 24 09:59:55 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 24 Oct 2007 09:59:55 -0700 Subject: [ofa-general] makefile problem in using librdmacm and libverbs In-Reply-To: <471F77C7.6010602@ncic.ac.cn> References: <471F77C7.6010602@ncic.ac.cn> Message-ID: <471F7A0B.4080604@ichips.intel.com> > INC_VERBS = ${TOP_DIR}/libibverbs/include/ > INC_RDMACM = ${TOP_DIR}/librdmacm/include/ Do these files match what's in /usr/local/include/infiniband and /usr/local/include/rdma? (Or the equivalent install directory.) You could try picking up the installed include files, rather than going directly into the source directory. - Sean From ardavis at ichips.intel.com Wed Oct 24 10:12:37 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 24 Oct 2007 10:12:37 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471F6952.9040404@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <471F5516.2000801@opengridcomputing.com> <471F6952.9040404@ichips.intel.com> Message-ID: <471F7D05.80804@ichips.intel.com> Sean Hefty wrote: >> I said "clean way to do it". ;-) > > I'm referring to an rdma cm connection protocol for iWarp. We have one > for IB. I mentioned uDAPL as a possibility because it abstracts the > transport, QP, CQ, etc. anyway, and one could argue that the uDAPL iWarp > provider should take necessary steps to support the uDAPL API. There is one OpenFabrics uDAPL provider for all OFA devices. Sure, we could add some logic in the DAPL abstraction layer to check for iWARP devices and possibly hide the restriction. Say we do that, what about the applications that sit directly on top of OFA verbs and rdma_cm? Say we add some iWARP abstraction at this layer, what about the WinOF stack? > > I don't know that there's a need to change the iWarp architecture. If you think customers are willing to work around this restriction then by all means leave the architecture alone and simply document the rdma API's. I would think that this put's iWARP vendors at a disadvantage. I am guessing that energy and time spent changing the iWARP protocol specification is a better use of everyone's time then hacking every iWARP stack out there to hide the restriction. -arlin From sean.hefty at intel.com Wed Oct 24 11:04:04 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 24 Oct 2007 11:04:04 -0700 Subject: [ofa-general] [RFC] upstream IB router support Message-ID: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> A while ago, some support was added to the rdma stack to support IB routers in a very limited fashion (done as part of PathForward). The relevant patches are available at: git://git.openfabrics.org/~shefty/rdma-dev.git ib_router I wanted to gauge interest merging these changes upstream for 2.6.25. I know there is growing interest in using IB routers. Obsidian has a router, and both Mellanox and QLogic adapters can be used to construct host routers. The main disadvantage to merging the patches is that it slightly violates the IB CM protocol by sending invalid data in the CM REQ. The patches can be optionally compiled in as experimental if needed. - Sean From krause at cup.hp.com Wed Oct 24 10:58:01 2007 From: krause at cup.hp.com (Michael Krause) Date: Wed, 24 Oct 2007 10:58:01 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> Message-ID: <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> The proper action is to propose a new MPA specification to the IETF - it isn't an OFA decision to make. MPA within the IETF was a tough fight to get into existence. This particular issue was raised and the outcome from that debate is what is in the 1.0 specification (it is a standard if I recall not a draft). Fine to argue here but action and specification work must be brought up in the IETF RDDP workgroup and likely to be vetted as well by the TSVWG and Transport AD (both weighed in quite a bit during MPA's creation). If the IETF approves a new draft, then OFA can develop the associated software. But there may be multiple software stacks to deal with legacy hardware / drivers so the problem isn't just fixed by providing a new MPA specification. People are using iWARP today that is compliant with today's MPA specification. Mike At 06:25 PM 10/23/2007, Kanevsky, Arkady wrote: >This is still a protocol and should be defined by IETF not OFA. >But if we get agreement from all iWARP vendors this will be a good step. >If we can not get agreement on it on reflector lets do >it at SC'07 OFA dev. conference. > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance Inc. phone: 781-768-5395 >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > > Sent: Tuesday, October 23, 2007 9:02 PM > > To: Sean Hefty; Steve Wise > > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > > OpenFabrics General > > Subject: RE: [ofa-general] [RFP] support for iWARP > > requirement - activeconnect side MUST send first FPDU > > > > > > That is what I've been trying to push. Both MVAPICH2 and > > > OMPI have been > > > > open to adjusting their transports to adhere to this requirement. > > > > > > > > I wouldn't mind implementing something to enforce this in > > > the IWCM or > > > > the iWARP drivers IF there was a clean way to do it. So > > far there > > > > hasn't been a clean way proposed. > > > > > > Why can't either uDAPL or iW CM always do a send from the active to > > > passive side that gets stripped off? From the active side, > > the first > > > send is always posted before any user sends, and if > > necessary, a user > > > send can be queued by software to avoid a QP/CQ overrun. The > > > completion can simply be eaten by software. On the passive > > side, you > > > have a similar process for receiving the data. > > > > This is similar to an option in the NetEffect driver. A zero > > byte RDMA write is sent from the active side and accounted > > for on the passive side. This can be turned on and off by > > compile and module options for compatibility. > > > > I second Sean's question - why can't uDAPL or the iw_cm do this? > > > > > > > > (Yes this adds wire protocol, which requires both sides to support > > > it.) > > > > > > - Sean > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From dwshahsoftm at shahsoft.com Wed Oct 24 11:10:19 2007 From: dwshahsoftm at shahsoft.com (Marquita Crews) Date: Wed, 24 Oct 2007 13:10:19 -0500 Subject: [ofa-general] =?iso-8859-1?q?We_don=92t_advertise=2C_we_advise=2E?= =?iso-8859-1?q?_?= Message-ID: <01c8163f$3a0a6690$92a99718@dwshahsoftm> Customers of ŤCanadianPharmacyť online drugstore appreciate the opportunity to save money, quality of pharmaceutical products and speed of delivery when you order with ŤCanadianPharmacyť.Visit our "CanadianPharmacy" site Purchase meds with us and enjoy the life to the full. http://stationcolony.cn -------------- next part -------------- An HTML attachment was scrubbed... URL: From slba at bluerunnerfoods.com Wed Oct 24 11:15:08 2007 From: slba at bluerunnerfoods.com (Stacy Goddard) Date: Thu, 25 Oct 2007 03:15:08 +0900 Subject: [ofa-general] Just some helpful information about XtraSize+. Message-ID: <01c816b5$3f09fd10$cc1b8d3a@slba> When you come to conclusion that you need to enlarge your penis, the most important thing is to find safe and effective way to do it. Once you tried 100 % safe XtraSize+, you'll never look for other methods or devices. We understand that most customers need confidentiality and respect every need of our clients. Secure online ordering process, discreet packing, security of your private information are guaranteed. http://idfaz.com Forget your worries about penis size with XtraSize+. From mshefty at ichips.intel.com Wed Oct 24 11:24:45 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 24 Oct 2007 11:24:45 -0700 Subject: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU In-Reply-To: <471F7D05.80804@ichips.intel.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <471F5516.2000801@opengridcomputing.com> <471F6952.9040404@ichips.intel.com> <471F7D05.80804@ichips.intel.com> Message-ID: <471F8DED.9030202@ichips.intel.com> > If you think customers are willing to work around this restriction then > by all means leave the architecture alone and simply document the rdma > API's. I would think that this put's iWARP vendors at a disadvantage. I think a connection service can hide this restriction, similar to how some MPI implementations handle this today. The solution I mentioned was to send a message over the QP from the active side. This leaves the existing architecture unchanged. From swise at opengridcomputing.com Wed Oct 24 11:41:19 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 24 Oct 2007 13:41:19 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> Message-ID: <471F91CF.2060704@opengridcomputing.com> Michael Krause wrote: > The proper action is to propose a new MPA specification to the IETF - it > isn't an OFA decision to make. MPA within the IETF was a tough fight to > get into existence. This particular issue was raised and the outcome > from that debate is what is in the 1.0 specification (it is a standard > if I recall not a draft). As far as I can see on the IETF site, the MPA, DDP, and RDMAP docs are all expired Internet Drafts. Can you point me to the RFCs? > Fine to argue here but action and > specification work must be brought up in the IETF RDDP workgroup and > likely to be vetted as well by the TSVWG and Transport AD (both weighed > in quite a bit during MPA's creation). > > If the IETF approves a new draft, then OFA can develop the associated > software. But there may be multiple software stacks to deal with legacy > hardware / drivers so the problem isn't just fixed by providing a new > MPA specification. People are using iWARP today that is compliant with > today's MPA specification. > Yup. From bill.boas at gmail.com Wed Oct 24 11:42:30 2007 From: bill.boas at gmail.com (Bill Boas) Date: Wed, 24 Oct 2007 11:42:30 -0700 Subject: [Interop-wg] Re: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <471F91CF.2060704@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> <471F91CF.2060704@opengridcomputing.com> Message-ID: <19a929370710241142j69d33f1apfa221122db4df050@mail.gmail.com> On 10/24/07, Steve Wise wrote: > > > > Michael Krause wrote: > > The proper action is to propose a new MPA specification to the IETF - it > > isn't an OFA decision to make. MPA within the IETF was a tough fight to > > get into existence. This particular issue was raised and the outcome > > from that debate is what is in the 1.0 specification (it is a standard > > if I recall not a draft). > > As far as I can see on the IETF site, the MPA, DDP, and RDMAP docs are > all expired Internet Drafts. Can you point me to the RFCs? > > > Fine to argue here but action and > > specification work must be brought up in the IETF RDDP workgroup and > > likely to be vetted as well by the TSVWG and Transport AD (both weighed > > in quite a bit during MPA's creation). > > > > If the IETF approves a new draft, then OFA can develop the associated > > software. But there may be multiple software stacks to deal with legacy > > hardware / drivers so the problem isn't just fixed by providing a new > > MPA specification. People are using iWARP today that is compliant with > > today's MPA specification. > > > > Yup. > > _______________________________________________ > Interop-wg mailing list > Interop-wg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/interop-wg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Wed Oct 24 12:16:51 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 24 Oct 2007 15:16:51 -0400 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> Message-ID: Since there is feedback from actual device usage and interoperability issues this is a good feedback to bring to RDDP with the proposal. Sure ULPs which were designed with MPA picularity in mind do work. But there is not reason to restrict iWARP usage. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Michael Krause [mailto:krause at cup.hp.com] > Sent: Wednesday, October 24, 2007 1:58 PM > To: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; > OpenFabrics General > Subject: RE: [ofa-general] [RFP] support for iWARP > requirement - activeconnect side MUST send first FPDU > > The proper action is to propose a new MPA specification to > the IETF - it isn't an OFA decision to make. MPA within the > IETF was a tough fight to get into existence. This > particular issue was raised and the outcome from that debate > is what is in the 1.0 specification (it is a standard if I > recall not a draft). Fine to argue here but action and > specification work > must be brought up in the IETF RDDP workgroup and likely to > be vetted as well by the TSVWG and Transport AD (both weighed > in quite a bit during MPA's creation). > > If the IETF approves a new draft, then OFA can develop the > associated software. But there may be multiple software > stacks to deal with legacy hardware / drivers so the problem > isn't just fixed by providing a new MPA > specification. People are using iWARP today that is compliant with > today's MPA specification. > > Mike > > At 06:25 PM 10/23/2007, Kanevsky, Arkady wrote: > >This is still a protocol and should be defined by IETF not OFA. > >But if we get agreement from all iWARP vendors this will be > a good step. > >If we can not get agreement on it on reflector lets do it at > SC'07 OFA > >dev. conference. > > > >Arkady Kanevsky email: arkady at netapp.com > >Network Appliance Inc. phone: 781-768-5395 > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > -----Original Message----- > > > From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] > > > Sent: Tuesday, October 23, 2007 9:02 PM > > > To: Sean Hefty; Steve Wise > > > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; OpenFabrics > > > General > > > Subject: RE: [ofa-general] [RFP] support for iWARP requirement - > > > activeconnect side MUST send first FPDU > > > > > > > > That is what I've been trying to push. Both MVAPICH2 and > > > > OMPI have been > > > > > open to adjusting their transports to adhere to this > requirement. > > > > > > > > > > I wouldn't mind implementing something to enforce this in > > > > the IWCM or > > > > > the iWARP drivers IF there was a clean way to do it. So > > > far there > > > > > hasn't been a clean way proposed. > > > > > > > > Why can't either uDAPL or iW CM always do a send from > the active > > > > to passive side that gets stripped off? From the active side, > > > the first > > > > send is always posted before any user sends, and if > > > necessary, a user > > > > send can be queued by software to avoid a QP/CQ overrun. The > > > > completion can simply be eaten by software. On the passive > > > side, you > > > > have a similar process for receiving the data. > > > > > > This is similar to an option in the NetEffect driver. A > zero byte > > > RDMA write is sent from the active side and accounted for on the > > > passive side. This can be turned on and off by compile > and module > > > options for compatibility. > > > > > > I second Sean's question - why can't uDAPL or the iw_cm do this? > > > > > > > > > > > (Yes this adds wire protocol, which requires both sides > to support > > > > it.) > > > > > > > > - Sean > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > >_______________________________________________ > >general mailing list > >general at lists.openfabrics.org > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > >To unsubscribe, please visit > >http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Wed Oct 24 13:09:49 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 24 Oct 2007 15:09:49 -0500 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> Message-ID: <471FA68D.80707@opengridcomputing.com> Michael Krause wrote: > The proper action is to propose a new MPA specification to the IETF - > it isn't an OFA decision to make. MPA within the IETF was a tough > fight to get into existence. This particular issue was raised and the > outcome from that debate is what is in the 1.0 specification (it is a > standard if I recall not a draft). It looks to me to be an ID, not an RFC. > Fine to argue here but action and specification work must be brought > up in the IETF RDDP workgroup and likely to be vetted as well by the > TSVWG and Transport AD (both weighed in quite a bit during MPA's > creation). > > If the IETF approves a new draft, then OFA can develop the associated > software. I think that's backwards. Referring to Page 3 of the Internet Standards Process document: o These procedures are explicitly aimed at recognizing and adopting generally-accepted practices. Thus, a candidate specification must be implemented and tested for correct operation and interoperability by multiple independent parties and utilized in increasingly demanding environments, before it can be adopted as an Internet Standard. I think this means that it is not only acceptable, but expected that anything proposed would have a working, interoperable implememtation. > But there may be multiple software stacks to deal with legacy hardware > / drivers so the problem isn't just fixed by providing a new MPA > specification. People are using iWARP today that is compliant with > today's MPA specification. That remains true whether or not additional application level functionality is added to the API as is being proposed. Whether or not this additional functionality is itself standardized is a separate issue. > > Mike > > At 06:25 PM 10/23/2007, Kanevsky, Arkady wrote: >> This is still a protocol and should be defined by IETF not OFA. >> But if we get agreement from all iWARP vendors this will be a good step. >> If we can not get agreement on it on reflector lets do >> it at SC'07 OFA dev. conference. >> >> Arkady Kanevsky email: arkady at netapp.com >> Network Appliance Inc. phone: 781-768-5395 >> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >> Waltham, MA 02451 central phone: 781-768-5300 >> >> >> > -----Original Message----- >> > From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] >> > Sent: Tuesday, October 23, 2007 9:02 PM >> > To: Sean Hefty; Steve Wise >> > Cc: Roland Dreier; interop-wg at lists.openfabrics.org; >> > OpenFabrics General >> > Subject: RE: [ofa-general] [RFP] support for iWARP >> > requirement - activeconnect side MUST send first FPDU >> > >> > > > That is what I've been trying to push. Both MVAPICH2 and >> > > OMPI have been >> > > > open to adjusting their transports to adhere to this requirement. >> > > > >> > > > I wouldn't mind implementing something to enforce this in >> > > the IWCM or >> > > > the iWARP drivers IF there was a clean way to do it. So >> > far there >> > > > hasn't been a clean way proposed. >> > > >> > > Why can't either uDAPL or iW CM always do a send from the active to >> > > passive side that gets stripped off? From the active side, >> > the first >> > > send is always posted before any user sends, and if >> > necessary, a user >> > > send can be queued by software to avoid a QP/CQ overrun. The >> > > completion can simply be eaten by software. On the passive >> > side, you >> > > have a similar process for receiving the data. >> > >> > This is similar to an option in the NetEffect driver. A zero >> > byte RDMA write is sent from the active side and accounted >> > for on the passive side. This can be turned on and off by >> > compile and module options for compatibility. >> > >> > I second Sean's question - why can't uDAPL or the iw_cm do this? >> > >> > > >> > > (Yes this adds wire protocol, which requires both sides to support >> > > it.) >> > > >> > > - Sean >> > > >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit >> > http://openib.org/mailman/listinfo/openib-general >> > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From pradeeps at linux.vnet.ibm.com Wed Oct 24 13:29:43 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 24 Oct 2007 13:29:43 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: <1193218707.25235.18.camel@mtls03> References: <1193155667.25235.4.camel@mtls03> <1193218707.25235.18.camel@mtls03> Message-ID: <471FAB37.5040209@linux.vnet.ibm.com> > > Other drivers do similar allocations. For example, e1000 when working > with jumbo frames does such large allocations. Also I did not notice > allocation failures though my system was pretty much active but I can > monitor for such possible failures. > > e1000_main.c line 3549: > > else if (max_frame <= E1000_RXBUFFER_16384) > adapter->rx_buffer_len = E1000_RXBUFFER_16384; > It looks like E1000_RXBUFFER_16384 is used by e1000_setup_rctl() to configure the receive control registers. The actual allocation of buffers happens in e1000_alloc_rx_buffers_ps() wherein 3 pages are allocated through alloc_page(GFP_ATOMIC) for the case of jumbo frames. Pradeep From pradeeps at linux.vnet.ibm.com Wed Oct 24 13:33:35 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 24 Oct 2007 13:33:35 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <471FAC1F.2070401@linux.vnet.ibm.com> Roland Dreier wrote: > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > this will get some more fixes/changes for 2.6.24. I have one more > IPoIB feature (support for CM without SRQs) I hope to send later > today, but we'll see... Roland, any further news on this? Pradeep From rdreier at cisco.com Wed Oct 24 13:55:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 24 Oct 2007 13:55:03 -0700 Subject: [Fwd: [Fwd: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler()]] In-Reply-To: <1193197374.22038.39.camel@hrosenstock-ws.xsigo.com> (Hal Rosenstock's message of "Tue, 23 Oct 2007 20:42:54 -0700") References: <1193197374.22038.39.camel@hrosenstock-ws.xsigo.com> Message-ID: > Actually to be complete, this one is: > Acked-by: Ralph Campbell > Acked-by: Hal Rosenstock ?? The original patch seems to be *FROM* Ralph, is it not? From hrosenstock at xsigo.com Wed Oct 24 13:57:07 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 24 Oct 2007 13:57:07 -0700 Subject: [Fwd: [Fwd: [ofa-general] [PATCH] IB/core - remove redundant NULL pointer check in ib_mad_recv_done_handler()]] In-Reply-To: References: <1193197374.22038.39.camel@hrosenstock-ws.xsigo.com> Message-ID: <1193259427.22038.135.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-24 at 13:55 -0700, Roland Dreier wrote: > > Actually to be complete, this one is: > > > Acked-by: Ralph Campbell > > Acked-by: Hal Rosenstock > > ?? The original patch seems to be *FROM* Ralph, is it not? Oops; those acks are for Steve's patch for DR loopback. Sorry. From rdreier at cisco.com Wed Oct 24 13:57:08 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 24 Oct 2007 13:57:08 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: <1193218707.25235.18.camel@mtls03> (Eli Cohen's message of "Wed, 24 Oct 2007 11:38:26 +0200") References: <1193155667.25235.4.camel@mtls03> <1193218707.25235.18.camel@mtls03> Message-ID: > Other drivers do similar allocations. For example, e1000 when working > with jumbo frames does such large allocations. Also I did not notice > allocation failures though my system was pretty much active but I can > monitor for such possible failures. > > e1000_main.c line 3549: > > else if (max_frame <= E1000_RXBUFFER_16384) > adapter->rx_buffer_len = E1000_RXBUFFER_16384; Actually if you read the e1000 code more closely you'll see that on all hardware that supports it, they use packet splitting to make all of their allocations be only a single page. And if you do a web search on the words 'page allocation failure order e1000' you will see many examples of the problems coming from the old e100 code, which did higher-order allocations for jumbo frames. - R. From gsadasiv7 at gmail.com Wed Oct 24 15:30:09 2007 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Wed, 24 Oct 2007 15:30:09 -0700 Subject: [ofa-general] ***SPAM*** IB port state change Message-ID: <532b813a0710241530n5de748b0m2bdb55e1219bb7f1@mail.gmail.com> Hi, Is there any example code that explains how to register and receive IB port state changes? Thanks Ganesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill.magro at intel.com Wed Oct 24 16:04:33 2007 From: bill.magro at intel.com (Magro, Bill) Date: Wed, 24 Oct 2007 16:04:33 -0700 Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's Summit:tentative agenda In-Reply-To: <20071024004042.GB10244@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com><15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024004042.GB10244@cuprite.pathscale.com> Message-ID: <4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> If time allowed, we would be happy to give a 10m or so perspective on the OFA stack and OFED distribution from the Intel MPI point of view. Thanks, --Bill -----Original Message----- From: promoters-bounces at lists.openfabrics.org [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Johann George Sent: Tuesday, October 23, 2007 7:41 PM To: Jeff Squyres Cc: promoters at lists.openfabrics.org; ewg at lists.openfabrics.org; general at lists.openfabrics.org; Or Gerlitz Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's Summit:tentative agenda Jeff, > Is there any intent for HP MPI or Intel MPI to speak? I would be > interested to hear what they have to say (e.g., feedback on the OFED > stack vs. other network stacks and other status update kinds of > things). We considered it but given the time constraints, thought we should wait until Sonoma. Priority was given to OpenMPI and MVAPICH since they are being shipped as part of OFED. Still, as you point out, getting feedback on their view of OFED vs. other networking stacks could be valuable. Johann _______________________________________________ promoters mailing list promoters at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters From jsquyres at cisco.com Wed Oct 24 16:13:57 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 24 Oct 2007 19:13:57 -0400 Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's Summit:tentative agenda In-Reply-To: <4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> References: <20071023200329.GA6368@cuprite.pathscale.com><15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024004042.GB10244@cuprite.pathscale.com> <4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> Message-ID: Perhaps the total 50 minutes currently allocated to MPI implementations could be split between all of us who want to present? This makes 3 so far (i.e., 15 min/ea) -- 4 if HP wants to present (12 min/ea, or perhaps we could bump up to 60 mins for an even 15 min/ea). On Oct 24, 2007, at 7:04 PM, Magro, Bill wrote: > If time allowed, we would be happy to give a 10m or so perspective on > the OFA stack and OFED distribution from the Intel MPI point of view. > > Thanks, > > --Bill > > -----Original Message----- > From: promoters-bounces at lists.openfabrics.org > [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Johann > George > Sent: Tuesday, October 23, 2007 7:41 PM > To: Jeff Squyres > Cc: promoters at lists.openfabrics.org; ewg at lists.openfabrics.org; > general at lists.openfabrics.org; Or Gerlitz > Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's > Summit:tentative agenda > > Jeff, > >> Is there any intent for HP MPI or Intel MPI to speak? I would be >> interested to hear what they have to say (e.g., feedback on the OFED >> stack vs. other network stacks and other status update kinds of >> things). > > We considered it but given the time constraints, thought we should > wait until Sonoma. Priority was given to OpenMPI and MVAPICH since > they are being shipped as part of OFED. Still, as you point out, > getting feedback on their view of OFED vs. other networking stacks > could be valuable. > > Johann > _______________________________________________ > promoters mailing list > promoters at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters -- Jeff Squyres Cisco Systems From changquing.tang at hp.com Wed Oct 24 18:56:20 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 25 Oct 2007 01:56:20 -0000 Subject: [ofa-general] message is received but sender report error. Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> Roland: On OFED 1.2, we recently find an error, when sender post an immediate send as follows: sr.next = NULL; sr.sg_list = &ssg; sr.num_sge = 0; sr.opcode = IBV_WR_SEND_WITH_IMM; sr.send_flags = IBV_SEND_SIGNALED; sr.wr_id = (uint64_t)RIGHT; err = ibv_post_send(hpmp_ibv->ring_right_hndl, &sr, &bad_sr); the receiver has successfully got an event, and (compl.opcode&IBV_WC_RECV_RDMA_WITH_IMM) is true. However, the sender got an completion with compl.status=12, which is retry count exceeded, how is this possible ? One thing I can tell is that receiver destroy the QP after receiving above message. Thanks. --CQ From kannan.narasimhan at hp.com Wed Oct 24 23:24:34 2007 From: kannan.narasimhan at hp.com (Narasimhan, Kannan) Date: Thu, 25 Oct 2007 06:24:34 -0000 Subject: [ewg] Re: [promoters] Re: [ofa-general] OpenFabrics Developer'sSummit:tentative agenda In-Reply-To: References: <20071023200329.GA6368@cuprite.pathscale.com><15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com><20071024004042.GB10244@cuprite.pathscale.com><4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> Message-ID: I think this is a great suggestion. I will be glad to present a 10-15 minute overview and status update of the HP-MPI with the OFED distribution. Thanx! Kannan -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Wednesday, October 24, 2007 6:14 PM To: Magro, Bill Cc: promoters at lists.openfabrics.org; ewg at lists.openfabrics.org; general at lists.openfabrics.org Subject: [ewg] Re: [promoters] Re: [ofa-general] OpenFabrics Developer'sSummit:tentative agenda Perhaps the total 50 minutes currently allocated to MPI implementations could be split between all of us who want to present? This makes 3 so far (i.e., 15 min/ea) -- 4 if HP wants to present (12 min/ea, or perhaps we could bump up to 60 mins for an even 15 min/ea). On Oct 24, 2007, at 7:04 PM, Magro, Bill wrote: > If time allowed, we would be happy to give a 10m or so perspective on > the OFA stack and OFED distribution from the Intel MPI point of view. > > Thanks, > > --Bill > > -----Original Message----- > From: promoters-bounces at lists.openfabrics.org > [mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Johann > George > Sent: Tuesday, October 23, 2007 7:41 PM > To: Jeff Squyres > Cc: promoters at lists.openfabrics.org; ewg at lists.openfabrics.org; > general at lists.openfabrics.org; Or Gerlitz > Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's > Summit:tentative agenda > > Jeff, > >> Is there any intent for HP MPI or Intel MPI to speak? I would be >> interested to hear what they have to say (e.g., feedback on the OFED >> stack vs. other network stacks and other status update kinds of >> things). > > We considered it but given the time constraints, thought we should > wait until Sonoma. Priority was given to OpenMPI and MVAPICH since > they are being shipped as part of OFED. Still, as you point out, > getting feedback on their view of OFED vs. other networking stacks > could be valuable. > > Johann > _______________________________________________ > promoters mailing list > promoters at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters -- Jeff Squyres Cisco Systems _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From eli at mellanox.co.il Thu Oct 25 00:46:26 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 25 Oct 2007 09:46:26 +0200 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: References: <1193155667.25235.4.camel@mtls03> <1193218707.25235.18.camel@mtls03> Message-ID: <1193298386.25235.29.camel@mtls03> > Actually if you read the e1000 code more closely you'll see that on > all hardware that supports it, they use packet splitting to make all > of their allocations be only a single page. And if you do a web > search on the words 'page allocation failure order e1000' you will see > many examples of the problems coming from the old e100 code, which did > higher-order allocations for jumbo frames. So what about the following idea - please let me know if you think it's practical: We will allocate compound pages of order 2 at initialization time using GFP_KERNEL. Then we take a reference on each of these pages and put them in a free list. Allocating a page will then be done from this this free list. We will also put a destructor on the SKB and return the pages to the free list when the destructor is called. We can also peridically push to the work queue a task that manages the size of the list. From smcxitxc at bobogee.com Thu Oct 25 01:50:58 2007 From: smcxitxc at bobogee.com (Sandy Stevenson) Date: Thu, 25 Oct 2007 10:50:58 +0200 Subject: [ofa-general] Watch the pounds disappear Message-ID: <01c816f4$ece8cc10$ec78b250@smcxitxc> Hey Amanda, Here's that fat loss pill site you asked about, the one I told you with the amazing Anatrim pills. Hey- if they're good enough for Oprah, then they must be good enough for us lol ;) Check the site out and let me know later how they work for you, hope you lose as many pounds as I did! :) http://www.vanvote.com/?mqipsvthmekg Later babe xo Sandy Stevenson From vlad at lists.openfabrics.org Thu Oct 25 02:58:47 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 25 Oct 2007 02:58:47 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071025-0200 daily build status Message-ID: <20071025095847.C1C26E608F6@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From bulten at netmarkpatent.com Thu Oct 25 03:34:48 2007 From: bulten at netmarkpatent.com (NETMARK PATENT) Date: Thu, 25 Oct 2007 13:34:48 +0300 Subject: [ofa-general] ***SPAM*** =?windows-1254?q?Netmark_Patent=2C_NETMARK_KOB=DD_ile_?= =?windows-1254?q?Sekt=F6re_Yeni_Bir_Soluk_Getiriyor!?= Message-ID: <3839-2200710425103448499@ugur> PCT Araştırma Ücretinde Değişiklik! PCT kapsamında Türkiye'ye kabul ofisi olarak yapılan uluslararası başvurularda, uluslararası araştırma raporunun düzenlenmesi için Avrupa Patent Ofisi'nin Alacağı araştırma ücreti 1 Eylül 2007 tarihinden itibaren 1615EUR = 2668CHF (İsviçre Frangı) olmuştur. Söz konusu değişiklik PCT ücret tablosuna da yansıtılmıştır. Netmark Patent, NETMARK KOBİ ile Sektöre Yeni Bir Soluk Getiriyor! zengin içeriği ve sunduğu güncel bilgileriyle kişi ve kuruluşlara faydalı olmayı hedefleyen sitemiz www.netmarkkobi.com da, Haftalık Dergi ve Fuar Tanıtımları, mevzuat bilgileri, sektörel haberler, teknolojik gelişmeler, tercüme hizmetleri, sektörün lider firmalarıyla yapılan röportajlar, gerekli adres, telefon ve irtibat bilgileri, fikri ve sınaî haklar konusunda bilgilendirme, kalite yönetim sistemleri, fuar takvimi, güncel makaleler ve Cazip Kampanyalar yer alacaktır. Elektrik Üreten Boya Floridalı Industrial Nanotech firması kaplandığı yüzeyin sıcaklığını kullanarak elekrik elde edilmesini sağlayan bir boya geliştirdiklerini açıkladı. Bu ürün kapladığı yüzeyin iç ve dış dereceleri arasındaki termal farklılıkla elektrik meydana gelirilebiliyor. İlk öncelikle, enerji ve yalıtım ekonomisine katkısının önceliğini vurgulayan firma, gelecekte daha fazla alanda bu ürünün kullanılır hale geleceğini belirtiyor. Kaynak: New Launches Bu bültenleri almak istemiyorsan1z bulten at netmarkpatent.com adresine bo_ bir mail göndermenizi rica ederiz. Böyle bir talebiniz olmad11 sürece düzenli olarak bültenlerimizi alabilirsiniz. NETMARK PATENT T:0212 220 31 20 F:0212 220 74 21 -------------- next part -------------- An HTML attachment was scrubbed... URL: From plushes at shmazi.com Thu Oct 25 05:14:42 2007 From: plushes at shmazi.com (Nichael Avila) Date: Thu, 25 Oct 2007 15:14:42 +0300 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c81700$19c77500$0100007f@localhost> adobe4less . com From kliteyn at dev.mellanox.co.il Thu Oct 25 05:56:47 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 25 Oct 2007 14:56:47 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071024153957.GR7088@sashak.voltaire.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> Message-ID: <4720928F.3050002@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 06:49 Wed 24 Oct , Hal Rosenstock wrote: >> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: >>> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: >>>>>> Switches have the NodeDescription filled by FW, and it's usually the >>>>>> same string for all the switches. >>>>> It must not be same. Also I suppose that node description can be changed >>>>> at least for some managed switches even today. >>>> Come on, man... >>>> How many cluster administrators that you know will actually go and set >>>> NodeDescription on switches??? >>> I know at least one asked for this. >> Perhaps switch_map can be used in conjunction with this like in the >> diags ? > > Hmm, right, switch_map is another example of switch naming, which is > useful with diags. > Perhaps even more generic - guid to name map? And this will work instead > of (or in addition to) node description when specified? Can you elaborate on this? What exactly is switch_map? And why would be need an additional guid-to-anything map if we already have node map indexed by guids (or am I missing something)? -- Yevgeny > Sasha > From hrosenstock at xsigo.com Thu Oct 25 06:32:06 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 25 Oct 2007 06:32:06 -0700 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <4720928F.3050002@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> <4720928F.3050002@dev.mellanox.co.il> Message-ID: <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-25 at 14:56 +0200, Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: > > On 06:49 Wed 24 Oct , Hal Rosenstock wrote: > >> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > >>> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > >>>>>> Switches have the NodeDescription filled by FW, and it's usually the > >>>>>> same string for all the switches. > >>>>> It must not be same. Also I suppose that node description can be changed > >>>>> at least for some managed switches even today. > >>>> Come on, man... > >>>> How many cluster administrators that you know will actually go and set > >>>> NodeDescription on switches??? > >>> I know at least one asked for this. > >> Perhaps switch_map can be used in conjunction with this like in the > >> diags ? > > > > Hmm, right, switch_map is another example of switch naming, which is > > useful with diags. > > Perhaps even more generic - guid to name map? And this will work instead > > of (or in addition to) node description when specified? > > Can you elaborate on this? > What exactly is switch_map? See infiniband-diags (ibnetdiscover man page for one). > And why would be need an additional guid-to-anything map if we > already have node map indexed by guids (or am I missing something)? GUIDs are not the most admin friendly identifiers. -- Hal > -- Yevgeny > > > Sasha > > > From hrosenstock at xsigo.com Thu Oct 25 06:43:48 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 25 Oct 2007 06:43:48 -0700 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> Message-ID: <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> On Wed, 2007-10-24 at 11:04 -0700, Sean Hefty wrote: > A while ago, some support was added to the rdma stack to support IB routers in a > very limited fashion (done as part of PathForward). The relevant patches are > available at: > > git://git.openfabrics.org/~shefty/rdma-dev.git ib_router > > I wanted to gauge interest merging these changes upstream for 2.6.25. I know > there is growing interest in using IB routers. Obsidian has a router, and both > Mellanox and QLogic adapters can be used to construct host routers. > > The main disadvantage to merging the patches is that it slightly violates the IB > CM protocol by sending invalid data in the CM REQ. The patches can be > optionally compiled in as experimental if needed. My take ($0.02) on this is (at most) experimental if it is to be pushed upstream. The issue I see is how prestandard v. standard IB routers can be dealt with as cleanly as possible. -- Hal > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From erezz at voltaire.com Thu Oct 25 07:11:34 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 25 Oct 2007 16:11:34 +0200 Subject: [ofa-general] iSER for stgt - wiki page Message-ID: <4720A416.8010503@voltaire.com> The following wiki page is a quick start guide for running an iSCSI over iSER target through the open-source stgt project: https://wiki.openfabrics.org/tiki-index.php?page=ISER-target For more information about stgt: http://stgt.berlios.de/ I hope that you find it helpful. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Solutions Voltaire – _The Grid Backbone_ __ www.voltaire.com From kliteyn at dev.mellanox.co.il Thu Oct 25 07:36:24 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 25 Oct 2007 16:36:24 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> <4720928F.3050002@dev.mellanox.co.il> <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> Message-ID: <4720A9E8.4010300@dev.mellanox.co.il> Hal Rosenstock wrote: > On Thu, 2007-10-25 at 14:56 +0200, Yevgeny Kliteynik wrote: >> Sasha Khapyorsky wrote: >>> On 06:49 Wed 24 Oct , Hal Rosenstock wrote: >>>> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: >>>>> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: >>>>>>>> Switches have the NodeDescription filled by FW, and it's usually the >>>>>>>> same string for all the switches. >>>>>>> It must not be same. Also I suppose that node description can be changed >>>>>>> at least for some managed switches even today. >>>>>> Come on, man... >>>>>> How many cluster administrators that you know will actually go and set >>>>>> NodeDescription on switches??? >>>>> I know at least one asked for this. >>>> Perhaps switch_map can be used in conjunction with this like in the >>>> diags ? >>> Hmm, right, switch_map is another example of switch naming, which is >>> useful with diags. >>> Perhaps even more generic - guid to name map? And this will work instead >>> of (or in addition to) node description when specified? >> Can you elaborate on this? >> What exactly is switch_map? > > See infiniband-diags (ibnetdiscover man page for one). > >> And why would be need an additional guid-to-anything map if we >> already have node map indexed by guids (or am I missing something)? > > GUIDs are not the most admin friendly identifiers. In that case I understand that you probably meant "name to guid" and not "guid to name" map. --Yevgeny > > -- Hal > >> -- Yevgeny >> >>> Sasha >>> > From hrosenstock at xsigo.com Thu Oct 25 07:43:55 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 25 Oct 2007 07:43:55 -0700 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <4720A9E8.4010300@dev.mellanox.co.il> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> <4720928F.3050002@dev.mellanox.co.il> <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> <4720A9E8.4010300@dev.mellanox.co.il> Message-ID: <1193323435.31872.128.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-25 at 16:36 +0200, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > On Thu, 2007-10-25 at 14:56 +0200, Yevgeny Kliteynik wrote: > >> Sasha Khapyorsky wrote: > >>> On 06:49 Wed 24 Oct , Hal Rosenstock wrote: > >>>> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > >>>>> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > >>>>>>>> Switches have the NodeDescription filled by FW, and it's usually the > >>>>>>>> same string for all the switches. > >>>>>>> It must not be same. Also I suppose that node description can be changed > >>>>>>> at least for some managed switches even today. > >>>>>> Come on, man... > >>>>>> How many cluster administrators that you know will actually go and set > >>>>>> NodeDescription on switches??? > >>>>> I know at least one asked for this. > >>>> Perhaps switch_map can be used in conjunction with this like in the > >>>> diags ? > >>> Hmm, right, switch_map is another example of switch naming, which is > >>> useful with diags. > >>> Perhaps even more generic - guid to name map? And this will work instead > >>> of (or in addition to) node description when specified? > >> Can you elaborate on this? > >> What exactly is switch_map? > > > > See infiniband-diags (ibnetdiscover man page for one). > > > >> And why would be need an additional guid-to-anything map if we > >> already have node map indexed by guids (or am I missing something)? > > > > GUIDs are not the most admin friendly identifiers. > > In that case I understand that you probably meant > "name to guid" and not "guid to name" map. The switch map file is guid to name so that names can be used as friendly identifiers for (switch) guids (since their NodeDescriptions are not easy to set): SWITCH MAP FILE FORMAT The switch map is used to specify a user friendly name for switches in the output. GUIDs are used to perform the lookup. "" -- Hal > > --Yevgeny > > > > > -- Hal > > > >> -- Yevgeny > >> > >>> Sasha > >>> > > > From tziporet at dev.mellanox.co.il Thu Oct 25 08:56:22 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 25 Oct 2007 17:56:22 +0200 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> Message-ID: <4720BCA6.9080501@mellanox.co.il> Hi Johann, Please see my comments inside Thanks, Tziporet On 10/23/07, *Johann George* > wrote: > > > Friday, November 16, 2007 > ------------------------- > 07:15 45m Breakfast > ------------------------------------ > 08:00 30m WinOF: Update and Futures > Gilad Shainer, Mellanox > 08:30 30m CCS Ve2 Preview > Eric Lantz, Microsoft > 09:00 30m OFED 1.4 Planned Features > Tziporet Koren, Mellanox > We may need more then 30m for this Also - is will be good that this session will be the last one, and then I can put inside input from all sessions - especially those that speak on the new features. > > 09:30 20m OFED Management Tools > Ira Weiny, Lawrence Livermore National > Laboratories > ------------------------------------ > 09:50 20m Break > ------------------------------------ > 10:10 20m RDS with Zero Copy > Rick Frank, Oracle > 10:30 20m QoS Support > Sean Hefty, Intel; Dror Goldenberg, Mellanox > 10:50 20m InfiniBand Routing Update > Jason Gunthorpe, Obsidian Research > 11:10 20m IPoIB Stateless Offloads > Liran Liss, Mellanox > Liran will not come to the summit. Dror can replace him. It can be good if Or will work with Dror on this. > > 11:30 20m Using XRC > Dror Goldenberg, Mellanox and Dr. Panda, Ohio > State > University > 11:50 20m Fibre Channel over InfiniBand > Dror Goldenberg, Mellanox > ------------------------------------ > 12:10 60m Lunch > ------------------------------------ > > From weiny2 at llnl.gov Thu Oct 25 09:15:54 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 09:15:54 -0700 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <1193323435.31872.128.camel@hrosenstock-ws.xsigo.com> References: <470B4374.6040502@dev.mellanox.co.il> <20071013202559.GG12364@sashak.voltaire.com> <4711EE76.4070107@dev.mellanox.co.il> <20071014160314.GE6489@sashak.voltaire.com> <4712990D.1060801@dev.mellanox.co.il> <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> <4720928F.3050002@dev.mellanox.co.il> <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> <4720A9E8.4010300@dev.mellanox.co.il> <1193323435.31872.128.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071025091554.150750aa.weiny2@llnl.gov> I actually have a patch series which I am testing right now which adds switch-map support to opensm. There are 2 main reasons we (LLNL) want this: 1) log messages in opensm.log match output from the diags 2) log messages use nice descriptive strings like "SW1 (R 3) ISR9024D" which tell the sysadmin that this is Switch 1 Rack 3 3) I am in the process of using the new event plugin interface to start logging port counters to a mysql DB. (This is going to be a separate plugin GPL'ed project so there will be no requirement on mysql to opensm.) But in the process it will be nice to be able to have descriptive names for switches rather than just the GUIDs I will send the patch series ASAP. Thanks, Ira On Thu, 25 Oct 2007 07:43:55 -0700 Hal Rosenstock wrote: > On Thu, 2007-10-25 at 16:36 +0200, Yevgeny Kliteynik wrote: > > Hal Rosenstock wrote: > > > On Thu, 2007-10-25 at 14:56 +0200, Yevgeny Kliteynik wrote: > > >> Sasha Khapyorsky wrote: > > >>> On 06:49 Wed 24 Oct , Hal Rosenstock wrote: > > >>>> On Mon, 2007-10-15 at 12:39 +0200, Sasha Khapyorsky wrote: > > >>>>> On 10:48 Mon 15 Oct , Yevgeny Kliteynik wrote: > > >>>>>>>> Switches have the NodeDescription filled by FW, and it's usually the > > >>>>>>>> same string for all the switches. > > >>>>>>> It must not be same. Also I suppose that node description can be changed > > >>>>>>> at least for some managed switches even today. > > >>>>>> Come on, man... > > >>>>>> How many cluster administrators that you know will actually go and set > > >>>>>> NodeDescription on switches??? > > >>>>> I know at least one asked for this. > > >>>> Perhaps switch_map can be used in conjunction with this like in the > > >>>> diags ? > > >>> Hmm, right, switch_map is another example of switch naming, which is > > >>> useful with diags. > > >>> Perhaps even more generic - guid to name map? And this will work instead > > >>> of (or in addition to) node description when specified? > > >> Can you elaborate on this? > > >> What exactly is switch_map? > > > > > > See infiniband-diags (ibnetdiscover man page for one). > > > > > >> And why would be need an additional guid-to-anything map if we > > >> already have node map indexed by guids (or am I missing something)? > > > > > > GUIDs are not the most admin friendly identifiers. > > > > In that case I understand that you probably meant > > "name to guid" and not "guid to name" map. > > The switch map file is guid to name so that names can be used as > friendly identifiers for (switch) guids (since their NodeDescriptions > are not easy to set): > > SWITCH MAP FILE FORMAT > The switch map is used to specify a user friendly name for switches in > the output. GUIDs are used to perform the lookup. > > "" > > -- Hal > > > > > --Yevgeny > > > > > > > > -- Hal > > > > > >> -- Yevgeny > > >> > > >>> Sasha > > >>> > > > > > From swelch at systemfabricworks.com Thu Oct 25 09:55:25 2007 From: swelch at systemfabricworks.com (Steve Welch) Date: Thu, 25 Oct 2007 11:55:25 -0500 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <20071024184236.C12E7E6093C@openfabrics.org> References: <20071024184236.C12E7E6093C@openfabrics.org> Message-ID: <001301c81727$d78266f0$bc0da8c0@catcher> Hi Sean, > Message: 3 > Date: Wed, 24 Oct 2007 11:04:04 -0700 > From: "Sean Hefty" > Subject: [ofa-general] [RFC] upstream IB router support > To: "'general'" > Message-ID: <000001c81668$43d9dba0$73cc180a at amr.corp.intel.com> > Content-Type: text/plain; charset="us-ascii" > > A while ago, some support was added to the rdma stack to support IB > routers in a > very limited fashion (done as part of PathForward). The relevant patches > are > available at: > > git://git.openfabrics.org/~shefty/rdma-dev.git ib_router > > I wanted to gauge interest merging these changes upstream for 2.6.25. I would like to see them included. Steve From rdreier at cisco.com Thu Oct 25 10:12:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 10:12:19 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: <471FAC1F.2070401@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 24 Oct 2007 13:33:35 -0700") References: <471FAC1F.2070401@linux.vnet.ibm.com> Message-ID: > Roland, any further news on this? I did not manage to get the IPoIB connected mode without SRQ patch merged in time. I tried to fix up your latest patch to apply to the current tree, but while doing that I found several apparent bugs (eg I don't see how it is safe to destroy a QP and free all pending receives while another thread might be handling receives, calling ipoib_cm_dev_cleanup just because a new QP failed to be allocated looks wrong) and also a few things that seemed needlessly complicated (eg duplicating code in the RX WC handler and the timer handling). So I basically started rewriting the patch. I'm almost done and I will post the results soon, and I expect this to be the first patch I queue for 2.6.25. (BTW, when posting patches, please make sure that they apply with -p1; your latest posting at least has an extra level in the pathnames. And also always include the full patch description for the changelog, so I don't have to recreate it when merging the patch) - R. From rdreier at cisco.com Thu Oct 25 10:13:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 10:13:35 -0700 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Thu, 25 Oct 2007 01:56:20 -0000") References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> Message-ID: > the receiver has successfully got an event, and > (compl.opcode&IBV_WC_RECV_RDMA_WITH_IMM) is true. > > However, the sender got an completion with compl.status=12, which is > retry count exceeded, how is this possible ? > One thing I can tell is that receiver destroy the QP after receiving > above message. I guess maybe it's possible to destroy the QP before the ACK is generated? Maybe the first ACK is dropped for some reason and the responder QP is gone before it can resend the ACK? - R. From rdreier at cisco.com Thu Oct 25 10:14:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 10:14:42 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments In-Reply-To: <1193298386.25235.29.camel@mtls03> (Eli Cohen's message of "Thu, 25 Oct 2007 09:46:26 +0200") References: <1193155667.25235.4.camel@mtls03> <1193218707.25235.18.camel@mtls03> <1193298386.25235.29.camel@mtls03> Message-ID: > We will allocate compound pages of order 2 at initialization time using > GFP_KERNEL. Then we take a reference on each of these pages and put them > in a free list. Allocating a page will then be done from this this free > list. We will also put a destructor on the SKB and return the pages to > the free list when the destructor is called. > We can also peridically push to the work queue a task that manages the > size of the list. Seems too complicated to put into a driver. It might make sense as a generic service like the software LRO stuff; probably almost all high-speed NICs would want to share it to save allocating overhead. - R. From mshefty at ichips.intel.com Thu Oct 25 10:15:09 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 10:15:09 -0700 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> Message-ID: <4720CF1D.9050704@ichips.intel.com> > My take ($0.02) on this is (at most) experimental if it is to be pushed > upstream. I agree. I just didn't want to bother with this change without support for merging the changes upstream. > The issue I see is how prestandard v. standard IB routers can be dealt > with as cleanly as possible. IMO, the risk is minimal. The patches do not introduce any new protocols or SA attributes. Visible changes are limited to setting the DLID field in the CM REQ message to an invalid value that the passive side keys off of to determine the correct value. A node which does not support this would simply reject the connection with an invalid LID. - Sean From hrosenstock at xsigo.com Thu Oct 25 10:28:38 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 25 Oct 2007 10:28:38 -0700 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <4720CF1D.9050704@ichips.intel.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> <4720CF1D.9050704@ichips.intel.com> Message-ID: <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> On Thu, 2007-10-25 at 10:15 -0700, Sean Hefty wrote: > > My take ($0.02) on this is (at most) experimental if it is to be pushed > > upstream. > > I agree. I just didn't want to bother with this change without support > for merging the changes upstream. > > > The issue I see is how prestandard v. standard IB routers can be dealt > > with as cleanly as possible. > > IMO, the risk is minimal. Understood but there is some risk in terms of compatibility moving forward. > The patches do not introduce any new > protocols or SA attributes. Visible changes are limited to setting the > DLID field in the CM REQ message to an invalid value that the passive > side keys off of to determine the correct value. A node which does not > support this would simply reject the connection with an invalid LID. How might this affect end node operation when there are standard based routers ? If there are other larger changes for that, then this particular issue is a red herring. I do think it's important to try to keep in mind if it is possible to smooth a migration path for end nodes (and SMs) in terms of prestandard and standard routers. That's not to say that there should be no changes; just that it would be nice to be able to tell the two apart and make intelligent choices based on this. -- Hal > - Sean From changquing.tang at hp.com Thu Oct 25 10:31:16 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 25 Oct 2007 17:31:16 -0000 Subject: [ofa-general] message is received but sender report error. In-Reply-To: References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, October 25, 2007 12:14 PM > To: Tang, Changqing > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] message is received but sender > report error. > > > > the receiver has successfully got an event, and > > (compl.opcode&IBV_WC_RECV_RDMA_WITH_IMM) is true. > > > > However, the sender got an completion with > compl.status=12, which is > retry count exceeded, how is > this possible ? > > One thing I can tell is that receiver destroy the QP after > receiving > above message. > > I guess maybe it's possible to destroy the QP before the ACK > is generated? Maybe the first ACK is dropped for some reason > and the responder QP is gone before it can resend the ACK? If this is the case, how would we fix the problem ? It's hard for us to delay to destroy the QP, because we don't know how long to delay. The other way is to do something from the driver, or firmware. --CQ > > - R. > From rdreier at cisco.com Thu Oct 25 10:36:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 10:36:33 -0700 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> (Changqing Tang's message of "Thu, 25 Oct 2007 17:31:16 -0000") References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> Message-ID: > If this is the case, how would we fix the problem ? It's hard for us to > delay to destroy the QP, because we don't know how long to delay. > The other way is to do something from the driver, or firmware. I think you just have to deal with this possibility at your level. This type of thing is just inherent in network programming: without an explicit message back, you don't know if any given message has really been received. The exact same thing could happen with TCP for example: just cut the cable after a message has been received but before the ACK has gone back the other way. - R. From jgunthorpe at obsidianresearch.com Thu Oct 25 10:37:57 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 25 Oct 2007 11:37:57 -0600 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> <4720CF1D.9050704@ichips.intel.com> <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071025173757.GK23078@obsidianresearch.com> On Thu, Oct 25, 2007 at 10:28:38AM -0700, Hal Rosenstock wrote: > How might this affect end node operation when there are standard based > routers ? If there are other larger changes for that, then this > particular issue is a red herring. I think the current thinking in the LWG would allow Sean's patch in the end-nodes to transparently continue working in all the cases it can work, with a little help from the SM. > I do think it's important to try to keep in mind if it is possible to > smooth a migration path for end nodes (and SMs) in terms of prestandard > and standard routers. That's not to say that there should be no changes; > just that it would be nice to be able to tell the two apart and make > intelligent choices based on this. >From an end-node perspective, pre-standard vs standard operation would have to be based on the SM supporting the new capabilities. TBH, I'm not too concerned about continuing to support this pre-standard router hardware we are going to be using at SC|07. It is strictly a development tool to start getting the ULPs in shape. Jason From sean.hefty at intel.com Thu Oct 25 10:37:52 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 10:37:52 -0700 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> Message-ID: <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> >If this is the case, how would we fix the problem ? It's hard for us to >delay to destroy the QP, because we don't know how long to delay. >The other way is to do something from the driver, or firmware. Do you disconnect the QPs using the IB CM? - Sean From sean.hefty at intel.com Thu Oct 25 10:50:23 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 10:50:23 -0700 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> <4720CF1D.9050704@ichips.intel.com> <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> Message-ID: <000401c8172f$84c20a30$ff0da8c0@amr.corp.intel.com> >> The patches do not introduce any new >> protocols or SA attributes. Visible changes are limited to setting the >> DLID field in the CM REQ message to an invalid value that the passive >> side keys off of to determine the correct value. A node which does not >> support this would simply reject the connection with an invalid LID. > >How might this affect end node operation when there are standard based >routers ? If there are other larger changes for that, then this >particular issue is a red herring. Assuming that the CM protocol does not change, the standard will need to define a way for the active side to obtain correct CM REQ values. (The patches handle this btw.) This likely requires new host to SA interactions. For now, the patches use the defined path record query, which is likely inadequate based on previous discussions. >I do think it's important to try to keep in mind if it is possible to >smooth a migration path for end nodes (and SMs) in terms of prestandard >and standard routers. That's not to say that there should be no changes; >just that it would be nice to be able to tell the two apart and make >intelligent choices based on this. IMO, some of this falls into the routing architecture. Does it change the CM protocol or modify/add SA attributes (path record)? But there's no sure way for one side of a connection to know beforehand if the other side is following the standard. I would consider these patches purely experimental with no guarantee that they interoperate with any defined standard. They give people developing routers something that they can use to test with, and care was taken to try to avoid potential interoperability issues. - Sean From changquing.tang at hp.com Thu Oct 25 10:45:26 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 25 Oct 2007 17:45:26 -0000 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> This is Verbs layer code, no IB CM is used. --CQ > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, October 25, 2007 12:38 PM > To: Tang, Changqing; Roland Dreier > Cc: general at lists.openfabrics.org > Subject: RE: [ofa-general] message is received but sender > report error. > > >If this is the case, how would we fix the problem ? It's > hard for us to > >delay to destroy the QP, because we don't know how long to delay. > >The other way is to do something from the driver, or firmware. > > Do you disconnect the QPs using the IB CM? > > - Sean > From jgunthorpe at obsidianresearch.com Thu Oct 25 11:02:16 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 25 Oct 2007 12:02:16 -0600 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <000401c8172f$84c20a30$ff0da8c0@amr.corp.intel.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> <4720CF1D.9050704@ichips.intel.com> <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> <000401c8172f$84c20a30$ff0da8c0@amr.corp.intel.com> Message-ID: <20071025180216.GM23078@obsidianresearch.com> On Thu, Oct 25, 2007 at 10:50:23AM -0700, Sean Hefty wrote: > But there's no sure way for one side of a connection to know > beforehand if the other side is following the standard. I would > consider these patches purely experimental with no guarantee that Well, it is my hope the standard will let all CM passive sides that conform today to work transparently with router enabled active sides. Jason From sean.hefty at intel.com Thu Oct 25 11:07:55 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 11:07:55 -0700 Subject: [ofa-general] [RFC] upstream IB router support In-Reply-To: <20071025180216.GM23078@obsidianresearch.com> References: <000001c81668$43d9dba0$73cc180a@amr.corp.intel.com> <1193319828.31872.106.camel@hrosenstock-ws.xsigo.com> <4720CF1D.9050704@ichips.intel.com> <1193333318.31872.162.camel@hrosenstock-ws.xsigo.com> <000401c8172f$84c20a30$ff0da8c0@amr.corp.intel.com> <20071025180216.GM23078@obsidianresearch.com> Message-ID: <000501c81731$f7ed0080$ff0da8c0@amr.corp.intel.com> >Well, it is my hope the standard will let all CM passive sides that >conform today to work transparently with router enabled active sides. That's my hope as well, and it should work fine as long as the CM protocol doesn't change. (And there's no indication that it will.) - Sean From pradeeps at linux.vnet.ibm.com Thu Oct 25 11:41:25 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 25 Oct 2007 11:41:25 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: <471FAC1F.2070401@linux.vnet.ibm.com> Message-ID: <4720E355.8010400@linux.vnet.ibm.com> Roland Dreier wrote: > > Roland, any further news on this? > > I did not manage to get the IPoIB connected mode without SRQ patch > merged in time. I tried to fix up your latest patch to apply to the > current tree, but while doing that I found several apparent bugs (eg I > don't see how it is safe to destroy a QP and free all pending receives > while another thread might be handling receives, calling > ipoib_cm_dev_cleanup just because a new QP failed to be allocated > looks wrong) and also a few things that seemed needlessly complicated > (eg duplicating code in the RX WC handler and the timer handling). So > I basically started rewriting the patch. I'm almost done and I will > post the results soon, and I expect this to be the first patch I queue > for 2.6.25. > > (BTW, when posting patches, please make sure that they apply with -p1; > your latest posting at least has an extra level in the pathnames. And > also always include the full patch description for the changelog, so I > don't have to recreate it when merging the patch) Having waited for months for this patch to be merged in, it is very disappointing to say the least. Wish it had been merged and if changes are needed they can always be made subsequently. That has been my understanding of the development model. Pradeep From weiny2 at llnl.gov Thu Oct 25 11:43:03 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:03 -0700 Subject: [ofa-general] [PATCH 0/6] Add Switch Map support to opensm Message-ID: <20071025114303.6d712bcb.weiny2@llnl.gov> As I said in another thread. I have added switch-map support to opensm. This patch series does that in a number of steps. Patch: 1) Simple comment fix (Should be applied on it's own regardless of if the series is accepted.) 2) Moves the switch map support to ibcommon but leaves the implementation alone. 3) Changes the implementation of the switch map to read the file into memory to facilitate faster lookups as well as multi-threaded lookups. 4) Add the switch map calls to opensm but leave the creation of the switch map to be the default one provided by ibcommon (Pass NULL to create_switch_map) 5) Add an option to the opts file to specify a switch map. 6) Allow a special value of "(null)" in the opts file. (This too could be applied outside of the series.) Patches to follow, Ira From weiny2 at llnl.gov Thu Oct 25 11:43:17 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:17 -0700 Subject: [ofa-general] [PATCH 1/6] infiniband-diags/configure.in: fix comment Message-ID: <20071025114317.2010de53.weiny2@llnl.gov> >From b338078dc970c09513dd1d3023bebff334010c05 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 19 Oct 2007 11:22:40 -0700 Subject: [PATCH] infiniband-diags/configure.in: fix comment Signed-off-by: Ira K. Weiny --- infiniband-diags/configure.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 3b3dd3f..95c7b34 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -72,7 +72,7 @@ AC_CHECK_FUNCS([strchr strrchr strtol strtoul memset]) dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST -dnl Check for perl and perl install location +dnl Check for the specification of a default switch map file AC_MSG_CHECKING(for --with-switch-map ) AC_ARG_WITH(switch-map, AC_HELP_STRING([--with-switch-map=file], -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-infiniband-diags-configure.in-fix-comment.patch Type: application/octet-stream Size: 913 bytes Desc: not available URL: From weiny2 at llnl.gov Thu Oct 25 11:43:25 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:25 -0700 Subject: [ofa-general] [PATCH 2/6] Move switch map out of infiniband-diags and into ibcommon Message-ID: <20071025114325.68c6c3f5.weiny2@llnl.gov> >From 65777220855baea12d4b7f961daca85765614f4a Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 19 Oct 2007 11:43:01 -0700 Subject: [PATCH] Move switch map out of infiniband-diags and into ibcommon Signed-off-by: Ira K. Weiny --- infiniband-diags/configure.in | 26 --------- infiniband-diags/include/ibdiag_common.h | 13 ----- infiniband-diags/src/ibdiag_common.c | 82 ----------------------------- libibcommon/configure.in | 26 +++++++++ libibcommon/include/infiniband/common.h | 15 +++++ libibcommon/src/libibcommon.map | 4 ++ libibcommon/src/util.c | 83 ++++++++++++++++++++++++++++++ 7 files changed, 128 insertions(+), 121 deletions(-) diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 95c7b34..a24d478 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -72,32 +72,6 @@ AC_CHECK_FUNCS([strchr strrchr strtol strtoul memset]) dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST -dnl Check for the specification of a default switch map file -AC_MSG_CHECKING(for --with-switch-map ) -AC_ARG_WITH(switch-map, - AC_HELP_STRING([--with-switch-map=file], - [define a default switch map file]), - [ case "$withval" in - no) - ;; - *) - withswitchmap=yes - SWITCHMAPFILE=$withval - ;; - esac ] -) -AC_MSG_RESULT(${withswitchmap=no}) - -if test $withswitchmap = "yes"; then - SWITCHMAP_TMP1="`eval echo ${sysconfdir}/$SWITCHMAPFILE`" - SWITCHMAP_TMP2="`echo $SWITCHMAP_TMP1 | sed 's/^NONE/$ac_default_prefix/'`" - SWITCHMAP="`eval echo $SWITCHMAP_TMP2`" - - AC_DEFINE_UNQUOTED(HAVE_DEFAULT_SWITCH_MAP, - ["$SWITCHMAP"], - [Define a default switch map file]) -fi - dnl Check for perl and perl install location AC_MSG_CHECKING(for --with-perl-path ) AC_ARG_WITH(perl-path, diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 159e929..029d80e 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -45,16 +45,6 @@ extern int ibdebug; /* External interface */ /*========================================================*/ -/** - * Switch map interface. - * It is OK to pass NULL for the switch_map[_fp] parameters. - */ -FILE *open_switch_map(char *switch_map); -void close_switch_map(FILE *switch_map_fp); -char *lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, - char *nodedesc); - /* NOTE: parameter "nodedesc" may be modified here. */ - #undef DEBUG #define DEBUG if (ibdebug || verbose) IBWARN #define VERBOSE if (ibdebug || verbose > 1) IBWARN @@ -62,9 +52,6 @@ char *lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, void iberror(const char *fn, char *msg, ...); -/* NOTE: this modifies the parameter "nodedesc". */ -char *clean_nodedesc(char *nodedesc); - #ifdef __BUILD_VERSION_TAG__ #define stringify(s) to_string(s) diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index bfddfd7..68e90b2 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -51,73 +51,6 @@ int ibdebug; -FILE * -open_switch_map(char *switch_map) -{ - FILE *rc = NULL; - - if (switch_map != NULL) { - rc = fopen(switch_map, "r"); - if (rc == NULL) { - fprintf(stderr, - "WARNING failed to open switch map \"%s\" (%s)\n", - switch_map, strerror(errno)); - } -#ifdef HAVE_DEFAULT_SWITCH_MAP - } else { - rc = fopen(HAVE_DEFAULT_SWITCH_MAP, "r"); -#endif /* HAVE_DEFAULT_SWITCH_MAP */ - } - return (rc); -} - -void -close_switch_map(FILE *fp) -{ - if (fp) - fclose(fp); -} - -char * -lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, char *nodedesc) -{ -#define NAME_LEN (256) - char *line = NULL; - size_t len = 0; - uint64_t guid = 0; - char *rc = NULL; - int line_count = 0; - - if (switch_map_fp == NULL) - goto done; - - rewind(switch_map_fp); - for (line_count = 1; - getline(&line, &len, switch_map_fp) != -1; - line_count++) { - line[len-1] = '\0'; - if (line[0] == '#') - goto next_one; - char *guid_str = strtok(line, "\"#"); - char *name = strtok(NULL, "\"#"); - if (!guid_str || !name) - goto next_one; - guid = strtoull(guid_str, NULL, 0); - if (target_guid == guid) { - rc = strdup(name); - free (line); - goto done; - } -next_one: - free (line); - line = NULL; - } -done: - if (rc == NULL) - rc = strdup(clean_nodedesc(nodedesc)); - return (rc); -} - void iberror(const char *fn, char *msg, ...) { @@ -140,18 +73,3 @@ iberror(const char *fn, char *msg, ...) exit(-1); } - -char * -clean_nodedesc(char *nodedesc) -{ - int i = 0; - - nodedesc[63] = '\0'; - while (nodedesc[i]) { - if (!isprint(nodedesc[i])) - nodedesc[i] = ' '; - i++; - } - - return (nodedesc); -} diff --git a/libibcommon/configure.in b/libibcommon/configure.in index 2e896a0..c9dcf78 100644 --- a/libibcommon/configure.in +++ b/libibcommon/configure.in @@ -48,5 +48,31 @@ AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") +dnl Check for the specification of a default switch map file +AC_MSG_CHECKING(for --with-switch-map ) +AC_ARG_WITH(switch-map, + AC_HELP_STRING([--with-switch-map=file], + [define a default switch map file]), + [ case "$withval" in + no) + ;; + *) + withswitchmap=yes + SWITCHMAPFILE=$withval + ;; + esac ] +) +AC_MSG_RESULT(${withswitchmap=no}) + +if test $withswitchmap = "yes"; then + SWITCHMAP_TMP1="`eval echo ${sysconfdir}/$SWITCHMAPFILE`" + SWITCHMAP_TMP2="`echo $SWITCHMAP_TMP1 | sed 's/^NONE/$ac_default_prefix/'`" + SWITCHMAP="`eval echo $SWITCHMAP_TMP2`" + + AC_DEFINE_UNQUOTED(HAVE_DEFAULT_SWITCH_MAP, + ["$SWITCHMAP"], + [Define a default switch map file]) +fi + AC_CONFIG_FILES([Makefile libibcommon.spec]) AC_OUTPUT diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index 4eb3872..bd78f41 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -126,6 +126,21 @@ void logmsg(const char *const fn, char *msg, ...) IBCOMMON_STRICT_FORMAT; void xdump(FILE *file, char *msg, void *p, int size); +/* NOTE: this modifies the parameter "nodedesc". */ +char *clean_nodedesc(char *nodedesc); + +/** + * Switch map interface. + * It is OK to pass NULL for the switch_map[_fp] parameters. + */ +FILE *open_switch_map(char *switch_map); +void close_switch_map(FILE *switch_map_fp); +char *lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, + char *nodedesc); + /* NOTE: parameter "nodedesc" may be modified here. + * return pointer must be free'd by caller + */ + /* sysfs.c: /sys utilities */ int sys_read_string(char *dir_name, char *file_name, char *str, int max_len); int sys_read_guid(char *dir_name, char *file_name, uint64_t *net_guid); diff --git a/libibcommon/src/libibcommon.map b/libibcommon/src/libibcommon.map index 96ce2d8..afd8e6d 100644 --- a/libibcommon/src/libibcommon.map +++ b/libibcommon/src/libibcommon.map @@ -13,5 +13,9 @@ IBCOMMON_1.0 { ibpanic; ibwarn; xdump; + clean_nodedesc; + open_switch_map; + close_switch_map; + lookup_switch_name; local: *; }; diff --git a/libibcommon/src/util.c b/libibcommon/src/util.c index 7da967e..e2f45f4 100644 --- a/libibcommon/src/util.c +++ b/libibcommon/src/util.c @@ -133,3 +133,86 @@ xdump(FILE *file, char *msg, void *p, int size) fputc('\n', file); } } + +char * +clean_nodedesc(char *nodedesc) +{ + int i = 0; + + nodedesc[63] = '\0'; + while (nodedesc[i]) { + if (!isprint(nodedesc[i])) + nodedesc[i] = ' '; + i++; + } + + return (nodedesc); +} + +FILE * +open_switch_map(char *switch_map) +{ + FILE *rc = NULL; + + if (switch_map != NULL) { + rc = fopen(switch_map, "r"); + if (rc == NULL) { + fprintf(stderr, + "WARNING failed to open switch map \"%s\" (%s)\n", + switch_map, strerror(errno)); + } +#ifdef HAVE_DEFAULT_SWITCH_MAP + } else { + rc = fopen(HAVE_DEFAULT_SWITCH_MAP, "r"); +#endif /* HAVE_DEFAULT_SWITCH_MAP */ + } + return (rc); +} + +void +close_switch_map(FILE *fp) +{ + if (fp) + fclose(fp); +} + +char * +lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, char *nodedesc) +{ +#define NAME_LEN (256) + char *line = NULL; + size_t len = 0; + uint64_t guid = 0; + char *rc = NULL; + int line_count = 0; + + if (switch_map_fp == NULL) + goto done; + + rewind(switch_map_fp); + for (line_count = 1; + getline(&line, &len, switch_map_fp) != -1; + line_count++) { + line[len-1] = '\0'; + if (line[0] == '#') + goto next_one; + char *guid_str = strtok(line, "\"#"); + char *name = strtok(NULL, "\"#"); + if (!guid_str || !name) + goto next_one; + guid = strtoull(guid_str, NULL, 0); + if (target_guid == guid) { + rc = strdup(name); + free (line); + goto done; + } +next_one: + free (line); + line = NULL; + } +done: + if (rc == NULL) + rc = strdup(clean_nodedesc(nodedesc)); + return (rc); +} + -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-Move-switch-map-out-of-infiniband-diags-and-into-ibc.patch Type: application/octet-stream Size: 9346 bytes Desc: not available URL: From weiny2 at llnl.gov Thu Oct 25 11:43:32 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:32 -0700 Subject: [ofa-general] [PATCH 3/6] Improve the switch_map by storing the map file in memory for faster lookups Message-ID: <20071025114332.0b05fc48.weiny2@llnl.gov> >From e11fd56785ed646018735f990346023e733b5358 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Fri, 19 Oct 2007 16:25:26 -0700 Subject: [PATCH] Improve the switch_map by storing the map file in memory for faster lookups This also makes lookups thread safe as long as "free_switch_map" is not called. Signed-off-by: Ira K. Weiny --- infiniband-diags/src/ibnetdiscover.c | 28 ++---- infiniband-diags/src/ibtracert.c | 28 ++---- infiniband-diags/src/saquery.c | 12 +- infiniband-diags/src/smpquery.c | 12 +- libibcommon/Makefile.am | 2 +- libibcommon/include/infiniband/common.h | 41 +++++--- libibcommon/src/libibcommon.map | 4 +- libibcommon/src/switch_map.c | 155 +++++++++++++++++++++++++++++++ libibcommon/src/util.c | 83 ---------------- 9 files changed, 214 insertions(+), 151 deletions(-) create mode 100644 libibcommon/src/switch_map.c diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index e627e84..2857117 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -91,8 +91,8 @@ static FILE *f; char *argv0 = "ibnetdiscover"; -static char *switch_map = NULL; -static FILE *switch_map_fp = NULL; +static char *switch_map_name = NULL; +static sw_map_t *switch_map = NULL; Node *nodesdist[MAXHOPS+1]; /* last is Ca list */ Node *mynode; @@ -462,7 +462,7 @@ list_node(Node *node) char *nodename = NULL; if (node->type == SWITCH_NODE) - nodename = lookup_switch_name(switch_map_fp, node->nodeguid, + nodename = lookup_switch_name(switch_map, node->nodeguid, node->nodedesc); else nodename = clean_nodedesc(node->nodedesc); @@ -484,9 +484,6 @@ list_node(Node *node) node_type, node->nodeguid, node->numports, node->devid, node->vendid, nodename); - - if (nodename && (node->type == SWITCH_NODE)) - free(nodename); } void @@ -542,7 +539,7 @@ out_switch(Node *node, int group, char *chname) } if (node->type == SWITCH_NODE) - nodename = lookup_switch_name(switch_map_fp, node->nodeguid, + nodename = lookup_switch_name(switch_map, node->nodeguid, node->nodedesc); else nodename = clean_nodedesc(node->nodedesc); @@ -551,8 +548,6 @@ out_switch(Node *node, int group, char *chname) nodename, node->smaenhsp0 ? "enhanced" : "base", node->smalid, node->smalmc); - if (nodename && (node->type == SWITCH_NODE)) - free(nodename); } void @@ -613,7 +608,7 @@ out_switch_port(Port *port, int group) fprintf(f, "%s", ext_port_str); if (port->remoteport->node->type == SWITCH_NODE) - rem_nodename = lookup_switch_name(switch_map_fp, + rem_nodename = lookup_switch_name(switch_map, port->remoteport->node->nodeguid, port->remoteport->node->nodedesc); else @@ -637,9 +632,6 @@ out_switch_port(Port *port, int group) else if (is_xsigo_hca(port->remoteport->portguid)) fprintf(f, " (scp)"); fprintf(f, "\n"); - - if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) - free(rem_nodename); } void @@ -661,7 +653,7 @@ out_ca_port(Port *port, int group) fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid); if (port->remoteport->node->type == SWITCH_NODE) - rem_nodename = lookup_switch_name(switch_map_fp, + rem_nodename = lookup_switch_name(switch_map, port->remoteport->node->nodeguid, port->remoteport->node->nodedesc); else @@ -671,8 +663,6 @@ out_ca_port(Port *port, int group) port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed)); - if (rem_nodename && (port->remoteport->node->type == SWITCH_NODE)) - free(rem_nodename); } int @@ -902,7 +892,7 @@ main(int argc, char **argv) break; switch(ch) { case 1: - switch_map = strdup(optarg); + switch_map_name = strdup(optarg); break; case 'C': ca = optarg; @@ -959,7 +949,7 @@ main(int argc, char **argv) IBERROR("can't open file %s for writing", argv[0]); madrpc_init(ca, ca_port, mgmt_classes, 2); - switch_map_fp = open_switch_map(switch_map); + switch_map = create_switch_map(switch_map_name); if (discover(&my_portid) < 0) IBERROR("discover"); @@ -969,6 +959,6 @@ main(int argc, char **argv) dump_topology(list, group); - close_switch_map(switch_map_fp); + free_switch_map(switch_map); exit(0); } diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index e553f4f..aee0b1c 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -70,8 +70,8 @@ static FILE *f; char *argv0 = "ibtracert"; -static char *switch_map = NULL; -static FILE *switch_map_fp = NULL; +static char *switch_map_name = NULL; +static sw_map_t *switch_map = NULL; typedef struct Port Port; typedef struct Switch Switch; @@ -205,7 +205,7 @@ dump_endnode(int dump, char *prompt, Node *node, Port *port) } if (node->type == IB_NODE_SWITCH) - nodename = lookup_switch_name(switch_map_fp, node->nodeguid, node->nodedesc); + nodename = lookup_switch_name(switch_map, node->nodeguid, node->nodedesc); else nodename = clean_nodedesc(node->nodedesc); @@ -215,9 +215,6 @@ dump_endnode(int dump, char *prompt, Node *node, Port *port) node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, nodename); - - if (nodename && (node->type == IB_NODE_SWITCH)) - free(nodename); } static void @@ -229,7 +226,7 @@ dump_route(int dump, Node *node, int outport, Port *port) return; if (node->type == IB_NODE_SWITCH) - nodename = lookup_switch_name(switch_map_fp, node->nodeguid, node->nodedesc); + nodename = lookup_switch_name(switch_map, node->nodeguid, node->nodedesc); else nodename = clean_nodedesc(node->nodedesc); @@ -243,9 +240,6 @@ dump_route(int dump, Node *node, int outport, Port *port) port->portguid, port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, nodename); - - if (nodename && (node->type == IB_NODE_SWITCH)) - free(nodename); } static int @@ -645,7 +639,7 @@ dump_mcpath(Node *node, int dumplevel) dump_mcpath(node->upnode, dumplevel); if (node->type == IB_NODE_SWITCH) - nodename = lookup_switch_name(switch_map_fp, node->nodeguid, node->nodedesc); + nodename = lookup_switch_name(switch_map, node->nodeguid, node->nodedesc); else nodename = clean_nodedesc(node->nodedesc); @@ -655,7 +649,7 @@ dump_mcpath(Node *node, int dumplevel) node->nodeguid, node->ports->portnum, node->ports->lid, node->ports->lid + (1 << node->ports->lmc) - 1, nodename); - goto free_name; + return; } if (node->dist) { @@ -679,10 +673,6 @@ dump_mcpath(Node *node, int dumplevel) node->nodeguid, node->ports->portnum, node->ports->lid, node->ports->lid + (1 << node->ports->lmc) - 1, nodename); - -free_name: - if (nodename && (node->type == IB_NODE_SWITCH)) - free(nodename); } static void @@ -752,7 +742,7 @@ main(int argc, char **argv) break; switch(ch) { case 1: - switch_map = strdup(optarg); + switch_map_name = strdup(optarg); break; case 'C': ca = optarg; @@ -810,7 +800,7 @@ main(int argc, char **argv) usage(); madrpc_init(ca, ca_port, mgmt_classes, 3); - switch_map_fp = open_switch_map(switch_map); + switch_map = create_switch_map(switch_map_name); if (ib_resolve_portid_str(&src_portid, argv[0], dest_type, sm_id) < 0) IBERROR("can't resolve source port %s", argv[0]); @@ -849,6 +839,6 @@ main(int argc, char **argv) /* dump multicast path */ dump_mcpath(endnode, dumplevel); - close_switch_map(switch_map_fp); + free_switch_map(switch_map); exit(0); } diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index e17ec5a..f76fa41 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -59,8 +59,8 @@ char *argv0 = "saquery"; -static char *switch_map = NULL; -static FILE *switch_map_fp = NULL; +static char *switch_map_name = NULL; +static sw_map_t *switch_map = NULL; /** * Declare some globals because I don't want this to be too complex. @@ -137,7 +137,7 @@ print_node_record(ib_node_record_t *node_record) case NAME_OF_LID: case NAME_OF_GUID: if (p_ni->node_type == IB_NODE_TYPE_SWITCH) - name = lookup_switch_name(switch_map_fp, + name = lookup_switch_name(switch_map, cl_ntoh64(p_ni->node_guid), (char *)p_nd->description); else @@ -1144,7 +1144,7 @@ main(int argc, char **argv) break; } case 2: - switch_map = strdup(optarg); + switch_map_name = strdup(optarg); break; case 'p': query_type = IB_MAD_ATTR_PATH_RECORD; @@ -1249,7 +1249,7 @@ main(int argc, char **argv) } bind_handle = get_bind_handle(); - switch_map_fp = open_switch_map(switch_map); + switch_map = create_switch_map(switch_map_name); switch (query_type) { case IB_MAD_ATTR_NODE_RECORD: @@ -1295,6 +1295,6 @@ main(int argc, char **argv) if (dst) free(dst); clean_up(); - close_switch_map(switch_map_fp); + free_switch_map(switch_map); return (status); } diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 73e880b..29ba9c2 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -84,8 +84,8 @@ static const match_rec_t match_tbl[] = { }; char *argv0 = "smpquery"; -static char *switch_map = NULL; -static FILE *switch_map_fp = NULL; +static char *switch_map_name = NULL; +static sw_map_t *switch_map = NULL; /*******************************************/ static char * @@ -108,7 +108,7 @@ node_desc(ib_portid_t *dest, char **argv, int argc) return "node desc query failed"; if (node_type == IB_NODE_SWITCH) - nodename = lookup_switch_name(switch_map_fp, node_guid, nd); + nodename = lookup_switch_name(switch_map, node_guid, nd); else nodename = clean_nodedesc(nd); @@ -458,7 +458,7 @@ main(int argc, char **argv) break; switch(ch) { case 1: - switch_map = strdup(optarg); + switch_map_name = strdup(optarg); break; case 'd': ibdebug++; @@ -514,7 +514,7 @@ main(int argc, char **argv) IBERROR("operation '%s' not supported", argv[0]); madrpc_init(ca, ca_port, mgmt_classes, 3); - switch_map_fp = open_switch_map(switch_map); + switch_map = create_switch_map(switch_map_name); if (dest_type != IB_DEST_DRSLID) { if (ib_resolve_portid_str(&portid, argv[1], dest_type, sm_id) < 0) @@ -531,6 +531,6 @@ main(int argc, char **argv) if ((err = fn(&portid, argv+3, argc-3))) IBERROR("operation %s: %s", argv[0], err); } - close_switch_map(switch_map_fp); + free_switch_map(switch_map); exit(0); } diff --git a/libibcommon/Makefile.am b/libibcommon/Makefile.am index af60035..8d54437 100644 --- a/libibcommon/Makefile.am +++ b/libibcommon/Makefile.am @@ -13,7 +13,7 @@ else libibcommon_version_script = endif -libibcommon_la_SOURCES = src/stack.c src/sysfs.c src/util.c src/time.c src/hash.c +libibcommon_la_SOURCES = src/stack.c src/sysfs.c src/util.c src/time.c src/hash.c src/switch_map.c libibcommon_la_LDFLAGS = -version-info $(ibcommon_api_version) \ -export-dynamic $(libibcommon_version_script) libibcommon_la_DEPENDENCIES = $(srcdir)/src/libibcommon.map diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index bd78f41..f8e8549 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -126,21 +126,6 @@ void logmsg(const char *const fn, char *msg, ...) IBCOMMON_STRICT_FORMAT; void xdump(FILE *file, char *msg, void *p, int size); -/* NOTE: this modifies the parameter "nodedesc". */ -char *clean_nodedesc(char *nodedesc); - -/** - * Switch map interface. - * It is OK to pass NULL for the switch_map[_fp] parameters. - */ -FILE *open_switch_map(char *switch_map); -void close_switch_map(FILE *switch_map_fp); -char *lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, - char *nodedesc); - /* NOTE: parameter "nodedesc" may be modified here. - * return pointer must be free'd by caller - */ - /* sysfs.c: /sys utilities */ int sys_read_string(char *dir_name, char *file_name, char *str, int max_len); int sys_read_guid(char *dir_name, char *file_name, uint64_t *net_guid); @@ -158,6 +143,32 @@ uint64_t getcurrenttime(void); /* hash.c */ uint32_t fhash(uint8_t *k, int length, uint32_t initval); + +/* switch_map.c */ +typedef struct _sw_name_ent { + uint64_t guid; + char *name; +} sw_name_ent_t; +typedef struct _switch_map { + FILE *fp; + int num; + sw_name_ent_t names[1]; /* MUST BE LAST */ +} sw_map_t; + +/* + * create and free ARE NOT thread safe + * However lookup_switch_name IS thread safe as long as free is not called + * during lookup. + */ +sw_map_t *create_switch_map(char *switch_map_name); +void free_switch_map(sw_map_t *map); +char *lookup_switch_name(sw_map_t *map, + uint64_t target_guid, + char *nodedesc /* "nodedesc" may be modified */ + ); +/* NOTE: this modifies the parameter "nodedesc". */ +char *clean_nodedesc(char *nodedesc); + END_C_DECLS #endif /* __COMMON_H__ */ diff --git a/libibcommon/src/libibcommon.map b/libibcommon/src/libibcommon.map index afd8e6d..7415233 100644 --- a/libibcommon/src/libibcommon.map +++ b/libibcommon/src/libibcommon.map @@ -14,8 +14,8 @@ IBCOMMON_1.0 { ibwarn; xdump; clean_nodedesc; - open_switch_map; - close_switch_map; + create_switch_map; + free_switch_map; lookup_switch_name; local: *; }; diff --git a/libibcommon/src/switch_map.c b/libibcommon/src/switch_map.c new file mode 100644 index 0000000..37fd60f --- /dev/null +++ b/libibcommon/src/switch_map.c @@ -0,0 +1,155 @@ +/* + * Copyright (c) 2007 Lawrence Livermore National Laboratory (LLNL) + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#define _GNU_SOURCE + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include + +#include + + +char * +clean_nodedesc(char *nodedesc) +{ + int i = 0; + + nodedesc[63] = '\0'; + while (nodedesc[i]) { + if (!isprint(nodedesc[i])) + nodedesc[i] = ' '; + i++; + } + + return (nodedesc); +} + +static sw_map_t * +read_names(sw_map_t *map) +{ + char *line = NULL; + size_t len = 0; + + rewind(map->fp); + while (getline(&line, &len, map->fp) != -1) { + char *guid_str = NULL; + char *name = NULL; + line[len-1] = '\0'; + if (line[0] == '#') + goto next_one; + + guid_str = strtok(line, "\"#"); + name = strtok(NULL, "\"#"); + if (!guid_str || !name) + goto next_one; + + map->num++; + map = realloc(map, sizeof(*map) + (sizeof(sw_name_ent_t) * map->num)); + map->names[map->num -1].guid = strtoull(guid_str, NULL, 0); + map->names[map->num -1].name = strdup(name); +next_one: + free (line); + line = NULL; + } + + return (map); +} + +sw_map_t * +create_switch_map(char *switch_map) +{ + FILE *tmp_fp = NULL; + sw_map_t *rc = NULL; + + if (switch_map != NULL) { + tmp_fp = fopen(switch_map, "r"); + if (tmp_fp == NULL) { + fprintf(stderr, + "WARNING failed to open switch map \"%s\" (%s)\n", + switch_map, strerror(errno)); + } +#ifdef HAVE_DEFAULT_SWITCH_MAP + } else { + tmp_fp = fopen(HAVE_DEFAULT_SWITCH_MAP, "r"); +#endif /* HAVE_DEFAULT_SWITCH_MAP */ + } + if (!tmp_fp) + return (NULL); + + rc = malloc(sizeof(*rc)); + if (!rc) + return (NULL); + rc->fp = tmp_fp; + rc->num = 0; + rc = read_names(rc); + return (rc); +} + +void +free_switch_map(sw_map_t *map) +{ + int i = 0; + if (map == NULL) + return; + for (i = 0; i < map->num; i++) + free(map->names[i].name); + if (map->fp) + fclose(map->fp); + free(map); +} + +char * +lookup_switch_name(sw_map_t *map, uint64_t target_guid, char *nodedesc) +{ + int i = 0; + char *rc = NULL; + + if (map == NULL) + goto done; + + for (i = 0; i < map->num; i++) + if (map->names[i].guid == target_guid) + return (map->names[i].name); +done: + if (rc == NULL) + rc = clean_nodedesc(nodedesc); + return (rc); +} + diff --git a/libibcommon/src/util.c b/libibcommon/src/util.c index e2f45f4..7da967e 100644 --- a/libibcommon/src/util.c +++ b/libibcommon/src/util.c @@ -133,86 +133,3 @@ xdump(FILE *file, char *msg, void *p, int size) fputc('\n', file); } } - -char * -clean_nodedesc(char *nodedesc) -{ - int i = 0; - - nodedesc[63] = '\0'; - while (nodedesc[i]) { - if (!isprint(nodedesc[i])) - nodedesc[i] = ' '; - i++; - } - - return (nodedesc); -} - -FILE * -open_switch_map(char *switch_map) -{ - FILE *rc = NULL; - - if (switch_map != NULL) { - rc = fopen(switch_map, "r"); - if (rc == NULL) { - fprintf(stderr, - "WARNING failed to open switch map \"%s\" (%s)\n", - switch_map, strerror(errno)); - } -#ifdef HAVE_DEFAULT_SWITCH_MAP - } else { - rc = fopen(HAVE_DEFAULT_SWITCH_MAP, "r"); -#endif /* HAVE_DEFAULT_SWITCH_MAP */ - } - return (rc); -} - -void -close_switch_map(FILE *fp) -{ - if (fp) - fclose(fp); -} - -char * -lookup_switch_name(FILE *switch_map_fp, uint64_t target_guid, char *nodedesc) -{ -#define NAME_LEN (256) - char *line = NULL; - size_t len = 0; - uint64_t guid = 0; - char *rc = NULL; - int line_count = 0; - - if (switch_map_fp == NULL) - goto done; - - rewind(switch_map_fp); - for (line_count = 1; - getline(&line, &len, switch_map_fp) != -1; - line_count++) { - line[len-1] = '\0'; - if (line[0] == '#') - goto next_one; - char *guid_str = strtok(line, "\"#"); - char *name = strtok(NULL, "\"#"); - if (!guid_str || !name) - goto next_one; - guid = strtoull(guid_str, NULL, 0); - if (target_guid == guid) { - rc = strdup(name); - free (line); - goto done; - } -next_one: - free (line); - line = NULL; - } -done: - if (rc == NULL) - rc = strdup(clean_nodedesc(nodedesc)); - return (rc); -} - -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-Improve-the-switch_map-by-storing-the-map-file-in-me.patch Type: application/octet-stream Size: 19331 bytes Desc: not available URL: From weiny2 at llnl.gov Thu Oct 25 11:43:36 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:36 -0700 Subject: [ofa-general] [PATCH 4/6] Add switch-map support to OpenSM; using the "default" map. Message-ID: <20071025114336.7aed47a9.weiny2@llnl.gov> >From ba6cea679d745587bfe37e9be45d7491a5be9918 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Tue, 23 Oct 2007 16:04:46 -0700 Subject: [PATCH] Add switch-map support to OpenSM; using the "default" map. Signed-off-by: Ira K. Weiny --- opensm/include/opensm/osm_node.h | 2 +- opensm/include/opensm/osm_opensm.h | 1 + opensm/include/opensm/osm_subnet.h | 1 + opensm/opensm/osm_node.c | 6 ++++++ opensm/opensm/osm_node_desc_rcv.c | 14 ++++++++++++-- opensm/opensm/osm_opensm.c | 4 ++++ 6 files changed, 25 insertions(+), 3 deletions(-) diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h index f87e81d..8af5418 100644 --- a/opensm/include/opensm/osm_node.h +++ b/opensm/include/opensm/osm_node.h @@ -106,7 +106,7 @@ typedef struct _osm_node { ib_node_desc_t node_desc; uint32_t discovery_count; uint32_t physp_tbl_size; - char print_desc[IB_NODE_DESCRIPTION_SIZE + 1]; + char *print_desc; osm_physp_t physp_table[1]; } osm_node_t; /* diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index 1ea1ec2..76e82d8 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -168,6 +168,7 @@ typedef struct _osm_opensm_t { struct osm_routing_engine routing_engine; osm_stats_t stats; osm_console_t console; + sw_map_t *switch_map; } osm_opensm_t; /* * FIELDS diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index dada8bf..573d506 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -297,6 +297,7 @@ typedef struct _osm_subn_opt { char *event_db_dump_file; #endif /* ENABLE_OSM_PERF_MGR */ char *event_plugin_name; + char *switch_map_name; } osm_subn_opt_t; /* * FIELDS diff --git a/opensm/opensm/osm_node.c b/opensm/opensm/osm_node.c index 645daa9..f34da1f 100644 --- a/opensm/opensm/osm_node.c +++ b/opensm/opensm/osm_node.c @@ -131,6 +131,7 @@ osm_node_t *osm_node_new(IN const osm_madw_t * const p_madw) osm_node_init_physp(p_node, p_madw); } + p_node->print_desc = ""; return (p_node); } @@ -146,6 +147,11 @@ static void osm_node_destroy(IN osm_node_t * p_node) */ for (i = 0; i < p_node->physp_tbl_size; i++) osm_physp_destroy(&p_node->physp_table[i]); + + /* cleanup printable node_desc field */ + if (p_node->print_desc) { + free(p_node->print_desc); + } } /********************************************************************** diff --git a/opensm/opensm/osm_node_desc_rcv.c b/opensm/opensm/osm_node_desc_rcv.c index d50883c..b050ae1 100644 --- a/opensm/opensm/osm_node_desc_rcv.c +++ b/opensm/opensm/osm_node_desc_rcv.c @@ -58,6 +58,7 @@ #include #include #include +#include #include /********************************************************************** @@ -67,13 +68,22 @@ __osm_nd_rcv_process_nd(IN const osm_nd_rcv_t * const p_rcv, IN osm_node_t * const p_node, IN const ib_node_desc_t * const p_nd) { + char *tmp_desc; + char print_desc[IB_NODE_DESCRIPTION_SIZE + 1]; + OSM_LOG_ENTER(p_rcv->p_log, __osm_nd_rcv_process_nd); memcpy(&p_node->node_desc.description, p_nd, sizeof(*p_nd)); /* also set up a printable version */ - memcpy(&p_node->print_desc, p_nd, sizeof(*p_nd)); - p_node->print_desc[IB_NODE_DESCRIPTION_SIZE] = '\0'; + memcpy(print_desc, p_nd, sizeof(*p_nd)); + print_desc[IB_NODE_DESCRIPTION_SIZE] = '\0'; + tmp_desc = lookup_switch_name(p_rcv->p_subn->p_osm->switch_map, + cl_ntoh64(osm_node_get_node_guid(p_node)), + print_desc); + + /* make a copy for this node to "own" */ + p_node->print_desc = strdup(tmp_desc); if (osm_log_is_active(p_rcv->p_log, OSM_LOG_VERBOSE)) { osm_log(p_rcv->p_log, OSM_LOG_VERBOSE, diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 5b45401..53f4f8b 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -183,6 +183,8 @@ void osm_opensm_destroy(IN osm_opensm_t * const p_osm) osm_subn_destroy(&p_osm->subn); cl_disp_destroy(&p_osm->disp); + free_switch_map(p_osm->switch_map); + cl_plock_destroy(&p_osm->lock); osm_log_destroy(&p_osm->log); @@ -310,6 +312,8 @@ osm_opensm_init(IN osm_opensm_t * const p_osm, goto Exit; } + p_osm->switch_map = create_switch_map(NULL); + Exit: osm_log(&p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: ]\n"); /* Format Waived */ return (status); -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-Add-switch-map-support-to-OpenSM-using-the-default.patch Type: application/octet-stream Size: 4572 bytes Desc: not available URL: From weiny2 at llnl.gov Thu Oct 25 11:43:42 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:42 -0700 Subject: [ofa-general] [PATCH 5/6] Add switch_map_name to opts file. Message-ID: <20071025114342.14c41bd7.weiny2@llnl.gov> >From 5f1db5f3444e21f3c78e42c047d1e440ac35ac66 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 25 Oct 2007 09:31:21 -0700 Subject: [PATCH] Add switch_map_name to opts file. Signed-off-by: Ira K. Weiny --- opensm/opensm/osm_opensm.c | 2 +- opensm/opensm/osm_subnet.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 53f4f8b..5fa4383 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -312,7 +312,7 @@ osm_opensm_init(IN osm_opensm_t * const p_osm, goto Exit; } - p_osm->switch_map = create_switch_map(NULL); + p_osm->switch_map = create_switch_map(p_opt->switch_map_name); Exit: osm_log(&p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: ]\n"); /* Format Waived */ diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 829c82b..9bc6940 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -445,6 +445,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) #endif /* ENABLE_OSM_PERF_MGR */ p_opt->event_plugin_name = OSM_DEFAULT_EVENT_PLUGIN_NAME; + p_opt->switch_map_name = NULL; p_opt->dump_files_dir = getenv("OSM_TMP_DIR"); if (!p_opt->dump_files_dir || !(*p_opt->dump_files_dir)) @@ -1245,6 +1246,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_charp("event_plugin_name", p_key, p_val, &p_opts->event_plugin_name); + opts_unpack_charp("switch_map_name", + p_key, p_val, &p_opts->switch_map_name); + subn_parse_qos_options("qos", p_key, p_val, &p_opts->qos_options); @@ -1504,6 +1508,12 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "event_plugin_name %s\n\n", p_opts->event_plugin_name); fprintf(opts_file, + "#\n# Switch Map for mapping switch GUID's to more descirptive node descriptors\n" + "# (man ibnetdiscover for more information)\n#\n" + "switch_map_name %s\n\n", + p_opts->switch_map_name ? p_opts->switch_map_name : "(null)"); + + fprintf(opts_file, "#\n# DEBUG FEATURES\n#\n" "# The log flags used\n" "log_flags 0x%02x\n\n" -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0005-Add-switch_map_name-to-opts-file.patch Type: application/octet-stream Size: 2234 bytes Desc: not available URL: From weiny2 at llnl.gov Thu Oct 25 11:43:46 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 25 Oct 2007 11:43:46 -0700 Subject: [ofa-general] [PATCH 6/6] Allow for a special value of "(null)" in the opts file. Message-ID: <20071025114346.5902acc9.weiny2@llnl.gov> >From 3df0056cce46e521dc9f0ab07c55a41cef6f340c Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 25 Oct 2007 10:09:54 -0700 Subject: [PATCH] Allow for a special value of "(null)" in the opts file. Some string values are valid if they are "(null)". Special case this string so that it sets the pointer to NULL when read. Signed-off-by: Ira K. Weiny --- opensm/opensm/osm_subnet.c | 17 +++++++++++------ 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 9bc6940..32af508 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -630,12 +630,17 @@ opts_unpack_charp(IN char *p_req_key, printf(buff); cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0); - /* - Ignore the possible memory leak here; - the pointer may be to a static default. - */ - *p_val = (char *)malloc(strlen(p_val_str) + 1); - strcpy(*p_val, p_val_str); + /* special case the "(null)" string */ + if (strcmp("(null)", p_val_str) == 0) { + *p_val = NULL; + } else { + /* + Ignore the possible memory leak here; + the pointer may be to a static default. + */ + *p_val = (char *)malloc(strlen(p_val_str) + 1); + strcpy(*p_val, p_val_str); + } } } } -- 1.5.1 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0006-Allow-for-a-special-value-of-null-in-the-opts-fi.patch Type: application/octet-stream Size: 1339 bytes Desc: not available URL: From Jeffrey.C.Becker at nasa.gov Thu Oct 25 11:54:28 2007 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Thu, 25 Oct 2007 11:54:28 -0700 Subject: [ofa-general] [Fwd: Re: Dropped OpenFabrics list messages (!)] Message-ID: <4720E664.4060306@nasa.gov> -------------- next part -------------- An embedded message was scrubbed... From: Jeff Becker Subject: Re: Dropped OpenFabrics list messages (!) Date: Thu, 25 Oct 2007 11:53:17 -0700 Size: 1244 URL: From mshefty at ichips.intel.com Thu Oct 25 12:08:08 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 12:08:08 -0700 Subject: [ofa-general] [Fwd: Re: Dropped OpenFabrics list messages (!)] In-Reply-To: <4720E664.4060306@nasa.gov> References: <4720E664.4060306@nasa.gov> Message-ID: <4720E998.3080606@ichips.intel.com> > In the interest of curtailing the SPAM problem, I would also be in favor > of a only-subscribers-can-post policy. Is there really strong opposition > to this? Thanks. I'm guessing it would block a lot of feedback that we could get from other Linux maintainers. Personally I find it kind of annoying when others include subscriber only lists (interop-wg, dapl) on their posts, only to get bounce messages anytime I reply. I haven't seen the spam get that bad that I would resort to a subscriber only list at this point. - Sean From jsquyres at cisco.com Thu Oct 25 12:14:06 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 25 Oct 2007 15:14:06 -0400 Subject: [ofa-general] [Fwd: Re: Dropped OpenFabrics list messages (!)] In-Reply-To: <4720E998.3080606@ichips.intel.com> References: <4720E664.4060306@nasa.gov> <4720E998.3080606@ichips.intel.com> Message-ID: <8A5D87E1-F4E6-4FCD-912D-9A970C8C35CF@cisco.com> I have no problems with only-subscribers-can-post policies, but don't care enough to partake in the debate. All I want is my posts to not be dropped. :-) Whitelisting could be a good start (I still have not been able to send out the teleconf info for next week!). On Oct 25, 2007, at 3:08 PM, Sean Hefty wrote: >> In the interest of curtailing the SPAM problem, I would also be in >> favor >> of a only-subscribers-can-post policy. Is there really strong >> opposition >> to this? Thanks. > > I'm guessing it would block a lot of feedback that we could get > from other Linux maintainers. > > Personally I find it kind of annoying when others include > subscriber only lists (interop-wg, dapl) on their posts, only to > get bounce messages anytime I reply. I haven't seen the spam get > that bad that I would resort to a subscriber only list at this point. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From rdreier at cisco.com Thu Oct 25 12:48:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 12:48:24 -0700 Subject: [ofa-general] Re: [ewg] Re: Dropped OpenFabrics list messages (!) In-Reply-To: <4720E61D.6000906@nasa.gov> (Jeff Becker's message of "Thu, 25 Oct 2007 11:53:17 -0700") References: <4720E61D.6000906@nasa.gov> Message-ID: > In the interest of curtailing the SPAM problem, I would also be in favor > of a only-subscribers-can-post policy. Is there really strong opposition > to this? Thanks. I would definitely prefer to keep the list open. Making the list subscribers-only is really annoying to many potential contributors. I know personally that it is a huge turn-off when I fix a bug in some code and attempt to submit a patch and then get a bounce because the project's list is subscribers-only. I'm not going to subscribe to some mailing list I'm not interested in just so that I can give a project my work -- and so the project loses out. - R. From dledford at redhat.com Thu Oct 25 13:39:18 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 25 Oct 2007 16:39:18 -0400 Subject: [ofa-general] Downloads missing from web site Message-ID: <1193344758.10336.130.camel@firewall.xsintricity.com> One of the things that would go a *long* way towards easing my integration of the openfabrics code into RHEL would be accessible downloads of the latest release of various software packages. I did a quick check of the openfabrics.org/downloads area to see what software releases are actually published and which aren't. These are the things I found. Packages with good downloads: rdmacm management cxgb3 Packages with downloads that could use some touchup: dapl Packages without any downloads: ibverbs SDP utils mthca mlx4 ehca ipath Packages that should have a download area and don't: too many to list For those packages without any downloads available or without a download area, if there is another place I should be looking for the authoritative download source, please let me know. For dapl, the version on the tarballs includes a release number and it really shouldn't. For example, there is dapl-2.0.1-1.tar.gz. Having a release number appended to the tarball version is going to produce an extremely ugly looking release on my part, as it's going to end up being dapl-2.0.1-1-1.el5 or something similar. Not only that, but standard tarball practice is to have the tarball unpack into a directory named - where version is the entirety of the number string between the package name and the archive extension, so the directory would need to be dapl-2.0.1-1. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Thu Oct 25 13:43:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 13:43:35 -0700 Subject: [ofa-general] Re: [SPAM] - Re: [ewg] Re: Dropped OpenFabrics list messages (!) - Email found in subject In-Reply-To: <4720F64E.2010605@texmemsys.com> (Chris Dennett's message of "Thu, 25 Oct 2007 15:02:22 -0500") References: <4720E61D.6000906@nasa.gov> <4720F64E.2010605@texmemsys.com> Message-ID: > I don't know if this is possible, but why not just have list > subscribers bypass the SPAM filter. Those people that aren't > subscribers will get the usual filtering applied. Seems to me this > would keep the SPAM situation the same and only inconvenience those > non-subscribers that happen to trigger the SPAM filter. Again, I'm > not sure if this is technically feasible, but it doesn't seem like it > would be difficult to implement. That seems like a good idea. Another possibility to consider would be moving the general@ list to someplace like vger.kernel.org (if the kernel.org team is agreeable): they seem to be able to handle having linux-kernel be an open mailing list with very high traffic and minimal spam. - R. From rdreier at cisco.com Thu Oct 25 13:51:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 25 Oct 2007 13:51:39 -0700 Subject: [ofa-general] Downloads missing from web site In-Reply-To: <1193344758.10336.130.camel@firewall.xsintricity.com> (Doug Ledford's message of "Thu, 25 Oct 2007 16:39:18 -0400") References: <1193344758.10336.130.camel@firewall.xsintricity.com> Message-ID: > Packages without any downloads: > ibverbs > mthca Hmm, I've been pretty conscientious about making releases of these, and wget http://www.openfabrics.org/downloads/libibverbs-1.1.1.tar.gz wget http://www.openfabrics.org/downloads/libmthca-1.0.4.tar.gz work fine for me (and I find the links just fine in http://www.openfabrics.org/downloads/). ...looks harder... Oh, I see, http://www.openfabrics.org/downloads.htm is now the messed up snapshot of the dynamically generated downloads page. I think that project definitely needs to be finished so that the downloads page is dynamically generated all the time. > mlx4 Yes, I need to actually release a version of this. I should be able to get to that process soon. - R. From anton at samba.org Thu Oct 25 14:25:38 2007 From: anton at samba.org (Anton Blanchard) Date: Thu, 25 Oct 2007 16:25:38 -0500 Subject: [ofa-general] [PATCH] Stop ib_fmr from contributing to the load average Message-ID: <20071025212538.GA27442@kryten> I noticed my machine was at a constant load average of 1. This was because ib_create_fmr_pool calls kthread_create but does not immediately wake the thread up. Change to using kthread_run so we enter ib_fmr_cleanup_thread, set TASK_INTERRUPTIBLE, then go to sleep. Signed-off-by: Anton Blanchard --- diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index d7f6452..2218450 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -291,7 +291,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, atomic_set(&pool->flush_ser, 0); init_waitqueue_head(&pool->force_wait); - pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool->thread = kthread_run(ib_fmr_cleanup_thread, pool, "ib_fmr(%s)", device->name); From jlentini at netapp.com Thu Oct 25 14:44:44 2007 From: jlentini at netapp.com (James Lentini) Date: Thu, 25 Oct 2007 17:44:44 -0400 (EDT) Subject: [ofa-general] [Fwd: Re: Dropped OpenFabrics list messages (!)] In-Reply-To: <4720E998.3080606@ichips.intel.com> References: <4720E664.4060306@nasa.gov> <4720E998.3080606@ichips.intel.com> Message-ID: On Thu, 25 Oct 2007, Sean Hefty wrote: > > In the interest of curtailing the SPAM problem, I would also be in favor > > of a only-subscribers-can-post policy. Is there really strong opposition > > to this? Thanks. > > I'm guessing it would block a lot of feedback that we could get from other > Linux maintainers. > > Personally I find it kind of annoying when others include subscriber only > lists (interop-wg, dapl) on their posts, only to get bounce messages anytime I > reply. I haven't seen the spam get that bad that I would resort to a > subscriber only list at this point. I agree with Sean. I wouldn't be in favor of making the list subscriber only. The amount of spam seems pretty low to me too. From sean.hefty at intel.com Thu Oct 25 15:12:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 25 Oct 2007 15:12:36 -0700 Subject: [ofa-general] [GIT PULL] 2.6.24: please pull rdma-dev.git for-roland branch Message-ID: <000f01c81754$261d1130$ff0da8c0@amr.corp.intel.com> Please pull from: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland drivers/infiniband/core/multicast.c | 55 +++++++++++++++++++++++++++++------- 1 files changed, 45 insertions(+), 10 deletions(-) Sean Hefty (1): ib/multicast: report errors on multicast groups if pkeys change diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 1bc1fe6..107f170 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -73,11 +73,20 @@ struct mcast_device { }; enum mcast_state { - MCAST_IDLE, MCAST_JOINING, MCAST_MEMBER, + MCAST_ERROR, +}; + +enum mcast_group_state { + MCAST_IDLE, MCAST_BUSY, - MCAST_ERROR + MCAST_GROUP_ERROR, + MCAST_PKEY_EVENT +}; + +enum { + MCAST_INVALID_PKEY_INDEX = 0xFFFF }; struct mcast_member; @@ -93,9 +102,10 @@ struct mcast_group { struct mcast_member *last_join; int members[3]; atomic_t refcount; - enum mcast_state state; + enum mcast_group_state state; struct ib_sa_query *query; int query_id; + u16 pkey_index; }; struct mcast_member { @@ -378,9 +388,19 @@ static int fail_join(struct mcast_group *group, struct mcast_member *member, static void process_group_error(struct mcast_group *group) { struct mcast_member *member; - int ret; + int ret = 0; + u16 pkey_index; + + if (group->state == MCAST_PKEY_EVENT) + ret = ib_find_pkey(group->port->dev->device, + group->port->port_num, + be16_to_cpu(group->rec.pkey), &pkey_index); spin_lock_irq(&group->lock); + if (group->state == MCAST_PKEY_EVENT && !ret && + group->pkey_index == pkey_index) + goto out; + while (!list_empty(&group->active_list)) { member = list_entry(group->active_list.next, struct mcast_member, list); @@ -399,6 +419,7 @@ static void process_group_error(struct mcast_group *group) } group->rec.join_state = 0; +out: group->state = MCAST_BUSY; spin_unlock_irq(&group->lock); } @@ -415,9 +436,9 @@ static void mcast_work_handler(struct work_struct *work) retest: spin_lock_irq(&group->lock); while (!list_empty(&group->pending_list) || - (group->state == MCAST_ERROR)) { + (group->state != MCAST_BUSY)) { - if (group->state == MCAST_ERROR) { + if (group->state != MCAST_BUSY) { spin_unlock_irq(&group->lock); process_group_error(group); goto retest; @@ -494,12 +515,19 @@ static void join_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { struct mcast_group *group = context; + u16 pkey_index = MCAST_INVALID_PKEY_INDEX; if (status) process_join_error(group, status); else { + ib_find_pkey(group->port->dev->device, group->port->port_num, + be16_to_cpu(rec->pkey), &pkey_index); + spin_lock_irq(&group->port->lock); group->rec = *rec; + if (group->state == MCAST_BUSY && + group->pkey_index == MCAST_INVALID_PKEY_INDEX) + group->pkey_index = pkey_index; if (!memcmp(&mgid0, &group->rec.mgid, sizeof mgid0)) { rb_erase(&group->node, &group->port->table); mcast_insert(group->port, group, 1); @@ -539,6 +567,7 @@ static struct mcast_group *acquire_group(struct mcast_port *port, group->port = port; group->rec.mgid = *mgid; + group->pkey_index = MCAST_INVALID_PKEY_INDEX; INIT_LIST_HEAD(&group->pending_list); INIT_LIST_HEAD(&group->active_list); INIT_WORK(&group->work, mcast_work_handler); @@ -707,7 +736,8 @@ int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num, } EXPORT_SYMBOL(ib_init_ah_from_mcmember); -static void mcast_groups_lost(struct mcast_port *port) +static void mcast_groups_event(struct mcast_port *port, + enum mcast_group_state state) { struct mcast_group *group; struct rb_node *node; @@ -721,7 +751,8 @@ static void mcast_groups_lost(struct mcast_port *port) atomic_inc(&group->refcount); queue_work(mcast_wq, &group->work); } - group->state = MCAST_ERROR; + if (group->state != MCAST_GROUP_ERROR) + group->state = state; spin_unlock(&group->lock); } spin_unlock_irqrestore(&port->lock, flags); @@ -731,16 +762,20 @@ static void mcast_event_handler(struct ib_event_handler *handler, struct ib_event *event) { struct mcast_device *dev; + int index; dev = container_of(handler, struct mcast_device, event_handler); + index = event->element.port_num - dev->start_port; switch (event->event) { case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: case IB_EVENT_SM_CHANGE: case IB_EVENT_CLIENT_REREGISTER: - mcast_groups_lost(&dev->port[event->element.port_num - - dev->start_port]); + mcast_groups_event(&dev->port[index], MCAST_GROUP_ERROR); + break; + case IB_EVENT_PKEY_CHANGE: + mcast_groups_event(&dev->port[index], MCAST_PKEY_EVENT); break; default: break; From dledford at redhat.com Thu Oct 25 15:50:11 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 25 Oct 2007 18:50:11 -0400 Subject: [ofa-general] Re: Input on the new OFED package In-Reply-To: <4720C40E.1030708@mellanox.co.il> References: <4720C40E.1030708@mellanox.co.il> Message-ID: <1193352611.10336.215.camel@firewall.xsintricity.com> On Thu, 2007-10-25 at 18:27 +0200, Tziporet Koren wrote: > Hi Doug, > Since we updated OFED install and user space packages according to your > request I would like to get input to see if its ease your integration > and if you have more input to give us. > > Thanks, > Tziporet It's definitely better being all split up, but I'm still going to have to replace at least some of the spec files wholesale (well, currently all of them). I'll pick on the ibutils spec file as my example and I'm cc:ing this to the list so I don't need to do this over and over again. [ I'm not pasting the big copyright notice ] For both RHEL and Fedora, the spec file is generally considered Red Hat property and it is always under GPL regardless of the license on the software package itself. We usually both write and maintain our spec files ourselves, so this is OK. But, that means that if you guys want me to use your spec files, then I can't import them with this multi-license copyright. What's more, even if they get imported, they will be imported once and only once. From that point on, they will be maintained internally. This is so that when the next update rolls around, we don't loose the internal history of various builds we did between upstream releases. So, that being said, you guys are free to ignore the rest of this email if you don't care if I ever import your spec files. However, even if I don't import your spec files, the things I point out are good spec file practice *anyway*, so you might care just for reasons of spec file quality. # Disable debugging %define __check_files %{nil} # Disable brp-lib64-linux %ifarch x86_64 ia64 %define __arch_install_post %{nil} %endif Why? If you don't have a good reason to disable debug packages or the __arch_install_post, then don't. %{!?_prefix: %define _prefix /usr} %{_prefix} is never not defined, so this is a useless statement. %{!?configure_options: %define configure_options %{nil}} %define build_ibmgtsim %(if ( echo %{configure_options} | grep "enable-ibmgtsim" > /dev/null ); then echo -n '1'; else echo -n '0'; fi) This is a horrible construct. If you really want the ability to pass in a configure option to rpmbuild to control this, then define a specific option to pass and do something like: %{?build_ibmgtsim: %define ibmgtsim --enable-ibmgtsim} and then in the actual call to ./configure pass an option of %{?ibmgtsim} That will save a shell/grep spawn and is much cleaner. Name: %{?_name:%{_name}}%{!?_name:ibutils} It's never valid to play games with the name of the package. If you were to change the name to, say, ibutils-gen2, then you would need to add a Conflicts: ibutils or Obsoletes: ibutils to the spec file to avoid rpm conflicts. Since the spec file needs a history of all the previous names you are obsoleting, random name changes aren't really possible nor should they be done lightly. License: GPL/BSD Although this is a specific requirement of Fedora and not other distros, it can't hurt to do for other distros as it clearly identifies and differentiates the different license scenarios at a glance. The license tag should specify all the possible valid license scenarios, and there is a list of accepted abbreviations for the various licenses that exist. In the case of this package, it is available under either the GPL or BSD license (which is distinctly different from packages that have some files under one license and others under another, in which case both licenses apply all the time). The proper license tag depends on the exact GPL and BSD licenses in use. For example, if the GPL license in use is version 2 or later, then the abbreviation would be GPLv2+. If the BSD option is the original BSD license that has an advertising clause, it would be BSD with advertising. The situation of the multiple license, aka whether your choice of one applies or all apply simultaneously, is differentiated by the use of the words and/or between clause abbreviations. So the correct license for this package would be (assuming I have the right BSD type, it could be another BSD type): License: GPLv2+ or BSD with advertising See http://fedoraproject.org/wiki/Licensing for a complete list of the license abbreviations to date. Url: http://openib.org/downloads/%{name}-%{version}.tar.gz Source: http://www.openfabrics.org/downloads/ibutils-1.2-0.4.ofed20070930.tar.gz The URL should point not to the tarball itself, but to the project web page. Regardless though, you need to update the URL to openfabrics.org if that's your download site. # Requires: opensm It's correct that you should not need to specify a Requires: line for this package. However, what you *should* have in there is a BuildRequires: that specifies the -devel packages you need in order to build this package. The Requires: portion is mostly automatically generated by rpm by looking up the ldd output of the binaries you package up. %setup -n %{name}-%{version} Shouldn't need the -n option if the tarball is packaged correctly. ### ### build ### %build OK, if someone can't figure out that %build is the build section and needs a big comment header, they need real help. The spec file is not a regular source code file. That comment will actually get embedded into the %prep script because everything between the %prep and the next section identifier in the spec file is considered part of the %prep script. %configure %{configure_options} %{ppc64_configure_options} --enable-ibmgtsim OK, so you went through the trouble to make the spec file horribly ugly with that option parsing stuff for enabling the ibmgtsim, then just hardcoded enabling it? Let's also jump back and take a look at something: # Add ppc64 64 bit compile flages %ifarch ppc64 %define ppc64_configure_options CFLAGS='-m64 -g -O2' CPPFLAGS='-m64 -g -O2' LDFL AGS='-m64 -g -O2 -L/usr/lib64/' %else %define ppc64_configure_options %{nil} %endif In this you are completely wiping out the default CFLAGS et. al. It would be far preferable for you to append -m64 to the existing CFLAGS than it is to reset them entirely. However, the %build script is run as a standalone script, so if you attempt to set the actual CFLAGS environment variable earlier in the spec file, it won't get passed on to the %build script. Part of the %build macro itself includes startup scripting that sets the default CFLAGS environment variable. So, you would really want something like: %build %ifarch ppc64 CFLAGS="$$CFLAGS -m64" %configure CFLAGS="$$CFLAGS" (you need the double $ in front of CFLAGS to get past the rpm option parsing and down to shell environment variable parsing) # W/A for libtool issue: change libdir in all *.la files to point to ${RPM_BUILD _ROOT}/${libdir} # This W/A should be removed in post install section if [ -d ${RPM_BUILD_ROOT}/%{_prefix} ]; then LA_FILES=$(find ${RPM_BUILD_ROOT}/%{_prefix} -type f -name '*.la') for la_file in ${LA_FILES} do case ${la_file##*/} in libibumad.la|libosmcomp.la|libopensm.la|libosmvendor.la) perl -ni -e "s@(libdir=).*@\$1'${RPM_BUILD_ROOT}%{_libdir}'@; print" $ {la_file} perl -ni -e "s@ %{_libdir}@\ ${RPM_BUILD_ROOT}%{_libdir}@g; print" ${l a_file} ;; esac done fi No, this isn't a workaround for libtool, it's a work around for building against opensm-libs that haven't been installed yet. Will absolutely get yanked from anything we do here. Use the BuildRequires: line in the spec file and install the opensm-devel package before building this one. install -d $RPM_BUILD_ROOT/etc/profile.d cat > $RPM_BUILD_ROOT/etc/profile.d/ibutils.sh << EOF if ! echo \${PATH} | grep -q %{_prefix}/bin ; then PATH=\${PATH}:%{_prefix}/bin fi EOF cat > $RPM_BUILD_ROOT/etc/profile.d/ibutils.csh << EOF if ( "\${path}" !~ *%{_prefix}/bin* ) then set path = ( \$path %{_prefix}/bin ) endif EOF Writing files from within the spec file is generally frowned upon, but overlooked on small enough files. However, just below this section you have this: case %{_prefix} in /usr | /usr/) ;; *) echo "/etc/ld.so.conf.d/ibutils.conf" >> ibutils-files ;; esac Since the two shell environment files are not necessary when the _prefix is /usr, realistically, both the shell files and the ibutils.conf file should be under the case statement above and all three files should be added to ibutils-files if they are actually needed. You should also only write the files under the case statement above so you don't have random files in the build root that aren't needed. %clean #Remove installed driver after rpm build finished rm -rf $RPM_BUILD_DIR/%{name}-%{version} rm -rf $RPM_BUILD_ROOT ### ### pre section ### %pre As mentioned before, this comment is now part of the %clean script. %pre ### ### post section ### %post And this comment is now the entire %pre script that is stored in the actual rpm database after package installation. If you aren't using a section, such as %pre, then don't include it at all. %post if [ $1 = 1 ]; then # 1 : This package is being installed for the first time /sbin/ldconfig fi Don't check to see if this is the first install in order to run ldconfig, run ldconfig unconditionally. It's entirely possible that an upgrade from ibutils-1.2 to ibutils-1.3 could update the minor version of the installed library and then ldconfig would have a stale cache. # %{_libdir}/ibibdmcom.a %{_libdir}/libibdm.a Why some .a files and not others? Of course, if you are going to include a devel environment it should really be in a separate -devel sub package, and .a files should generally be in a separate -static package (not -devel-static BTW, devel is implied with -static), but that still leaves the issue of why some and not others. %{_prefix}/include/ibdm Should be %{_includedir}/ibdm %{_libdir}/ibis1.2 %{_libdir}/ibdm1.2 %{_libdir}/ibdiagnet1.2 %{_libdir}/ibdiagpath1.2 %{_libdir}/ibdiagui1.2 %{_prefix}/include/ibdm These are all directories that are new. In order to declare that a package owns the directory itself, and not just all the files under that directory, you also need a %dir statement for each of these. This results in rpm trying to remove the directory itself if you remove the package. So you end up with: %dir %{_libdir}/ibis1.2 %{_libdir}/ibis1.2 for each of the above. More of the same issues in the remaining file list. OK, that covers this spec file. And in case you think I'm just being nit picky, these are the exact sorts of things that I had to fix before release engineering would let me submit the original openib package into our build tree, and fixing all of this stuff each and every release is why I don't use your spec files and instead just update my own spec file to use the new sources (well, that and history loss), which is of course why I really want downloadable tarballs as opposed to trying to munge tarballs out of your distribution. At least with this release, if nothing else, I can pull separate package tarballs out of the srpms in the OFED tarball, which is *much* better than in the past with the big ofa_user stuff. But, I still really need a download link to put in the spec files or else I get yelled at. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From arlin.r.davis at intel.com Thu Oct 25 16:46:27 2007 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 25 Oct 2007 16:46:27 -0700 Subject: [ofa-general] [PATCH] uDAPL (dat2.0 branch) - fixup distribution and specfile Message-ID: <001801c81761$42e04640$4297070a@amr.corp.intel.com> Doug, Please take a look at the patch to cleanup the DAPL distribution tarball and spec file. Thanks, -arlin -- Fix DAPL distribution by removing release number from tarball. General spec file cleanup for release 2 targeted for OFED 1.3 beta. Signed-off by: Arlin Davis -- diff --git a/configure.in b/configure.in index b08e06f..f11a7a2 100644 --- a/configure.in +++ b/configure.in @@ -1,11 +1,11 @@ dnl Process this file with autoconf to produce a configure script. AC_PREREQ(2.57) -AC_INIT(dapl, 2.0.1-1, general at lists.openfabrics.org) +AC_INIT(dapl, 2.0.1, general at lists.openfabrics.org) AC_CONFIG_SRCDIR([dat/udat/udat.c]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE(dapl, 2.0.1-1) +AM_INIT_AUTOMAKE(dapl, 2.0.1) AM_PROG_LIBTOOL diff --git a/dapl.spec.in b/dapl.spec.in index 232f0df..f49b95d 100644 --- a/dapl.spec.in +++ b/dapl.spec.in @@ -33,13 +33,13 @@ # $Id: $ Name: dapl Version: 2.0.1 -Release: 1%{?dist} +Release: 2%{?dist} Summary: A Library for userspace access to RDMA devices using OS Agnostic DAT APIs. Group: System Environment/Libraries License: Dual GPL/BSD/CPL Url: http://openfabrics.org/ -Source: http://www.openfabrics.org/downloads/%{name}/%{name}-%{version}-%{release}.tar.gz +Source: http://www.openfabrics.org/downloads/%{name}/%{name}-%{version}.tar.gz BuildRoot: %(mktemp -ud %{_tmppath}/%{name}-%{version}-%{release}-XXXXXX) Requires(post): /sbin/ldconfig Requires(postun): /sbin/ldconfig @@ -72,7 +72,7 @@ Requires: %{name} = %{version}-%{release} Useful test suites to validate uDAPL library API's. %prep -%setup -q -n %{name}- at VERSION@ +%setup %{name}-%{version} %build %configure --enable-ext-type=ib @@ -99,6 +99,7 @@ rm -rf %{buildroot} %files devel %defattr(-,root,root,-) %{_libdir}/*.so +%dir %{_includedir}/dat2 %{_includedir}/dat2/* %files devel-static @@ -111,6 +112,9 @@ rm -rf %{buildroot} %{_mandir}/man1/* %changelog +* Thu Oct 25 2007 Arlin Davis - 2.0.1-2 +- OFED 1.3-beta, DAT/DAPL Version 2.0.1 Release 2 + * Tue Sep 18 2007 Arlin Davis - 2.0.1-1 - OFED 1.3-alpha, co-exist with DAT 1.2 library package. From frontenac67 at praxa.com.au Thu Oct 25 18:37:07 2007 From: frontenac67 at praxa.com.au (Darrel Price) Date: Thu, 25 Oct 2007 17:37:07 -0800 Subject: [ofa-general] What do you think Message-ID: <912605568.54972977998660@praxa.com.au> An HTML attachment was scrubbed... URL: From stefan.roscher at de.ibm.com Fri Oct 26 02:12:57 2007 From: stefan.roscher at de.ibm.com (Stefan Roscher) Date: Fri, 26 Oct 2007 11:12:57 +0200 Subject: [ofa-general] Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks In-Reply-To: <471E0EFF.5040604@dev.mellanox.co.il> Message-ID: Hi Vlad, the problem still exists with the latest build. But I don't think it is an install.pl problem. We had a discussion with Doug weeks ago and I think there we figured out the problem. The discussion is here: http://lists.openfabrics.org/pipermail/ewg/2007-August/004408.html I hope we can solve this as fast as possible, maybe we should involve Doug again? Kind Regards Stefan Roscher InfiniBand/HEA Linux Device Driver Development Phone: ++49 (0) 7031-16-2015 Mail:stefan.roscher at de.ibm.com Labor Boeblingen, D3627/7103-19 (009), Schoenaicher Str. 220, D-71032 Boeblingen, Germany IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Herbert Kircher Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 Vladimir Sokolovsky 23.10.2007 17:10 Please respond to vlad at mellanox.co.il To Hoang-Nam Nguyen/Germany/IBM at IBMDE cc Tziporet Koren , Stefan Roscher/Germany/IBM at IBMDE, ewg at lists.openfabrics.org, general at lists.openfabrics.org Subject Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks Hoang-Nam Nguyen wrote: > Hi, >> 3. Tasks that should completed for the beta: >> 4. Fix compilation problems on PPC with 32 bits - Vlad >> (Mellanox) - >> Nam please open a bug on this issue > Stefan has created #746 "Installation of 32-bit libibverbs failed". > @Vlad, since we'll have rpm spec for user space with beta, would it > better to tackle this with rpm specs? For libibverbs is it > libibverbs.spec.in we need to look at? > Thanks > Nam > Hi Nam, Please recheck this issue with the latest OFED-1.3 build. If this issue still exist then it is probably install.pl issue. Please update me, Thanks, Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.roscher at de.ibm.com Fri Oct 26 02:12:57 2007 From: stefan.roscher at de.ibm.com (Stefan Roscher) Date: Fri, 26 Oct 2007 11:12:57 +0200 Subject: [ofa-general] Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks In-Reply-To: <471E0EFF.5040604@dev.mellanox.co.il> Message-ID: Hi Vlad, the problem still exists with the latest build. But I don't think it is an install.pl problem. We had a discussion with Doug weeks ago and I think there we figured out the problem. The discussion is here: http://lists.openfabrics.org/pipermail/ewg/2007-August/004408.html I hope we can solve this as fast as possible, maybe we should involve Doug again? Kind Regards Stefan Roscher InfiniBand/HEA Linux Device Driver Development Phone: ++49 (0) 7031-16-2015 Mail:stefan.roscher at de.ibm.com Labor Boeblingen, D3627/7103-19 (009), Schoenaicher Str. 220, D-71032 Boeblingen, Germany IBM Deutschland Entwicklung GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Herbert Kircher Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294 Vladimir Sokolovsky 23.10.2007 17:10 Please respond to vlad at mellanox.co.il To Hoang-Nam Nguyen/Germany/IBM at IBMDE cc Tziporet Koren , Stefan Roscher/Germany/IBM at IBMDE, ewg at lists.openfabrics.org, general at lists.openfabrics.org Subject Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and beta tasks Hoang-Nam Nguyen wrote: > Hi, >> 3. Tasks that should completed for the beta: >> 4. Fix compilation problems on PPC with 32 bits - Vlad >> (Mellanox) - >> Nam please open a bug on this issue > Stefan has created #746 "Installation of 32-bit libibverbs failed". > @Vlad, since we'll have rpm spec for user space with beta, would it > better to tackle this with rpm specs? For libibverbs is it > libibverbs.spec.in we need to look at? > Thanks > Nam > Hi Nam, Please recheck this issue with the latest OFED-1.3 build. If this issue still exist then it is probably install.pl issue. Please update me, Thanks, Vladimir -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Oct 26 02:55:20 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 26 Oct 2007 02:55:20 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071026-0200 daily build status Message-ID: <20071026095520.44BFEE60842@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From befooling at lccarver.com Fri Oct 26 05:03:05 2007 From: befooling at lccarver.com (Joon Green) Date: Fri, 26 Oct 2007 06:03:05 -0600 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c817b6$8dc45480$0100007f@localhost> newadobesoft . com From dotanb at dev.mellanox.co.il Fri Oct 26 04:07:39 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Fri, 26 Oct 2007 13:07:39 +0200 Subject: [ofa-general] ***SPAM*** IB port state change In-Reply-To: <532b813a0710241530n5de748b0m2bdb55e1219bb7f1@mail.gmail.com> References: <532b813a0710241530n5de748b0m2bdb55e1219bb7f1@mail.gmail.com> Message-ID: <4721CA7B.4060204@dev.mellanox.co.il> Hi. If you will wait for an async event you will get the port events of the HCA that you are using. in the libibverbs examples, the asyncwatch.c can be a good reference for you. Dotan Ganesh Sadasivan wrote: > > Hi, > > Is there any example code that explains how to register and receive > IB port state changes? > > Thanks > Ganesh From arthur.jones at qlogic.com Fri Oct 26 07:46:20 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 26 Oct 2007 07:46:20 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- more patches for 2.6.24 Message-ID: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> hi roland, here is our current set of bugfix patches for 2.6.24. these changes can be pulled from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur From arthur.jones at qlogic.com Fri Oct 26 07:46:25 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 26 Oct 2007 07:46:25 -0700 Subject: [ofa-general] [PATCH 1/4] IB/ipath - Fix a race where s_last updated w/o lock held. In-Reply-To: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071026144625.13639.2648.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell There is a small window where a send work queue entry could be overwritten by ib_post_send() because s_last is updated before the entry is read. This patch closes the window by acquiring the lock and updating the last send work queue entry index after reading the wr_id. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_ruc.c | 14 +++++++++----- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 4b6b7ee..54c61a9 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -630,11 +630,8 @@ bail:; void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, enum ib_wc_status status) { - u32 last = qp->s_last; - - if (++last == qp->s_size) - last = 0; - qp->s_last = last; + unsigned long flags; + u32 last; /* See ch. 11.2.4.1 and 10.7.3.1 */ if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || @@ -658,4 +655,11 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, wc.port_num = 0; ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); } + + spin_lock_irqsave(&qp->s_lock, flags); + last = qp->s_last; + if (++last >= qp->s_size) + last = 0; + qp->s_last = last; + spin_unlock_irqrestore(&qp->s_lock, flags); } From arthur.jones at qlogic.com Fri Oct 26 07:46:30 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 26 Oct 2007 07:46:30 -0700 Subject: [ofa-general] [PATCH 2/4] IB/ipath -- limit length checksummed in eeprom In-Reply-To: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071026144630.13639.36659.stgit@eng-46.internal.keyresearch.com> From: Michael Albaugh The small eeprom that holds, e.g. GUID contains a data-length, but if the actual eeprom is new or has been erased, that byte will be 0xFF, which is greater than the maximum physical length of the eeprom, and more importantly greater than the length of the buffer we vmalloc'd. Sanity-check the length to avoid the possbility of reading past end of buffer. Signed-off-by: Michael Albaugh --- drivers/infiniband/hw/ipath/ipath_eeprom.c | 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c index bcfa3cc..e7c25db 100644 --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c @@ -538,7 +538,15 @@ static u8 flash_csum(struct ipath_flash *ifp, int adjust) u8 *ip = (u8 *) ifp; u8 csum = 0, len; - for (len = 0; len < ifp->if_length; len++) + /* + * Limit length checksummed to max length of actual data. + * Checksum of erased eeprom will still be bad, but we avoid + * reading past the end of the buffer we were passed. + */ + len = ifp->if_length; + if (len > sizeof(struct ipath_flash)) + len = sizeof(struct ipath_flash); + while (len--) csum += *ip++; csum -= ifp->if_csum; csum = ~csum; From arthur.jones at qlogic.com Fri Oct 26 07:46:36 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 26 Oct 2007 07:46:36 -0700 Subject: [ofa-general] [PATCH 3/4] IB/ipath -- Fix incorrect use of sizeof on msg buffer (function argument) In-Reply-To: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071026144636.13639.31567.stgit@eng-46.internal.keyresearch.com> From: Dave Olson Reduced the size of the buffer also, 512 was overly generous. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_intr.c | 15 ++++++++------- 1 files changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 6a5dd5c..a4f3cf9 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -453,7 +453,7 @@ skip_ibchange: } static void handle_supp_msgs(struct ipath_devdata *dd, - unsigned supp_msgs, char msg[512]) + unsigned supp_msgs, char *msg, u32 msgsz) { /* * Print the message unless it's ibc status change only, which @@ -461,7 +461,7 @@ static void handle_supp_msgs(struct ipath_devdata *dd, */ if (dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED) { int iserr; - iserr = ipath_decode_err(msg, sizeof msg, + iserr = ipath_decode_err(msg, msgsz, dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED); if (dd->ipath_lasterror & @@ -492,8 +492,8 @@ static void handle_supp_msgs(struct ipath_devdata *dd, } static unsigned handle_frequent_errors(struct ipath_devdata *dd, - ipath_err_t errs, char msg[512], - int *noprint) + ipath_err_t errs, char *msg, + u32 msgsz, int *noprint) { unsigned long nc; static unsigned long nextmsg_time; @@ -512,7 +512,7 @@ static unsigned handle_frequent_errors(struct ipath_devdata *dd, nextmsg_time = nc + HZ * 3; } else if (supp_msgs) { - handle_supp_msgs(dd, supp_msgs, msg); + handle_supp_msgs(dd, supp_msgs, msg, msgsz); supp_msgs = 0; nmsgs = 0; } @@ -525,14 +525,15 @@ static unsigned handle_frequent_errors(struct ipath_devdata *dd, static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) { - char msg[512]; + char msg[128]; u64 ignore_this_time = 0; int i, iserr = 0; int chkerrpkts = 0, noprint = 0; unsigned supp_msgs; int log_idx; - supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint); + supp_msgs = handle_frequent_errors(dd, errs, msg, (u32)sizeof msg, + &noprint); /* don't report errors that are masked */ errs &= ~dd->ipath_maskederrs; From arthur.jones at qlogic.com Fri Oct 26 07:46:41 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 26 Oct 2007 07:46:41 -0700 Subject: [ofa-general] [PATCH 4/4] IB/ipath -- Improve interrupt handler cache footprint In-Reply-To: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071026144641.13639.73320.stgit@eng-46.internal.keyresearch.com> From: Dave Olson Improve interrupt handler cache footprint by noinline'ing error functions Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_intr.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index a4f3cf9..61e8822 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -850,7 +850,7 @@ void ipath_clear_freeze(struct ipath_devdata *dd) /* this is separate to allow for better optimization of ipath_intr() */ -static void ipath_bad_intr(struct ipath_devdata *dd, u32 * unexpectp) +static noinline void ipath_bad_intr(struct ipath_devdata *dd, u32 *unexpectp) { /* * sometimes happen during driver init and unload, don't want @@ -893,7 +893,7 @@ static void ipath_bad_intr(struct ipath_devdata *dd, u32 * unexpectp) "ignoring\n"); } -static void ipath_bad_regread(struct ipath_devdata *dd) +static noinline void ipath_bad_regread(struct ipath_devdata *dd) { static int allbits; From Thomas.Talpey at netapp.com Fri Oct 26 08:42:46 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 26 Oct 2007 11:42:46 -0400 Subject: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU In-Reply-To: <471FA68D.80707@opengridcomputing.com> References: <471E4438.6080300@ichips.intel.com> <471E5880.1030100@opengridcomputing.com> <471E8A6A.2030207@ichips.intel.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C8D74@venom2> <6.2.0.14.2.20071024105353.03090e10@esmail.cup.hp.com> <471FA68D.80707@opengridcomputing.com> Message-ID: At 04:09 PM 10/24/2007, Tom Tucker wrote: >Michael Krause wrote: >> The proper action is to propose a new MPA specification to the IETF - >> it isn't an OFA decision to make. MPA within the IETF was a tough >> fight to get into existence. This particular issue was raised and the >> outcome from that debate is what is in the 1.0 specification (it is a >> standard if I recall not a draft). >It looks to me to be an ID, not an RFC. The RDDP specifications are in the RFC Editor's queue, therefore they are in-between Internet-Draft and Proposed-Standard status. You can find them on the IETF tracking pages: "RFC" is just a nickname. The specific document status, Proposed Standard in this case, is the key. RFCs are in fact living documents, but once published their status changes. The idea of "Proposed" is that people implement it, and feed back what works and what doesn't about the protocol. After vetting the protocol by this process, the next step is to modify and/or republish it as a so-called "Draft" standard. In fact, to get to that level we need two or more interoperable implementations. This connection model problem certainly is an interoperability issue. This discussion should be held on the IETF RDDP list. Take this experience back to the protocol and fix it, or deal with it in upper layers and leave the draft as-is. As mentioned, the MPA protocol was far and away the most contentious point of the entire iWARP RDDP stack. It represents many, many compromises, and it's not surprising in hindsight that this issue is surfacing. It means the process is working, remember. Tom. P.S. Internet Standards process: From fauxpas at singnet.com.sg Fri Oct 26 10:26:02 2007 From: fauxpas at singnet.com.sg (IRISHLOTTERY) Date: Sat, 27 Oct 2007 01:26:02 +0800 (SGT) Subject: [ofa-general] ***SPAM*** ******END OF YEAR PROMO****** Message-ID: <1193419562.4722232a6080b@discus.singnet.com.sg> 11 G Lower Dorset Street, Dublin 1, Ireland. P O Box 1010. ATTENTION: YOUR E-MAIL JUST WON FOR YOU �1,350,000.00 We are pleased to inform you today 26 October, 2007 of the result of the winners of the IRISH NATIONAL LOTTERY ONLINE PROMO PROGRAMME, held on 24 October 2007, ticketnumber:56475600545 188 with Serial number 5368/02, this are your lucky numbers:06, 17, 24, 26, 36, 44, Bonus 37, You have therefore been approved for a lump sum pay out of �1,350,000 (One million, three hundred and fifty thousand, pounds sterling) in cash. To file your claims contact our fiduciary agent for claims: Mr.Edward Brown Email: fiduciaryclaimsofficer_agent011 at yahoo.co.uk Tel: (+44)-7024066880 Tel: (+44) 701113 7597 Provide him with the information below: 1.Full Name:................... 2.Full Address:................ 3.Marital Status:.............. 4.Occupation:.................. 5.Age:......................... 6.Sex:......................... 7.Nationality:................. 8.Country Of Residence:........ 9.Telephone Number:............ Congratulations once more. Sincerely, Sir.kolyn parkins Online coordinator for THE IRISH LOTTERY Sweepstakes From dledford at redhat.com Fri Oct 26 12:56:34 2007 From: dledford at redhat.com (Doug Ledford) Date: Fri, 26 Oct 2007 15:56:34 -0400 Subject: [ofa-general] Re: [PATCH] uDAPL (dat2.0 branch) - fixup distribution and specfile In-Reply-To: <001801c81761$42e04640$4297070a@amr.corp.intel.com> References: <001801c81761$42e04640$4297070a@amr.corp.intel.com> Message-ID: <1193428594.10336.318.camel@firewall.xsintricity.com> On Thu, 2007-10-25 at 16:46 -0700, Arlin Davis wrote: > Doug, > > Please take a look at the patch to cleanup the DAPL distribution > tarball and spec file. > > Thanks, > > -arlin > > -- > Fix DAPL distribution by removing release number from tarball. > General spec file cleanup for release 2 targeted for OFED 1.3 beta. > > Signed-off by: Arlin Davis > -- > diff --git a/configure.in b/configure.in > index b08e06f..f11a7a2 100644 > --- a/configure.in > +++ b/configure.in > @@ -1,11 +1,11 @@ > dnl Process this file with autoconf to produce a configure script. > > AC_PREREQ(2.57) > -AC_INIT(dapl, 2.0.1-1, general at lists.openfabrics.org) > +AC_INIT(dapl, 2.0.1, general at lists.openfabrics.org) > AC_CONFIG_SRCDIR([dat/udat/udat.c]) > AC_CONFIG_AUX_DIR(config) > AM_CONFIG_HEADER(config.h) > -AM_INIT_AUTOMAKE(dapl, 2.0.1-1) > +AM_INIT_AUTOMAKE(dapl, 2.0.1) If you are going to build a new tarball, this should probably be 2.0.2. And any new tarballs after that should increment that number. > AM_PROG_LIBTOOL > > diff --git a/dapl.spec.in b/dapl.spec.in > index 232f0df..f49b95d 100644 > --- a/dapl.spec.in > +++ b/dapl.spec.in > @@ -33,13 +33,13 @@ > # $Id: $ > Name: dapl > Version: 2.0.1 > -Release: 1%{?dist} > +Release: 2%{?dist} This should probably be: Version: @VERSION@ Release: 1%{?dist} There is a general issue with including spec files in an upstream release. Namely, any distribution is likely to rebuild your release any number of times for integration issues and fixes. That means any distribution is guaranteed to end up producing their own releases of your version of the software. If both you and the distros are producing various releases of the same version of software, then it becomes next to impossible to map from your version-release combo to the same code in a distro version-release combo. So, the spec file may make things easy for people to build rpms of your code, but in order to play nice with distros, you really need to update the version of the software and leave the release at 1 anytime you update your tarball. Since you don't maintain your spec files separate from your tarballs, that means even a spec file change requires a tarball version increment. Now, if your tarball didn't contain the spec or spec.in file, then you could update those without having to update the tarball. But, that generally means you would need to have both the tarball itself and the spec file available for people to download. That way people who are building rpms locally have the benefit of you being able to release updated specs without updating the tarball, and distros have a reliable, authoritative tarball download that doesn't change unless the source changes. I'm fine with either way, but let's say you uploaded dapl-2.0.1.tar.gz with the original spec file, then re-uploaded the exact same dapl-2.0.1.tar.gz with this updated spec file, then that would be *horribly* broken. You can't have two different tarballs with the same name. I know the original dapl updload was 2.0.1-1, but the way things are headed with this update, if you *did* do a -3 release, it would overwrite the -2 release tarball silently. > Summary: A Library for userspace access to RDMA devices using OS > Agnostic DAT APIs. > > Group: System Environment/Libraries > License: Dual GPL/BSD/CPL > Url: http://openfabrics.org/ > -Source: > http://www.openfabrics.org/downloads/%{name}/%{name}-%{version}-%{release}.tar.gz > +Source: > http://www.openfabrics.org/downloads/%{name}/%{name}-%{version}.tar.gz > BuildRoot: %(mktemp -ud > %{_tmppath}/%{name}-%{version}-%{release}-XXXXXX) > Requires(post): /sbin/ldconfig > Requires(postun): /sbin/ldconfig > @@ -72,7 +72,7 @@ Requires: %{name} = %{version}-%{release} > Useful test suites to validate uDAPL library API's. > > %prep > -%setup -q -n %{name}- at VERSION@ > +%setup %{name}-%{version} You don't need the %{name}-%{version} without -n. The whole purpose of -n is to tell the setup script that the tarball unpacked into a directory *other* than %{name}-%{version}, and then the argument to -n is the directory it actually unpacked into. You might want to keep the -q on the other hand, it just keeps them from passing -v to the tar extract command. > %build > %configure --enable-ext-type=ib > @@ -99,6 +99,7 @@ rm -rf %{buildroot} > %files devel > %defattr(-,root,root,-) > %{_libdir}/*.so > +%dir %{_includedir}/dat2 > %{_includedir}/dat2/* > > %files devel-static > @@ -111,6 +112,9 @@ rm -rf %{buildroot} > %{_mandir}/man1/* > > %changelog > +* Thu Oct 25 2007 Arlin Davis - 2.0.1-2 > +- OFED 1.3-beta, DAT/DAPL Version 2.0.1 Release 2 > + > * Tue Sep 18 2007 Arlin Davis - 2.0.1-1 > - OFED 1.3-alpha, co-exist with DAT 1.2 library package. > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From erezz at voltaire.com Thu Oct 25 07:11:34 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 25 Oct 2007 16:11:34 +0200 Subject: [ofa-general] iSER for stgt - wiki page Message-ID: <4720A416.8010503@voltaire.com> The following wiki page is a quick start guide for running an iSCSI over iSER target through the open-source stgt project: https://wiki.openfabrics.org/tiki-index.php?page=ISER-target For more information about stgt: http://stgt.berlios.de/ I hope that you find it helpful. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Solutions Voltaire – _The Grid Backbone_ __ www.voltaire.com --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi at googlegroups.com To unsubscribe from this group, send email to open-iscsi-unsubscribe at googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~--- From sashak at voltaire.com Fri Oct 26 15:30:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 00:30:02 +0200 Subject: [ofa-general] [PATCH RFC] libibumad: support for new pkey enabled user_mad API In-Reply-To: References: <46EACC6B.5060702@ichips.intel.com> <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071026223002.GB22317@sashak.voltaire.com> This adds support for new pkey enabled user_mad API. When ABI version is 5 this tries to use IB_USER_MAD_ENABLE_PKEY ioctl(). Signed-off-by: Sasha Khapyorsky --- libibumad/include/infiniband/umad.h | 4 ++- libibumad/src/umad.c | 45 +++++++++++++++++++++++++--------- 2 files changed, 36 insertions(+), 13 deletions(-) diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 2ec8b37..21cf729 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -60,6 +60,8 @@ typedef struct ib_mad_addr { uint8_t traffic_class; uint8_t gid[16]; uint32_t flow_label; + uint16_t pkey_index; + uint8_t reserved[6]; } ib_mad_addr_t; typedef struct ib_user_mad { @@ -80,8 +82,8 @@ typedef struct ib_user_mad { #define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ struct ib_user_mad_reg_req) - #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, uint32_t) +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) #define UMAD_CA_NAME_LEN 20 #define UMAD_CA_MAX_PORTS 10 /* 0 - 9 */ diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 41373e7..307145f 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -85,6 +85,9 @@ int umaddebug = 0; static char *def_ca_name = "mthca0"; static int def_ca_port = 1; +static unsigned abi_version; +static unsigned new_user_mad_api; + /************************************* * Port */ @@ -428,16 +431,14 @@ dev_to_umad_id(char *dev, unsigned port) int umad_init(void) { - unsigned abi_version; - TRACE("umad_init"); if (sys_read_uint(IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, &abi_version) < 0) { IBWARN("can't read ABI version from %s/%s (%m): is ib_umad module loaded?", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE); return -1; } - if (abi_version != IB_UMAD_ABI_VERSION) { - IBWARN("wrong ABI version: %s/%s is %d but library ABI is %d", + if (abi_version < IB_UMAD_ABI_VERSION) { + IBWARN("wrong ABI version: %s/%s is %d but library minimal ABI is %d", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, abi_version, IB_UMAD_ABI_VERSION); return -1; } @@ -554,6 +555,21 @@ umad_open_port(char *ca_name, int portnum) return -EIO; } + if (abi_version > 5) + new_user_mad_api = 1; + else { + int ret = ioctl(fd, IB_USER_MAD_ENABLE_PKEY, NULL); + if (ret == 0) + new_user_mad_api = 1; + else if (ret < 0 && errno == EINVAL) + new_user_mad_api = 0; + else { + close(fd); + IBWARN("cannot detect is user_mad P_Key enabled API supported."); + return ret; + } + } + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); return fd; } @@ -636,13 +652,15 @@ umad_close_port(int fd) void * umad_get_mad(void *umad) { - return ((struct ib_user_mad *)umad)->data; + return new_user_mad_api ? ((struct ib_user_mad *)umad)->data : + (void *)&((struct ib_user_mad *)umad)->addr.pkey_index; } size_t umad_size(void) { - return sizeof (struct ib_user_mad); + return new_user_mad_api ? sizeof (struct ib_user_mad) : + sizeof(struct ib_user_mad) - 8; } int @@ -663,11 +681,13 @@ umad_set_grh(void *umad, void *mad_addr) } int -umad_set_pkey(void *umad, int pkey) +umad_set_pkey(void *umad, int pkey_index) { -#if 0 - mad->addr.pkey = 0; /* FIXME - PKEY support */ -#endif + struct ib_user_mad *mad = umad; + + if (new_user_mad_api) + mad->addr.pkey_index = htons(pkey_index); + return 0; } @@ -929,11 +949,12 @@ umad_addr_dump(ib_mad_addr_t *addr) } gid_str[i*2] = 0; IBWARN("qpn %d qkey 0x%x lid 0x%x sl %d\n" - "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x\n" + "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x pkey_index 0x%x\n" "Gid 0x%s", ntohl(addr->qpn), ntohl(addr->qkey), ntohs(addr->lid), addr->sl, addr->grh_present, (int)addr->gid_index, (int)addr->hop_limit, - (int)addr->traffic_class, addr->flow_label, gid_str); + (int)addr->traffic_class, addr->flow_label, addr->pkey_index, + gid_str); } void -- 1.5.3.4.206.g58ba4 From rdreier at cisco.com Fri Oct 26 15:22:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:22:54 -0700 Subject: [ofa-general] Re: [PATCH RFC] libibumad: support for new pkey enabled user_mad API In-Reply-To: <20071026223002.GB22317@sashak.voltaire.com> (Sasha Khapyorsky's message of "Sat, 27 Oct 2007 00:30:02 +0200") References: <46EACC6B.5060702@ichips.intel.com> <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> <20071026223002.GB22317@sashak.voltaire.com> Message-ID: Looks great to me. Sorry I didn't get around to finishing my patch to implement this (I was going to get to it, honest!). Anyway, thanks for doing the work. - R. From ssufficool at rov.sbcounty.gov Fri Oct 26 15:31:28 2007 From: ssufficool at rov.sbcounty.gov (Sufficool, Stanley) Date: Fri, 26 Oct 2007 15:31:28 -0700 Subject: [ofa-general] iSER for stgt - wiki page In-Reply-To: <4720A416.8010503@voltaire.com> Message-ID: Does anyone know a source for Windows initiators for iSER? -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Erez Zilber Sent: Thursday, October 25, 2007 7:12 AM To: ewg at lists.openfabrics.org; general at lists.openfabrics.org; stgt-devel at lists.berlios.de; open-iscsi at googlegroups.com Subject: [ofa-general] iSER for stgt - wiki page The following wiki page is a quick start guide for running an iSCSI over iSER target through the open-source stgt project: https://wiki.openfabrics.org/tiki-index.php?page=ISER-target For more information about stgt: http://stgt.berlios.de/ I hope that you find it helpful. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Solutions Voltaire - _The Grid Backbone_ __ www.voltaire.com --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi at googlegroups.com To unsubscribe from this group, send email to open-iscsi-unsubscribe at googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~--- _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Oct 26 15:32:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:32:48 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: <4720E355.8010400@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 25 Oct 2007 11:41:25 -0700") References: <471FAC1F.2070401@linux.vnet.ibm.com> <4720E355.8010400@linux.vnet.ibm.com> Message-ID: > Having waited for months for this patch to be merged in, it is very disappointing > to say the least. Wish it had been merged and if changes are needed they can always be > made subsequently. That has been my understanding of the development model. If you really want to get into it... I'll certainly accept some of the blame for taking too long to review this patch. However, you didn't do yourself any favors by: a) making one huge ugly patch and b) being rather disagreeable when someone actually tried to review it. As far as the development model goes, it is certainly true that for new things, we can merge first and fix later. But when we're touching something like IPoIB, which is pretty critical to just about everyone using the IB stack at all, the standard is a little different: we need to be much more conservative. And even for new stuff, starting from a good base is pretty important; it's easy to pick on coding style problems, and indeed they do make review harder, but it's even more important to have the underlying logic and structure be simple and maintainable. Anyway, I'll post my current patch series shortly. I think I was able to make the patch quite a bit neater and more reviewable: your patch added > 400 lines, while the main part of my series adds < 200 lines. - R. From rolandd at cisco.com Fri Oct 26 15:33:30 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:33:30 -0700 Subject: [ofa-general] [PATCH 0/4] [RFC] IPoIB/cm: Handle devices without SRQ support Message-ID: <200710261533.L8xgEk4eiTPodbPe@cisco.com> Here is the current series of IPoIB changes I plan to merge for 2.6.25. The point of the series is to add IPoIB connected mode support for HCAs that do not implement SRQs. It is based on Pradeep's patch, but when I started trying to get his most recent patch to apply to the current tree, I ended up completely rewriting things. I'm pretty sure I fixed a few bugs in the process, but I probably introduced several more too, so review and/or test results would be appreciated. We still have a couple of months until 2.6.25 opens up, so there should be time to get this solid. Thanks, Roland From rolandd at cisco.com Fri Oct 26 15:33:31 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:33:31 -0700 Subject: [ofa-general] [PATCH 1/4] [RFC] IPoIB: Trivial formatting cleanups In-Reply-To: <200710261533.L8xgEk4eiTPodbPe@cisco.com> Message-ID: <200710261533.Xn1nnRrnrNQK8TWk@cisco.com> Fix whitespace blunders, convert "foo* bar" to "foo *bar", etc. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 152 ++++++++++++------------ drivers/infiniband/ulp/ipoib/ipoib_cm.c | 48 ++++---- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 8 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 +++--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 4 +- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 10 +- 6 files changed, 130 insertions(+), 130 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index eb7edab..a376fb6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -56,42 +56,42 @@ /* constants */ enum { - IPOIB_PACKET_SIZE = 2048, - IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, - IPOIB_ENCAP_LEN = 4, + IPOIB_ENCAP_LEN = 4, - IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ - IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, - IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, - IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, - IPOIB_RX_RING_SIZE = 128, - IPOIB_TX_RING_SIZE = 64, + IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ + IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, + IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, + IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, - IPOIB_NUM_WC = 4, + IPOIB_NUM_WC = 4, IPOIB_MAX_PATH_REC_QUEUE = 3, - IPOIB_MAX_MCAST_QUEUE = 3, - - IPOIB_FLAG_OPER_UP = 0, - IPOIB_FLAG_INITIALIZED = 1, - IPOIB_FLAG_ADMIN_UP = 2, - IPOIB_PKEY_ASSIGNED = 3, - IPOIB_PKEY_STOP = 4, - IPOIB_FLAG_SUBINTERFACE = 5, - IPOIB_MCAST_RUN = 6, - IPOIB_STOP_REAPER = 7, - IPOIB_MCAST_STARTED = 8, - IPOIB_FLAG_ADMIN_CM = 9, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_OPER_UP = 0, + IPOIB_FLAG_INITIALIZED = 1, + IPOIB_FLAG_ADMIN_UP = 2, + IPOIB_PKEY_ASSIGNED = 3, + IPOIB_PKEY_STOP = 4, + IPOIB_FLAG_SUBINTERFACE = 5, + IPOIB_MCAST_RUN = 6, + IPOIB_STOP_REAPER = 7, + IPOIB_MCAST_STARTED = 8, + IPOIB_FLAG_ADMIN_CM = 9, IPOIB_FLAG_UMCAST = 10, IPOIB_MAX_BACKOFF_SECONDS = 16, - IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ IPOIB_MCAST_FLAG_SENDONLY = 1, - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ IPOIB_MCAST_FLAG_ATTACHED = 3, }; @@ -117,7 +117,7 @@ struct ipoib_pseudoheader { struct ipoib_mcast { struct ib_sa_mcmember_rec mcmember; struct ib_sa_multicast *mc; - struct ipoib_ah *ah; + struct ipoib_ah *ah; struct rb_node rb_node; struct list_head list; @@ -186,27 +186,27 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; - enum ipoib_cm_state state; + struct ib_cm_id *id; + struct ib_qp *qp; + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + enum ipoib_cm_state state; }; struct ipoib_cm_tx { - struct ib_cm_id *id; - struct ib_qp *qp; + struct ib_cm_id *id; + struct ib_qp *qp; struct list_head list; struct net_device *dev; struct ipoib_neigh *neigh; struct ipoib_path *path; struct ipoib_tx_buf *tx_ring; - unsigned tx_head; - unsigned tx_tail; - unsigned long flags; - u32 mtu; - struct ib_wc ibwc[IPOIB_NUM_WC]; + unsigned tx_head; + unsigned tx_tail; + unsigned long flags; + u32 mtu; + struct ib_wc ibwc[IPOIB_NUM_WC]; }; struct ipoib_cm_rx_buf { @@ -215,24 +215,24 @@ struct ipoib_cm_rx_buf { }; struct ipoib_cm_dev_priv { - struct ib_srq *srq; + struct ib_srq *srq; struct ipoib_cm_rx_buf *srq_ring; - struct ib_cm_id *id; - struct list_head passive_ids; /* state: LIVE */ - struct list_head rx_error_list; /* state: ERROR */ - struct list_head rx_flush_list; /* state: FLUSH, drain not started */ - struct list_head rx_drain_list; /* state: FLUSH, drain started */ - struct list_head rx_reap_list; /* state: FLUSH, drain done */ + struct ib_cm_id *id; + struct list_head passive_ids; /* state: LIVE */ + struct list_head rx_error_list; /* state: ERROR */ + struct list_head rx_flush_list; /* state: FLUSH, drain not started */ + struct list_head rx_drain_list; /* state: FLUSH, drain started */ + struct list_head rx_reap_list; /* state: FLUSH, drain done */ struct work_struct start_task; struct work_struct reap_task; struct work_struct skb_task; struct work_struct rx_reap_task; struct delayed_work stale_task; struct sk_buff_head skb_queue; - struct list_head start_list; - struct list_head reap_list; - struct ib_wc ibwc[IPOIB_NUM_WC]; - struct ib_sge rx_sge[IPOIB_CM_RX_SG]; + struct list_head start_list; + struct list_head reap_list; + struct ib_wc ibwc[IPOIB_NUM_WC]; + struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; }; @@ -269,30 +269,30 @@ struct ipoib_dev_priv { struct work_struct pkey_event_task; struct ib_device *ca; - u8 port; - u16 pkey; - u16 pkey_index; - struct ib_pd *pd; - struct ib_mr *mr; - struct ib_cq *cq; - struct ib_qp *qp; - u32 qkey; + u8 port; + u16 pkey; + u16 pkey_index; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; union ib_gid local_gid; - u16 local_lid; + u16 local_lid; unsigned int admin_mtu; unsigned int mcast_mtu; struct ipoib_rx_buf *rx_ring; - spinlock_t tx_lock; + spinlock_t tx_lock; struct ipoib_tx_buf *tx_ring; - unsigned tx_head; - unsigned tx_tail; - struct ib_sge tx_sge; + unsigned tx_head; + unsigned tx_tail; + struct ib_sge tx_sge; struct ib_send_wr tx_wr; - unsigned tx_outstanding; + unsigned tx_outstanding; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -317,10 +317,10 @@ struct ipoib_dev_priv { struct ipoib_ah { struct net_device *dev; - struct ib_ah *ah; + struct ib_ah *ah; struct list_head list; - struct kref ref; - unsigned last_send; + struct kref ref; + unsigned last_send; }; struct ipoib_path { @@ -331,11 +331,11 @@ struct ipoib_path { struct list_head neigh_list; - int query_id; + int query_id; struct ib_sa_query *query; struct completion done; - struct rb_node rb_node; + struct rb_node rb_node; struct list_head list; }; @@ -344,7 +344,7 @@ struct ipoib_neigh { #ifdef CONFIG_INFINIBAND_IPOIB_CM struct ipoib_cm_tx *cm; #endif - union ib_gid dgid; + union ib_gid dgid; struct sk_buff_head queue; struct neighbour *neighbour; @@ -455,8 +455,8 @@ void ipoib_drain_cq(struct net_device *dev); #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_FLAGS_RC 0x80 -#define IPOIB_FLAGS_UC 0x40 +#define IPOIB_FLAGS_RC 0x80 +#define IPOIB_FLAGS_UC 0x40 /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) @@ -500,7 +500,7 @@ void ipoib_cm_dev_cleanup(struct net_device *dev); struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, struct ipoib_neigh *neigh); void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); -void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, +void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); @@ -582,7 +582,7 @@ int ipoib_cm_add_mode_attr(struct net_device *dev) return 0; } -static inline void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, +static inline void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, unsigned int mtu) { dev_kfree_skb_any(skb); @@ -624,12 +624,12 @@ extern struct ib_sa_client ipoib_sa_client; extern int ipoib_debug_level; #define ipoib_dbg(priv, format, arg...) \ - do { \ + do { \ if (ipoib_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) #define ipoib_dbg_mcast(priv, format, arg...) \ - do { \ + do { \ if (mcast_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) @@ -642,7 +642,7 @@ extern int ipoib_debug_level; #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA #define ipoib_dbg_data(priv, format, arg...) \ - do { \ + do { \ if (data_debug_level > 0) \ ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ } while (0) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8761077..2811554 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -155,7 +155,7 @@ partial_error: return NULL; } -static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv* priv) +static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv *priv) { struct ib_send_wr *bad_wr; struct ipoib_cm_rx *p; @@ -495,10 +495,10 @@ static inline int post_send(struct ipoib_dev_priv *priv, { struct ib_send_wr *bad_wr; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + priv->tx_sge.addr = addr; + priv->tx_sge.length = len; - priv->tx_wr.wr_id = wr_id | IPOIB_OP_CM; + priv->tx_wr.wr_id = wr_id | IPOIB_OP_CM; return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); } @@ -540,7 +540,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ tx_req->mapping = addr; if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), - addr, skb->len))) { + addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++dev->stats.tx_errors; ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); @@ -799,7 +799,7 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = tx - }; + }; return ib_create_qp(priv->pd, &attr); } @@ -816,28 +816,28 @@ static int ipoib_cm_send_req(struct net_device *dev, data.qpn = cpu_to_be32(priv->qp->qp_num); data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); - req.primary_path = pathrec; - req.alternate_path = NULL; - req.service_id = cpu_to_be64(IPOIB_CM_IETF_ID | qpn); - req.qp_num = qp->qp_num; - req.qp_type = qp->qp_type; - req.private_data = &data; - req.private_data_len = sizeof data; - req.flow_control = 0; + req.primary_path = pathrec; + req.alternate_path = NULL; + req.service_id = cpu_to_be64(IPOIB_CM_IETF_ID | qpn); + req.qp_num = qp->qp_num; + req.qp_type = qp->qp_type; + req.private_data = &data; + req.private_data_len = sizeof data; + req.flow_control = 0; - req.starting_psn = 0; /* FIXME */ + req.starting_psn = 0; /* FIXME */ /* * Pick some arbitrary defaults here; we could make these * module parameters if anyone cared about setting them. */ - req.responder_resources = 4; - req.remote_cm_response_timeout = 20; - req.local_cm_response_timeout = 20; - req.retry_count = 0; /* RFC draft warns against retries */ - req.rnr_retry_count = 0; /* RFC draft warns against retries */ - req.max_cm_retries = 15; - req.srq = 1; + req.responder_resources = 4; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 0; /* RFC draft warns against retries */ + req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.max_cm_retries = 15; + req.srq = 1; return ib_send_cm_req(id, &req); } @@ -1150,7 +1150,7 @@ static void ipoib_cm_skb_reap(struct work_struct *work) spin_unlock_irq(&priv->tx_lock); } -void ipoib_cm_skb_too_long(struct net_device* dev, struct sk_buff *skb, +void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, unsigned int mtu) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -1212,7 +1212,7 @@ static void ipoib_cm_stale_task(struct work_struct *work) } -static ssize_t show_mode(struct device *d, struct device_attribute *attr, +static ssize_t show_mode(struct device *d, struct device_attribute *attr, char *buf) { struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(d)); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 5063dd5..52bc2bd 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -345,12 +345,12 @@ static inline int post_send(struct ipoib_dev_priv *priv, { struct ib_send_wr *bad_wr; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + priv->tx_sge.addr = addr; + priv->tx_sge.length = len; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr.wr_id = wr_id; priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + priv->tx_wr.wr.ud.ah = address; return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index a03a65e..f31f419 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -471,8 +471,8 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid) INIT_LIST_HEAD(&path->neigh_list); memcpy(path->pathrec.dgid.raw, gid, sizeof (union ib_gid)); - path->pathrec.sgid = priv->local_gid; - path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); path->pathrec.numb_path = 1; path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class; @@ -947,34 +947,34 @@ static void ipoib_setup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - dev->open = ipoib_open; - dev->stop = ipoib_stop; - dev->change_mtu = ipoib_change_mtu; - dev->hard_start_xmit = ipoib_start_xmit; - dev->tx_timeout = ipoib_timeout; - dev->header_ops = &ipoib_header_ops; - dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->tx_timeout = ipoib_timeout; + dev->header_ops = &ipoib_header_ops; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; netif_napi_add(dev, &priv->napi, ipoib_poll, 100); - dev->watchdog_timeo = HZ; + dev->watchdog_timeo = HZ; - dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; /* * We add in INFINIBAND_ALEN to allow for the destination * address "pseudoheader" for skbs without neighbour struct. */ - dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; - dev->addr_len = INFINIBAND_ALEN; - dev->type = ARPHRD_INFINIBAND; - dev->tx_queue_len = ipoib_sendq_size * 2; - dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = ipoib_sendq_size * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ - dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; - priv->mcast_mtu = priv->admin_mtu = dev->mtu; + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 9bcfc7a..858ada1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -702,7 +702,7 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb) out: if (mcast && mcast->ah) { - if (skb->dst && + if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, @@ -710,7 +710,7 @@ out: if (neigh) { kref_get(&mcast->ah->ref); - neigh->ah = mcast->ah; + neigh->ah = mcast->ah; list_add_tail(&neigh->list, &mcast->neigh_list); } } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 3c6e45d..b6848a8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -197,12 +197,12 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; + priv->tx_sge.lkey = priv->mr->lkey; - priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; - priv->tx_wr.send_flags = IB_SEND_SIGNALED; + priv->tx_wr.opcode = IB_WR_SEND; + priv->tx_wr.sg_list = &priv->tx_sge; + priv->tx_wr.num_sge = 1; + priv->tx_wr.send_flags = IB_SEND_SIGNALED; return 0; -- 1.5.3.2 From rolandd at cisco.com Fri Oct 26 15:33:31 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:33:31 -0700 Subject: [ofa-general] [PATCH 2/4] [RFC] IPoIB/cm: Factor out ipoib_cm_free_rx_ring() In-Reply-To: <200710261533.Xn1nnRrnrNQK8TWk@cisco.com> Message-ID: <200710261533.BZP9vEVlKDEKjWp4@cisco.com> Factor out the code to unmap/free skbs and free the receive ring in ipoib_cm_dev_cleanup() into a new function ipoib_cm_free_rx_ring(). This function will be called from a couple of other places when support for devices that don't implement SRQs is added. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 28 +++++++++++++++++++--------- 1 files changed, 19 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 2811554..d2ba7bb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -155,6 +155,22 @@ partial_error: return NULL; } +static void ipoib_cm_free_rx_ring(struct net_device *dev, + struct ipoib_cm_rx_buf *rx_ring) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ring[i].mapping); + dev_kfree_skb_any(rx_ring[i].skb); + } + + kfree(rx_ring); +} + static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv *priv) { struct ib_send_wr *bad_wr; @@ -1328,7 +1344,7 @@ int ipoib_cm_dev_init(struct net_device *dev) void ipoib_cm_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - int i, ret; + int ret; if (!priv->cm.srq) return; @@ -1342,13 +1358,7 @@ void ipoib_cm_dev_cleanup(struct net_device *dev) priv->cm.srq = NULL; if (!priv->cm.srq_ring) return; - for (i = 0; i < ipoib_recvq_size; ++i) - if (priv->cm.srq_ring[i].skb) { - ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, - priv->cm.srq_ring[i].mapping); - dev_kfree_skb_any(priv->cm.srq_ring[i].skb); - priv->cm.srq_ring[i].skb = NULL; - } - kfree(priv->cm.srq_ring); + + ipoib_cm_free_rx_ring(dev, priv->cm.srq_ring); priv->cm.srq_ring = NULL; } -- 1.5.3.2 From rolandd at cisco.com Fri Oct 26 15:33:31 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:33:31 -0700 Subject: [ofa-general] [PATCH 3/4] [RFC] IPoIB/cm: Factor out ipoib_cm_create_srq() In-Reply-To: <200710261533.BZP9vEVlKDEKjWp4@cisco.com> Message-ID: <200710261533.YxMqwLlaYvO6fxi2@cisco.com> Factor out the code to create an SRQ and free the receive ring in ipoib_cm_dev_init() into a new function ipoib_cm_create_srq(). This will make the code neater when support for devices that don't implement SRQs is added. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 47 +++++++++++++++++++----------- 1 files changed, 30 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index d2ba7bb..d4a867d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -1271,7 +1271,7 @@ int ipoib_cm_add_mode_attr(struct net_device *dev) return device_create_file(&dev->dev, &dev_attr_mode); } -int ipoib_cm_dev_init(struct net_device *dev) +static int ipoib_cm_create_srq(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_srq_init_attr srq_init_attr = { @@ -1280,6 +1280,31 @@ int ipoib_cm_dev_init(struct net_device *dev) .max_sge = IPOIB_CM_RX_SG } }; + int ret; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ib_destroy_srq(priv->cm.srq); + priv->cm.srq = NULL; + return -ENOMEM; + } + + return 0; +} + +int ipoib_cm_dev_init(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret, i; INIT_LIST_HEAD(&priv->cm.passive_ids); @@ -1297,22 +1322,6 @@ int ipoib_cm_dev_init(struct net_device *dev) skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; - return ret; - } - - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].lkey = priv->mr->lkey; @@ -1323,6 +1332,10 @@ int ipoib_cm_dev_init(struct net_device *dev) priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; + ret = ipoib_cm_create_srq(dev); + if (ret) + return ret; + for (i = 0; i < ipoib_recvq_size; ++i) { if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { -- 1.5.3.2 From rolandd at cisco.com Fri Oct 26 15:33:31 2007 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 15:33:31 -0700 Subject: [ofa-general] [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: <200710261533.YxMqwLlaYvO6fxi2@cisco.com> Message-ID: <200710261533.UlO70kYhcNvuPmut@cisco.com> Some IB adapters (notably IBM's eHCA) do not implement SRQs (shared receive queues). The current IPoIB connected mode support only works on devices that support SRQs. Fix this by adding support for using the receive queue of each connected mode receive QP. The disadvantage of this compared to using an SRQ is that it means a full queue of receives must be posted for each remote connected mode peer, which means that total memory usage is potentially much higher than when using SRQs. To manage this, add a new module parameter "max_nonsrq_conn_qp" that limits the number of connections allowed per interface. The rest of the changes are fairly straightforward: we use a table of struct ipoib_cm_rx to hold all the active connections, and put the table index of the connection in the high bits of receive WR IDs. This is needed because we cannot rely on the struct ib_wc.qp field for non-SRQ receive completions. Most of the rest of the changes just test whether or not an SRQ is available, and post receives or find received packets in the right place depending on the answer. Cleaning up dead connections actually becomes simpler, because we do not have to do the "last WQE reached" dance that is required to destroy QPs attached to an SRQ. We just move the QP to the error state and wait for all pending receives to be flushed. Signed-off-by: Pradeep Satyanarayana [ Completely rewritten and split up, based on Pradeep's work. Several bugs fixed and no doubt several bugs introduced. - Roland ] Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 20 +++ drivers/infiniband/ulp/ipoib/ipoib_cm.c | 221 +++++++++++++++++++++++----- drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 8 +- 4 files changed, 212 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index a376fb6..93867db 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -69,6 +69,7 @@ enum { IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, + IPOIB_CM_MAX_CONN_QP = 4096, IPOIB_NUM_WC = 4, @@ -188,10 +189,13 @@ enum ipoib_cm_state { struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; struct list_head list; struct net_device *dev; unsigned long jiffies; enum ipoib_cm_state state; + int index; + int recv_count; }; struct ipoib_cm_tx { @@ -234,6 +238,7 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_table; }; /* @@ -461,6 +466,8 @@ void ipoib_drain_cq(struct net_device *dev); /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) +extern int ipoib_max_conn_qp; + static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -491,6 +498,12 @@ static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *t neigh->cm = tx; } +static inline int ipoib_cm_has_srq(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + return !!priv->cm.srq; +} + void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx); int ipoib_cm_dev_open(struct net_device *dev); void ipoib_cm_dev_stop(struct net_device *dev); @@ -508,6 +521,8 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc); struct ipoib_cm_tx; +#define ipoib_max_conn_qp 0 + static inline int ipoib_cm_admin_enabled(struct net_device *dev) { return 0; @@ -533,6 +548,11 @@ static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *t { } +static inline int ipoib_cm_has_srq(struct net_device *dev) +{ + return 0; +} + static inline void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index d4a867d..fbe01b8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -39,6 +39,13 @@ #include #include +int ipoib_max_conn_qp = 128; + +module_param_named(max_nonsrq_conn_qp, ipoib_max_conn_qp, int, 0444); +MODULE_PARM_DESC(max_nonsrq_conn_qp, + "Max number of connected-mode QPs per interface " + "(applied only if shared receive queue is not available)"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA static int data_debug_level; @@ -81,7 +88,7 @@ static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags, ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int ipoib_cm_post_receive_srq(struct net_device *dev, int id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; @@ -104,7 +111,35 @@ static int ipoib_cm_post_receive(struct net_device *dev, int id) return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int ipoib_cm_post_receive_nonsrq(struct net_device *dev, int id, int index) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx *rx; + struct ib_recv_wr *bad_wr; + int i, ret; + + rx = priv->cm.rx_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV | + ((u64) index << 32); + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx->rx_ring[id].mapping[i]; + + ret = ib_post_recv(rx->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx->rx_ring[id].mapping); + dev_kfree_skb_any(rx->rx_ring[id].skb); + rx->rx_ring[id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, + int index, int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -141,7 +176,10 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (ipoib_cm_has_srq(dev)) + priv->cm.srq_ring[id].skb = skb; + else + priv->cm.rx_table[index]->rx_ring[id].skb = skb; return skb; partial_error: @@ -224,12 +262,18 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev, .qp_type = IB_QPT_RC, .qp_context = p, }; + + if (!ipoib_cm_has_srq(dev)) { + attr.cap.max_recv_wr = ipoib_recvq_size; + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; + } + return ib_create_qp(priv->pd, &attr); } static int ipoib_cm_modify_rx_qp(struct net_device *dev, - struct ib_cm_id *cm_id, struct ib_qp *qp, - unsigned psn) + struct ib_cm_id *cm_id, struct ib_qp *qp, + unsigned psn) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -282,6 +326,60 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev, return 0; } +static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_id, + struct ipoib_cm_rx *rx) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int index; + int ret; + int i; + + rx->rx_ring = kcalloc(ipoib_recvq_size, sizeof *rx->rx_ring, GFP_KERNEL); + if (!rx->rx_ring) + return -ENOMEM; + + spin_lock_irq(&priv->lock); + + for (index = 0; index < ipoib_max_conn_qp; index++) + if (priv->cm.rx_table[index] == NULL) + break; + + if (index == ipoib_max_conn_qp) { + spin_unlock_irq(&priv->lock); + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err; + } + + priv->cm.rx_table[index] = rx; + + spin_unlock_irq(&priv->lock); + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, index, IPOIB_CM_RX_SG - 1, + rx->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ret = -ENOMEM; + goto err; + } + ret = ipoib_cm_post_receive_nonsrq(dev, i, index); + if (ret) { + ipoib_warn(priv, "ipoib_cm_post_receive_srq " + "failed for buf %d\n", i); + ret = -EIO; + goto err; + } + } + + rx->recv_count = ipoib_recvq_size; + + return 0; + +err: + ipoib_cm_free_rx_ring(dev, rx->rx_ring); + return ret; +} + static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, struct ib_qp *qp, struct ib_cm_req_event_param *req, unsigned psn) @@ -297,7 +395,7 @@ static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, rep.private_data_len = sizeof data; rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; - rep.srq = 1; + rep.srq = ipoib_cm_has_srq(dev); rep.qp_num = qp->qp_num; rep.starting_psn = psn; return ib_send_cm_rep(cm_id, &rep); @@ -333,6 +431,12 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even if (ret) goto err_modify; + if (!ipoib_cm_has_srq(dev)) { + ret = ipoib_cm_nonsrq_init_rx(dev, cm_id, p); + if (ret) + goto err_modify; + } + spin_lock_irq(&priv->lock); queue_delayed_work(ipoib_workqueue, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); @@ -417,12 +521,14 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx_buf *rx_ring; unsigned int wr_id = wc->wr_id & ~(IPOIB_OP_CM | IPOIB_OP_RECV); struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; int frags; + int index; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -440,14 +546,34 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) return; } - skb = priv->cm.srq_ring[wr_id].skb; + if (ipoib_cm_has_srq(dev)) { + index = -1; + rx_ring = priv->cm.srq_ring; + } else { + index = wc->wr_id >> 32; + rx_ring = priv->cm.rx_table[index]->rx_ring; + } + + skb = rx_ring[wr_id].skb; if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); ++dev->stats.rx_dropped; - goto repost; + if (index < 0) + goto repost; + else { + if (!--priv->cm.rx_table[index]->recv_count) { + spin_lock_irqsave(&priv->lock, flags); + list_move(&priv->cm.rx_table[index]->list, + &priv->cm.rx_reap_list); + priv->cm.rx_table[index] = NULL; + spin_unlock_irqrestore(&priv->lock, flags); + queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + } + return; + } } if (unlikely(!(wr_id & IPOIB_CM_RX_UPDATE_MASK))) { @@ -466,7 +592,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; - newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, 0, frags, mapping); if (unlikely(!newskb)) { /* * If we can't allocate a new RX buffer, dump @@ -477,8 +603,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) goto repost; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, rx_ring[wr_id].mapping); + memcpy(rx_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -499,9 +625,17 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) netif_receive_skb(skb); repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); + if (index < 0) { + if (unlikely(ipoib_cm_post_receive_srq(dev, wr_id))) + ipoib_warn(priv, "ipoib_cm_post_receive_srq failed " + "for buf %d\n", wr_id); + } else { + if (unlikely(ipoib_cm_post_receive_nonsrq(dev, wr_id, index))) { + --priv->cm.rx_table[index]->recv_count; + ipoib_warn(priv, "ipoib_cm_post_receive_nonsrq failed " + "for buf %d\n", wr_id); + } + } } static inline int post_send(struct ipoib_dev_priv *priv, @@ -729,6 +863,8 @@ void ipoib_cm_dev_stop(struct net_device *dev) list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!ipoib_cm_has_srq(dev)) + priv->cm.rx_table[p->index] = NULL; kfree(p); } @@ -853,7 +989,7 @@ static int ipoib_cm_send_req(struct net_device *dev, req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = ipoib_cm_has_srq(dev); return ib_send_cm_req(id, &req); } @@ -1194,6 +1330,8 @@ static void ipoib_cm_rx_reap(struct work_struct *work) list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!ipoib_cm_has_srq(priv->dev)) + ipoib_cm_free_rx_ring(priv->dev, p->rx_ring); kfree(p); } } @@ -1271,7 +1409,7 @@ int ipoib_cm_add_mode_attr(struct net_device *dev) return device_create_file(&dev->dev, &dev_attr_mode); } -static int ipoib_cm_create_srq(struct net_device *dev) +static void ipoib_cm_create_srq(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_srq_init_attr srq_init_attr = { @@ -1280,32 +1418,30 @@ static int ipoib_cm_create_srq(struct net_device *dev) .max_sge = IPOIB_CM_RX_SG } }; - int ret; priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); + if (PTR_ERR(priv->cm.srq) != -ENOSYS) + printk(KERN_WARNING "%s: failed to allocate SRQ, error %ld\n", + priv->ca->name, PTR_ERR(priv->cm.srq)); priv->cm.srq = NULL; - return ret; + return; } priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, GFP_KERNEL); if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", + printk(KERN_WARNING "%s: failed to allocate CM SRQ ring (%d entries)\n", priv->ca->name, ipoib_recvq_size); ib_destroy_srq(priv->cm.srq); priv->cm.srq = NULL; - return -ENOMEM; } - - return 0; } int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - int ret, i; + int i; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1332,21 +1468,32 @@ int ipoib_cm_dev_init(struct net_device *dev) priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - ret = ipoib_cm_create_srq(dev); - if (ret) - return ret; + ipoib_cm_create_srq(dev); - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, - priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (ipoib_cm_has_srq(dev)) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, 0, IPOIB_CM_RX_SG - 1, + priv->cm.srq_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate " + "receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + if (ipoib_cm_post_receive_srq(dev, i)) { + ipoib_warn(priv, "ipoib_cm_post_receive_srq " + "failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + } else { + priv->cm.rx_table = kcalloc(ipoib_max_conn_qp, + sizeof *priv->cm.rx_table, + GFP_KERNEL); + if (!priv->cm.rx_table) { + ipoib_warn(priv, "Failed to allocate rx_table\n"); + return -ENOMEM; } } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f31f419..623458e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1266,6 +1266,8 @@ static int __init ipoib_init_module(void) ipoib_sendq_size = min(ipoib_sendq_size, IPOIB_MAX_QUEUE_SIZE); ipoib_sendq_size = max(ipoib_sendq_size, IPOIB_MIN_QUEUE_SIZE); + ipoib_max_conn_qp = min(ipoib_max_conn_qp, IPOIB_CM_MAX_CONN_QP); + ret = ipoib_register_debugfs(); if (ret) return ret; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index b6848a8..433e99a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -172,8 +172,12 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); - if (!ret) - size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; + if (!ret) { + if (ipoib_cm_has_srq(dev)) + size += ipoib_recvq_size + 1; /* 1 extra for rx_drain_qp */ + else + size += ipoib_recvq_size * ipoib_max_conn_qp; + } priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { -- 1.5.3.2 From sashak at voltaire.com Fri Oct 26 15:59:49 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 00:59:49 +0200 Subject: [ofa-general] Re: [PATCH RFC] libibumad: support for new pkey enabled user_mad API In-Reply-To: References: <46EACC6B.5060702@ichips.intel.com> <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> <20071026223002.GB22317@sashak.voltaire.com> Message-ID: <20071026225949.GD22317@sashak.voltaire.com> On 15:22 Fri 26 Oct , Roland Dreier wrote: > Looks great to me. I'm testing this now. It works fins with new kernel, but breaks with old - few more lines are needed. V2 is soon. Sasha From sashak at voltaire.com Fri Oct 26 16:07:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 01:07:44 +0200 Subject: [ofa-general] [PATCH v2 RFC] libibumad: support for new pkey enabled user_mad API In-Reply-To: <20071026223002.GB22317@sashak.voltaire.com> References: <46EACC6B.5060702@ichips.intel.com> <1190034864.6272.86.camel@hrosenstock-ws.xsigo.com> <20071026223002.GB22317@sashak.voltaire.com> Message-ID: <20071026230744.GE22317@sashak.voltaire.com> This adds support for new pkey enabled user_mad API. When ABI version is 5 this tries to use IB_USER_MAD_ENABLE_PKEY ioctl(). Signed-off-by: Sasha Khapyorsky --- libibumad/include/infiniband/umad.h | 4 ++- libibumad/src/umad.c | 65 +++++++++++++++++++++++------------ 2 files changed, 46 insertions(+), 23 deletions(-) diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 2ec8b37..21cf729 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -60,6 +60,8 @@ typedef struct ib_mad_addr { uint8_t traffic_class; uint8_t gid[16]; uint32_t flow_label; + uint16_t pkey_index; + uint8_t reserved[6]; } ib_mad_addr_t; typedef struct ib_user_mad { @@ -80,8 +82,8 @@ typedef struct ib_user_mad { #define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ struct ib_user_mad_reg_req) - #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, uint32_t) +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) #define UMAD_CA_NAME_LEN 20 #define UMAD_CA_MAX_PORTS 10 /* 0 - 9 */ diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 41373e7..9d9f9c3 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -85,6 +85,9 @@ int umaddebug = 0; static char *def_ca_name = "mthca0"; static int def_ca_port = 1; +static unsigned abi_version; +static unsigned new_user_mad_api; + /************************************* * Port */ @@ -428,16 +431,14 @@ dev_to_umad_id(char *dev, unsigned port) int umad_init(void) { - unsigned abi_version; - TRACE("umad_init"); if (sys_read_uint(IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, &abi_version) < 0) { IBWARN("can't read ABI version from %s/%s (%m): is ib_umad module loaded?", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE); return -1; } - if (abi_version != IB_UMAD_ABI_VERSION) { - IBWARN("wrong ABI version: %s/%s is %d but library ABI is %d", + if (abi_version < IB_UMAD_ABI_VERSION) { + IBWARN("wrong ABI version: %s/%s is %d but library minimal ABI is %d", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, abi_version, IB_UMAD_ABI_VERSION); return -1; } @@ -554,6 +555,21 @@ umad_open_port(char *ca_name, int portnum) return -EIO; } + if (abi_version > 5) + new_user_mad_api = 1; + else { + int ret = ioctl(fd, IB_USER_MAD_ENABLE_PKEY, NULL); + if (ret == 0) + new_user_mad_api = 1; + else if (ret < 0 && errno == EINVAL) + new_user_mad_api = 0; + else { + close(fd); + IBWARN("cannot detect is user_mad P_Key enabled API supported."); + return ret; + } + } + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); return fd; } @@ -636,13 +652,15 @@ umad_close_port(int fd) void * umad_get_mad(void *umad) { - return ((struct ib_user_mad *)umad)->data; + return new_user_mad_api ? ((struct ib_user_mad *)umad)->data : + (void *)&((struct ib_user_mad *)umad)->addr.pkey_index; } size_t umad_size(void) { - return sizeof (struct ib_user_mad); + return new_user_mad_api ? sizeof (struct ib_user_mad) : + sizeof(struct ib_user_mad) - 8; } int @@ -663,11 +681,13 @@ umad_set_grh(void *umad, void *mad_addr) } int -umad_set_pkey(void *umad, int pkey) +umad_set_pkey(void *umad, int pkey_index) { -#if 0 - mad->addr.pkey = 0; /* FIXME - PKEY support */ -#endif + struct ib_user_mad *mad = umad; + + if (new_user_mad_api) + mad->addr.pkey_index = htons(pkey_index); + return 0; } @@ -719,12 +739,12 @@ umad_send(int fd, int agentid, void *umad, int length, if (umaddebug > 1) umad_dump(mad); - n = write(fd, mad, length + sizeof *mad); - if (n == length + sizeof *mad) + n = write(fd, mad, length + umad_size()); + if (n == length + umad_size()) return 0; DEBUG("write returned %d != sizeof umad %zu + length %d (%m)", - n, sizeof *mad, length); + n, umad_size(), length); if (!errno) errno = EIO; return -EIO; @@ -768,14 +788,14 @@ umad_recv(int fd, void *umad, int *length, int timeout_ms) return n; } - n = read(fd, umad, sizeof *mad + *length); + n = read(fd, umad, umad_size() + *length); - VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); + VALGRIND_MAKE_MEM_DEFINED(umad, umad_size() + *length); - if ((n >= 0) && (n <= sizeof *mad + *length)) { + if ((n >= 0) && (n <= umad_size() + *length)) { DEBUG("mad received by agent %d length %d", mad->agent_id, n); - if (n > sizeof *mad) - *length = n - sizeof *mad; + if (n > umad_size()) + *length = n - umad_size(); else *length = 0; return mad->agent_id; @@ -788,9 +808,9 @@ umad_recv(int fd, void *umad, int *length, int timeout_ms) } DEBUG("read returned %zu > sizeof umad %zu + length %d (%m)", - mad->length - sizeof *mad, sizeof *mad, *length); + mad->length - umad_size(), umad_size(), *length); - *length = mad->length - sizeof *mad; + *length = mad->length - umad_size(); if (!errno) errno = EIO; return -errno; @@ -929,11 +949,12 @@ umad_addr_dump(ib_mad_addr_t *addr) } gid_str[i*2] = 0; IBWARN("qpn %d qkey 0x%x lid 0x%x sl %d\n" - "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x\n" + "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x pkey_index 0x%x\n" "Gid 0x%s", ntohl(addr->qpn), ntohl(addr->qkey), ntohs(addr->lid), addr->sl, addr->grh_present, (int)addr->gid_index, (int)addr->hop_limit, - (int)addr->traffic_class, addr->flow_label, gid_str); + (int)addr->traffic_class, addr->flow_label, addr->pkey_index, + gid_str); } void -- 1.5.3.4.206.g58ba4 From rdreier at cisco.com Fri Oct 26 16:23:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 26 Oct 2007 16:23:23 -0700 Subject: [ofa-general] Re: [PATCH 1/14 v2] nes: module and device initialization In-Reply-To: <200710192001.l9JK1U8O021689@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:01:30 -0500") References: <200710192001.l9JK1U8O021689@neteffect.com> Message-ID: OK, a couple quick review comments and a process comment too: - First step in the driver is to kill off a lot of the #ifdefs: > +#ifdef IRQF_SHARED The upstream driver really shouldn't have compatibility gunk for older kernels... just make it build against the kernel it's in. > +#ifdef OFED_1_2 Same... kernel code shouldn't worry about OFED. > +#ifdef CONFIG_PCI_MSI > + if (nesdev->msi_enabled) { > + pci_disable_msi(pcidev); > + } > +#endif This can be much simpler, because pci_disable_msi() is always available and is a NOP if the config option is off or MSI is not enabled. So you can just unconditionally do pci_disable_msi(pcidev); > +#ifdef NES_NAPI I don't see anything that defines NES_NAPI. I think for the final merge we want a NAPI-only driver (ie no ifdef at all)... is there any performance or other reason to ever build a non-NAPI driver (for a modern kernel)? OK, on a process level, my plan is to pull the current driver into a "neteffect" branch in my git tree with the intention of merging it for 2.6.25. I'll let you know when that's ready (probably early next week). I'll probably do some cleanups there, and you can send me cleanup/fix patches against that branch any time too. We should try to keep the cycle time short: the interval between the first posting of this driver and the current one was pretty long, and there's a lot of cleanup to do to get ready for the next merge window. Does that plan make sense? - R. From panda at cse.ohio-state.edu Fri Oct 26 19:32:03 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri, 26 Oct 2007 22:32:03 -0400 (EDT) Subject: [ofa-general] MVAPICH 1.0-beta is available Message-ID: <200710270232.l9R2W3TV001243@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH 1.0-beta with the following NEW features: - New OpenFabrics Gen2 Unreliable-Datagram (UD)-based design for large-scale InfiniBand clusters (multi-thousand cores) - delivers performance and scalability with constant memory footprint for communication contexts - zero-copy protocol for large data transfer - shared memory communication between cores within a node - multi-core optimized collectives (MPI_Bcast, MPI_Barrier, MPI_Reduce and MPI_Allreduce) - New features for OpenFabrics Gen2-IB interface - support for asynchronous progress at both sender and receiver to overlap computation and communication - support for ConnectX adapter - multi-core optimized collectives (MPI_Bcast) - tuned collectives (MPI_AlltoAll, MPI_Bcast) based on network adapter characteristics - network-level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand. - enhanced coalescing support with varying degree of coalescing - enhanced mpirun_rsh to provide scalable launching on multi-thousand node clusters - New Support for Qlogic InfiniPath adapters - high-performance point-to-point communication - optimized collectives (MPI_Bcast and MPI_Barrier) with k-nomial algorithms while exploiting multi-core architecture For downloading MVAPICH 1.0-beta, associated user guide and accessing the anonymous SVN, please visit the following URL: http://mvapich.cse.ohio-state.edu All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, The MVAPICH Team From monticoline at getsmaet.com Fri Oct 26 22:09:58 2007 From: monticoline at getsmaet.com (Dwight Kelley) Date: Fri, 26 Oct 2007 23:09:58 -0600 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c81845$e7279480$0100007f@localhost> newadobesoft . com From premastery at hiltbrunner.net Sat Oct 27 00:54:09 2007 From: premastery at hiltbrunner.net (Troy Bishop) Date: Sat, 27 Oct 2007 08:54:09 +0100 Subject: [ofa-general] Ado6e Acro6at PR0, New Vista/XP Edition 79$, Save 599.95$ 0ff Retai| Message-ID: <000001c8186e$32ee3400$0100007f@localhost> newadobesoft . com From vlad at lists.openfabrics.org Sat Oct 27 02:53:39 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 27 Oct 2007 02:53:39 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071027-0200 daily build status Message-ID: <20071027095339.99FC9E608BA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.23 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From sashak at voltaire.com Sat Oct 27 09:18:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 18:18:41 +0200 Subject: [ofa-general] [PATCH] libibumad: support for new pkey enabled user_mad API Message-ID: <20071027161841.GH22317@sashak.voltaire.com> This adds support for new pkey enabled user_mad API. When ABI version is 5 this tries to use IB_USER_MAD_ENABLE_PKEY ioctl(). Signed-off-by: Sasha Khapyorsky --- libibumad/include/infiniband/umad.h | 4 ++- libibumad/src/umad.c | 65 +++++++++++++++++++++++------------ 2 files changed, 46 insertions(+), 23 deletions(-) diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 2ec8b37..21cf729 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -60,6 +60,8 @@ typedef struct ib_mad_addr { uint8_t traffic_class; uint8_t gid[16]; uint32_t flow_label; + uint16_t pkey_index; + uint8_t reserved[6]; } ib_mad_addr_t; typedef struct ib_user_mad { @@ -80,8 +82,8 @@ typedef struct ib_user_mad { #define IB_USER_MAD_REGISTER_AGENT _IOWR(IB_IOCTL_MAGIC, 1, \ struct ib_user_mad_reg_req) - #define IB_USER_MAD_UNREGISTER_AGENT _IOW(IB_IOCTL_MAGIC, 2, uint32_t) +#define IB_USER_MAD_ENABLE_PKEY _IO(IB_IOCTL_MAGIC, 3) #define UMAD_CA_NAME_LEN 20 #define UMAD_CA_MAX_PORTS 10 /* 0 - 9 */ diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 41373e7..9d9f9c3 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -85,6 +85,9 @@ int umaddebug = 0; static char *def_ca_name = "mthca0"; static int def_ca_port = 1; +static unsigned abi_version; +static unsigned new_user_mad_api; + /************************************* * Port */ @@ -428,16 +431,14 @@ dev_to_umad_id(char *dev, unsigned port) int umad_init(void) { - unsigned abi_version; - TRACE("umad_init"); if (sys_read_uint(IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, &abi_version) < 0) { IBWARN("can't read ABI version from %s/%s (%m): is ib_umad module loaded?", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE); return -1; } - if (abi_version != IB_UMAD_ABI_VERSION) { - IBWARN("wrong ABI version: %s/%s is %d but library ABI is %d", + if (abi_version < IB_UMAD_ABI_VERSION) { + IBWARN("wrong ABI version: %s/%s is %d but library minimal ABI is %d", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, abi_version, IB_UMAD_ABI_VERSION); return -1; } @@ -554,6 +555,21 @@ umad_open_port(char *ca_name, int portnum) return -EIO; } + if (abi_version > 5) + new_user_mad_api = 1; + else { + int ret = ioctl(fd, IB_USER_MAD_ENABLE_PKEY, NULL); + if (ret == 0) + new_user_mad_api = 1; + else if (ret < 0 && errno == EINVAL) + new_user_mad_api = 0; + else { + close(fd); + IBWARN("cannot detect is user_mad P_Key enabled API supported."); + return ret; + } + } + DEBUG("opened %s fd %d portid %d", dev_file, fd, umad_id); return fd; } @@ -636,13 +652,15 @@ umad_close_port(int fd) void * umad_get_mad(void *umad) { - return ((struct ib_user_mad *)umad)->data; + return new_user_mad_api ? ((struct ib_user_mad *)umad)->data : + (void *)&((struct ib_user_mad *)umad)->addr.pkey_index; } size_t umad_size(void) { - return sizeof (struct ib_user_mad); + return new_user_mad_api ? sizeof (struct ib_user_mad) : + sizeof(struct ib_user_mad) - 8; } int @@ -663,11 +681,13 @@ umad_set_grh(void *umad, void *mad_addr) } int -umad_set_pkey(void *umad, int pkey) +umad_set_pkey(void *umad, int pkey_index) { -#if 0 - mad->addr.pkey = 0; /* FIXME - PKEY support */ -#endif + struct ib_user_mad *mad = umad; + + if (new_user_mad_api) + mad->addr.pkey_index = htons(pkey_index); + return 0; } @@ -719,12 +739,12 @@ umad_send(int fd, int agentid, void *umad, int length, if (umaddebug > 1) umad_dump(mad); - n = write(fd, mad, length + sizeof *mad); - if (n == length + sizeof *mad) + n = write(fd, mad, length + umad_size()); + if (n == length + umad_size()) return 0; DEBUG("write returned %d != sizeof umad %zu + length %d (%m)", - n, sizeof *mad, length); + n, umad_size(), length); if (!errno) errno = EIO; return -EIO; @@ -768,14 +788,14 @@ umad_recv(int fd, void *umad, int *length, int timeout_ms) return n; } - n = read(fd, umad, sizeof *mad + *length); + n = read(fd, umad, umad_size() + *length); - VALGRIND_MAKE_MEM_DEFINED(umad, sizeof *mad + *length); + VALGRIND_MAKE_MEM_DEFINED(umad, umad_size() + *length); - if ((n >= 0) && (n <= sizeof *mad + *length)) { + if ((n >= 0) && (n <= umad_size() + *length)) { DEBUG("mad received by agent %d length %d", mad->agent_id, n); - if (n > sizeof *mad) - *length = n - sizeof *mad; + if (n > umad_size()) + *length = n - umad_size(); else *length = 0; return mad->agent_id; @@ -788,9 +808,9 @@ umad_recv(int fd, void *umad, int *length, int timeout_ms) } DEBUG("read returned %zu > sizeof umad %zu + length %d (%m)", - mad->length - sizeof *mad, sizeof *mad, *length); + mad->length - umad_size(), umad_size(), *length); - *length = mad->length - sizeof *mad; + *length = mad->length - umad_size(); if (!errno) errno = EIO; return -errno; @@ -929,11 +949,12 @@ umad_addr_dump(ib_mad_addr_t *addr) } gid_str[i*2] = 0; IBWARN("qpn %d qkey 0x%x lid 0x%x sl %d\n" - "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x\n" + "grh_present %d gid_index %d hop_limit %d traffic_class %d flow_label 0x%x pkey_index 0x%x\n" "Gid 0x%s", ntohl(addr->qpn), ntohl(addr->qkey), ntohs(addr->lid), addr->sl, addr->grh_present, (int)addr->gid_index, (int)addr->hop_limit, - (int)addr->traffic_class, addr->flow_label, gid_str); + (int)addr->traffic_class, addr->flow_label, addr->pkey_index, + gid_str); } void -- 1.5.3.4.206.g58ba4 From sashak at voltaire.com Sat Oct 27 09:19:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 18:19:44 +0200 Subject: [ofa-general] Re: [PATCH 1/6] infiniband-diags/configure.in: fix comment In-Reply-To: <20071025114317.2010de53.weiny2@llnl.gov> References: <20071025114317.2010de53.weiny2@llnl.gov> Message-ID: <20071027161944.GI22317@sashak.voltaire.com> On 11:43 Thu 25 Oct , Ira Weiny wrote: > From b338078dc970c09513dd1d3023bebff334010c05 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Fri, 19 Oct 2007 11:22:40 -0700 > Subject: [PATCH] infiniband-diags/configure.in: fix comment > > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 27 09:20:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 18:20:09 +0200 Subject: [ofa-general] Re: [PATCH 6/6] Allow for a special value of "(null)" in the opts file. In-Reply-To: <20071025114346.5902acc9.weiny2@llnl.gov> References: <20071025114346.5902acc9.weiny2@llnl.gov> Message-ID: <20071027162009.GJ22317@sashak.voltaire.com> On 11:43 Thu 25 Oct , Ira Weiny wrote: > From 3df0056cce46e521dc9f0ab07c55a41cef6f340c Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 25 Oct 2007 10:09:54 -0700 > Subject: [PATCH] Allow for a special value of "(null)" in the opts file. > > Some string values are valid if they are "(null)". Special case this string > so that it sets the pointer to NULL when read. > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 27 10:52:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 27 Oct 2007 19:52:27 +0200 Subject: [ofa-general] Re: [PATCH 0/6] Add Switch Map support to opensm In-Reply-To: <20071025114303.6d712bcb.weiny2@llnl.gov> References: <20071025114303.6d712bcb.weiny2@llnl.gov> Message-ID: <20071027175227.GK22317@sashak.voltaire.com> Hi Ira, On 11:43 Thu 25 Oct , Ira Weiny wrote: > As I said in another thread. I have added switch-map support to opensm. This > patch series does that in a number of steps. > > Patch: > 1) Simple comment fix (Should be applied on it's own regardless of if the > series is accepted.) > 2) Moves the switch map support to ibcommon but leaves the implementation > alone. (hmm, I thought to remove (split between libibumad and libibmad) libibcommon after OFED-1.3 in order to reduce number of management packages, it is not 100% necessary step however.) > 3) Changes the implementation of the switch map to read the file into memory > to facilitate faster lookups as well as multi-threaded lookups. > 4) Add the switch map calls to opensm but leave the creation of the switch > map to be the default one provided by ibcommon (Pass NULL to > create_switch_map) > 5) Add an option to the opts file to specify a switch map. > 6) Allow a special value of "(null)" in the opts file. (This too could be > applied outside of the series.) Thanks for the patches. I applied 1 and 6 and have some thoughts about others: 1. If we are doing naming map, why it should be limited from beginning for switches only. I would prefer to extend it to any node types (since "guid name" records are optional, it will work as "switches only" just well). Actually I can see that in OpenSM related patch it is done for any node. Probably then we need another than "switch-map" name, what about "guid2name-map" or "nodename-map"? 2. In-memory optimization is good thing, but lookup still be linear and slow. It is probably acceptable for diags (for most - only few nodes should be resolved), but at least for OpenSM I think name_by_guid qmap approach is preferable. What do you think? Sasha From johann.george at qlogic.com Sat Oct 27 11:28:27 2007 From: johann.george at qlogic.com (Johann George) Date: Sat, 27 Oct 2007 11:28:27 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <4720BCA6.9080501@mellanox.co.il> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <4720BCA6.9080501@mellanox.co.il> Message-ID: <20071027182827.GA12285@cuprite.pathscale.com> Tziporet, > We may need more then 30m for this > Also - is will be good that this session will be the last one, and then > I can put inside input from all sessions - especially those that speak > on the new features. Since I knew that some people would miss the last few sessions due to flight limitations, I thought that having this session in the morning would allow everyone to attend. Nevertheless, you make some good points. Let's go with your suggestion and move this to the end. > Liran will not come to the summit. Dror can replace him. It can be good > if Or will work with Dror on this. Sorry to hear that Liran will not be attending. I will put Dror down in his place; and if Or agrees, I'll include him. Thanks much. Johann From yangdong at ncic.ac.cn Sat Oct 27 11:44:33 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Sun, 28 Oct 2007 02:44:33 +0800 Subject: [ofa-general] ibv_get_cq_event problem~ Message-ID: <47238711.6070208@ncic.ac.cn> when i was using ibv_get_cq_event in my thread just as follow (this is similar to some codes in rping.c), i post a send op by ibv_post_send, but thread find ibv_get_cq_event cannot get a cq event. why ? send is post successfully , matbe any other errors ? when i cannot get a cq event, which may happen? void thread() { ... while (1) { ret = ibv_get_cq_event(conn->ac_rdma->cr_send_comp_channel, &ev_cq, &ev_ctx); if (ret) { AMP_ERROR("__amp_rdma_scq_thread:Failed to get scq event!\n"); goto EXIT; } AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_sendcompletion_thread: ibv_get_cq_event.\n"); if (ev_cq != conn->ac_rdma->cr_scq) { AMP_ERROR("__amp_rdma_scq_thread: Unkown SCQ!\n"); ret = -1; goto EXIT; } ret = ibv_req_notify_cq(conn->ac_rdma->cr_scq, 0); if (ret) { AMP_ERROR("__amp_rdma_scq_thread: Failed to set notify!\n"); goto EXIT; } AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: ibv_req_notify_cq.\n"); ret = __amp_sq_cq_reap(conn->ac_rdma, NULL); AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: __amp_rq_cq_reap.\n"); ibv_ack_cq_events(conn->ac_rdma->cr_scq, 1); if (ret) { AMP_ERROR("__amp_rdma_scq_thread: Failed to ack scq event!\n"); goto EXIT; } AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: ibv_ack_cq_events.\n"); } } From yangdong at ncic.ac.cn Sat Oct 27 11:57:34 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Sun, 28 Oct 2007 02:57:34 +0800 Subject: [ofa-general] ibv_get_cq_event problem~ In-Reply-To: <47238711.6070208@ncic.ac.cn> References: <47238711.6070208@ncic.ac.cn> Message-ID: <47238A1E.4060100@ncic.ac.cn> whether there is some errs in my makefile? whether i need to include sth? CC = cc RM = rm -f AR = ar rvs MV = mv -f RANLIB = ranlib TOP_DIR = /usr/include INC_VERBS = ${TOP_DIR}/infiniband/ INC_RDMACM = ${TOP_DIR}/rdma/ INC_THIS = ./ CFLAGS = -DHAVE_CONFIG_H -I../../include/ -I${INC_VERBS} -I${INC_RDMACM} -I${INC_THIS} -Wall -g -D_GNU_SOURCE -O2 -D__RDMA__ LDFLAGS = -lm -lpthread -libverbs -lrdmacm LIBPATH = ../../lib/ OBJS = amp_interface.o amp_conn.o amp_utcp.o amp_uopenib.o amp_protos.o amp_request.o \ amp_uthread.o amp_help.o LIB = libamp.a .c.o: ${CC} ${CFLAGS} ${EXTRA_CFLAGS} -c $*.c lib: ${OBJS} ${AR} ${LIB} ${OBJS} ${RANLIB} ${LIB} ${MV} ${LIB} ${LIBPATH} clean: ${RM} *.o core ~* *.cpp ~ yangdong 写道: > when i was using ibv_get_cq_event in my thread just as follow (this is > similar to some codes in rping.c), i post a send op by ibv_post_send, > but thread find ibv_get_cq_event cannot get a cq event. why ? send is > post successfully , matbe any other errors ? when i cannot get a cq > event, which may happen? > > void thread() { > ... > while (1) { > ret = ibv_get_cq_event(conn->ac_rdma->cr_send_comp_channel, &ev_cq, > &ev_ctx); > if (ret) { > AMP_ERROR("__amp_rdma_scq_thread:Failed to get scq event!\n"); > goto EXIT; > } > AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_sendcompletion_thread: > ibv_get_cq_event.\n"); > > if (ev_cq != conn->ac_rdma->cr_scq) { > AMP_ERROR("__amp_rdma_scq_thread: Unkown SCQ!\n"); > ret = -1; > goto EXIT; > } > > ret = ibv_req_notify_cq(conn->ac_rdma->cr_scq, 0); > if (ret) { > AMP_ERROR("__amp_rdma_scq_thread: Failed to set notify!\n"); > goto EXIT; > } > AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: ibv_req_notify_cq.\n"); > > ret = __amp_sq_cq_reap(conn->ac_rdma, NULL); > AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: __amp_rq_cq_reap.\n"); > > ibv_ack_cq_events(conn->ac_rdma->cr_scq, 1); > if (ret) { > AMP_ERROR("__amp_rdma_scq_thread: Failed to ack scq event!\n"); > goto EXIT; > } > AMP_DEBUG(AMP_DEBUG_MSG, "__amp_rdma_scq_thread: ibv_ack_cq_events.\n"); > } > } > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From johann.george at qlogic.com Sat Oct 27 11:59:12 2007 From: johann.george at qlogic.com (Johann George) Date: Sat, 27 Oct 2007 11:59:12 -0700 Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's Summit:tentative agenda In-Reply-To: References: <20071024004042.GB10244@cuprite.pathscale.com> <4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> Message-ID: <20071027185912.GA12501@cuprite.pathscale.com> Thanks for all the comments on the MPI sessions. Our primary interest should be to make the MPI sessions as valuable as possible to the audience that is attending. My allotment was based on discussion with the presenters having decided to limit it to those MPIs that were included as part of OFED due to time constraints. Granted, this was entirely subjective. If a different allotment can be agreed upon, with the sole purpose of maximizing value to attendees, I will be happy to update the agenda. Johann On Wed, Oct 24, 2007 at 07:13:57PM -0400, Jeff Squyres wrote: > Perhaps the total 50 minutes currently allocated to MPI > implementations could be split between all of us who want to > present? This makes 3 so far (i.e., 15 min/ea) -- 4 if HP wants to > present (12 min/ea, or perhaps we could bump up to 60 mins for an > even 15 min/ea). > > > > On Oct 24, 2007, at 7:04 PM, Magro, Bill wrote: > > >If time allowed, we would be happy to give a 10m or so perspective on > >the OFA stack and OFED distribution from the Intel MPI point of view. > > > >Thanks, > > > >--Bill > > > >-----Original Message----- > >From: promoters-bounces at lists.openfabrics.org > >[mailto:promoters-bounces at lists.openfabrics.org] On Behalf Of Johann > >George > >Sent: Tuesday, October 23, 2007 7:41 PM > >To: Jeff Squyres > >Cc: promoters at lists.openfabrics.org; ewg at lists.openfabrics.org; > >general at lists.openfabrics.org; Or Gerlitz > >Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's > >Summit:tentative agenda > > > >Jeff, > > > >>Is there any intent for HP MPI or Intel MPI to speak? I would be > >>interested to hear what they have to say (e.g., feedback on the OFED > >>stack vs. other network stacks and other status update kinds of > >>things). > > > >We considered it but given the time constraints, thought we should > >wait until Sonoma. Priority was given to OpenMPI and MVAPICH since > >they are being shipped as part of OFED. Still, as you point out, > >getting feedback on their view of OFED vs. other networking stacks > >could be valuable. > > > >Johann > >_______________________________________________ > >promoters mailing list > >promoters at lists.openfabrics.org > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters > > > -- > Jeff Squyres > Cisco Systems From drift at lespenchants.com Sat Oct 27 12:19:07 2007 From: drift at lespenchants.com (Ken Turner) Date: Sat, 27 Oct 2007 20:19:07 +0100 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c818cd$e9cdcc80$0100007f@localhost> cheapnewsoft . com From wormlike at dontrumptower.com Sat Oct 27 13:31:01 2007 From: wormlike at dontrumptower.com (Jochen Griffin) Date: Sat, 27 Oct 2007 23:31:01 +0300 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c818d7$f3f4c880$0100007f@localhost> cheapnewsoft . com From cardiectomize at linscottrealestate.com Sat Oct 27 14:37:37 2007 From: cardiectomize at linscottrealestate.com (Tricia Gomes) Date: Sat, 27 Oct 2007 23:37:37 +0200 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c818e1$d3317080$0100007f@localhost> cheapnewsoft . com From sashak at voltaire.com Sat Oct 27 16:45:22 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 28 Oct 2007 01:45:22 +0200 Subject: [ofa-general] Re: [PATCH 3/3] osm: QoS - parsing port names In-Reply-To: <20071025091554.150750aa.weiny2@llnl.gov> References: <20071015035309.GN12364@sashak.voltaire.com> <47132943.9090301@dev.mellanox.co.il> <20071015103918.GO12364@sashak.voltaire.com> <1193233760.22038.80.camel@hrosenstock-ws.xsigo.com> <20071024153957.GR7088@sashak.voltaire.com> <4720928F.3050002@dev.mellanox.co.il> <1193319126.31872.94.camel@hrosenstock-ws.xsigo.com> <4720A9E8.4010300@dev.mellanox.co.il> <1193323435.31872.128.camel@hrosenstock-ws.xsigo.com> <20071025091554.150750aa.weiny2@llnl.gov> Message-ID: <20071027234522.GL22317@sashak.voltaire.com> Hi Ira, On 09:15 Thu 25 Oct , Ira Weiny wrote: > > 3) I am in the process of using the new event plugin interface to start > logging port counters to a mysql DB. (This is going to be a separate > plugin GPL'ed project so there will be no requirement on mysql to > opensm.) Maybe when the code is optional (configurable), it is ok to have such GPL-only inclusions from POV of OFA rules. Somebody knows? Personally I'm fine with GPL :). Sasha From hirmos at homecashclub.com Sat Oct 27 19:08:21 2007 From: hirmos at homecashclub.com (Huashi Olsen) Date: Sat, 27 Oct 2007 19:08:21 -0700 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c818f6$20b7d880$0100007f@localhost> cheapnewsoft . com From sashak at voltaire.com Sat Oct 27 18:02:26 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 28 Oct 2007 03:02:26 +0200 Subject: [ofa-general] Re: [PATCH] opensm & osm_console: modified console framework to support multiple connections In-Reply-To: <4713FD51.4010506@llnl.gov> References: <4713FD51.4010506@llnl.gov> Message-ID: <20071028010226.GN22317@sashak.voltaire.com> Hi Tim, Sorry about very long delay with reviewing this. On 16:52 Mon 15 Oct , Timothy A. Meier wrote: > This patch is setting up for adding Remote/Secure Console capability using > SSL/TSL (we need at LLNL). Thanks for doing this - it is great thing to secure OpenSM console. > Its a big patch because I changed to an abstract server model, instead of > the original > single connection and synchronous model. There is no significant functional > difference (yet). It is hard to understand how such abstraction model serves us without seeing the rest of SSL/TSL code. Probably it is better idea to issue whole patch series? Anyway some initial comments are below. > ======== > From cb69c1e2c8ea526bcb1e81d079bfa787eda09ba8 Mon Sep 17 00:00:00 2001 > From: Tim Meier > Date: Mon, 15 Oct 2007 16:08:10 -0700 > Subject: [PATCH] opensm & osm_console: modified console framework to support > multiple connections > > Provided an abstract console service that supports the current connection > types > (local, loopback, socket) as well as supporting the addition of a secure > connection type. > > * A server implementation supports multiple connections, and reduces the > posibility of an inadvertant denial of service (currently vulnerable). > > * An IO abstraction (CIO) is employed to facilitate the future > implementation > of a secure socket (SSL / TSL) connection, while maintaining backward > compatibility. Would be nice to not mix two things in one patch - "one patch per thought" makes it easier to review and submit. > > Signed-off-by: Tim Meier > --- > opensm/include/opensm/osm_console.h | 35 +- > opensm/opensm/main.c | 77 ++- > opensm/opensm/osm_console.c | 1500 > +++++++++++++++++++++++++---------- > 3 files changed, 1177 insertions(+), 435 deletions(-) > > diff --git a/opensm/include/opensm/osm_console.h > b/opensm/include/opensm/osm_console.h > index 33e41e7..75111a4 100644 > --- a/opensm/include/opensm/osm_console.h > +++ b/opensm/include/opensm/osm_console.h > @@ -49,6 +49,14 @@ > #define OSM_DEFAULT_CONSOLE OSM_DISABLE_CONSOLE > #define OSM_DEFAULT_CONSOLE_PORT 10000 > #define OSM_DAEMON_NAME "opensm" > +#define OSM_QUIT_CMD "quit" > +#define OSM_LOOP_PERIOD_SEC 2 > + > +#define CIO_BUFSIZE 1024 > +#define CIO_INFO_SIZE 128 > +#define CIO_NOTE_SIZE 64 > +#define CIO_MAX_CONNECTS 5 > +#define CIO_CONNECTION_PORT 10000 > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > @@ -59,10 +67,29 @@ > #endif /* __cplusplus */ > BEGIN_C_DECLS > -void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm); > -void osm_console(osm_opensm_t * p_osm); > -void osm_console_prompt(FILE * out); > -void osm_console_close_socket(osm_opensm_t * p_osm); > + > +/* TODO move when fully implemented */ > +typedef struct _CIO_t > +{ > + int fd; // file descriptor (socket) > + FILE *out; > + FILE *err; > + FILE *in; > + struct pollfd *pfd; > +} CIO_t; > + > +int osm_console_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm); > +void osm_console_server_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm); > +void osm_console_server_destroy(osm_opensm_t *p_osm); > +int is_console_enabled(osm_subn_opt_t *p_opt); > + > +/* TODO move along with other IO abstraction code */ > +int cio_printf( CIO_t *cio, const char *format, ...); > +int cio_flush( CIO_t *cio); > +int cio_getline( char **lineptr, size_t *n, CIO_t *cio); > +int cio_open( CIO_t *cio); > +int cio_close( CIO_t *cio); > +int cio_poll(CIO_t *cio, int timeout); Later I see that all cio_* and CIO_* stuff is used only in osm_console.c, then I think this all should be moved to this file, local function should be static, etc.. Another thing, please try to not break existing coding style (it is described in opensm/doc/opensm-coding-style.txt), in many cases you can use opensm/opensm/osm_indent script to format the code. > END_C_DECLS > #endif /* _OSM_CONSOLE_H_ */ > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 0250551..b744157 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -229,11 +229,13 @@ void show_usage(void) > " SMPs.\n" > " Without -maxsmps, OpenSM defaults to a maximum of\n" > " 4 outstanding SMPs.\n\n"); > - printf("-console [off|local" > #ifdef ENABLE_OSM_CONSOLE_SOCKET > - "|socket|loopback" > + printf("-console [%s|%s|%s|%s]", OSM_DISABLE_CONSOLE, OSM_LOCAL_CONSOLE, > + OSM_REMOTE_CONSOLE, OSM_LOOPBACK_CONSOLE); > +#else > + printf("-console [%s|%s]", OSM_DISABLE_CONSOLE, OSM_LOCAL_CONSOLE); > #endif > - "]\n This option activates the OpenSM console (default > off).\n\n"); > + printf("]\n This option activates the OpenSM console (default > off).\n\n"); > #ifdef ENABLE_OSM_CONSOLE_SOCKET > printf("-console-port \n" > " Specify an alternate telnet port for the console > (default %d).\n\n", > @@ -566,6 +568,45 @@ static int daemonize(osm_opensm_t * osm) > return 0; > } > +/* simple server to provide an interface to support > + * interactive (and non-interactive) commands + * loop here until an > exit signal is received > + * > + * currently just support a command console > + */ > +void osm_opensm_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) > +{ > + if(is_console_enabled(p_opt)) > + osm_console_server_init(p_opt, p_osm); > + > + /* > + Sit here forever - dwelling or running the server > + */ > + while (!osm_exit_flag) > + { > + if(is_console_enabled(p_opt)) > + osm_console_server(p_opt, p_osm); > + else > + cl_thread_suspend( 10000); > + > + if (osm_usr1_flag) > + { > + osm_usr1_flag = 0; > + osm_log_reopen_file(&(p_osm->log)); > + } > + if (osm_hup_flag) > + { > + osm_hup_flag = 0; > + /* a HUP signal should only start a new heavy sweep */ > + p_osm->subn.force_immediate_heavy_sweep = TRUE; > + osm_opensm_sweep(p_osm); > + } > + } > + + if(is_console_enabled(p_opt)) > + osm_console_server_destroy(p_osm); > +} > + > /********************************************************************** > **********************************************************************/ > int main(int argc, char *argv[]) > @@ -1034,34 +1075,8 @@ int main(int argc, char *argv[]) > osm_exit_flag = 1; > } > } else { > - osm_console_init(&opt, &osm); > - > - /* > - Sit here forever > - */ > - while (!osm_exit_flag) { > - if (strcmp(opt.console, OSM_LOCAL_CONSOLE) == 0 > -#ifdef ENABLE_OSM_CONSOLE_SOCKET > - || strcmp(opt.console, OSM_REMOTE_CONSOLE) == 0 > - || strcmp(opt.console, OSM_LOOPBACK_CONSOLE) == 0 > -#endif > - ) > - osm_console(&osm); > - else > - cl_thread_suspend(10000); > - > - if (osm_usr1_flag) { > - osm_usr1_flag = 0; > - osm_log_reopen_file(&osm.log); > - } > - if (osm_hup_flag) { > - osm_hup_flag = 0; > - /* a HUP signal should only start a new heavy sweep */ > - osm.subn.force_immediate_heavy_sweep = TRUE; > - osm_opensm_sweep(&osm); > - } > - } > - osm_console_close_socket(&osm); > + // start a server that runs indefinately > + osm_opensm_server(&opt, &osm); > } > #if 0 > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c > index c6e02ab..9d62774 100644 > --- a/opensm/opensm/osm_console.c > +++ b/opensm/opensm/osm_console.c > @@ -38,15 +38,16 @@ > #define _GNU_SOURCE /* for getline */ > #include > #include > +#include > #include > #include > #include > #include > #ifdef ENABLE_OSM_CONSOLE_SOCKET > #include > -#endif > #include > #include > +#endif > #include > #include > #include > @@ -57,20 +58,113 @@ > #include > #include > +typedef struct _LoopCmd > +{ > + int on; > + int running; > + int delay_s; > + void (*loop_function)(osm_opensm_t *p_osm, CIO_t *out); > + cl_thread_t loopThread; // a specific thread for each looping cmd > +} LoopCmd; > + > +// unique attributes for each connection > +typedef struct _osm_console_thread_t > +{ > + int used; > + unsigned short int port; > + int authorized; > + int state; > + char name[CIO_INFO_SIZE]; > + char in_buff[CIO_BUFSIZE]; > + char out_buff[CIO_BUFSIZE]; > + char client_type[CIO_NOTE_SIZE]; // maps to option->console > (off|local|socket) > + char client_ip[CIO_NOTE_SIZE]; > + char client_hn[CIO_INFO_SIZE]; > + unsigned int thread_num; // a unique ever increasing number + > osm_opensm_t *p_osm; // the global opensm singleton (protect with lock) > + CIO_t io; // the io streams for the connection > + LoopCmd loop_command; > + cl_thread_t consoleThread; // a specific thread each console connection > + struct timeval connect_time; > +} osm_console_thread_t; I think this introduces CIO_MAX_CONNECTS new threads + for loop commands. What about to do all in one thread - to use select() or poll() with timeout on multiple file descriptors? This will "reserve" another CPUs for running another OpenSM things. Another potential problem is multi thread synchronizations - we had (and still have) a lot of issues in this area. > + > struct command { > - char *name; > - void (*help_function) (FILE * out, int detail); > - void (*parse_function) (char **p_last, osm_opensm_t * p_osm, > - FILE * out); > + char *name; > + void (*help_function)(CIO_t *out, int detail); > + void (*parse_function)(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out); > }; > -struct { > - int on; > - int delay_s; > - time_t previous; > - void (*loop_function) (osm_opensm_t * p_osm, FILE * out); > -} loop_command = { > -on: 0, delay_s: 2, loop_function:NULL}; > +/* connection pool for remote clients - currently only consoles */ > +static osm_console_thread_t ConsoleThreadPool[CIO_MAX_CONNECTS]; > +static cl_plock_t ThreadLock; > +static volatile unsigned int cio_thread_counter = 0; > +static struct timeval ServerTime; Would be nice to avoid using non-constant static/global variables. Instead we could keep needed per OpenSM session info in allocated structure. > + > +/********************************************************************** > + * convenience function > + **********************************************************************/ > +CIO_t* getCIO(osm_console_thread_t *oct) This function should be static? > +{ > + return &oct->io; > +} > + > +/********************************************************************** > + * thread pool primitive: counts the number currently in use > + **********************************************************************/ > +int num_console_threads(void) Ditto (and many others below) > +{ > + // count them up > + > + int i; > + int num = 0; > + + cl_plock_acquire(&ThreadLock); > + for(i = 0; i < CIO_MAX_CONNECTS; ++i) > + { > + if(ConsoleThreadPool[i].used != 0) > + num++; > + } > + cl_plock_release(&ThreadLock); > + + return num; > +} > + > +/********************************************************************** > + * thread pool primitive: the current value reflects the number of > + * connection attempts made since program execution. > + **********************************************************************/ > +unsigned int get_console_thread_counter(void) > +{ > + return cio_thread_counter; > +} > + > +int is_loopback(char* str) > +{ > + // convenience - checks if socket based connection > + if(str) > + return (strcmp(str, OSM_LOOPBACK_CONSOLE) == 0); > +return 0; > +} > + > +int is_remote(char* str) > +{ > + // convenience - checks if socket based connection > + if(str) > + return (strcmp(str, OSM_REMOTE_CONSOLE) == 0) > + || is_loopback(str); > +return 0; > +} > + > +int is_console_enabled(osm_subn_opt_t *p_opt) > +{ > + // checks for a variety of types of consoles - default is off or 0 > + if(p_opt) > + return ((strcmp(p_opt->console, OSM_LOCAL_CONSOLE) == 0) > + || (strcmp(p_opt->console, OSM_LOOPBACK_CONSOLE) == 0) > + || (strcmp(p_opt->console, OSM_REMOTE_CONSOLE) == 0)); > +return 0; > +} > + > static const struct command console_cmds[]; > @@ -79,114 +173,103 @@ static inline char *next_token(char **p_last) > return strtok_r(NULL, " \t\n\r", p_last); > } > -static void help_command(FILE * out, int detail) > +static void help_command(CIO_t *out, int detail) > { > int i; > - fprintf(out, "Supported commands and syntax:\n"); > - fprintf(out, "help []\n"); > + cio_printf(out, "Supported commands and syntax:\n"); > + cio_printf(out, "help []\n"); > /* skip help command */ > for (i = 1; console_cmds[i].name; i++) > console_cmds[i].help_function(out, 0); > } > -static void help_quit(FILE * out, int detail) > +static void help_quit(CIO_t *out, int detail) > { > - fprintf(out, "quit (not valid in local mode; use ctl-c)\n"); > + cio_printf(out, "%s -- stops the console\n", OSM_QUIT_CMD); > + if (detail) { > + cio_printf(out, " OpenSM will continue, to kill; \n"); > + cio_printf(out, " use ctrl-C in local mode or\n"); > + cio_printf(out, " kill the process\n"); > + } > } > -static void help_loglevel(FILE * out, int detail) > + > +static void help_loglevel(CIO_t *out, int detail) > { > - fprintf(out, "loglevel []\n"); > + cio_printf(out, "loglevel []\n"); > if (detail) { > - fprintf(out, " log-level is OR'ed from the following\n"); > - fprintf(out, " OSM_LOG_NONE 0x%02X\n", > - OSM_LOG_NONE); > - fprintf(out, " OSM_LOG_ERROR 0x%02X\n", > - OSM_LOG_ERROR); > - fprintf(out, " OSM_LOG_INFO 0x%02X\n", > - OSM_LOG_INFO); > - fprintf(out, " OSM_LOG_VERBOSE 0x%02X\n", > - OSM_LOG_VERBOSE); > - fprintf(out, " OSM_LOG_DEBUG 0x%02X\n", > - OSM_LOG_DEBUG); > - fprintf(out, " OSM_LOG_FUNCS 0x%02X\n", > - OSM_LOG_FUNCS); > - fprintf(out, " OSM_LOG_FRAMES 0x%02X\n", > - OSM_LOG_FRAMES); > - fprintf(out, " OSM_LOG_ROUTING 0x%02X\n", > - OSM_LOG_ROUTING); > - fprintf(out, " OSM_LOG_SYS 0x%02X\n", > - OSM_LOG_SYS); > - fprintf(out, "\n"); > - fprintf(out, " OSM_LOG_DEFAULT_LEVEL 0x%02X\n", > - OSM_LOG_DEFAULT_LEVEL); > + cio_printf(out, " log-level is OR'ed from the following\n"); > + cio_printf(out, " OSM_LOG_NONE 0x%02X\n", > OSM_LOG_NONE); > + cio_printf(out, " OSM_LOG_ERROR 0x%02X\n", > OSM_LOG_ERROR); > + cio_printf(out, " OSM_LOG_INFO 0x%02X\n", > OSM_LOG_INFO); > + cio_printf(out, " OSM_LOG_VERBOSE 0x%02X\n", > OSM_LOG_VERBOSE); > + cio_printf(out, " OSM_LOG_DEBUG 0x%02X\n", > OSM_LOG_DEBUG); > + cio_printf(out, " OSM_LOG_FUNCS 0x%02X\n", > OSM_LOG_FUNCS); > + cio_printf(out, " OSM_LOG_FRAMES 0x%02X\n", > OSM_LOG_FRAMES); > + cio_printf(out, " OSM_LOG_ROUTING 0x%02X\n", > OSM_LOG_ROUTING); > + cio_printf(out, " OSM_LOG_SYS 0x%02X\n", > OSM_LOG_SYS); > + cio_printf(out, "\n"); > + cio_printf(out, " OSM_LOG_DEFAULT_LEVEL 0x%02X\n", > OSM_LOG_DEFAULT_LEVEL); > } > } > -static void help_priority(FILE * out, int detail) > +static void help_priority(CIO_t *out, int detail) > { > - fprintf(out, "priority []\n"); > + cio_printf(out, "priority []\n"); > } > -static void help_resweep(FILE * out, int detail) > +static void help_resweep(CIO_t *out, int detail) > { > - fprintf(out, "resweep [heavy|light]\n"); > + cio_printf(out, "resweep [heavy|light]\n"); > } > -static void help_status(FILE * out, int detail) > +static void help_status(CIO_t *out, int detail) > { > - fprintf(out, "status [loop]\n"); > + cio_printf(out, "status [loop]\n"); > if (detail) { > - fprintf(out, " loop -- type \"q\" to quit\n"); > + cio_printf(out, " loop -- type \"q\" to quit\n"); > } > } > -static void help_logflush(FILE * out, int detail) > +static void help_logflush(CIO_t *out, int detail) > { > - fprintf(out, "logflush -- flush the opensm.log file\n"); > + cio_printf(out, "logflush -- flush the opensm.log file\n"); > } > -static void help_querylid(FILE * out, int detail) > +static void help_querylid(CIO_t *out, int detail) > { > - fprintf(out, > - "querylid lid -- print internal information about the lid > specified\n"); > + cio_printf(out, > + "querylid lid -- print internal information about the lid > specified\n"); > } > -static void help_portstatus(FILE * out, int detail) > +static void help_portstatus(CIO_t *out, int detail) > { > - fprintf(out, "portstatus [ca|switch|router]\n"); > + cio_printf(out, "portstatus [ca|switch|router]\n"); > if (detail) { > - fprintf(out, "summarize port status\n"); > - fprintf(out, > - " [ca|switch|router] -- limit the results to the node type > specified\n"); > + cio_printf(out, "summarize port status\n"); > + cio_printf(out, " [ca|switch|router] -- limit the results to the > node type specified\n"); > } > } > #ifdef ENABLE_OSM_PERF_MGR > -static void help_perfmgr(FILE * out, int detail) > +static void help_perfmgr(CIO_t *out, int detail) > { > - fprintf(out, > - "perfmgr > [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n"); > + cio_printf(out, "perfmgr > [enable|disable|clear_counters|dump_counters|sweep_time[seconds]]\n"); > if (detail) { > - fprintf(out, > - "perfmgr -- print the performance manager state\n"); > - fprintf(out, > - " [enable|disable] -- change the perfmgr state\n"); > - fprintf(out, > - " [sweep_time] -- change the perfmgr sweep time (requires > [seconds] option)\n"); > - fprintf(out, > - " [clear_counters] -- clear the counters stored\n"); > - fprintf(out, > - " [dump_counters [mach]] -- dump the counters (optionally in > [mach]ine readable format)\n"); > + cio_printf(out, "perfmgr -- print the performance manager > state\n"); > + cio_printf(out, " [enable|disable] -- change the perfmgr > state\n"); > + cio_printf(out, " [sweep_time] -- change the perfmgr sweep time > (requires [seconds] option)\n"); > + cio_printf(out, " [clear_counters] -- clear the counters > stored\n"); > + cio_printf(out, " [dump_counters [mach]] -- dump the counters > (optionally in [mach]ine readable format)\n"); > } > } > #endif /* ENABLE_OSM_PERF_MGR */ > /* more help routines go here */ > -static void help_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void help_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > { > char *p_cmd; > int i, found = 0; > @@ -203,21 +286,21 @@ static void help_parse(char **p_last, osm_opensm_t * > p_osm, FILE * out) > } > } > if (!found) { > - fprintf(out, "%s : Command not found\n\n", p_cmd); > + cio_printf(out, "%s : Command not found\n\n", p_cmd); > help_command(out, 0); > } > } > } > -static void loglevel_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void loglevel_parse(char **p_last, osm_console_thread_t *p_oct, > CIO_t *out) > { > + osm_opensm_t *p_osm = p_oct->p_osm; > char *p_cmd; > int level; > p_cmd = next_token(p_last); > if (!p_cmd) > - fprintf(out, "Current log level is 0x%x\n", > - osm_log_get_level(&p_osm->log)); > + cio_printf(out, "Current log level is 0x%x\n", > osm_log_get_level(&p_osm->log)); At least here your mailer wraps the line :( > else { > /* Handle x, 0x, and decimal specification of log level */ > if (!strncmp(p_cmd, "x", 1)) { > @@ -231,31 +314,29 @@ static void loglevel_parse(char **p_last, osm_opensm_t > * p_osm, FILE * out) > level = strtol(p_cmd, NULL, 10); > } > if ((level >= 0) && (level < 256)) { > - fprintf(out, "Setting log level to 0x%x\n", level); > + cio_printf(out, "Setting log level to 0x%x\n", level); > osm_log_set_level(&p_osm->log, level); > } else > - fprintf(out, "Invalid log level 0x%x\n", level); > + cio_printf(out, "Invalid log level 0x%x\n", level); > } > } > -static void priority_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void priority_parse(char **p_last, osm_console_thread_t *p_oct, > CIO_t *out) > { > + osm_opensm_t *p_osm = p_oct->p_osm; > char *p_cmd; > int priority; > p_cmd = next_token(p_last); > if (!p_cmd) > - fprintf(out, "Current sm-priority is %d\n", > - p_osm->subn.opt.sm_priority); > + cio_printf(out, "Current sm-priority is %d\n", > p_osm->subn.opt.sm_priority); > else { > priority = strtol(p_cmd, NULL, 0); > if (0 > priority || 15 < priority) > - fprintf(out, > - "Invalid sm-priority %d; must be between 0 and 15\n", > - priority); > + cio_printf(out, "Invalid sm-priority %d; must be between 0 and > 15\n", priority); > else { > - fprintf(out, "Setting sm-priority to %d\n", priority); > - p_osm->subn.opt.sm_priority = (uint8_t) priority; > + cio_printf(out, "Setting sm-priority to %d\n", priority); > + p_osm->subn.opt.sm_priority = (uint8_t)priority; > /* Does the SM state machine need a kick now ? */ > } > } > @@ -371,24 +452,23 @@ static char *sm_state_mgr_str(osm_sm_state_t state) > } > } > -static void print_status(osm_opensm_t * p_osm, FILE * out) > +static void print_status(osm_opensm_t *p_osm, CIO_t *out) > { > if (out) { > - fprintf(out, " OpenSM Version : %s\n", OSM_VERSION); > - fprintf(out, " SM State/Mgr State : %s/%s\n", > + cio_printf(out, " OpenSM Version : %s\n", OSM_VERSION); > + cio_printf(out, " SM State/Mgr State : %s/%s\n", > sm_state_str(p_osm->subn.sm_state), > sm_state_mgr_str(p_osm->sm.state_mgr.state)); > - fprintf(out, " SA State : %s\n", > + cio_printf(out, " SA State : %s\n", > sa_state_str(p_osm->sa.state)); > - fprintf(out, " Routing Engine : %s\n", > - p_osm->routing_engine.name ? p_osm->routing_engine. > - name : "null (min-hop)"); > + cio_printf(out, " Routing Engine : %s\n", > + p_osm->routing_engine.name ? p_osm->routing_engine.name : "null > (min-hop)"); > #ifdef ENABLE_OSM_PERF_MGR > - fprintf(out, "\n PerfMgr state/sweep state : %s/%s\n", > + cio_printf(out, "\n PerfMgr state/sweep state : %s/%s\n", > osm_perfmgr_get_state_str(&(p_osm->perfmgr)), > osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr))); > #endif > - fprintf(out, "\n MAD stats\n" > + cio_printf(out, "\n MAD stats\n" > " ---------\n" > " QP0 MADs outstanding : %d\n" > " QP0 MADs outstanding (on wire) : %d\n" > @@ -412,7 +492,7 @@ static void print_status(osm_opensm_t * p_osm, FILE * > out) > p_osm->stats.sa_mads_sent, > p_osm->stats.sa_mads_rcvd_unknown, > p_osm->stats.sa_mads_ignored); > - fprintf(out, "\n Subnet flags\n" > + cio_printf(out, "\n Subnet flags\n" > " ------------\n" > " Ignore existing lfts : %d\n" > " Subnet Init errors : %d\n" > @@ -426,32 +506,24 @@ static void print_status(osm_opensm_t * p_osm, FILE * > out) > p_osm->subn.moved_to_master_state, > p_osm->subn.first_time_master_sweep, > p_osm->subn.coming_out_of_standby); > - fprintf(out, "\n"); > - } > -} > - > -static int loop_command_check_time(void) > -{ > - time_t cur = time(NULL); > - if ((loop_command.previous + loop_command.delay_s) < cur) { > - loop_command.previous = cur; > - return (1); > + cio_printf(out, "\n"); > } > - return (0); > } > -static void status_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void status_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > { > + osm_opensm_t *p_osm = p_oct->p_osm; > char *p_cmd; > p_cmd = next_token(p_last); > if (p_cmd) { > if (strcmp(p_cmd, "loop") == 0) { > - fprintf(out, "Looping on status command...\n"); > - fflush(out); > - loop_command.on = 1; > - loop_command.previous = time(NULL); > - loop_command.loop_function = print_status; > + cio_printf(out, "Looping on status command...\n"); > + cio_flush(out); > + p_oct->loop_command.on = 1; > + p_oct->loop_command.delay_s = OSM_LOOP_PERIOD_SEC; > + p_oct->loop_command.running = 0; > + p_oct->loop_command.loop_function = print_status; > } else { > help_status(out, 1); > return; > @@ -460,14 +532,15 @@ static void status_parse(char **p_last, osm_opensm_t * > p_osm, FILE * out) > print_status(p_osm, out); > } > -static void resweep_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void resweep_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > { > + osm_opensm_t *p_osm = p_oct->p_osm; > char *p_cmd; > p_cmd = next_token(p_last); > if (!p_cmd || > (strcmp(p_cmd, "heavy") != 0 && strcmp(p_cmd, "light") != 0)) { > - fprintf(out, "Invalid resweep command\n"); > + cio_printf(out, "Invalid resweep command\n"); > help_resweep(out, 1); > } else { > if (strcmp(p_cmd, "heavy") == 0) { > @@ -477,20 +550,21 @@ static void resweep_parse(char **p_last, osm_opensm_t > * p_osm, FILE * out) > } > } > -static void logflush_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void logflush_parse(char **p_last, osm_console_thread_t *p_oct, > CIO_t *out) > { > - fflush(p_osm->log.out_port); > + fflush(p_oct->p_osm->log.out_port); > } > -static void querylid_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void querylid_parse(char **p_last, osm_console_thread_t *p_oct, > CIO_t *out) > { > - int p = 0; > - uint16_t lid = 0; > + osm_opensm_t *p_osm = p_oct->p_osm; > + int p = 0; > + uint16_t lid = 0; > osm_port_t *p_port = NULL; > char *p_cmd = next_token(p_last); > if (!p_cmd) { > - fprintf(out, "no LID specified\n"); > + cio_printf(out, "no LID specified\n"); > help_querylid(out, 1); > return; > } > @@ -503,8 +577,8 @@ static void querylid_parse(char **p_last, osm_opensm_t * > p_osm, FILE * out) > if (!p_port) > goto invalid_lid; > - fprintf(out, "Query results for LID %d\n", lid); > - fprintf(out, > + cio_printf(out, "Query results for LID %d\n", lid); > + cio_printf(out, > " GUID : 0x%016" PRIx64 "\n" > " Node Desc : %s\n" > " Node Type : %s\n" > @@ -518,20 +592,19 @@ static void querylid_parse(char **p_last, osm_opensm_t > * p_osm, FILE * out) > p = 0; > else > p = 1; > - for ( /* see above */ ; p < p_port->p_node->physp_tbl_size; p++) { > - fprintf(out, > + for (/* see above */; p < p_port->p_node->physp_tbl_size; p++) { > + cio_printf(out, > " Port %d health : %s\n", > p, > - p_port->p_node->physp_table[p]. > - healthy ? "OK" : "ERROR"); > + p_port->p_node->physp_table[p].healthy ? "OK" : "ERROR"); > } > cl_plock_release(&p_osm->lock); > return; > - invalid_lid: > +invalid_lid: > cl_plock_release(&p_osm->lock); > - fprintf(out, "Invalid lid %d\n", lid); > + cio_printf(out, "Invalid lid %d\n", lid); > return; > } > @@ -564,11 +637,11 @@ __tag_port_report(port_report_t ** head, uint64_t > node_guid, > *head = rep; > } > -static void __print_port_report(FILE * out, port_report_t * head) > +static void __print_port_report(CIO_t *out, port_report_t *head) > { > port_report_t *item = head; > while (item != NULL) { > - fprintf(out, " 0x%016" PRIx64 " %d (%s)\n", > + cio_printf(out, " 0x%016"PRIx64" %d (%s)\n", > item->node_guid, item->port_num, item->print_desc); > port_report_t *next = item->next; > free(item); > @@ -689,10 +762,11 @@ static void __get_stats(cl_map_item_t * const > p_map_item, void *context) > } > } > -static void portstatus_parse(char **p_last, osm_opensm_t * p_osm, FILE * > out) > +static void portstatus_parse(char **p_last, osm_console_thread_t *p_oct, > CIO_t *out) > { > - fabric_stats_t fs; > - struct timeval before, after; > + osm_opensm_t *p_osm = p_oct->p_osm; > + fabric_stats_t fs; > + struct timeval before, after; > char *p_cmd; > memset(&fs, 0, sizeof(fs)); > @@ -706,7 +780,7 @@ static void portstatus_parse(char **p_last, osm_opensm_t > * p_osm, FILE * out) > } else if (strcmp(p_cmd, "router") == 0) { > fs.node_type_lim = IB_NODE_TYPE_ROUTER; > } else { > - fprintf(out, "Node type not understood\n"); > + cio_printf(out, "Node type not understood\n"); > help_portstatus(out, 1); > return; > } > @@ -723,58 +797,56 @@ static void portstatus_parse(char **p_last, > osm_opensm_t * p_osm, FILE * out) > gettimeofday(&after, NULL); > /* report the stats */ > - fprintf(out, "\"%s\" port status:\n", > - fs.node_type_lim ? ib_get_node_type_str(fs. > - node_type_lim) : "ALL"); > - fprintf(out, > - " %" PRIu64 " port(s) scanned on %" PRIu64 > - " nodes in %lu us\n", fs.total_ports, fs.total_nodes, > - after.tv_usec - before.tv_usec); > + cio_printf(out, "\"%s\" port status:\n", > + fs.node_type_lim ? ib_get_node_type_str(fs.node_type_lim) : > "ALL"); > + cio_printf(out, " %"PRIu64" port(s) scanned on %"PRIu64" nodes in %lu > us\n", > + fs.total_ports, fs.total_nodes, after.tv_usec - before.tv_usec); > if (fs.ports_down) > - fprintf(out, " %" PRIu64 " down\n", fs.ports_down); > + cio_printf(out, " %"PRIu64" down\n", fs.ports_down); > if (fs.ports_active) > - fprintf(out, " %" PRIu64 " active\n", fs.ports_active); > + cio_printf(out, " %"PRIu64" active\n", fs.ports_active); > if (fs.ports_1X) > - fprintf(out, " %" PRIu64 " at 1X\n", fs.ports_1X); > + cio_printf(out, " %"PRIu64" at 1X\n", fs.ports_1X); > if (fs.ports_4X) > - fprintf(out, " %" PRIu64 " at 4X\n", fs.ports_4X); > + cio_printf(out, " %"PRIu64" at 4X\n", fs.ports_4X); > if (fs.ports_8X) > - fprintf(out, " %" PRIu64 " at 8X\n", fs.ports_8X); > + cio_printf(out, " %"PRIu64" at 8X\n", fs.ports_8X); > if (fs.ports_12X) > - fprintf(out, " %" PRIu64 " at 12X\n", fs.ports_12X); > + cio_printf(out, " %"PRIu64" at 12X\n", fs.ports_12X); > if (fs.ports_sdr) > - fprintf(out, " %" PRIu64 " at 2.5 Gbps\n", fs.ports_sdr); > + cio_printf(out, " %"PRIu64" at 2.5 Gbps\n", fs.ports_sdr); > if (fs.ports_ddr) > - fprintf(out, " %" PRIu64 " at 5.0 Gbps\n", fs.ports_ddr); > + cio_printf(out, " %"PRIu64" at 5.0 Gbps\n", fs.ports_ddr); > if (fs.ports_qdr) > - fprintf(out, " %" PRIu64 " at 10.0 Gbps\n", fs.ports_qdr); > + cio_printf(out, " %"PRIu64" at 10.0 Gbps\n", fs.ports_qdr); > if (fs.ports_disabled + fs.ports_reduced_speed + fs.ports_reduced_width > - > 0) { > - fprintf(out, "\nPossible issues:\n"); > + > 0) { > + cio_printf(out, "\nPossible issues:\n"); > } > if (fs.ports_disabled) { > - fprintf(out, " %" PRIu64 " disabled\n", fs.ports_disabled); > + cio_printf(out, " %"PRIu64" disabled\n", fs.ports_disabled); > __print_port_report(out, fs.disabled_ports); > } > if (fs.ports_reduced_speed) { > - fprintf(out, " %" PRIu64 " with reduced speed\n", > + cio_printf(out, " %"PRIu64" with reduced speed\n", > fs.ports_reduced_speed); > __print_port_report(out, fs.reduced_speed_ports); > } > if (fs.ports_reduced_width) { > - fprintf(out, " %" PRIu64 " with reduced width\n", > + cio_printf(out, " %"PRIu64" with reduced width\n", > fs.ports_reduced_width); > __print_port_report(out, fs.reduced_width_ports); > } > - fprintf(out, "\n"); > + cio_printf(out, "\n"); > } > #ifdef ENABLE_OSM_PERF_MGR > -static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void perfmgr_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > { > + osm_opensm_t *p_osm = p_oct->p_osm; > char *p_cmd; > p_cmd = next_token(p_last); > @@ -803,309 +875,937 @@ static void perfmgr_parse(char **p_last, > osm_opensm_t * p_osm, FILE * out) > osm_perfmgr_set_sweep_time_s(&(p_osm->perfmgr), > time_s); > } else { > - fprintf(out, > + cio_printf(out, > "sweep_time requires a time period (in seconds) to be > specified\n"); > } > } else { > - fprintf(out, "\"%s\" option not found\n", p_cmd); > + cio_printf(out, "\"%s\" option not found\n", p_cmd); > } > } else { > - fprintf(out, "Performance Manager status:\n" > + cio_printf(out, "Performance Manager status:\n" > "state : %s\n" > "sweep state : %s\n" > "sweep time : %us\n" > - "outstanding queries/max : %d/%u\n" > - "loaded event plugin : %s\n", > + "outstanding queries/max : %d/%u\n", > osm_perfmgr_get_state_str(&(p_osm->perfmgr)), > osm_perfmgr_get_sweep_state_str(&(p_osm->perfmgr)), > osm_perfmgr_get_sweep_time_s(&(p_osm->perfmgr)), > p_osm->perfmgr.outstanding_queries, > - p_osm->perfmgr.max_outstanding_queries, > - p_osm->perfmgr.event_plugin ? > - p_osm->perfmgr.event_plugin->plugin_name : "NONE"); > + p_osm->perfmgr.max_outstanding_queries); > } > } > #endif /* ENABLE_OSM_PERF_MGR */ > -/* This is public to be able to close it on exit */ > -void osm_console_close_socket(osm_opensm_t * p_osm) > +static void help_version(CIO_t *out, int detail) > { > - if (p_osm->console.socket > 0) { > - close(p_osm->console.in_fd); > - p_osm->console.in_fd = -1; > - p_osm->console.out_fd = -1; > - p_osm->console.in = NULL; > - p_osm->console.out = NULL; > - } > + cio_printf(out, "version -- print the OSM version\n"); > } > -static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > +static void version_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > { > - osm_console_close_socket(p_osm); > + cio_printf(out, "%s build %s %s\n", OSM_VERSION, __DATE__, __TIME__); > } > -static void help_version(FILE * out, int detail) > +/********************************************************************** > + * thread pool primitive: returns the thread structure to the pool, and > + * makes it available > + **********************************************************************/ > +int free_console_thread(osm_console_thread_t *oct) > { > - fprintf(out, "version -- print the OSM version\n"); > + // just clear the used flag, mark as available > + oct->used = 0; > + return 1; > +} > + > +/********************************************************************** > + * Cleans up the thread that was established for a connection. > + * The connection should already be closed. This method releases > + * any resources and destroy the thread (done automagically??) > + * > + * refer to: osm_console_thread and osm_console_thread_init > +**********************************************************************/ > +int osm_console_thread_destroy(osm_console_thread_t *oct) > +{ + free_console_thread(oct); > + + // there are a few end cases that might need this (e.g. not completely > init) > + cio_close(getCIO(oct)); > + + return 0; > } > -static void version_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) > + > +/********************************************************************** > + * Gracefully shut down the console connection, release resources > + * refer to: osm_console_init > + **********************************************************************/ > +void osm_console_destroy(osm_console_thread_t *p_oct) > +{ > + osm_opensm_t *p_osm = p_oct->p_osm; > + CIO_t *out = getCIO(p_oct); > + > + osm_log(&(p_osm->log), OSM_LOG_INFO, > + "osm_console_destroy: Console connection being closed: %s (%s) > s#%d\n", p_oct->client_hn, > + p_oct->client_ip, out->fd); > + fflush(p_osm->log.out_port); > + cio_printf(out, "Closing this connection from osm_console_destroy\n"); > + + cio_close(out); > + } > + > +/********************************************************************** > + * thread pool primitive: kills and disconnects connections. If the > + * argument is a current thread, it will NOT be cleared (will be skipped) > + **********************************************************************/ > +int kill_console_thread_pool(osm_console_thread_t* p_oct, osm_opensm_t > *p_osm) > +{ > + // kill everything but my connection if p_oct is in the list > + int i; > + osm_console_thread_t* oct; > + CIO_t *p_out = getCIO(p_oct); > + CIO_t *out = getCIO(p_oct); > + > + // brute force this, don't use locks because don't want to get deadlocked > +// cl_plock_acquire(&ThreadLock); > + for(i = 0; i < CIO_MAX_CONNECTS; ++i) > + { > + oct = &ConsoleThreadPool[i]; > + if((oct) && (oct->used) && (p_oct != oct)) > + { > + cio_printf(p_out, " killing thread: %s\n", oct->name); > + out = getCIO(oct); > + + // disconnect gracefully?? > + osm_log(&(p_osm->log), OSM_LOG_INFO, > + "kill_console_thread_pool: %d (s#%d)\n", i, out->fd); > + + // return all the console resources > + osm_console_destroy(oct); > + + // return all the thread and connection resources > + osm_console_thread_destroy(oct); > + } > + } > +// cl_plock_release(&ThreadLock); > + return i; > +} > + > +/********************************************************************** > + * releases all of the resources used by all of the connections, by > + * closing sockets, freeing threads, etc.. > + * > + * a good method for handling a kill signal > + **********************************************************************/ > +int free_console_threads(osm_opensm_t *p_osm) > { > - fprintf(out, "%s build %s %s\n", OSM_VERSION, __DATE__, __TIME__); > + // just make sure everything is gone > + int rtnval = kill_console_thread_pool(NULL, p_osm); > + return rtnval; > } > + > +/********************************************************************** > + * thread pool primitive: clears and initializes all the threads. If the > + * argument is a current thread, it will NOT be cleared (will be skipped) > + **********************************************************************/ > +int print_console_thread_pool(osm_console_thread_t* p_oct, osm_opensm_t > *p_osm, CIO_t *out) This function is not used. > +{ > + // show whats in use, and whats available > + > + int i; > + osm_console_thread_t* oct; > + > + char *t_string = ctime(&(ServerTime.tv_sec)); > + t_string[strlen(t_string)-1]=0; + cio_printf(out, "OSM Server - Up > since: %s, Users: %d, * = this connection\n", t_string, > num_console_threads()); > + > + // (careful not to double lock .. num_console_threads() + > cl_plock_acquire(&ThreadLock); > + > + for(i = 0; i < CIO_MAX_CONNECTS; ++i) > + { > + oct = &ConsoleThreadPool[i]; > + if((oct) && (oct->used)) > + { > + if(p_oct == oct) > + cio_printf(out, "*"); > + else > + cio_printf(out, " "); > + cio_printf(out, "Thread: %s [%d]\n", oct->name, oct->thread_num); > + cio_printf(out, " User: %s, (%s)\n", oct->client_hn, > oct->client_ip); > + t_string = ctime(&(oct->connect_time.tv_sec)); > + t_string[strlen(t_string)-1]=0; + cio_printf(out, " Since: > %s\n", t_string); > + cio_printf(out, " Port: %d\n", oct->port); > + cio_printf(out, " Socket: %d\n", oct->io.fd); > + cio_printf(out, " State: %d\n", oct->state); > + } > + } > + cl_plock_release(&ThreadLock); > + return i; > +} > + > +/* close and free up resources used by socket */ > +static void osm_console_deinit_socket(osm_opensm_t *p_osm) > +{ > + if (p_osm->console.socket > 0) > + { > + osm_log(&(p_osm->log), OSM_LOG_INFO, > + "osm_console: Closing the primary (listening) socket connection > (%d)\n", p_osm->console.in_fd); > + > + close(p_osm->console.in_fd); > + p_osm->console.in_fd = -1; > + p_osm->console.out_fd = -1; > + p_osm->console.in = NULL; > + p_osm->console.out = NULL; > + } > +} > + > +/* do everything necessary to gracefully turn off the console */ > +void osm_console_server_destroy(osm_opensm_t *p_osm) > +{ > + /* make sure consoles are closed before stopping the main listener socket > */ > + free_console_threads(p_osm); > + + cl_plock_destroy(&ThreadLock); > + + /* close the socket, listening for connections */ > + osm_console_deinit_socket(p_osm); > +} > + > +/* turns off the console, signature needs to match the parse_funciton() */ > +static void quit_parse(char **p_last, osm_console_thread_t *p_oct, CIO_t > *out) > +{ > + // set the "done" flag used by the isDone() method > + p_oct->authorized = 0; // temporarily use this as the done flag > + > + // do other necessary things to clean up and turn off > +} > + > + > /* more parse routines go here */ > static const struct command console_cmds[] = { > - {"help", &help_command, &help_parse}, > - {"quit", &help_quit, &quit_parse}, > - {"loglevel", &help_loglevel, &loglevel_parse}, > - {"priority", &help_priority, &priority_parse}, > - {"resweep", &help_resweep, &resweep_parse}, > - {"status", &help_status, &status_parse}, > - {"logflush", &help_logflush, &logflush_parse}, > - {"querylid", &help_querylid, &querylid_parse}, > - {"portstatus", &help_portstatus, &portstatus_parse}, > - {"version", &help_version, &version_parse}, > + { "help", &help_command, &help_parse}, > + { OSM_QUIT_CMD, &help_quit, &quit_parse}, > + { "loglevel", &help_loglevel, &loglevel_parse}, > + { "priority", &help_priority, &priority_parse}, > + { "resweep", &help_resweep, &resweep_parse}, > + { "status", &help_status, &status_parse}, > + { "logflush", &help_logflush, &logflush_parse}, > + { "querylid", &help_querylid, &querylid_parse}, > + { "portstatus", &help_portstatus, &portstatus_parse}, > + { "version", &help_version, &version_parse}, > #ifdef ENABLE_OSM_PERF_MGR > {"perfmgr", &help_perfmgr, &perfmgr_parse}, > #endif /* ENABLE_OSM_PERF_MGR */ > {NULL, NULL, NULL} /* end of array */ > }; > -static void parse_cmd_line(char *line, osm_opensm_t * p_osm) > -{ > - char *p_cmd, *p_last; > - int i, found = 0; > - FILE *out = p_osm->console.out; > - > - while (isspace(*line)) > - line++; > - if (!*line) > - return; > - /* find first token which is the command */ > - p_cmd = strtok_r(line, " \t\n\r", &p_last); > - if (p_cmd) { > - for (i = 0; console_cmds[i].name; i++) { > - if (loop_command.on) { > - if (!strcmp(p_cmd, "q")) { > - loop_command.on = 0; > - } > - found = 1; > - break; > - } > - if (!strcmp(p_cmd, console_cmds[i].name)) { > - found = 1; > - console_cmds[i].parse_function(&p_last, p_osm, > - out); > - break; > - } > - } > - if (!found) { > - fprintf(out, "%s : Command not found\n\n", p_cmd); > - help_command(out, 0); > - } > - } else { > - fprintf(out, "Error parsing command line: `%s'\n", line); > - } > - if (loop_command.on) { > - fprintf(out, "use \"q\" to quit loop\n"); > - fflush(out); > - } > +static void parse_cmd_line(char *line, osm_console_thread_t *oct) > +{ > + char *p_cmd, *p_last; > + int i, found = 0; > + CIO_t *out = getCIO(oct); > + + while (isspace(*line)) > + line++; > + if (!*line) > + return; > + > + /* find first token which is the command */ > + p_cmd = strtok_r(line, " \t\n\r", &p_last); > + if (p_cmd) { > + for (i = 0; console_cmds[i].name; i++) { > + if (oct->loop_command.on ) { > + if (!strcmp(p_cmd, "q")) { > + oct->loop_command.on = 0; > + } > + found = 1; > + break; > + } > + if (!strcmp(p_cmd, console_cmds[i].name)) { > + found = 1; > + console_cmds[i].parse_function(&p_last, oct, out); > + break; > + } > + } > + if (!found) { > + cio_printf(out, "%s : Command not found\n\n", p_cmd); > + help_command(out, 0); > + } > + } else { > + cio_printf(out, "Error parsing command line: `%s'\n", line); > + } > } > -void osm_console_prompt(FILE * out) > +void osm_console_prompt(CIO_t *out, int loop_prompt) > { > if (out) { > - fprintf(out, "OpenSM %s", OSM_COMMAND_PROMPT); > - fflush(out); > + if(loop_prompt) > + cio_printf(out, "use \"q\" to quit loop\n"); > + else > + cio_printf(out, "OpenSM %s", OSM_COMMAND_PROMPT); > + cio_flush(out); > } > } > -void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm) > +/* open and setup socket connection */ > +static void osm_console_init_socket(osm_opensm_t *p_osm, uint16_t > console_port, char* console_type) > { > - p_osm->console.socket = -1; > - /* set up the file descriptors for the console */ > - if (strcmp(opt->console, OSM_LOCAL_CONSOLE) == 0) { > - p_osm->console.in = stdin; > - p_osm->console.out = stdout; > - p_osm->console.in_fd = fileno(stdin); > - p_osm->console.out_fd = fileno(stdout); > - > - osm_console_prompt(p_osm->console.out); > #ifdef ENABLE_OSM_CONSOLE_SOCKET > - } else if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0 > - || strcmp(opt->console, OSM_LOOPBACK_CONSOLE) == 0) { > - struct sockaddr_in sin; > - int optval = 1; > - > - if ((p_osm->console.socket = > - socket(AF_INET, SOCK_STREAM, 0)) < 0) { > - osm_log(&(p_osm->log), OSM_LOG_ERROR, > - "osm_console_init: ERR 4B01: Failed to open console socket: > %s\n", > - strerror(errno)); > - return; > - } > - setsockopt(p_osm->console.socket, SOL_SOCKET, SO_REUSEADDR, > - &optval, sizeof(optval)); > - sin.sin_family = AF_INET; > - sin.sin_port = htons(opt->console_port); > - if (strcmp(opt->console, OSM_REMOTE_CONSOLE) == 0) > - sin.sin_addr.s_addr = htonl(INADDR_ANY); > - else > - sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); > - if (bind(p_osm->console.socket, &sin, sizeof(sin)) < 0) { > - osm_log(&(p_osm->log), OSM_LOG_ERROR, > - "osm_console_init: ERR 4B02: Failed to bind console socket: > %s\n", > - strerror(errno)); > - return; > - } > - if (listen(p_osm->console.socket, 1) < 0) { > - osm_log(&(p_osm->log), OSM_LOG_ERROR, > - "osm_console_init: ERR 4B03: Failed to listen on socket: > %s\n", > - strerror(errno)); > - return; > - } > - signal(SIGPIPE, SIG_IGN); /* protect ourselves from closed pipes > */ > - p_osm->console.in = NULL; > - p_osm->console.out = NULL; > - p_osm->console.in_fd = -1; > - p_osm->console.out_fd = -1; > - osm_log(&(p_osm->log), OSM_LOG_INFO, > - "osm_console_init: Console listening on port %d\n", > - opt->console_port); > + struct sockaddr_in sin; > + int optval = 1; > + + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init_socket: > Initializing the socket: %d\n", console_port); > + + if ((p_osm->console.socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) > + { > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR > 4B01: Failed to open console socket: %s\n", strerror(errno)); > + return; > + } > + setsockopt(p_osm->console.socket, SOL_SOCKET, SO_REUSEADDR, &optval, > sizeof(optval)); > + sin.sin_family = AF_INET; > + sin.sin_port = htons(console_port); > + > + // loopback or ... > + if(is_loopback(console_type)) > + sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK); > + else > + sin.sin_addr.s_addr = htonl(INADDR_ANY); > + if (bind(p_osm->console.socket, &sin, sizeof(sin))< 0) > + { > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR > 4B02: Failed to bind console socket: %s\n", strerror(errno)); > + return; > + } > + if (listen(p_osm->console.socket, 2)< 0) > + { > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init_socket: ERR > 4B03: Failed to listen on socket: %s\n", strerror(errno)); > + return; > + } > + > + signal(SIGPIPE, SIG_IGN); /* protect ourselves from closed pipes */ > + p_osm->console.in = NULL; > + p_osm->console.out = NULL; > + p_osm->console.in_fd = -1; > + p_osm->console.out_fd = -1; > + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init_socket: Console > listening on port %d\n", console_port); > #endif > - } > } > -#ifdef ENABLE_OSM_CONSOLE_SOCKET > -static void handle_osm_connection(osm_opensm_t * p_osm, int new_fd, > - char *client_ip, char *client_hn) > +/********************************************************************** > + * thread pool primitive: gets the next available thread structure from > + * the pool. > + * > + * refer to free_console_thread() > + **********************************************************************/ > +osm_console_thread_t* new_console_thread(void) > { > - char *p_line; > - size_t len; > - ssize_t n; > - > - if (p_osm->console.in_fd >= 0) { > - FILE *file = fdopen(new_fd, "w+"); > - > - fprintf(file, "OpenSM Console connection already in use\n" > - " kill other session (y/n)? "); > - fflush(file); > - p_line = NULL; > - n = getline(&p_line, &len, file); > - if (n > 0 && (p_line[0] == 'y' || p_line[0] == 'Y')) { > - osm_console_close_socket(p_osm); > - } else { > - close(new_fd); > - return; > - } > - } > - p_osm->console.in_fd = new_fd; > - p_osm->console.out_fd = p_osm->console.in_fd; > - p_osm->console.in = fdopen(p_osm->console.in_fd, "w+"); > - p_osm->console.out = p_osm->console.in; > - osm_console_prompt(p_osm->console.out); > - osm_log(&(p_osm->log), OSM_LOG_INFO, > - "osm_console_init: Console connection accepted: %s (%s)\n", > - client_hn, client_ip); > + // return the next available thread from the pool > + // just iterate through.. > + > + int i; > + osm_console_thread_t* next = NULL; > + + cl_plock_acquire(&ThreadLock); > + for(i = 0; i < CIO_MAX_CONNECTS; ++i) > + { > + next = &ConsoleThreadPool[i]; > + if(next->used == 0) > + break; > + } > + + if(i >= CIO_MAX_CONNECTS) > + next = NULL; // full > + else > + { > + // immediately mark this as NOT available > + next->used = 1; > + next->thread_num = ++cio_thread_counter; > + gettimeofday(&(next->connect_time), NULL); + } > + cl_plock_release(&ThreadLock); > + + return next; > } > -static int connection_ok(char *client_ip, char *client_hn) > +/********************************************************************** > + * thread pool primitive: clears and initializes all the threads. If the > + * argument is a current thread, it will NOT be cleared (will be skipped) > + **********************************************************************/ > +int init_console_thread_pool(osm_console_thread_t* p_oct, osm_subn_opt_t > *opt, osm_opensm_t *p_osm) > { > - return (hosts_ctl > - (OSM_DAEMON_NAME, client_hn, client_ip, "STRING_UNKNOWN")); > + // initialize > + > + int i; > + osm_console_thread_t* oct; > + + cl_plock_acquire(&ThreadLock); > + for(i = 0; i < CIO_MAX_CONNECTS; ++i) > + { > + oct = &ConsoleThreadPool[i]; > + if(p_oct == NULL || p_oct != oct) > + { > + oct->used = 0; > + oct->thread_num = -1; > + oct->authorized = 0; > + oct->port = CIO_CONNECTION_PORT; > + oct->io.fd = -1; > + oct->state = 0; > + oct->p_osm = p_osm; > + if(opt != NULL) > + { > + oct->port = opt->console_port; > + strncpy(oct->name, opt->console, CIO_INFO_SIZE); > + } > + } > + } > + cl_plock_release(&ThreadLock); > + return i; > } > -#endif > -void osm_console(osm_opensm_t * p_osm) > +void osm_console_server_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm) > { > - struct pollfd pollfd[2]; > - char *p_line; > - size_t len; > - ssize_t n; > - struct pollfd *fds; > - nfds_t nfds; > - > - pollfd[0].fd = p_osm->console.socket; > - pollfd[0].events = POLLIN; > - pollfd[0].revents = 0; > - > - pollfd[1].fd = p_osm->console.in_fd; > - pollfd[1].events = POLLIN; > - pollfd[1].revents = 0; > - > - fds = p_osm->console.socket < 0 ? &pollfd[1] : pollfd; > - nfds = p_osm->console.socket < 0 || pollfd[1].fd < 0 ? 1 : 2; > - > - if (loop_command.on && loop_command_check_time() && > - loop_command.loop_function) { > - if (p_osm->console.out) { > - loop_command.loop_function(p_osm, p_osm->console.out); > - fflush(p_osm->console.out); > - } else { > - loop_command.on = 0; > - } > - } > + int status = 0; > + > + cl_plock_construct(&ThreadLock); > + status = cl_plock_init(&ThreadLock); > + if (status != IB_SUCCESS) > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_server_init: lock > initialization error\n"); > + + init_console_thread_pool(NULL, opt, p_osm); > + + gettimeofday(&ServerTime, NULL); // start time > + + p_osm->console.socket = -1; > + > + /* set up the file descriptors for the console */ > + if (strcmp(opt->console, OSM_LOCAL_CONSOLE)== 0) > + { > + p_osm->console.in = stdin; > + p_osm->console.out = stdout; > + p_osm->console.in_fd = fileno(stdin); > + p_osm->console.out_fd = fileno(stdout); > + } > + else if (is_remote(opt->console)) > + { > + osm_console_init_socket(p_osm, opt->console_port, opt->console); > + } > + // TODO - other types of "console" connections here > +} > - if (poll(fds, nfds, 1000) <= 0) > - return; > +/********************************************************************** > + * Main Loop Thread. > + * > + * Continuously loop on this command until turned off > + **********************************************************************/ > +void osm_loop_thread(void *p_ptr) > +{ > + osm_console_thread_t *oct = ( osm_console_thread_t * ) p_ptr; > + CIO_t *p_io = getCIO(oct); > + + oct->loop_command.running = 1; > + while (oct->loop_command.on && oct->loop_command.loop_function) > + { > + if (p_io->out) > + { > + // dwell here > + cl_thread_suspend(oct->loop_command.delay_s * 1000); > + oct->loop_command.loop_function(oct->p_osm, p_io); > + + // send the cmd prompt > + osm_console_prompt(p_io, oct->loop_command.on); + > cio_flush(p_io); > + } > + else > + { > + oct->loop_command.on = 0; > + } > + } > + oct->loop_command.running = 0; > + return; > +} > +/********************************************************************** > + * Do authentication & authorization check > + **********************************************************************/ > +static int is_authorized(osm_console_thread_t *p_oct) > +{ > #ifdef ENABLE_OSM_CONSOLE_SOCKET > - if (pollfd[0].revents & POLLIN) { > - int new_fd = 0; > - struct sockaddr_in sin; > - socklen_t len = sizeof(sin); > - char client_ip[64]; > - char client_hn[128]; > - struct hostent *hent; > - if ((new_fd = accept(p_osm->console.socket, &sin, &len)) < 0) { > - osm_log(&(p_osm->log), OSM_LOG_ERROR, > - "osm_console: ERR 4B04: Failed to accept console socket: > %s\n", > - strerror(errno)); > - p_osm->console.in_fd = -1; > - return; > - } > - if (inet_ntop > - (AF_INET, &sin.sin_addr, client_ip, > - sizeof(client_ip)) == NULL) { > - snprintf(client_ip, 64, "STRING_UNKNOWN"); > - } > - if ((hent = gethostbyaddr((const char *)&sin.sin_addr, > - sizeof(struct in_addr), > - AF_INET)) == NULL) { > - snprintf(client_hn, 128, "STRING_UNKNOWN"); > - } else { > - snprintf(client_hn, 128, "%s", hent->h_name); > - } > - if (connection_ok(client_ip, client_hn)) { > - handle_osm_connection(p_osm, new_fd, client_ip, > - client_hn); > - } else { > - osm_log(&(p_osm->log), OSM_LOG_ERROR, > - "osm_console: ERR 4B05: Console connection denied: %s > (%s)\n", > - client_hn, client_ip); > - close(new_fd); > - } > - return; > - } > + //// oct->authorized = pam_authorize(pTs); > + p_oct->authorized = !is_remote(p_oct->client_type) || > + hosts_ctl(OSM_DAEMON_NAME, p_oct->client_hn, > p_oct->client_ip, "STRING_UNKNOWN"); > +#else > + p_oct->authorized = 1; > #endif > + return p_oct->authorized; > +} > - if (pollfd[1].revents & POLLIN) { > - p_line = NULL; > - /* Get input line */ > - n = getline(&p_line, &len, p_osm->console.in); > - if (n > 0) { > - /* Parse and act on input */ > - parse_cmd_line(p_line, p_osm); > - if (!loop_command.on) { > - osm_console_prompt(p_osm->console.out); > - } > - } else > - osm_console_close_socket(p_osm); > - if (p_line) > - free(p_line); > - } > +/* > + * determine if the connection should be closed > + */ > +static int is_done(osm_console_thread_t *oct) > +{ > + int done = 0; // set to 1 when finished > + + /* Look for a condition that signals the connection should be closed */ > + if (!(oct->authorized) || !strcmp(oct->in_buff, OSM_QUIT_CMD) || > osm_exit_flag) > + { > + done = 1; > + } > + return (done); > +} > + > +/* > + * handle basic output to the client > + * > + * this includes results from a command, error information > + * or any appropriate feedback > + */ > +static int output(osm_console_thread_t *oct) > +{ > + CIO_t *out = getCIO(oct); > + + // send the output buffer to the client > + cio_printf(out, oct->out_buff); > + cio_flush(out); > + + // clear the output buffer?? > + oct->out_buff[0] = 0; > + > + // send the cmd prompt > + if(!oct->loop_command.on) > + osm_console_prompt(out, 0); > + + return (is_done(oct)); > +} > + > +/* > + * handle basic input from the socket > + */ > +static int input(osm_console_thread_t *oct) > +{ > + char *p_line = NULL; > + size_t len; > + ssize_t n; > + CIO_t *p_io = getCIO(oct); > + + // if we are in a loop command, the don't block > + if(oct->loop_command.on && !cio_poll(p_io, 1000)) > + return 0; > + > + /* Get input line */ > + n = cio_getline(&p_line, &len, p_io); > + if (n > 0) > + { > + // got something, so copy it to the input buffer > + sprintf(oct->in_buff, "%s", p_line); + > + if(p_line) > + free(p_line); > + } > + + return (0); > +} > + > +/* > + * process the command in the input buffer - > + * take action, produce results, copy to output buffer > + */ > +static int commands(osm_console_thread_t *oct) > +{ > + osm_opensm_t *p_osm = oct->p_osm; > + + ib_api_status_t status = IB_INSUFFICIENT_RESOURCES; > + > + parse_cmd_line(oct->in_buff, oct); > + + /* if parsed and executed then clear the input buffer > + */ > + oct->in_buff[0] = 0; > + + /* special case, only allow one loop command > + */ > + if(!oct->loop_command.running && oct->loop_command.on && > oct->loop_command.loop_function) > + { > + status = cl_thread_init(&oct->loop_command.loopThread, > osm_loop_thread, oct, "Loop command"); + if (status != IB_SUCCESS) > + { > + // something bad > + osm_log(&(p_osm->log), OSM_LOG_ERROR, > + "commands: Couldn't create a thread for the loop command!\n"); > + return -1; > + } > + } > + return (0); > +} > + > +/********************************************************************** > + * Initialization and configuration of the console connection. > + * (security & authorization, plus some bookkeeping) > + * > + * returns 1 if okay > + * 0 if not authorized > + * -1 if too many connections > + * -2 if error?? > + **********************************************************************/ > +int osm_console_init(osm_console_thread_t *p_oct) > +{ > + // the first opportunity to do thread specific actions > + + int status = 0; // not authorized > + int max_connects_exceeded = (num_console_threads() >= CIO_MAX_CONNECTS); > + > + osm_opensm_t *p_osm = p_oct->p_osm; > + CIO_t *p_io = getCIO(p_oct); > + > + // check for authorization > + if(is_authorized(p_oct)) > + { > + // check for available connections (too many?) > + if (!max_connects_exceeded) > + { > + cio_open(p_io); > + > + osm_log(&(p_osm->log), OSM_LOG_INFO, "osm_console_init: Console > connection accepted: %s (%s) s#%d\n", p_oct->client_hn, > + p_oct->client_ip, p_io->fd); > + status = 1; > + } > + else > + { > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init: ERR 4B06: > No available connections: %s (%s) t#%d\n", p_oct->client_hn, > + p_oct->client_ip, num_console_threads()); > + status = -1; > + } + } > + else > + { > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_init: ERR 4B05: > Console connection denied: %s (%s)\n", p_oct->client_hn, > + p_oct->client_ip); > + status = 0; + } > + + fflush(p_osm->log.out_port); > + return status; > +} > + > +/********************************************************************** > + * The console I/O and command loop > + * refer to: osm_console_init and osm_console_destroy > + **********************************************************************/ > +void osm_console(osm_console_thread_t *oct) > +{ > + cl_thread_suspend(100); // wait for other threads to initialize > + + // provide feedback from the server (probably from a previous command) > + while(!output(oct)) > + { > + // read the socket > + input(oct); > + + // process or act on the input > + commands(oct); > + } > + // final methods?? > } > + > +/********************************************************************** > + * Main Console Thread. > + * > + * Finish setting up the connection ( secure & authorized) and misc config > + * > + * Loop continuously in the osm_console method. > + * > + * Clean up, and gracefully exit when done > + **********************************************************************/ > +void osm_console_thread(void *p_ptr) > +{ > + osm_console_thread_t *p_oct = ( osm_console_thread_t * ) p_ptr; > + + /* Finish setting up the connection (secure & authorized) and misc > config */ > + if(osm_console_init(p_oct) == 1) > + { > + // do all i/o and commands until done > + osm_console(p_oct); > + + // done, so close down the console gracefully > + osm_console_destroy(p_oct); > + } + + // nothing left to do but destroy our own thread, return to pool > + osm_console_thread_destroy(p_oct); > + return; +} > + > +/* Prepare to launch the console by encapsulating all the necessary data in > a thread > + * safe data structure. > + * > + * Support for single (local) or multiple (socket) threads. > + * > + * initialize the console data structure for a thread, and then.. > + * if socket > + * create the thread > + * else > + * run inline > + * > + * refer to: osm_console_thread and osm_console_thread_destroy > + * > + */ > +int osm_console_thread_init(int socket, struct sockaddr_in *sin, > osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) > +{ > + static int n_local = 0; > + osm_console_thread_t *oct; // see free_console_thread() !! > + ib_api_status_t status = IB_INSUFFICIENT_RESOURCES; > + > + // have we used up all available connections? > + if ((!is_remote(p_opt->console) && n_local) || ((oct = > new_console_thread())== NULL)) > + { > + if(n_local) > + cl_thread_suspend( 100000); // denied, dwell here before trying > again. > + else > + osm_log(&(p_osm->log), OSM_LOG_ERROR, > + "osm_console_thread_init: Maximum number of connections exceeded, > connection denied (%d)\n", > + num_console_threads()); > + return status; > + } > + + if(!is_remote(p_opt->console)) > + n_local++; // only one local connection... > + > + /* fill in the osm_console_thread_t structure (can't be NULL) */ > + oct->authorized = 0; > + oct->state = 0; > + oct->p_osm = p_osm; > + oct->io.fd = socket; > + oct->port = p_opt->console_port; > + snprintf(oct->client_type, CIO_NOTE_SIZE, p_opt->console); > + +#ifdef ENABLE_OSM_CONSOLE_SOCKET > + /* get then name and ip of the client (console connection) */ > + if(is_remote(oct->client_type)) > + { > + /* get the clients ip address */ > + if (inet_ntop(AF_INET, &sin->sin_addr, oct->client_ip, > sizeof(oct->client_ip))== NULL) > + { > + snprintf(oct->client_ip, CIO_NOTE_SIZE, "STRING_UNKNOWN"); > + } > + + /* get the clients host name */ > + struct hostent *hent; > + if ((hent = gethostbyaddr((const char *)&sin->sin_addr, sizeof(struct > in_addr), AF_INET)) == NULL) > + { > + snprintf(oct->client_hn, CIO_INFO_SIZE, "STRING_UNKNOWN"); > + } > + else > + { > + snprintf(oct->client_hn, CIO_INFO_SIZE, "%s", hent->h_name); > + } + } + else > +#endif + { > + if(gethostname(oct->client_hn, CIO_INFO_SIZE)) > + { > + snprintf(oct->client_hn, CIO_INFO_SIZE, "localhost"); > + snprintf(oct->client_ip, CIO_NOTE_SIZE, "localhost"); > + } > + else > + snprintf(oct->client_ip, CIO_NOTE_SIZE, oct->client_hn); + } > + + > + // create a name for the thread, based on the connection > + snprintf(oct->name, CIO_INFO_SIZE, "%s %d", OSM_CONSOLE_NAME, > oct->io.fd); > + > + // ***** Finally, create a new thread for this connection ****** > + status = cl_thread_init(&oct->consoleThread, osm_console_thread, oct, > oct->name); + if (status != IB_SUCCESS) > + { > + // something bad > + osm_log(&(p_osm->log), OSM_LOG_ERROR, > + "osm_console_thread_init: Couldn't create a thread for the > socket!\n"); > + > + // free up the thread, wasn't actually used + > osm_console_thread_destroy(oct); > + return -1; > + } > + return 0; > +} > + > + > +/* Multi-threaded service to handle zero or more osm_consoles > + * > + * Typically the OSM runs as a daemon process, with zero > + * consoles. Occationally it is necessary to remotely connect > + * to the OSM through a console connection. > + * > + * Allow one Master remote console and many Slaves. > + * > + * Provide a mechanism to release and assume Master role. > + * > + */ > +int osm_console_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm) > +{ > + struct sockaddr_in sin; > + int s = 0; > + > + /* don't enter this code section, if the exit flag is true */ > + if (!osm_exit_flag) > + { > + // handle IO from local or remote console > + // blocks here until a client tries to connect > + > + /* > + * this version is supposed to block > + * > + * the block is released when a connection occurs, which causes a new > + * thread to be spawned to handle the connection. The new thread > cleans > + * up after itself. > + * > + * return only happens after a successful connection has been > established, > + * and needs to be prepared for another connection. > + */ > +#ifdef ENABLE_OSM_CONSOLE_SOCKET > + socklen_t len = sizeof(sin); > + if (is_remote(p_opt->console) && ((s = accept(p_osm->console.socket, > &sin, &len)) < 0)) > + { > + // kill sig can cause this... which would be normal during a shutdown > + osm_log(&(p_osm->log), OSM_LOG_ERROR, "osm_console_server: not > accepting socket connections\n"); > + return -1; > + } > + else > +#endif > + // create a thread to handle the i/o on this connection > + osm_console_thread_init(s, &sin, p_opt, p_osm); > + } > + else > + free_console_threads(p_osm); // clean up > + return s; > +} > + > + > +/********************************************************************** > + * Function Name: > + * cio_vprintf > + * > + * This routine formats a message and uses a Stream IO abstraction to > determine > + * how and where to write the message out (stdout, socket, ssl, etc.) > + * > + * Side Effects: > + * Unknown, uses vsprintf and variable arguments. Possible > stack problems. > + * > + * cio pointer to the Connection IO data structure - an IO Stream > abstraction > + * > + * format A string literal that describes the desired text and > formatting. See printf(). > + * > + * args A variable argument list, of the type available between a > va_start() and > + * va_end() block. > + * > + * Always returns 0 > + > ******************************************************************************/ > + > + int cio_vprintf( CIO_t *cio, const char *format, va_list args) > + { > + char msg_buffer[CIO_BUFSIZE]; > + > + // create the formatted string and place it in the local string buffer > + vsprintf(msg_buffer, format, args); > + + // send it out the proper I/O channel > + fprintf(cio->out, msg_buffer); > + > + return 0; > + } > + > +/****************************************************************************** > + * Function Name: > + * cio_printf > + * > + * This is an abstract form of the standard fprintf() routine. It can be > used > + * in an identical manner, with the exception of the first argument that > needs > + * to be the Connection IO abstraction, rather than a FILE. > + * > + * Side Effects: > + * Unknown, uses vsprintf and variable arguments. Possible > stack problems. > + * > + * cio pointer to the Connection IO data structure - an IO Stream > abstraction > + * > + * format A string literal that describes the desired text and > formatting. See printf(). > + * > + * args A variable argument list, of the type available between a > va_start() and > + * va_end() block. > + * > + * Always returns 0, from cio_vprintf() > + > ******************************************************************************/ > + > + int cio_printf( CIO_t *cio, const char *format, ...) > + { > + int returnval = 0; > + va_list args; > + + // Sink Filter or Message Filter. Does it get printed?? > + if(1) > + { > + va_start(args, format); > + returnval = cio_vprintf(cio, format, args); > + va_end(args); > + } > + return returnval; > + } > + > + int cio_flush( CIO_t *cio) > + { > + int returnval = fflush(cio->out); > + + return returnval; > + } > + > + int cio_getline( char **lineptr, size_t *n, CIO_t *cio) > + { > + int returnval = getline(lineptr, n, cio->in); > + + return returnval; > + } > + > + int cio_open( CIO_t *cio) > + { > + // returns zero, if opened fine, -1 otherwise > + + struct pollfd *pd = (struct pollfd* )malloc(sizeof(struct pollfd)); > + if (pd == NULL) > + return -1; // should not happen > + + cio->in = fdopen(cio->fd, "w+"); > + cio->out = cio->in; > + cio->err = cio->in; > + + cio->pfd = pd; > + cio->pfd[0].fd = cio->fd; > + cio->pfd[0].events = POLLIN; > + cio->pfd[0].revents = 0; > + + return (cio->in == NULL) ? -1 : 0; > + } > + > + int cio_close( CIO_t *cio) > + { > + int rtnval = -1; > + if(cio && (cio->fd > 0)) > + { > + free(cio->pfd); > + rtnval = close(cio->fd); > + } > + cio->fd = 0; > + return rtnval; > + } > + > + /* return true if input available */ > + int cio_poll(CIO_t *cio, int timeout) > + { > + // if timeout is less than 1, return true, alw > + if(timeout < 1) > + return 1; > + return (poll(cio->pfd, 1, timeout) > 0); > + } It is not clear for me why most of those wrapper functions are needed at all. And how really so big comment about *_printf() usage is helpful. Sasha From spaceman at spring-aki.com Sun Oct 28 01:10:47 2007 From: spaceman at spring-aki.com (Kamel Chen) Date: Sun, 28 Oct 2007 13:10:47 +0500 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c81935$21239400$0100007f@localhost> cheapnewsoft . com From jackm at dev.mellanox.co.il Sun Oct 28 00:51:38 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 28 Oct 2007 09:51:38 +0200 Subject: [ofa-general] [PATCH 3 OF 5 v2] libmlx4: avoid adding unneeded extra CQE when creating a cq Message-ID: <200710280951.39156.jackm@dev.mellanox.co.il> commit c04463eb343a0f038eb7a2a877be90cd3e3e19a3 Author: Jack Morgenstein Date: Thu Oct 25 19:17:42 2007 +0200 Do not add an extra CQE when creating a CQ. Sanity-check against returned device capabilities, to avoid breaking ABI. Set minimum to 2, to avoid rejection by kernel. Adjust num cqes passed to verbs layer. Signed-off-by: Jack Morgenstein --- Roland, The previous patch neglected to increase the number of CQEs returned to the verbs-layer caller by 1. If the mlx4 layer was invoked with a power of 2, the returned value was - 1, which is not in conformance with the the IB spec. This patch fixes that oversight. In order to preserve the ABI, the corresponding kernel patch still returns - 1; however, the user layer can determine if the kernel has adjusted the number of CQEs per qp by examining if the device-capability max_cqes is a power of 2 -- if so, then create_cq() can increment the returned cqe value by 1. Its possible that this increment can be done unconditionally (i.e., even if there is a previous kernel driver installed) -- I've not yet checked this out. - Jack diff --git a/src/cq.c b/src/cq.c index c0d7a8b..aac84da 100644 --- a/src/cq.c +++ b/src/cq.c @@ -114,10 +114,10 @@ static struct mlx4_cqe *get_cqe(struct mlx4_cq *cq, int entry) static void *get_sw_cqe(struct mlx4_cq *cq, int n) { - struct mlx4_cqe *cqe = get_cqe(cq, n & cq->ibv_cq.cqe); + struct mlx4_cqe *cqe = get_cqe(cq, n & cq->cqe_mask); return (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^ - !!(n & (cq->ibv_cq.cqe + 1))) ? NULL : cqe; + !!(n & (cq->cqe_mask + 1))) ? NULL : cqe; } static struct mlx4_cqe *next_cqe_sw(struct mlx4_cq *cq) @@ -417,7 +417,7 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) * from our QP and therefore don't need to be checked. */ for (prod_index = cq->cons_index; get_sw_cqe(cq, prod_index); ++prod_index) - if (prod_index == cq->cons_index + cq->ibv_cq.cqe) + if (prod_index == cq->cons_index + cq->cqe_mask) break; /* @@ -425,7 +425,7 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) * that match our QP by copying older entries on top of them. */ while ((int) --prod_index - (int) cq->cons_index >= 0) { - cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); + cqe = get_cqe(cq, prod_index & cq->cqe_mask); if (is_xrc_srq && (ntohl(cqe->g_mlpath_rqpn & 0xffffff) == srq->srqn) && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) { @@ -436,7 +436,7 @@ void mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index)); ++nfreed; } else if (nfreed) { - dest = get_cqe(cq, (prod_index + nfreed) & cq->ibv_cq.cqe); + dest = get_cqe(cq, (prod_index + nfreed) & cq->cqe_mask); owner_bit = dest->owner_sr_opcode & MLX4_CQE_OWNER_MASK; memcpy(dest, cqe, sizeof *cqe); dest->owner_sr_opcode = owner_bit | diff --git a/src/mlx4.h b/src/mlx4.h index 09e2bdd..707061b 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -216,6 +216,7 @@ struct mlx4_cq { uint32_t *set_ci_db; uint32_t *arm_db; int arm_sn; + uint32_t cqe_mask; }; struct mlx4_srq { diff --git a/src/verbs.c b/src/verbs.c index 059b534..d2a15d5 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -168,11 +168,22 @@ struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, struct mlx4_create_cq_resp resp; struct mlx4_cq *cq; int ret; + struct mlx4_context *mctx = to_mctx(context); + int no_spare_cqe = 0; /* Sanity check CQ size before proceeding */ - if (cqe > 0x3fffff) + if (cqe < 1 || cqe > mctx->max_cqe) return NULL; + /* if max allowable cqes is a power-of-2, no spare cqe fix is in + * the kernel + */ + if (mctx->max_cqe == align_queue_size(mctx->max_cqe)) + no_spare_cqe = 1; + + /* raise minimum, to avoid breaking ABI */ + cqe = (cqe == 1) ? 2 : cqe; + cq = malloc(sizeof *cq); if (!cq) return NULL; @@ -182,7 +193,7 @@ struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, if (pthread_spin_init(&cq->lock, PTHREAD_PROCESS_PRIVATE)) goto err; - cqe = align_queue_size(cqe + 1); + cqe = align_queue_size(cqe); if (mlx4_alloc_buf(&cq->buf, cqe * MLX4_CQ_ENTRY_SIZE, to_mdev(context->device)->page_size)) @@ -209,6 +220,9 @@ struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, goto err_db; cq->cqn = resp.cqn; + cq->cqe_mask = cq->ibv_cq.cqe; + if (no_spare_cqe) + cq->ibv_cq.cqe++; return &cq->ibv_cq; From dotanb at dev.mellanox.co.il Sun Oct 28 00:48:14 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 28 Oct 2007 09:48:14 +0200 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> Message-ID: <47243EBE.7010809@dev.mellanox.co.il> Hi. Maybe you should increase your timeout/retry count for your application? can you check the ports error counters (using perfquery) maybe you have bad cables in your subnet .... Dotan Tang, Changqing wrote: > This is Verbs layer code, no IB CM is used. > > --CQ > > >> -----Original Message----- >> From: Sean Hefty [mailto:sean.hefty at intel.com] >> Sent: Thursday, October 25, 2007 12:38 PM >> To: Tang, Changqing; Roland Dreier >> Cc: general at lists.openfabrics.org >> Subject: RE: [ofa-general] message is received but sender >> report error. >> >> >>> If this is the case, how would we fix the problem ? It's >>> >> hard for us to >> >>> delay to destroy the QP, because we don't know how long to delay. >>> The other way is to do something from the driver, or firmware. >>> >> Do you disconnect the QPs using the IB CM? >> >> - Sean >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From jackm at dev.mellanox.co.il Sun Oct 28 00:59:57 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 28 Oct 2007 09:59:57 +0200 Subject: [ofa-general] [PATCH 5 of 5 v2] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ Message-ID: <200710280959.58133.jackm@dev.mellanox.co.il> mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ. The extra CQE can cause a huge waste of memory if requesting a power-of-2 number of CQEs. The number of CQEs in the cq that is returned to the kernel_caller is now a power-of-2. The value returned to userspace callers is the same as before, in order to preserve the ABI. Signed-off-by: Jack Morgenstein --- Roland, The previous patch neglected to increase the number of CQEs returned to the verbs-layer caller by 1. If the mlx4 layer was invoked with a power of 2, the returned value was - 1, which is not in conformance with the the IB spec. This patch fixes that oversight. In order to preserve the ABI, the kernel still returns - 1 cqes to a userspace caller; adjustments are made in userspace by libmlx4. - Jack Index: infiniband/drivers/infiniband/hw/mlx4/cq.c =================================================================== --- infiniband.orig/drivers/infiniband/hw/mlx4/cq.c 2007-10-28 09:34:17.055937000 +0200 +++ infiniband/drivers/infiniband/hw/mlx4/cq.c 2007-10-28 09:36:22.457431000 +0200 @@ -80,10 +80,10 @@ static void *get_cqe(struct mlx4_ib_cq * static void *get_sw_cqe(struct mlx4_ib_cq *cq, int n) { - struct mlx4_cqe *cqe = get_cqe(cq, n & cq->ibcq.cqe); + struct mlx4_cqe *cqe = get_cqe(cq, n & (cq->ibcq.cqe - 1)); return (!!(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK) ^ - !!(n & (cq->ibcq.cqe + 1))) ? NULL : cqe; + !!(n & cq->ibcq.cqe)) ? NULL : cqe; } static struct mlx4_cqe *next_cqe_sw(struct mlx4_ib_cq *cq) @@ -108,8 +108,16 @@ struct ib_cq *mlx4_ib_create_cq(struct i if (!cq) return ERR_PTR(-ENOMEM); - entries = roundup_pow_of_two(entries + 1); - cq->ibcq.cqe = entries - 1; + /* eliminate using extra CQE (for kernel space). + * For userspace, do in libmlx4, so that don't break ABI. + */ + if (context) { + entries = roundup_pow_of_two(entries + 1); + cq->ibcq.cqe = entries - 1; + } else { + entries = roundup_pow_of_two(entries); + cq->ibcq.cqe = entries; + } buf_size = entries * sizeof (struct mlx4_cqe); spin_lock_init(&cq->lock); @@ -222,7 +230,7 @@ int mlx4_ib_destroy_cq(struct ib_cq *cq) mlx4_ib_db_unmap_user(to_mucontext(cq->uobject->context), &mcq->db); ib_umem_release(mcq->umem); } else { - mlx4_buf_free(dev->dev, (cq->cqe + 1) * sizeof (struct mlx4_cqe), + mlx4_buf_free(dev->dev, (cq->cqe) * sizeof (struct mlx4_cqe), &mcq->buf.buf); mlx4_ib_db_free(dev, &mcq->db); } @@ -489,7 +497,7 @@ void __mlx4_ib_cq_clean(struct mlx4_ib_c * from our QP and therefore don't need to be checked. */ for (prod_index = cq->mcq.cons_index; get_sw_cqe(cq, prod_index); ++prod_index) - if (prod_index == cq->mcq.cons_index + cq->ibcq.cqe) + if (prod_index == cq->mcq.cons_index + cq->ibcq.cqe - 1) break; /* @@ -497,13 +505,13 @@ void __mlx4_ib_cq_clean(struct mlx4_ib_c * that match our QP by copying older entries on top of them. */ while ((int) --prod_index - (int) cq->mcq.cons_index >= 0) { - cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); + cqe = get_cqe(cq, prod_index & (cq->ibcq.cqe - 1)); if ((be32_to_cpu(cqe->my_qpn) & 0xffffff) == qpn) { if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) mlx4_ib_free_srq_wqe(srq, be16_to_cpu(cqe->wqe_index)); ++nfreed; } else if (nfreed) { - dest = get_cqe(cq, (prod_index + nfreed) & cq->ibcq.cqe); + dest = get_cqe(cq, (prod_index + nfreed) & (cq->ibcq.cqe - 1)); owner_bit = dest->owner_sr_opcode & MLX4_CQE_OWNER_MASK; memcpy(dest, cqe, sizeof *cqe); dest->owner_sr_opcode = owner_bit | Index: infiniband/drivers/net/mlx4/main.c =================================================================== --- infiniband.orig/drivers/net/mlx4/main.c 2007-10-28 09:34:17.077932000 +0200 +++ infiniband/drivers/net/mlx4/main.c 2007-10-28 09:36:22.465430000 +0200 @@ -141,12 +141,7 @@ static int mlx4_dev_cap(struct mlx4_dev dev->caps.max_sq_desc_sz = dev_cap->max_sq_desc_sz; dev->caps.max_rq_desc_sz = dev_cap->max_rq_desc_sz; dev->caps.num_qp_per_mgm = MLX4_QP_PER_MGM; - /* - * Subtract 1 from the limit because we need to allocate a - * spare CQE so the HCA HW can tell the difference between an - * empty CQ and a full CQ. - */ - dev->caps.max_cqes = dev_cap->max_cq_sz - 1; + dev->caps.max_cqes = dev_cap->max_cq_sz; dev->caps.reserved_cqs = dev_cap->reserved_cqs; dev->caps.reserved_eqs = dev_cap->reserved_eqs; dev->caps.reserved_mtts = DIV_ROUND_UP(dev_cap->reserved_mtts, From erezz at voltaire.com Sun Oct 28 01:10:06 2007 From: erezz at voltaire.com (Erez Zilber) Date: Sun, 28 Oct 2007 10:10:06 +0200 Subject: [ofa-general] iSER for stgt - wiki page In-Reply-To: References: Message-ID: <472443DE.8080503@voltaire.com> Sufficool, Stanley wrote: > Does anyone know a source for Windows initiators for iSER? > Currently, there's no open-source iSER initiator for Windows. However, we (Voltaire) are working on it. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Solutions Voltaire – _The Grid Backbone_ __ www.voltaire.com From vlad at lists.openfabrics.org Sun Oct 28 02:53:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 28 Oct 2007 02:53:32 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071028-0200 daily build status Message-ID: <20071028095332.5468EE6087E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.9-22.ELsmp Log: /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c:951: warning: assignment discards qualifiers from pointer target type /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c: In function 'class_device_create': /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/kernel_addons/backport/2.6.9_U2/include/linux/device.h:108: sorry, unimplemented: function 'class_device_create' can never be inlined because it uses variable argument lists make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-22.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-22.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-34.ELsmp Log: /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c:951: warning: assignment discards qualifiers from pointer target type /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c: In function 'class_device_create': /home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/kernel_addons/backport/2.6.9_U3/include/linux/device.h:108: sorry, unimplemented: function 'class_device_create' can never be inlined because it uses variable argument lists make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071028-0200_linux-2.6.9-34.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-34.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From kliteyn at dev.mellanox.co.il Sun Oct 28 03:50:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 28 Oct 2007 12:50:30 +0200 Subject: [ofa-general] [PATCH] osm: adding missing dependency in the makefile Message-ID: <47246976.4040702@dev.mellanox.co.il> Adding missing dependency in the makefile. Without it make may fail when compiling with -j. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/Makefile.am | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 5e1abd5..2895d18 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -65,7 +65,7 @@ osm_qos_parser_y.c: $(srcdir)/osm_qos_parser.y $(srcdir)/../include/opensm/osm_q $(YACC) -d -o $(srcdir)/osm_qos_parser_y.c -p__qos_parser_ $(srcdir)/osm_qos_parser.y mv -f $(srcdir)/osm_qos_parser_y.h $(srcdir)/../include/opensm/osm_qos_parser_y.h -osm_qos_parser_l.c: $(srcdir)/osm_qos_parser.l $(srcdir)/../include/opensm/osm_qos_policy.h +osm_qos_parser_l.c: $(srcdir)/osm_qos_parser.l $(srcdir)/../include/opensm/osm_qos_policy.h osm_qos_parser_y.c $(LEX) -P__qos_parser_ -o$(srcdir)/osm_qos_parser_l.c $(srcdir)/osm_qos_parser.l if OSMV_OPENIB -- 1.5.1.4 From sashak at voltaire.com Sun Oct 28 04:27:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 28 Oct 2007 13:27:01 +0200 Subject: [ofa-general] Re: [PATCH] osm: adding missing dependency in the makefile In-Reply-To: <47246976.4040702@dev.mellanox.co.il> References: <47246976.4040702@dev.mellanox.co.il> Message-ID: <20071028112701.GQ22317@sashak.voltaire.com> On 12:50 Sun 28 Oct , Yevgeny Kliteynik wrote: > Adding missing dependency in the makefile. > Without it make may fail when compiling with -j. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From saga at jbjunction.com Sun Oct 28 05:03:00 2007 From: saga at jbjunction.com (Pascal Springer) Date: Sun, 28 Oct 2007 12:03:00 +0000 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c8195a$0c0d2c00$0100007f@localhost> cheapnewsoft . com From or.gerlitz at gmail.com Sun Oct 28 05:55:10 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sun, 28 Oct 2007 14:55:10 +0200 Subject: [ofa-general] OpenFabrics Developer's Summit: tentative agenda In-Reply-To: <20071024003159.GA10244@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024003159.GA10244@cuprite.pathscale.com> Message-ID: <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> On 10/24/07, Johann George wrote: > > Thanks for your feedback. We can certainly expand, shorten and change > sessions around to maximize benefit for the attendees. > OK, so lets do it! In short, the agenda as it is now, has a great bias to "updates and reporting" && windowz-staff && iwarp-staff vs Linux-infiniband-developers-that-get-together-to-discuss-shout-laugh-brain-storm-think-together-suggest-designs-etc with this agenda and your polite "NO" saying on everything I suggested, at hand, I don't see how to make progress. Taking a constructive direction, I suggest the following: put everything which is not stricly related to Linux-infiniband-developers (dapl / iwarp / windowz / logo program / etc) in one track, and let the developers decide what's in their track. As for the updates, start two hours earlier and make all the updates to take place on that time slot. BTW - getting feedback from commercial MPIs as the Intel and HP ones, is at least important as getting update/feedback from MVAPICH and OMPI. I vote for a 45 minutes session as Jeff suggested with all MPIs. If it does not work for you to start earlier with two hours of updates etc as I have suggested here, allocate two sessions for the Linux-IB-developers, Thursday 6PM-8PM and Friday 8AM-10AM Seriously, thanks much for the feedback. If the majority feel we > should move some sessions, we'll do our best to accommodate people's > schedules. So far there are two votes (Jeff and myself) to the direction I have suggested and one (yours) to the agenda you have posted, 2:1 is a strict majority. Based on your reply I would be starting to collect votes for my suggestion. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Sun Oct 28 06:47:42 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 28 Oct 2007 09:47:42 -0400 (EDT) Subject: [ofa-general] Re: [ewg] OFED October 22 meeting summary on OFED 1.3 alpha status and In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9015640A8@mtlexch01.mtl.com> from "Tziporet Koren" at Oct 23, 2007 04:45:38 PM Message-ID: <200710281347.l9SDlgQP008866@xi.cse.ohio-state.edu> Tziporet, > 2. MPI status: > * MVAPICH - We wish to integrate the 1.0 code by the end of this week. > In this way it will be ready for the OFED beta release next week - need > DK approval We have made MVAPICH 1.0-beta release on Friday. Thus, it can be integrated with OFED 1.3 now. Unfortunately, my announcement related to MVAPICH 1.0-beta release got dropped from the ewg list. It has been posted on the general list. Thanks, DK From eli at mellanox.co.il Sun Oct 28 07:18:01 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 28 Oct 2007 16:18:01 +0200 Subject: [ofa-general] opensm partitions Message-ID: <1193581081.25235.91.camel@mtls03> Hi, I am trying to setup opensm for creating partitions for use with ipoib. I refer to the man pages and copy the following example to /etc/osm-partitions.conf: Default=0x7fff : ALL, SELF=full ; When running opensm I get the following error: PARSE ERROR: line 3: no partition definition found Can you send a valid sample configuration that I can use and also update the documentation? Thanks, Eli From sashak at voltaire.com Sun Oct 28 07:50:29 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 28 Oct 2007 16:50:29 +0200 Subject: [ofa-general] Re: opensm partitions In-Reply-To: <1193581081.25235.91.camel@mtls03> References: <1193581081.25235.91.camel@mtls03> Message-ID: <20071028145029.GV6945@sashak.voltaire.com> On 16:18 Sun 28 Oct , Eli Cohen wrote: > > I am trying to setup opensm for creating partitions for use with ipoib. > I refer to the man pages and copy the following example > to /etc/osm-partitions.conf: > > Default=0x7fff : ALL, SELF=full ; > > When running opensm I get the following error: > > PARSE ERROR: line 3: no partition definition found I'm not seeing any error with similar file. Could you attach exact file example? > Can you send a valid sample configuration that I can use and also update > the documentation? Attached. Sasha -------------- next part -------------- Default=0x7fff : ALL, SELF=full ; From moshek at voltaire.com Sun Oct 28 08:10:36 2007 From: moshek at voltaire.com (Moshe Kazir) Date: Sun, 28 Oct 2007 17:10:36 +0200 Subject: [ofa-general] Running netperf and iperf over SDP In-Reply-To: <200710281347.l9SDlgQP008866@xi.cse.ohio-state.edu> Message-ID: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> While running netperf and iperf over SDP I get unstable Performance results. On iperf I get more then 25 % difference between minimum and maximum. On netperf I get the following amazing result. -> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # while LD_PRELOAD=/usr/lib64/libsdp.so ./netperf -H 192.168.7.172 -- -m 512 -M 1047 ; do echo . ; done TCP STREAM TEST to 192.168.7.172 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 126976 126976 512 10.00 3247.39 . TCP STREAM TEST to 192.168.7.172 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 126976 126976 512 10.00 1222.48 . ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Can some one spar me a hint ? What I'm doing wrong ? The test run on x86_64 , sles 10 sp1 , OFED-1.2.5 Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com From uk at uknl.com Sun Oct 28 08:27:13 2007 From: uk at uknl.com (LOTTERY HOUSE) Date: Sun, 28 Oct 2007 10:27:13 -0500 (CDT) Subject: [ofa-general] TICKET No: 56475600545188 Message-ID: <62725.196.220.11.210.1193585233.squirrel@196.220.11.210> Dear Lucky Winner, We are pleased to inform you of the result of the just concluded annual finaldraws of UNITED KINGDOM NATIONAL PROGRAM. international programs. After this automated computer ballot, your e-mail address emerged as one of two winners in the category \\"A\\" You are therefore been approve to claim the sum of 1,000,000 (One Million Pounds Sterling) with the information below: TICKET No: 56475600545188 BATCH No: 2005MJL-01 LUCKY No: 887-13-865-37-10-83 Contact Person: Mr.Donald Wilson. Telephone: +447011133851 E-mail:claimsagent_lotteryboard07 at yahoo.de E-mail:claimsagent_lotteryboard07 at yahoo.de (1.) FULL NAME (2.) FULL ADDRESS (3)NATIONALITY (4) DATE OF BIRTH (5) OCCUPATION (6)TELEPHONE NUMBER (7) SEX Sincerely, Mr.Victor Brown. From ggrundstrom at NetEffect.com Sun Oct 28 11:21:30 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Sun, 28 Oct 2007 13:21:30 -0500 Subject: [ofa-general] RE: [PATCH 1/14 v2] nes: module and device initialization In-Reply-To: References: <200710192001.l9JK1U8O021689@neteffect.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC078C9670@venom2> > > OK, on a process level, my plan is to pull the current driver into a > "neteffect" branch in my git tree with the intention of merging it for > 2.6.25. I'll let you know when that's ready (probably early next > week). I'll probably do some cleanups there, and you can send me > cleanup/fix patches against that branch any time too. We should try > to keep the cycle time short: the interval between the first posting > of this driver and the current one was pretty long, and there's a lot > of cleanup to do to get ready for the next merge window. Does that > plan make sense? > > - R. > Thanks Roland. Let me know when you have your branch ready. Glenn. From jimmott at austin.rr.com Sun Oct 28 11:40:45 2007 From: jimmott at austin.rr.com (Jim Mott) Date: Sun, 28 Oct 2007 13:40:45 -0500 Subject: [ofa-general] RE: [ewg] Running netperf and iperf over SDP In-Reply-To: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> References: <200710281347.l9SDlgQP008866@xi.cse.ohio-state.edu> <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> Message-ID: <004c01c81992$14bcb450$3e361cf0$@rr.com> Hi, I have seen large variances between runs since I started SDP work. I have not seen more than 25% (or so) difference with netperf runs, but even that is unacceptable. The change from 3200 to 1200 is big enough that I expect something else is going on. Is it possible that the second run did not use SDP? I always export LD_PRELOAD (or unset LD_PRELOAD) prior to running the tests. With the publication of the zero copy bcopy changes, bugs and performance issues have moved to the top of my list. I would love to get any specific performance test parameters that seem to reliably cause especially egregious results like these. Jim Mott -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Moshe Kazir Sent: Sunday, October 28, 2007 10:11 AM To: ewg at lists.openfabrics.org; general at lists.openfabrics.org Cc: Moni Levy; Alon Verner Subject: [ewg] Running netperf and iperf over SDP While running netperf and iperf over SDP I get unstable Performance results. On iperf I get more then 25 % difference between minimum and maximum. On netperf I get the following amazing result. -> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # while LD_PRELOAD=/usr/lib64/libsdp.so ./netperf -H 192.168.7.172 -- -m 512 -M 1047 ; do echo . ; done TCP STREAM TEST to 192.168.7.172 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 126976 126976 512 10.00 3247.39 . TCP STREAM TEST to 192.168.7.172 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 126976 126976 512 10.00 1222.48 . ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Can some one spar me a hint ? What I'm doing wrong ? The test run on x86_64 , sles 10 sp1 , OFED-1.2.5 Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From tziporet at dev.mellanox.co.il Sun Oct 28 12:55:36 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 28 Oct 2007 21:55:36 +0200 Subject: [ofa-general] Re: [ewg] Running netperf and iperf over SDP In-Reply-To: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> References: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> Message-ID: <4724E938.3050506@mellanox.co.il> Moshe Kazir wrote: > While running netperf and iperf over SDP I get unstable Performance > results. > > > The test run on x86_64 , sles 10 sp1 , OFED-1.2.5 > > > Which HCA are you using? Tziporet From daftberry at ffremodeling.com Sun Oct 28 19:00:31 2007 From: daftberry at ffremodeling.com (Clark Warner) Date: Sun, 28 Oct 2007 20:00:31 -0600 Subject: [ofa-general] Symantec Norton 36O, Enhanced Security Edition 29$, Save 59.95$ 0ff Retai| Message-ID: <000001c819be$545e3580$0100007f@localhost> cheapnewsoft . com From Michael.Hockey at act.gov.au Sun Oct 28 18:57:38 2007 From: Michael.Hockey at act.gov.au (Hockey, Michael) Date: Mon, 29 Oct 2007 12:57:38 +1100 Subject: [ofa-general] link on site is broken Message-ID: <669D2E96754F0F4DA67C37B80A44315B05737246@mac067.act.gov.au> Do you have a mailing list ? The line on https://wiki.openfabrics.org/tiki-index.php?page=OpenIBFAQ points to a link http://openib.org/mailman/listinfo/openib-general that is broken. Michael Hockey ph 6207 4086 mb 0409 835 041 ----------------------------------------------------------------------- This email, and any attachments, may be confidential and also privileged. If you are not the intended recipient, please notify the sender and delete all copies of this transmission along with any attachments immediately. You should not copy or use it for any purpose, nor disclose its contents to any other person. ----------------------------------------------------------------------- From jsquyres at cisco.com Sun Oct 28 18:58:44 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Sun, 28 Oct 2007 21:58:44 -0400 Subject: [promoters] Re: [ofa-general] OpenFabrics Developer's Summit:tentative agenda In-Reply-To: <20071027185912.GA12501@cuprite.pathscale.com> References: <20071024004042.GB10244@cuprite.pathscale.com> <4D97B70CF7F72144881F66DFF4BD7A1202E1FD8C@fmsmsx413.amr.corp.intel.com> <20071027185912.GA12501@cuprite.pathscale.com> Message-ID: <158CCAD3-115D-49CD-A741-AC89FBF40637@cisco.com> On Oct 27, 2007, at 2:59 PM, Johann George wrote: > Thanks for all the comments on the MPI sessions. Our primary interest > should be to make the MPI sessions as valuable as possible to the > audience that is attending. My allotment was based on discussion with > the presenters having decided to limit it to those MPIs that were > included as part of OFED due to time constraints. Granted, this was > entirely subjective. I think it's definitely important to have feedback from Intel and HP, particularly since they have not been given a voice in this venue before (I'm not saying that we've been exclusionary -- I'm saying that I'd be very interested to hear what they have to say). Give all the MPI's either 45 or 60 minutes and split the time evenly; I think that will do fine. I think I'm saying essentially the same things that I've said before, so I'll butt out now... BTW: I know that being "the schedule guy" is a serious hassle. Many thanks for doing this, Johann! -- Jeff Squyres Cisco Systems From ptaylor at GlobeAndMail.ca Sun Oct 28 19:58:02 2007 From: ptaylor at GlobeAndMail.ca (Gregg Gregory) Date: , 28 Oct 2007 21:58:02 -0500 Subject: [ofa-general] [University news] Message-ID: <01c819ad$9c4fc810$ab0f114c@ptaylor> Obtain the_degree you deserve, based on your present knowledge and life experience. A prosperous future, money earning power, and the Admiration of all. Degrees from an Established, Prestigious, Leading Institution. Your Degree will show exactly what you really can do. Get the Job, Promotion, Business and Social Advancement you Desire! Get your Bachelors,Masters,MBA, or PhD in the field of your expertise Call now - your Graduation is a phone call away. Please call: +1(413)376-9218 admitted that von Kluck had advanced another twenty miles? The at a distance that suited us, without letting him get out of range." From changquing.tang at hp.com Sun Oct 28 20:09:25 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 29 Oct 2007 03:09:25 +0000 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <47243EBE.7010809@dev.mellanox.co.il> References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> <47243EBE.7010809@dev.mellanox.co.il> Message-ID: The timeout is 18 (~1sec), and retry is 7 (max). The error only occurs 1% of runs, sometimes I run the same hello_world code in a loop, and caught it after 1500 runs. So I don't think it is a cable issue(but I have not checked the port error counter). --CQ > -----Original Message----- > From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] > Sent: Sunday, October 28, 2007 2:48 AM > To: Tang, Changqing > Cc: Sean Hefty; Roland Dreier; general at lists.openfabrics.org > Subject: Re: [ofa-general] message is received but sender > report error. > > Hi. > > Maybe you should increase your timeout/retry count for your > application? > can you check the ports error counters (using perfquery) > maybe you have bad cables in your subnet .... > > Dotan > > Tang, Changqing wrote: > > This is Verbs layer code, no IB CM is used. > > > > --CQ > > > > > >> -----Original Message----- > >> From: Sean Hefty [mailto:sean.hefty at intel.com] > >> Sent: Thursday, October 25, 2007 12:38 PM > >> To: Tang, Changqing; Roland Dreier > >> Cc: general at lists.openfabrics.org > >> Subject: RE: [ofa-general] message is received but sender report > >> error. > >> > >> > >>> If this is the case, how would we fix the problem ? It's > >>> > >> hard for us to > >> > >>> delay to destroy the QP, because we don't know how long to delay. > >>> The other way is to do something from the driver, or firmware. > >>> > >> Do you disconnect the QPs using the IB CM? > >> > >> - Sean > >> > >> > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > From envios2000 at yahoo.es Sun Oct 28 14:13:56 2007 From: envios2000 at yahoo.es (HS Auditores) Date: Sun, 28 Oct 2007 18:13:56 -0300 Subject: [ofa-general] Curso de Iva y Renta ... Message-ID: <1782657-2200710028211356293@Mauricio> H.S. Auditores Ltda. y H.S. Capacitaci�n Ltda. Tiene el agrado de informar el Calendario de TALLERES DE TRIBUTARIA Octubre - Noviembre 2007 Taller de IVA : 70% Practico y 30% Te�rico D�as: Lunes 29 al Mi�rcoles 31 de Octubre Horario: 18:30 a 21:30 horas Temario: Taller de Renta : 70% Practico y 30% Te�rico D�as: Lunes 19 al Jueves 22 de Noviembre Horario: 18:30 a 21:45 horas Temario: Inscripciones y Matriculas en General Bari 185, Providencia, Metro Salvador, Fonos: 2640961 y 2641383. Solo 10 Alumnos por curso, Incluye Material, Gu�as de Ejercicio, Coffe Break Atte, HS Auditores E-mail: auditores at hasuditores.cl Este mensaje se env�a en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los art�culos 2 y 4 de la ley 19.628 sobre protecci�n de la vida privada o datos de car�cter personal, todo esto en conformidad a los numerales 4 y 12 de la constituci�n po�tica. Su direcci�n ha sido extra�da manualmente por personal de nuestra compa��a desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envi� de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione excluirdebase at gmail.com From johann.george at qlogic.com Sun Oct 28 23:51:38 2007 From: johann.george at qlogic.com (Johann George) Date: Sun, 28 Oct 2007 23:51:38 -0700 Subject: [ofa-general] OpenFabrics Developer's Summit: feedback requested In-Reply-To: <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024003159.GA10244@cuprite.pathscale.com> <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> Message-ID: <20071029065138.GA13737@cuprite.pathscale.com> Or and Jeff, Thanks again for your input. I like Or's idea of starting the summit earlier but am concerned as to whether people could attend. I'm also not sure we could have access to the room earlier although I suspect that will be possible. Regarding parallel tracks, we currently do not have another room to handle that. But we can investigate if this might be possible at reasonable cost. I would like to hear from the attendees since this summit is for you. Perhaps you can vote on the following three questions. I'll tally the votes that come from registered attendees by the end of the week and act on them as best as I can. This might be a good time to remind you that if you have not registered, please do so by following this link: http://www.acteva.com/booking.cfm?bevaid=143964 (1) Are you willing and able to attend if we start at 11:00am on Thursday rather than at 1:00pm? (2) If we are able to, would you prefer to see simultaneous tracks and lengthen some of the sessions. (3) Would you like to see additional MPI sessions crammed into the allotted time? To avoid polluting all the mailing lists, feel free to reply just to me unless you wish to do otherwise. Thanks. Johann From moshek at voltaire.com Mon Oct 29 00:38:40 2007 From: moshek at voltaire.com (Moshe Kazir) Date: Mon, 29 Oct 2007 09:38:40 +0200 Subject: [ofa-general] RE: [ewg] Running netperf and iperf over SDP In-Reply-To: <4724E938.3050506@mellanox.co.il> Message-ID: <39C75744D164D948A170E9792AF8E7CA4D2BC9@exil.voltaire.com> I get the same problem on Arbel DDR and ConnectX mlx4. Switch is DDR , and I repeat the test while no one else used the switch. As I see it this is not an HCA/hardware problem. It look like my command/setup error or sdplib problem . Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Sunday, October 28, 2007 9:56 PM To: Moshe Kazir Cc: ewg at lists.openfabrics.org; general at lists.openfabrics.org; Moni Levy; Alon Verner Subject: Re: [ewg] Running netperf and iperf over SDP Moshe Kazir wrote: > While running netperf and iperf over SDP I get unstable Performance > results. > > > The test run on x86_64 , sles 10 sp1 , OFED-1.2.5 > > > Which HCA are you using? Tziporet From eli at mellanox.co.il Mon Oct 29 00:43:00 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 29 Oct 2007 09:43:00 +0200 Subject: [ofa-general] Re: opensm partitions In-Reply-To: <20071028145029.GV6945@sashak.voltaire.com> References: <1193581081.25235.91.camel@mtls03> <20071028145029.GV6945@sashak.voltaire.com> Message-ID: <1193643780.25235.117.camel@mtls03> Here's the file I used (attached). I used this with ofa 1.2.5 so I will try now with ofa 1.3 just to be sure. -------------- next part -------------- # opensm configuration file Default=0x7fff,ipoib:ALL=full; From vlad at lists.openfabrics.org Mon Oct 29 02:56:58 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 29 Oct 2007 02:56:58 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071029-0200 daily build status Message-ID: <20071029095659.003E9E6083D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.9-22.ELsmp Log: /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c:951: warning: assignment discards qualifiers from pointer target type /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c: In function 'class_device_create': /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/kernel_addons/backport/2.6.9_U2/include/linux/device.h:108: sorry, unimplemented: function 'class_device_create' can never be inlined because it uses variable argument lists make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-22.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-22.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-34.ELsmp Log: /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c:951: warning: assignment discards qualifiers from pointer target type /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.c: In function 'class_device_create': /home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/kernel_addons/backport/2.6.9_U3/include/linux/device.h:108: sorry, unimplemented: function 'class_device_create' can never be inlined because it uses variable argument lists make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core/user_mad.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20071029-0200_linux-2.6.9-34.ELsmp_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.9-34.ELsmp' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From balaji at mcs.anl.gov Mon Oct 29 03:25:08 2007 From: balaji at mcs.anl.gov (Pavan Balaji) Date: Mon, 29 Oct 2007 05:25:08 -0500 Subject: [ofa-general] CFP: Workshop on High-Performance, Power-Aware Computing (HP-PAC) Message-ID: <4725B504.5000204@mcs.anl.gov> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2008-sm-logo.jpg Type: image/jpeg Size: 6905 bytes Desc: not available URL: From dotanb at dev.mellanox.co.il Mon Oct 29 04:47:11 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 29 Oct 2007 13:47:11 +0200 Subject: [ofa-general] message is received but sender report error. In-Reply-To: References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> <47243EBE.7010809@dev.mellanox.co.il> Message-ID: <4725C83F.8030804@dev.mellanox.co.il> If you are not connecting the QPs using CM, maybe you have a sync problem? one side (the sender) is in RTS and the other side isn't in RTR (or a sync problem when closing the connection) Dotan Tang, Changqing wrote: > The timeout is 18 (~1sec), and retry is 7 (max). > > The error only occurs 1% of runs, sometimes I run the same hello_world code in a loop, and caught it after 1500 runs. So I don't think it is a cable issue(but I have not checked the port error counter). > > --CQ > > >> -----Original Message----- >> From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] >> Sent: Sunday, October 28, 2007 2:48 AM >> To: Tang, Changqing >> Cc: Sean Hefty; Roland Dreier; general at lists.openfabrics.org >> Subject: Re: [ofa-general] message is received but sender >> report error. >> >> Hi. >> >> Maybe you should increase your timeout/retry count for your >> application? >> can you check the ports error counters (using perfquery) >> maybe you have bad cables in your subnet .... >> >> Dotan >> >> Tang, Changqing wrote: >> >>> This is Verbs layer code, no IB CM is used. >>> >>> --CQ >>> >>> >>> >>>> -----Original Message----- >>>> From: Sean Hefty [mailto:sean.hefty at intel.com] >>>> Sent: Thursday, October 25, 2007 12:38 PM >>>> To: Tang, Changqing; Roland Dreier >>>> Cc: general at lists.openfabrics.org >>>> Subject: RE: [ofa-general] message is received but sender report >>>> error. >>>> >>>> >>>> >>>>> If this is the case, how would we fix the problem ? It's >>>>> >>>>> >>>> hard for us to >>>> >>>> >>>>> delay to destroy the QP, because we don't know how long to delay. >>>>> The other way is to do something from the driver, or firmware. >>>>> >>>>> >>>> Do you disconnect the QPs using the IB CM? >>>> >>>> - Sean >>>> >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >> > > From eli at dev.mellanox.co.il Mon Oct 29 04:49:35 2007 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 29 Oct 2007 13:49:35 +0200 Subject: [ofa-general] Re: opensm partitions In-Reply-To: <1193643780.25235.117.camel@mtls03> References: <1193581081.25235.91.camel@mtls03> <20071028145029.GV6945@sashak.voltaire.com> <1193643780.25235.117.camel@mtls03> Message-ID: <1193658575.25235.152.camel@mtls03> No error messages with ofa 1.3 but I could still not verify how it works. If I have any other problem I'll let you know. Thanks. On Mon, 2007-10-29 at 09:43 +0200, Eli Cohen wrote: > Here's the file I used (attached). I used this with ofa 1.2.5 so I will > try now with ofa 1.3 just to be sure. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From keshetti85-student at yahoo.co.in Mon Oct 29 05:00:27 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 29 Oct 2007 17:30:27 +0530 Subject: [ofa-general] Does openSM ucast routing table generator utility exist .. ? Message-ID: <829ded920710290500p31de6c1bp6b219ddab54b41a3@mail.gmail.com> Hi all, I could see that openSM now supports file based unicast forwarding table loading. My question is, has anyone ever wrote an utility to generate such file (unicast forwarding table file) having the facility to load non min-hop paths (I think ) which is the actual intention behind allowing the file based unicast forwarding table loading. regards, Mahesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From fenkes at de.ibm.com Mon Oct 29 05:22:19 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 29 Oct 2007 14:22:19 +0200 Subject: [ofa-general] [PATCH] ofed_scripts: Add location code fix for older ppc64 kernels Message-ID: <200710291322.20235.fenkes@de.ibm.com> Kernels prior to 2.6.24 have problems with multiple devices sharing the same location code on ppc64 systems -- only one of these devices would be usable by ibmebus. This will be a problem on systems with multiple eHCA chips on a single hardware location. For older kernels, this problem can be circumvented by, prior to loading the eHCA driver, changing the location codes of the offending devices so that they're not the same anymore. This patch adds an openibd patch file which, if applied, will make openibd change the location codes of eHCA adapters with the same location code. ofed_patch.sh is changed so that it applies that patch if, and only if, it is run on a ppc64 architecture and the kernel version implies that the kernel has the ibmebus bug. Signed-off-by: Joachim Fenkes --- ofed_scripts/ofed_patch.sh | 49 +++++++++++++++++++++++++++++++++++ ofed_scripts/openibd-loc_code.patch | 43 ++++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+), 0 deletions(-) create mode 100644 ofed_scripts/openibd-loc_code.patch diff --git a/ofed_scripts/ofed_patch.sh b/ofed_scripts/ofed_patch.sh index e1f039d..b254000 100755 --- a/ofed_scripts/ofed_patch.sh +++ b/ofed_scripts/ofed_patch.sh @@ -200,6 +200,44 @@ get_backport_dir() } +need_openibd_loc_code_patch() +{ + local sub + + if [ "$ARCH" != "ppc64" ]; then + return 1; + fi + + case $KVERSION in + 2.6.9-*.EL*) + sub=$(echo $KVERSION | cut -d"-" -f2 | cut -d"." -f1) + if [ $sub -lt 62 ]; then + return 0; + fi + ;; + 2.6.16.*-*-*) + sub=$(echo $KVERSION | cut -d"." -f4 | cut -d"-" -f1) + if [ $sub -lt 53 ]; then + return 0; + fi + ;; + 2.6.18-*.el5*) + sub=$(echo $KVERSION | cut -d"-" -f2 | cut -d"." -f1) + if [ $sub -lt 52 ]; then + return 0; + fi + ;; + 2.6.*) + sub=$(echo $KVERSION | cut -d"." -f3 | cut -d"-" -f1 | tr -d [:alpha:][:punct:]) + if [ $sub -lt 24 ]; then + return 0; + fi + ;; + esac + + return 1; +} + # Apply patch apply_patch() { @@ -253,6 +291,13 @@ apply_backport_patches() fi } +apply_openibd_patches() +{ + if need_openibd_loc_code_patch; then + apply_patch ${CWD}/ofed_scripts/openibd-loc_code.patch + fi +} + # Apply patches patches_handle() { @@ -288,6 +333,9 @@ EOF fi BACKPORT_INCLUDES='-I${CWD}/kernel_addons/backport/'${BACKPORT_DIR}/include/ fi + + # Apply openibd patches + apply_openibd_patches $KVERSION #FIXME: why are these applied here? Move them to before backports? @@ -399,6 +447,7 @@ main() #Set default values KVERSION=${KVERSION:-$(uname -r)} +ARCH=${ARCH:-$(uname -m)} WITH_QUILT=${WITH_QUILT:-"yes"} WITH_PATCH=${WITH_PATCH:-"yes"} WITH_KERNEL_FIXES=${WITH_KERNEL_FIXES:-"yes"} diff --git a/ofed_scripts/openibd-loc_code.patch b/ofed_scripts/openibd-loc_code.patch new file mode 100644 index 0000000..43d70b4 --- /dev/null +++ b/ofed_scripts/openibd-loc_code.patch @@ -0,0 +1,43 @@ +--- a/ofed_scripts/openibd 2007-10-25 08:01:51.000000000 -0500 ++++ b/ofed_scripts/openibd 2007-10-27 09:58:56.000000000 -0500 +@@ -538,6 +538,32 @@ if test -x /sbin/lspci && test -x /sbin/ + fi + } + ++fix_location_codes() ++{ ++ # ppc64 only: ++ # Fix duplicate location codes on kernels where ibmebus can't handle them ++ if [ -d /proc/device-tree -a -f /proc/ppc64/ofdt ]; then ++ local i=1 phandle lcode len ++ # output all duplicate location codes and their devices ++ for attr in $(find /proc/device-tree -wholename "*lhca\@*/ibm,loc-code"); do ++ echo -e $(dirname $attr)"\t"$(cat $attr) ++ done | sort -k2 | uniq -f1 --all-repeated=separate | cut -f1 | while read dev; do ++ if [ -n "$dev" ]; then ++ # append an instance counter to the location code ++ phandle=$(hexdump -e '8 "%u"' $dev/ibm,phandle) ++ lcode=$(cat $dev/ibm,loc-code)-I$i ++ len=$(echo -n "$lcode" | wc -c) ++ # echo "$dev -> $lcode" ++ echo -n "update_property $phandle ibm,loc-code $len $lcode" > /proc/ppc64/ofdt ++ i=$(($i + 1)) ++ else ++ # empty line means new group -- reset i ++ i=1 ++ fi ++ done ++ fi ++} ++ + rotate_log() + { + local log=$1 +@@ -694,6 +720,7 @@ start() + + # Load eHCA driver + if [ "X${EHCA_LOAD}" == "Xyes" ]; then ++ fix_location_codes + /sbin/modprobe ib_ehca > /dev/null 2>&1 + my_rc=$? + if [ $my_rc -ne 0 ]; then -- 1.5.2 From ogerlitz at voltaire.com Mon Oct 29 06:29:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 29 Oct 2007 15:29:07 +0200 Subject: [ofa-general] to be discussed at the developer conference Message-ID: <4725E023.7070409@voltaire.com> (Assuming that the allocation of slots within the schedule to have enough time for Linux IB developers to discuss what ever they decide they need to would be taken care of) I'd like to check with people what we want to be on the agenda of these slots. My thinking for issues to discuss was: 1) the long time and endless threads related to the SA caching thing need to be there. Sean - I saw that you prepare a session, correct? will you presenting few possible designs? 2) as for IPoIB stateless offload - with Eli and Liran not planned to be there. Dror - do you intend to actually present the actual ipoib / core / drivers related design and implementation? Also, personally, I felt that the 1-2 slides you delivered on Sonoma where way below what would let one understand in what features exactly the HW supports, and I don't want to be referred to under-NDA docs, lets just have you provide a clear description regarding large-send and checksum offloading. Same for the HW interrupt mitigation, can be nice if you explain the problem, the solution and spare few words how does this goes with NAPI. One more thing is the LRO staff - its a pure SW optimization, if you think this should be in the ipoib code, some justification materials can be helpful. 3) QoS - Sean, Dror, generally speaking, what where you thinking to discuss? 4) IPoIB connected mode UC support - Roland, can work on this start once the no-SRQ design/code is agreed and committed to a branch at your git? In previous discussions with Michael over this list he insisted that some "keep alive" probing mechanism must be implemented since the arp probes sent by the kernel neighboring subsystem are not enough the cover all cases and he suggested to use IB CM LAP messages etc for that. What are the open issues you can think on here? would you be able to present this? 5) IB 4K MTU - in IPoIB and elsewhere in the IB stack, same here, Roland, do you think a short session is needed or your comments, eg http://lkml.org/lkml/2007/9/13/308 & http://lkml.org/lkml/2007/9/14/173 cover everything that need to be done? is there something to change at layers below IPoIB, what about SM implementations - does anyone see there possible required changes? 6) the netdev network batching RFCs - Krishna, Shirley, will someone from IBM can prepare a session to educate us on the matter and the status? any more ideas? Or. From Arkady.Kanevsky at netapp.com Mon Oct 29 06:55:35 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 29 Oct 2007 09:55:35 -0400 Subject: [ofa-general] RE: [promoters] OpenFabrics Developer's Summit: feedback requested In-Reply-To: <20071029065138.GA13737@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com><15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com><20071024003159.GA10244@cuprite.pathscale.com><15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> <20071029065138.GA13737@cuprite.pathscale.com> Message-ID: Just want to raise my voice against parallel sessions. We had done it a couple of times at Sonoma. It does not matter how you divide talks between tracks there is still a huge overlap. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Johann George [mailto:johann.george at qlogic.com] > Sent: Monday, October 29, 2007 2:52 AM > To: Or Gerlitz > Cc: promoters at lists.openfabrics.org; > ewg at lists.openfabrics.org; general at lists.openfabrics.org > Subject: [promoters] OpenFabrics Developer's Summit: feedback > requested > > Or and Jeff, > > Thanks again for your input. I like Or's idea of starting > the summit earlier but am concerned as to whether people > could attend. I'm also not sure we could have access to the > room earlier although I suspect that will be possible. > > Regarding parallel tracks, we currently do not have another > room to handle that. But we can investigate if this might be > possible at reasonable cost. > > I would like to hear from the attendees since this summit is > for you. Perhaps you can vote on the following three > questions. I'll tally the votes that come from registered > attendees by the end of the week and act on them as best as I > can. This might be a good time to remind you that if you > have not registered, please do so by following this link: > > http://www.acteva.com/booking.cfm?bevaid=143964 > > (1) Are you willing and able to attend if we start at > 11:00am on Thursday rather than at 1:00pm? > > (2) If we are able to, would you prefer to see simultaneous > tracks and lengthen some of the sessions. > > (3) Would you like to see additional MPI sessions crammed > into the allotted time? > > To avoid polluting all the mailing lists, feel free to reply > just to me unless you wish to do otherwise. > > Thanks. > > Johann > _______________________________________________ > promoters mailing list > promoters at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/promoters > From changquing.tang at hp.com Mon Oct 29 07:18:14 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 29 Oct 2007 14:18:14 +0000 Subject: [ofa-general] message is received but sender report error. In-Reply-To: <4725C83F.8030804@dev.mellanox.co.il> References: <349DCDA352EACF42A0C49FA6DCEA8403029943D4@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA8403029C6FB7@G3W0634.americas.hpqcorp.net> <000301c8172d$c58b7f80$ff0da8c0@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA8403029C7010@G3W0634.americas.hpqcorp.net> <47243EBE.7010809@dev.mellanox.co.il> <4725C83F.8030804@dev.mellanox.co.il> Message-ID: Yes I think it is a sync problem when close the connection, the sender sent a zero byte message with immediate data. The receier received the message correctly and destroy the coresponding QP immediate. The sender got the completion with status=12. If I delay the QP destroying, the code works fine. --CQ > -----Original Message----- > From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] > Sent: Monday, October 29, 2007 6:47 AM > To: Tang, Changqing > Cc: Sean Hefty; Roland Dreier; general at lists.openfabrics.org > Subject: Re: [ofa-general] message is received but sender > report error. > > If you are not connecting the QPs using CM, maybe you have a > sync problem? > one side (the sender) is in RTS and the other side isn't in > RTR (or a sync problem when closing the connection) > > Dotan > > Tang, Changqing wrote: > > The timeout is 18 (~1sec), and retry is 7 (max). > > > > The error only occurs 1% of runs, sometimes I run the same > hello_world code in a loop, and caught it after 1500 runs. So > I don't think it is a cable issue(but I have not checked the > port error counter). > > > > --CQ > > > > > >> -----Original Message----- > >> From: Dotan Barak [mailto:dotanb at dev.mellanox.co.il] > >> Sent: Sunday, October 28, 2007 2:48 AM > >> To: Tang, Changqing > >> Cc: Sean Hefty; Roland Dreier; general at lists.openfabrics.org > >> Subject: Re: [ofa-general] message is received but sender report > >> error. > >> > >> Hi. > >> > >> Maybe you should increase your timeout/retry count for your > >> application? > >> can you check the ports error counters (using perfquery) maybe you > >> have bad cables in your subnet .... > >> > >> Dotan > >> > >> Tang, Changqing wrote: > >> > >>> This is Verbs layer code, no IB CM is used. > >>> > >>> --CQ > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Sean Hefty [mailto:sean.hefty at intel.com] > >>>> Sent: Thursday, October 25, 2007 12:38 PM > >>>> To: Tang, Changqing; Roland Dreier > >>>> Cc: general at lists.openfabrics.org > >>>> Subject: RE: [ofa-general] message is received but sender report > >>>> error. > >>>> > >>>> > >>>> > >>>>> If this is the case, how would we fix the problem ? It's > >>>>> > >>>>> > >>>> hard for us to > >>>> > >>>> > >>>>> delay to destroy the QP, because we don't know how long > to delay. > >>>>> The other way is to do something from the driver, or firmware. > >>>>> > >>>>> > >>>> Do you disconnect the QPs using the IB CM? > >>>> > >>>> - Sean > >>>> > >>>> > >>>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> > >>> > >> > > > > > > From vlad at dev.mellanox.co.il Mon Oct 29 08:01:45 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 29 Oct 2007 17:01:45 +0200 Subject: [ofa-general] ofed_kernel merged with 2.6.24-rc1 patches update required Message-ID: <4725F5D9.6050301@dev.mellanox.co.il> Hello, There is a new branch "ofed_kernel_2_6_24_rc1" under git://git.openfabrics.org/ofed_1_3/linux-2.6.git All patches from kernel_patches/fixes that were applied in 2.6.24-rc1 were removed from kernel_patches/fixes directory. The "problematic" patches from kernel_patches/fixes were moved to the kernel_patches/attic directory. Backport patches and fixes should be updated according to the new kernel tree. The easy way to do so is using "ofed_scripts/ofed_makedist.sh" utility which creates tgz file for every supported kernel with all relevant patches applied. We want to move to the new branch on this Wednesday (31 Oct 2007) Please send me updated backport patches and fixes by tomorrow. Regards, Vladimir From johann.george at qlogic.com Mon Oct 29 08:07:47 2007 From: johann.george at qlogic.com (Johann George) Date: Mon, 29 Oct 2007 08:07:47 -0700 Subject: [ofa-general] Re: OpenFabrics Developer's Summit: feedback requested In-Reply-To: <20071029065138.GA13737@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024003159.GA10244@cuprite.pathscale.com> <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> <20071029065138.GA13737@cuprite.pathscale.com> Message-ID: <20071029150747.GA20952@cuprite.pathscale.com> Someone proposed a fourth option worth considering which is staying later on Friday. Here are the alternatives we are looking for feedback on: (1) Are you willing and able to attend if we start at 11:00am on Thursday rather than at 1:00pm? (2) If we are able to, would you prefer to see simultaneous tracks and lengthen some of the sessions. (3) Would you like to see additional MPI sessions crammed into the allotted time? (4) Are you willing and able to stay if we ran later on Friday? How long? Thanks. Johann From swise at opengridcomputing.com Mon Oct 29 08:10:14 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 29 Oct 2007 10:10:14 -0500 Subject: [ofa-general] iw_cxgb3 genalloc memory allocator dependency Message-ID: <4725F7D6.1080401@opengridcomputing.com> The iw_cxgb3 module depends on the linux kernel genalloc service. This service gets compiled into the kernel _only_ if another subsystem has a config dependency on the genalloc module (CONFIG_GENERIC_ALLOCATOR). In addtion, there are only two users of this service: iw_cxgb3 and some IA64 subsystem. So on a kernel.org kernel that has iw_cxgb3, genalloc gets built into the kernel when you enable the iw_cxgb3 module. But on non IA64 platforms that do not have iw_cxgb3 configured in, the genalloc code is not pulled into the kernel. The side affect of this is that if one tries to compile OFED on a kernel.org kernel that doesn't have iw_cxgb3 configured, the genalloc server is not available and ofed doesn't compile. Now, ofed has a backport of genalloc to support older kernels that do not even have the genalloc service. But we don't pull in that backport for kernels that do have genalloc. Thus the problem... I'm looking for suggestions on how and if we should do something about this? Here are some ideas: 1) always build in our own genalloc service as a backport. This solves the problem, but duplicates the code if it is indeed built into the kernel. 2) detect and ofed config time if we need the genalloc service or not. Then pull in the backport as needed. This one is nice in that it won't replicate the gencalloc code when not needed, but at the expense of adding complexity to the configure script for ofed. I'm not really sure how to do it at all. But maybe vlad knows how? Thoughts? BTW: bug 767 opened to track this. Thanks, Steve. From tziporet at mellanox.co.il Mon Oct 29 08:12:59 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 29 Oct 2007 17:12:59 +0200 Subject: [ofa-general] OFED meeting agenda (Oct 29) about beta readiness: Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E0D1@mtlexch01.mtl.com> This is the agenda for the OFED meeting about beta readiness: 1. Review beta tasks status: 1. Fix compilation problems on PPC with 32 bits - Vlad & Oren (Mellanox) - on work 2. Rebase kernel code on 2.6.24 rc1 (depending it's availability) - on work (please read mail from Vlad with instructions) 3. SPEC files should be part of each user space package - each owner should take the spec file 4. Multiple uDAPL libs (1.0 & 2.0) - Vlad and Arlin (Intel) 5. nes - need to update some backport patches Glenn (NetEffect) Any new task ??? Done tasks: * Add qperf test from Qlogic - Johann (Qlogic) * Support RHEL 5 up1 - Woody & Vlad * Apply patches that fix warning of backport patches - Vlad (Mellanox) (one patch was not applied since we got no answer regarding it) * New MVAPICH package - Pasha & DK (OSU) * Complete RDS work - Vlad (Mellanox) * Integrate all SDP features - Jim (Mellanox) 2. Open bugs - review most critical bugs: bug_id bug_severity op_sys assigned_to short_short_desc 753 blocker SLES 10 eitan at mellanox.co.il ibutils src.rpm compile error on SLES10 SP1 js21 PPC64 744 blocker RHEL 4 eli at mellanox.co.il OFED 1.3 Cheetah IPoIB netperf UDP_STREAM fails, causes IPoIB to stop working 757 blocker Other eli at mellanox.co.il ipoib cm - traffic does not work over partioning interfaces when the mode is connected. 756 critical RHEL 5 orenk at dev.mellanox.co.il OFED-1.3-20071024-0645 ibutils won't compile on RHEL4/RHEL5 750 critical SLES 10 raisch at de.ibm.com Problem with modprobe ib_ehca with older kernel versions 746 critical SLES 10 vlad at mellanox.co.il Installation of 32-bit libibverbs failed 758 critical SLES 10 vlad at mellanox.co.il IPOIB_CM is not compiled via install.pl 760 major All eli at mellanox.co.il UDP performance on Rx is lower than Tx 761 major Other eli at mellanox.co.il Poor and jittery UDP performance at small messages 508 major RHEL 4 eli at mellanox.co.il IPoIB CM multicast is hogging interrupts 751 major RHEL 5 pasha at mellanox.co.il MVAPICH won't build mpif77 and mpif90 with PGI 7.0 736 major Other rolandd at cisco.com IBV_WC_RETRY_EXC_ERR errors with local rdma_reads 730 major RHEL 4 vlad at mellanox.co.il OFED 1.3 MPI won't compile with PGI 6.2.5 on RHEL4 x86_64 740 major All vlad at mellanox.co.il OFED 1.3 install.pl is missing functionality (OFA_KERNEL_PARAMS and K_VER) that install.sh had 747 major All vlad at mellanox.co.il SRPHA_ENABLE missing from OFED 1.3 alpha2 openib.conf 733 normal All jackm at dev.mellanox.co.il create a CQ with a number which is power of 2 will result waste of memory 762 normal All jackm at dev.mellanox.co.il create an XRC QP with NULL in the xrc_domain causes kernel oops 763 normal All jackm at dev.mellanox.co.il XRC domain can be closed event QP/SRQ are using it 689 normal Other jsquyres at cisco.com When one install ofed (including the mpi-selector) and choosing prefix that end with "/", Install fails. 755 normal Other jsquyres at cisco.com openMPI src.rpm compile error on SLES10 SP1 JS21 PPC64 692 normal Other monis at voltaire.com Ping over IPoIB interface stops working when running openibd restart with bonding enabled. 709 normal All orenk at dev.mellanox.co.il ibutils binaries have wrong RPATH 765 normal Other orenk at dev.mellanox.co.il ofed-1.3 and ofed-1.2.5 can't burn mlx4 HCA's with old FWR (2.0.150) 754 normal SLES 10 perkinjo at cse.ohio-state.edu mvapich2 src.rpm compile error on js21 PPC64 SLES10 SP1 752 normal Other sashak at voltaire.com opensm daemon failed to start 721 normal SLES 10 vlad at mellanox.co.il OFED 1.3 installation failed: Failed to build opensm RPM 723 normal RHEL 4 vlad at mellanox.co.il netperf over rds failed - rds_send: data send error: Invalid argument 724 normal SLES 10 vlad at mellanox.co.il Oops during rds module unload 739 normal Other vlad at mellanox.co.il install.pl doesn't want to die 742 normal Other vlad at mellanox.co.il mpi-selector not working in 1.3-alpha2 748 normal Other vlad at mellanox.co.il install failed 764 normal SLES 10 vlad at mellanox.co.il Installation bug, 766 normal SLES 10 vlad at mellanox.co.il Installation bug, 690 minor All vlad at mellanox.co.il Attempt is made to install mvapich2 even when user says don't install it -------------- next part -------------- An HTML attachment was scrubbed... URL: From changquing.tang at hp.com Mon Oct 29 09:27:34 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon, 29 Oct 2007 16:27:34 +0000 Subject: [ofa-general] Any doc update on the fork() support ? Message-ID: Here is a statement from OFED 1.3 alpha 2 release notes, it has not been changed for a few releases. is there any update ? Thanks. 3. Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as the parent process does not run before the child exits or calls exec(). The former can be achieved by calling wait(childpid), and the latter can be achieved by application specific means. The Posix system() call is supported. --CQ From mshefty at ichips.intel.com Mon Oct 29 09:34:04 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Oct 2007 09:34:04 -0700 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4725E023.7070409@voltaire.com> References: <4725E023.7070409@voltaire.com> Message-ID: <47260B7C.8070203@ichips.intel.com> > 1) the long time and endless threads related to the SA caching thing > need to be there. Sean - I saw that you prepare a session, correct? will > you presenting few possible designs? I was asked to prepare a session and will mention some of the general scalability issues that we've seen with Intel MPI. > 3) QoS - Sean, Dror, generally speaking, what where you thinking to > discuss? We plan on discussing what was added to the stack and opensm. Keep in mind that both of these are only 20 minutes. - Sean From rick.jones2 at hp.com Mon Oct 29 10:05:30 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 29 Oct 2007 10:05:30 -0700 Subject: [ofa-general] Running netperf and iperf over SDP In-Reply-To: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> References: <39C75744D164D948A170E9792AF8E7CA4D2BC6@exil.voltaire.com> Message-ID: <472612DA.8020801@hp.com> Moshe Kazir wrote: > While running netperf and iperf over SDP I get unstable Performance > results. > > On iperf I get more then 25 % difference between minimum and maximum. > > On netperf I get the following amazing result. -> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > # while LD_PRELOAD=/usr/lib64/libsdp.so ./netperf -H 192.168.7.172 > -- -m 512 -M 1047 ; do echo . ; done > TCP STREAM TEST to 192.168.7.172 > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 126976 126976 512 10.00 3247.39 > . > TCP STREAM TEST to 192.168.7.172 > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 126976 126976 512 10.00 1222.48 > . > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Can some one spar me a hint ? > What I'm doing wrong ? I would start by telling netperf you want CPU utilization measured. That way we can see if there is a correlation between the CPU utilization and the throughput. Also, if this were a "pure" TCP test, you would have a race between the nagle algorithm and the speed at which ACK's come-back from the receiver affecting the distribution of TCP segment sizes being transmitted since your send size is so much smaller than the MTU of the link. IIRC for IPoIB in 1.2.mumble the MTU is 65520 or something like that. You might consider taking snapshots of the link-level statistics (does ethtool -S work for an IB interface?) from before and after each netperf test and run them through beforeafter: ftp://ftp.cup.hp.com/dist/networking/tools/ You might also experiment with setting TCP_NODELAY - although since this is LD_PRELOADED SDP I'm not sure what that really means/does. Any particular reason you are telling the netperf side to post 1047 byte receives when you are making 512 byte calls to send()? > > The test run on x86_64 , sles 10 sp1 , OFED-1.2.5 I'm guessing you have multiple cores - how do interrutps from the HCA (?) get distributed? What happens when you use the -T option of netperf to vary the CPU binding of either netperf or netserver: netperf -T N,M #bind netperf to CPU N, netserver to CPU M netperf -T N, #just bind netperf to CPU N, netserver unbound netperf -T ,M #netperf unbound, netserver bound to CPU M relative to where the interrupts from the HCA go? Finally, well for now :), there are "direct" SDP tests in netperf. Make sure you are on say 2.4.4: http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/ or ftp://ftp.netperf.org/netperf and add --enable-sdb to the ./configure command. happy benchmarking, rick jones From web_lottery99 at adelphia.net Mon Oct 29 10:07:16 2007 From: web_lottery99 at adelphia.net (WINNING NOTIFICATION) Date: Mon, 29 Oct 2007 10:07:16 -0700 Subject: [ofa-general] ***SPAM*** WINNING PRIZE Ref: XYL /26510460037/05 Message-ID: <26524859.1193677637021.JavaMail.root@web37> Ref: XYL /26510460037/05 Batch: 24/00319/IPD WINNING NOTIFICATION We happily announce to you the draw (#1071)winner of the cash prize of £2,696,385held on the 29th of October 2007 in London Uk. contact our fiduaciary claims department Agents Name: Van Williams Email: claims_uknationallottey77 at yahoo.co.uk Tel: +447024096270 1.Name...2.Address...3.Nationality....4.Age...5.Sex... 6.Occupation...7.Phone/Fax..8.COUNTRY.. Cordially, Rose Wood Online Co-ordinator Sweepstakes International Program From web_lottery99 at adelphia.net Mon Oct 29 10:07:53 2007 From: web_lottery99 at adelphia.net (WINNING NOTIFICATION) Date: Mon, 29 Oct 2007 10:07:53 -0700 Subject: [ofa-general] ***SPAM*** WINNING PRIZE Ref: XYL /26510460037/05 Message-ID: <25266300.1193677673381.JavaMail.root@web37> Ref: XYL /26510460037/05 Batch: 24/00319/IPD WINNING NOTIFICATION We happily announce to you the draw (#1071)winner of the cash prize of £2,696,385held on the 29th of October 2007 in London Uk. contact our fiduaciary claims department Agents Name: Van Williams Email: claims_uknationallottey77 at yahoo.co.uk Tel: +447024096270 1.Name...2.Address...3.Nationality....4.Age...5.Sex... 6.Occupation...7.Phone/Fax..8.COUNTRY.. Cordially, Rose Wood Online Co-ordinator Sweepstakes International Program From swise at opengridcomputing.com Mon Oct 29 10:24:58 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 29 Oct 2007 12:24:58 -0500 Subject: [ofa-general] Re: ofed_kernel merged with 2.6.24-rc1 patches update required In-Reply-To: <4725F5D9.6050301@dev.mellanox.co.il> References: <4725F5D9.6050301@dev.mellanox.co.il> Message-ID: <4726176A.2060505@opengridcomputing.com> Vladimir Sokolovsky wrote: > Hello, > There is a new branch "ofed_kernel_2_6_24_rc1" under > git://git.openfabrics.org/ofed_1_3/linux-2.6.git > > All patches from kernel_patches/fixes that were applied in 2.6.24-rc1 > were removed from kernel_patches/fixes directory. > The "problematic" patches from kernel_patches/fixes were moved to the > kernel_patches/attic directory. > > Backport patches and fixes should be updated according to the new kernel > tree. > The easy way to do so is using "ofed_scripts/ofed_makedist.sh" utility > which creates tgz file for every supported kernel with all relevant > patches applied. > Vlad, have you done any builds against the various kernels? What exactly should I, as cxgb3 owner, do with this branch other than verify the patches are correct? Steve. From meier3 at llnl.gov Mon Oct 29 10:48:47 2007 From: meier3 at llnl.gov (Timothy A. Meier) Date: Mon, 29 Oct 2007 10:48:47 -0700 Subject: [ofa-general] Re: [PATCH] opensm & osm_console: modified console framework to support multiple connections In-Reply-To: <20071028010226.GN22317@sashak.voltaire.com> References: <4713FD51.4010506@llnl.gov> <20071028010226.GN22317@sashak.voltaire.com> Message-ID: <47261CFF.1060206@llnl.gov> Hi Sasha, I apologize for the style and submission issues - still adjusting... Some of the design issues/objectives are discussed in-line below. Sasha Khapyorsky wrote: > Hi Tim, > > Sorry about very long delay with reviewing this. > > On 16:52 Mon 15 Oct , Timothy A. Meier wrote: > >> This patch is setting up for adding Remote/Secure Console capability using >> SSL/TSL (we need at LLNL). >> > > Thanks for doing this - it is great thing to secure OpenSM console. > > >> Its a big patch because I changed to an abstract server model, instead of >> the original >> single connection and synchronous model. There is no significant functional >> difference (yet). >> > > It is hard to understand how such abstraction model serves us without > seeing the rest of SSL/TSL code. Probably it is better idea to issue > whole patch series? Anyway some initial comments are below. > > I understand. This patch is fundamentally about changing the architecture to support new features and capabilities (without actually providing anything new). Adding SSL/TSL was driving the requirements for most of these changes, but once changed, has wider applications. I wanted to keep it abstract. The first "new" feature will be SSL/TSL for a secure remote console. >> ======== >> From cb69c1e2c8ea526bcb1e81d079bfa787eda09ba8 Mon Sep 17 00:00:00 2001 >> From: Tim Meier >> Date: Mon, 15 Oct 2007 16:08:10 -0700 >> Subject: [PATCH] opensm & osm_console: modified console framework to support >> multiple connections >> >> Provided an abstract console service that supports the current connection >> types >> (local, loopback, socket) as well as supporting the addition of a secure >> connection type. >> >> * A server implementation supports multiple connections, and reduces the >> posibility of an inadvertant denial of service (currently vulnerable). >> >> * An IO abstraction (CIO) is employed to facilitate the future >> implementation >> of a secure socket (SSL / TSL) connection, while maintaining backward >> compatibility. >> > > Would be nice to not mix two things in one patch - "one patch per > thought" makes it easier to review and submit. > > I was troubled with breaking this into pieces. The patch is really about providing an abstract OSM Server that supports local/remote connections. I can break them up, but in my mind, they were tightly coupled. >> Signed-off-by: Tim Meier >> --- >> opensm/include/opensm/osm_console.h | 35 +- >> opensm/opensm/main.c | 77 ++- >> opensm/opensm/osm_console.c | 1500 >> +++++++++++++++++++++++++---------- >> 3 files changed, 1177 insertions(+), 435 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_console.h >> b/opensm/include/opensm/osm_console.h >> index 33e41e7..75111a4 100644 >> --- a/opensm/include/opensm/osm_console.h >> +++ b/opensm/include/opensm/osm_console.h >> @@ -49,6 +49,14 @@ >> #define OSM_DEFAULT_CONSOLE OSM_DISABLE_CONSOLE >> #define OSM_DEFAULT_CONSOLE_PORT 10000 >> #define OSM_DAEMON_NAME "opensm" >> +#define OSM_QUIT_CMD "quit" >> +#define OSM_LOOP_PERIOD_SEC 2 >> + >> +#define CIO_BUFSIZE 1024 >> +#define CIO_INFO_SIZE 128 >> +#define CIO_NOTE_SIZE 64 >> +#define CIO_MAX_CONNECTS 5 >> +#define CIO_CONNECTION_PORT 10000 >> #ifdef __cplusplus >> # define BEGIN_C_DECLS extern "C" { >> @@ -59,10 +67,29 @@ >> #endif /* __cplusplus */ >> BEGIN_C_DECLS >> -void osm_console_init(osm_subn_opt_t * opt, osm_opensm_t * p_osm); >> -void osm_console(osm_opensm_t * p_osm); >> -void osm_console_prompt(FILE * out); >> -void osm_console_close_socket(osm_opensm_t * p_osm); >> + >> +/* TODO move when fully implemented */ >> +typedef struct _CIO_t >> +{ >> + int fd; // file descriptor (socket) >> + FILE *out; >> + FILE *err; >> + FILE *in; >> + struct pollfd *pfd; >> +} CIO_t; >> + >> +int osm_console_server(osm_subn_opt_t *p_opt, osm_opensm_t *p_osm); >> +void osm_console_server_init(osm_subn_opt_t *opt, osm_opensm_t *p_osm); >> +void osm_console_server_destroy(osm_opensm_t *p_osm); >> +int is_console_enabled(osm_subn_opt_t *p_opt); >> + >> +/* TODO move along with other IO abstraction code */ >> +int cio_printf( CIO_t *cio, const char *format, ...); >> +int cio_flush( CIO_t *cio); >> +int cio_getline( char **lineptr, size_t *n, CIO_t *cio); >> +int cio_open( CIO_t *cio); >> +int cio_close( CIO_t *cio); >> +int cio_poll(CIO_t *cio, int timeout); >> > > Later I see that all cio_* and CIO_* stuff is used only in > osm_console.c, then I think this all should be moved to this file, > local function should be static, etc.. > > The intent of the CIO abstraction is to support connections to the OSM server. Currently, the only thing "planned" to use this connection is the interactive Console. That might not always be the case. > Another thing, please try to not break existing coding style (it is > described in opensm/doc/opensm-coding-style.txt), in many cases you can > use opensm/opensm/osm_indent script to format the code. > > Sorry. I wasn't aware of the indent script until recently. Is this universally used, or just on new code? >> #include >> +typedef struct _LoopCmd >> +{ >> + int on; >> + int running; >> + int delay_s; >> + void (*loop_function)(osm_opensm_t *p_osm, CIO_t *out); >> + cl_thread_t loopThread; // a specific thread for each looping cmd >> +} LoopCmd; >> + >> +// unique attributes for each connection >> +typedef struct _osm_console_thread_t >> +{ >> + int used; >> + unsigned short int port; >> + int authorized; >> + int state; >> + char name[CIO_INFO_SIZE]; >> + char in_buff[CIO_BUFSIZE]; >> + char out_buff[CIO_BUFSIZE]; >> + char client_type[CIO_NOTE_SIZE]; // maps to option->console >> (off|local|socket) >> + char client_ip[CIO_NOTE_SIZE]; >> + char client_hn[CIO_INFO_SIZE]; >> + unsigned int thread_num; // a unique ever increasing number + >> osm_opensm_t *p_osm; // the global opensm singleton (protect with lock) >> + CIO_t io; // the io streams for the connection >> + LoopCmd loop_command; >> + cl_thread_t consoleThread; // a specific thread each console connection >> + struct timeval connect_time; >> +} osm_console_thread_t; >> > > I think this introduces CIO_MAX_CONNECTS new threads + for loop commands. > What about to do all in one thread - to use select() or poll() with > timeout on multiple file descriptors? This will "reserve" another CPUs > for running another OpenSM things. Another potential problem is multi > thread synchronizations - we had (and still have) a lot of issues in this > area. > > I wasn't aware of thread synchronization issues.... You are correct, this potentially introduces 2*CIO_MAX_CONNECTS new threads. (Worst case, all connections are used, all running a loop command.) Currently, the only loop command is for printing status, but the software was designed to support any command you may want to put in a loop. If no additional commands will be "looped", then I agree its overkill to put this in its own thread. I think each connection/session should be in its own thread. >> + >> struct command { >> - char *name; >> - void (*help_function) (FILE * out, int detail); >> - void (*parse_function) (char **p_last, osm_opensm_t * p_osm, >> - FILE * out); >> + char *name; >> + void (*help_function)(CIO_t *out, int detail); >> + void (*parse_function)(char **p_last, osm_console_thread_t *p_oct, CIO_t >> *out); >> }; >> -struct { >> - int on; >> - int delay_s; >> - time_t previous; >> - void (*loop_function) (osm_opensm_t * p_osm, FILE * out); >> -} loop_command = { >> -on: 0, delay_s: 2, loop_function:NULL}; >> +/* connection pool for remote clients - currently only consoles */ >> +static osm_console_thread_t ConsoleThreadPool[CIO_MAX_CONNECTS]; >> +static cl_plock_t ThreadLock; >> +static volatile unsigned int cio_thread_counter = 0; >> +static struct timeval ServerTime; >> > > Would be nice to avoid using non-constant static/global variables. > Instead we could keep needed per OpenSM session info in allocated > structure. > > I agree. I was following the existing code and limiting the # of connections to a small number. I didn't think it was different than current practice. Allocating the structure would be a better long term solution. >> + >> +/********************************************************************** >> + * convenience function >> + **********************************************************************/ >> +CIO_t* getCIO(osm_console_thread_t *oct) >> > > This function should be static? > > Yep, I'll fix those. >> +{ >> + return &oct->io; >> +} >> + >> +/********************************************************************** >> + * thread pool primitive: counts the number currently in use >> + **********************************************************************/ >> +int num_console_threads(void) >> >> -static void loglevel_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) >> +static void loglevel_parse(char **p_last, osm_console_thread_t *p_oct, >> CIO_t *out) >> { >> + osm_opensm_t *p_osm = p_oct->p_osm; >> char *p_cmd; >> int level; >> p_cmd = next_token(p_last); >> if (!p_cmd) >> - fprintf(out, "Current log level is 0x%x\n", >> - osm_log_get_level(&p_osm->log)); >> + cio_printf(out, "Current log level is 0x%x\n", >> osm_log_get_level(&p_osm->log)); >> > > At least here your mailer wraps the line :( > > I see, sorry - I thought I turned that off. I will switch to a different mailer for sending patches. >> + >> +int print_console_thread_pool(osm_console_thread_t* p_oct, osm_opensm_t >> *p_osm, CIO_t *out) >> > > This function is not used. > > Whoops! I removed the "new" command that uses this (didn't want to introduce too many new things) but missed this code. >> > > It is not clear for me why most of those wrapper functions are needed > at all. And how really so big comment about *_printf() usage is helpful. > > Sasha > > Currently those wrapper functions only provide a single implementation, but I intend to extend them with additional functionality when I add SSL/TSL. The new protocol will depend on new libraries/headers. We (LLNL) discussed this, and thought conditionally compiling this feature in would satisfy those folks who did not want to add this dependency if they did not want the feature. So the wrapper functions (in this Patch) were just a way of introducing the IO abstraction. Regarding the comments, sorry if it seems verbose. I tend to put all of my documentation in the code, and sometimes I get carried away. Thanks for reviewing all of this. How would you like me to move forward? Would you rather me (re)submit this Patch as a series of 2? I want to establish this as a working baseline (no new functionality, just more extensible) before adding the SSL/TSL code. -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov From pradeeps at linux.vnet.ibm.com Mon Oct 29 11:03:45 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 29 Oct 2007 11:03:45 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: <471FAC1F.2070401@linux.vnet.ibm.com> <4720E355.8010400@linux.vnet.ibm.com> Message-ID: <47262081.7010507@linux.vnet.ibm.com> Roland Dreier wrote: > > Having waited for months for this patch to be merged in, it is very disappointing > > to say the least. Wish it had been merged and if changes are needed they can always be > > made subsequently. That has been my understanding of the development model. > > If you really want to get into it... > > I'll certainly accept some of the blame for taking too long to review > this patch. However, you didn't do yourself any favors by: a) making > one huge ugly patch and b) being rather disagreeable when someone > actually tried to review it. > > As far as the development model goes, it is certainly true that for > new things, we can merge first and fix later. But when we're touching > something like IPoIB, which is pretty critical to just about everyone > using the IB stack at all, the standard is a little different: we need > to be much more conservative. And even for new stuff, starting from a > good base is pretty important; it's easy to pick on coding style > problems, and indeed they do make review harder, but it's even more > important to have the underlying logic and structure be simple and > maintainable. > > Anyway, I'll post my current patch series shortly. I think I was able > to make the patch quite a bit neater and more reviewable: your patch > added > 400 lines, while the main part of my series adds < 200 lines. > Roland, I realize a maintainers job is not easy with the incredible number of patches that are submitted as I have seen on this mailing list. And I am glad to see that this patch is starting to see some forward movement at long last :). I will review the patch and test it out and provide comments. Thanks for your efforts and help with this. There is some history behind why it became a huge ugly patch -which is not in the patch itself, and why I resisted making changes (before the merge). In the initial stages when I submitted the patch it got a very chilly reception and the message I got was "hands off my code". If you will recollect I did raise maintainability issues in the very beginning. But, from the communications I inferred that without incorporating those comments I could not get the patch in. There were a few genuine issues that the reviews pointed out. At the same time a lot of inconsequential comments were made too. Over time the patch morphed to incorporate several such comments. As you realize with a big patch, it is time consuming to test out every time the patch is changed and that too across multiple HCAs. Even though I was in agreement with Sean's comments (I had proposed several of them early on :) ) I deferred making those changes because they were undoing some of the changes suggested and I was not sure if there was agreement across the board. After all it had been reviewed by 3 people and continued to evolve. That is the reason I kept insisting that I would evaluate comments after the merge. Much easier to make small isolated changes after the big patch is in. Anyway, I would like to move on and close the chapter on for-2.6.24 tree since the window is now closed. Look forward to this patch being the first one to be merged into the for-2.6.25 tree. Pradeep Pradeep From vlad at dev.mellanox.co.il Mon Oct 29 11:02:26 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 29 Oct 2007 20:02:26 +0200 Subject: [ofa-general] Re: [ewg] Re: ofed_kernel merged with 2.6.24-rc1 patches update required In-Reply-To: <4726176A.2060505@opengridcomputing.com> References: <4725F5D9.6050301@dev.mellanox.co.il> <4726176A.2060505@opengridcomputing.com> Message-ID: <47262032.4000906@dev.mellanox.co.il> Steve Wise wrote: > > > Vladimir Sokolovsky wrote: >> Hello, >> There is a new branch "ofed_kernel_2_6_24_rc1" under >> git://git.openfabrics.org/ofed_1_3/linux-2.6.git >> >> All patches from kernel_patches/fixes that were applied in 2.6.24-rc1 >> were removed from kernel_patches/fixes directory. >> The "problematic" patches from kernel_patches/fixes were moved to the >> kernel_patches/attic directory. >> >> Backport patches and fixes should be updated according to the new >> kernel tree. >> The easy way to do so is using "ofed_scripts/ofed_makedist.sh" utility >> which creates tgz file for every supported kernel with all relevant >> patches applied. >> > > Vlad, have you done any builds against the various kernels? What > exactly should I, as cxgb3 owner, do with this branch other than verify > the patches are correct? > > Steve. Currently some backport patches fails to be applied. Please verify that cxgb3 backport patches can be applied and that under kernel_patches/fixes all required patches present. Vladimir. From or.gerlitz at gmail.com Mon Oct 29 12:05:10 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 29 Oct 2007 21:05:10 +0200 Subject: [ewg] Re: [ofa-general] to be discussed at the developer conference In-Reply-To: <47260B7C.8070203@ichips.intel.com> References: <4725E023.7070409@voltaire.com> <47260B7C.8070203@ichips.intel.com> Message-ID: <15ddcffd0710291205j5aa20129tdd441e3043e197e7@mail.gmail.com> On 10/29/07, Sean Hefty wrote: > > > 1) the long time and endless threads related to the SA caching thing > > need to be there. Sean - I saw that you prepare a session, correct? will > > you presenting few possible designs? > > I was asked to prepare a session and will mention some of the general > scalability issues that we've seen with Intel MPI. > 3) QoS - Sean, Dror, generally speaking, what where you thinking to > > discuss? > > We plan on discussing what was added to the stack and opensm. > > Keep in mind that both of these are only 20 minutes. Sean, As you might saw over the thread "OpenFabrics Developer's Summit: tentative agenda" I am working to get the Linux IB issues what ever time we need to discuss them. You can assume at least 45 minutes (and if needed more) to the SA caching so you can go much further then the problem description eg to sketch few possible designs / implementations. This is a two years old open issue which need to be solved. Similarily for QoS, I'd go further to discuss open issues if there are such that you are aware to. Who's going to present the opensm changes - you or Dror? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Mon Oct 29 12:11:01 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 29 Oct 2007 21:11:01 +0200 Subject: [ofa-general] Re: [ewg] to be discussed at the developer conference In-Reply-To: <4725E023.7070409@voltaire.com> References: <4725E023.7070409@voltaire.com> Message-ID: <15ddcffd0710291211n102ec557rcf45471e94f3cb54@mail.gmail.com> On 10/29/07, Or Gerlitz wrote: > > (Assuming that the allocation of slots within the schedule to have > enough time for Linux IB developers to discuss what ever they decide > they need to would be taken care of) I'd like to check with people what > we want to be on the agenda of these slots. My thinking for issues to > discuss was: > .... > any more ideas? > 7) the inform info code. Sean - you have implemented and attempted to push it through the sa caching push, but since the cache was rejected so did the inform info code. So the questions here - how do we make this push happen? are there any open issues, etc Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From donata.danby at viewpointonline.com Mon Oct 29 12:22:06 2007 From: donata.danby at viewpointonline.com (Patrice Gustafson) Date: Mon, 29 Oct 2007 14:22:06 -0500 Subject: [ofa-general] To be or not to be. To be... Message-ID: <274719350.69943962862414@viewpointonline.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: aaa.gif Type: image/gif Size: 6111 bytes Desc: not available URL: From mshefty at ichips.intel.com Mon Oct 29 12:15:54 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 29 Oct 2007 12:15:54 -0700 Subject: [ofa-general] Re: [ewg] to be discussed at the developer conference In-Reply-To: <15ddcffd0710291211n102ec557rcf45471e94f3cb54@mail.gmail.com> References: <4725E023.7070409@voltaire.com> <15ddcffd0710291211n102ec557rcf45471e94f3cb54@mail.gmail.com> Message-ID: <4726316A.3070703@ichips.intel.com> > 7) the inform info code. Sean - you have implemented and attempted to > push it through the sa caching push, but since the cache was rejected so > did the inform info code. So the questions here - how do we make this > push happen? are there any open issues, etc There either needs to be an in kernel user, or we need to reach agreement on the best way to expose this to userspace. Neither this, nor the multicast code are directly exported. I have seen e-mails on the list that event subscription is used by userspace apps, but it is done via the MAD layer directly. - Sean From sean.hefty at intel.com Mon Oct 29 12:45:07 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 29 Oct 2007 12:45:07 -0700 Subject: [ofa-general] librdmacm 1.0.4 release Message-ID: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> I've pushed out a release 1.0.4 of librdmacm that addresses some of the feedback from Doug. Patches were posted previously to the list, with a small update based on that feedback. Please pull this release into OFED 1.3. Changes from 1.0.3: librdmacm/cma: provide wrapper functions to extract src/dst addresses librdmacm/cma: provide sanity checks for max outstanding rdma ops librdmacm/man: update man pages to clarify connection request params Thanks, - Sean From or.gerlitz at gmail.com Mon Oct 29 13:15:25 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 29 Oct 2007 22:15:25 +0200 Subject: [ofa-general] Re: OpenFabrics Developer's Summit: feedback requested In-Reply-To: <20071029150747.GA20952@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024003159.GA10244@cuprite.pathscale.com> <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> <20071029065138.GA13737@cuprite.pathscale.com> <20071029150747.GA20952@cuprite.pathscale.com> Message-ID: <15ddcffd0710291315x1a460cb5i1199b4a561d2f2b8@mail.gmail.com> On 10/29/07, Johann George wrote: > > (2) If we are able to, would you prefer to see simultaneous > tracks and lengthen some of the sessions. It makes some sense to make a poll if people prefer simultaneous tracks or not, however if the answer is "no", still you can't allocate only 20 minutes for Linux IB open issues. This means that there will be no simultaneous tracks in the price of removing other things from the agenda. > (3) Would you like to see additional MPI sessions crammed > into the allotted time? Getting input from COMMERCIAL MPIs need not be a subject to this or that poll result, moreover, as Jeff commented, in all the previous meetings MVAPICH and OMPI people were able to provide updates and feedback, getting updates from other MPIs is more important in this time frame. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From skandhas at lividcomponents.com Mon Oct 29 16:55:31 2007 From: skandhas at lividcomponents.com (Hal Butler) Date: Mon, 29 Oct 2007 14:55:31 -0900 Subject: [ofa-general] Microsoft Vlsta & Office2007, Just released for 79$ Save 1599.95$ 0ff Retai| Message-ID: <000001c81a75$e70ddb80$0100007f@localhost> hotnewsoft . com From gdror at dev.mellanox.co.il Mon Oct 29 15:25:53 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Tue, 30 Oct 2007 00:25:53 +0200 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4725E023.7070409@voltaire.com> References: <4725E023.7070409@voltaire.com> Message-ID: <47265DF1.8060409@dev.mellanox.co.il> Or Gerlitz wrote: > > 2) as for IPoIB stateless offload - with Eli and Liran not planned to > be there. Dror - do you intend to actually present the actual ipoib / > core / drivers related design and implementation? Also, personally, I > felt that the 1-2 slides you delivered on Sonoma where way below what > would let one understand in what features exactly the HW supports, and > I don't want to be referred to under-NDA docs, lets just have you > provide a clear description regarding large-send and checksum > offloading. Same for the HW interrupt mitigation, can be nice if you > explain the problem, the solution and spare few words how does this > goes with NAPI. One more thing is the LRO staff - its a pure SW > optimization, if you think this should be in the ipoib code, some > justification materials can be helpful. Yes, I will try to do a better job this time :) From unpining at bertolotti.net Mon Oct 29 17:10:50 2007 From: unpining at bertolotti.net (Liza Collins) Date: Mon, 29 Oct 2007 18:10:50 -0600 Subject: [ofa-general] Microsoft Vlsta & Office2007, Just released for 79$ Save 1599.95$ 0ff Retai| Message-ID: <000001c81a7f$3f95a680$0100007f@localhost> hotnewsoft . com From Arkady.Kanevsky at netapp.com Mon Oct 29 16:23:43 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 29 Oct 2007 19:23:43 -0400 Subject: [ofa-general] Re: OpenFabrics Developer's Summit: feedback requested In-Reply-To: <20071029150747.GA20952@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com><15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com><20071024003159.GA10244@cuprite.pathscale.com><15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com><20071029065138.GA13737@cuprite.pathscale.com> <20071029150747.GA20952@cuprite.pathscale.com> Message-ID: Johann, please, do not schedule anything after Th 7:00pm iWARP session. I expect it to go much longer than 1 hour. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Johann George [mailto:johann.george at qlogic.com] > Sent: Monday, October 29, 2007 11:08 AM > To: general at lists.openfabrics.org; > promoters at lists.openfabrics.org; ewg at lists.openfabrics.org > Subject: [ofa-general] Re: OpenFabrics Developer's Summit: > feedback requested > > Someone proposed a fourth option worth considering which is > staying later on Friday. Here are the alternatives we are > looking for feedback on: > > (1) Are you willing and able to attend if we start at > 11:00am on Thursday rather than at 1:00pm? > > (2) If we are able to, would you prefer to see simultaneous > tracks and lengthen some of the sessions. > > (3) Would you like to see additional MPI sessions crammed > into the allotted time? > > (4) Are you willing and able to stay if we ran later on > Friday? How long? > > Thanks. > > Johann > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Oct 29 22:09:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 29 Oct 2007 22:09:28 -0700 Subject: [ofa-general] Re: [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ In-Reply-To: <200710241858.45305.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Wed, 24 Oct 2007 18:58:45 +0200") References: <200710241858.45305.jackm@dev.mellanox.co.il> Message-ID: I'm missing something with this patch set... > Leave create_cq for userspace CQs as before, to avoid breaking ABI. > (Handle this in separate libmlx4 patch) You don't want to break the ABI, which is very good. But... > - dev->caps.max_cqes = dev_cap->max_cq_sz - 1; > + dev->caps.max_cqes = dev_cap->max_cq_sz; Unless I'm confused, this will lead to userspace seeing a larger limit for the max number of CQEs. Which will mean that an old libmlx4 will fail to create a CQ if the consumer asks for the largest possible CQ, right? - R. From jackm at dev.mellanox.co.il Mon Oct 29 23:26:21 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 30 Oct 2007 08:26:21 +0200 Subject: [ofa-general] Re: [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ In-Reply-To: References: <200710241858.45305.jackm@dev.mellanox.co.il> Message-ID: <200710300826.21258.jackm@dev.mellanox.co.il> On Tuesday 30 October 2007 07:09, Roland Dreier wrote: > I'm missing something with this patch set... > > > Leave create_cq for userspace CQs as before, to avoid breaking ABI. > > (Handle this in separate libmlx4 patch) > > You don't want to break the ABI, which is very good. But... > > > - dev->caps.max_cqes = dev_cap->max_cq_sz - 1; > > + dev->caps.max_cqes = dev_cap->max_cq_sz; > > Unless I'm confused, this will lead to userspace seeing a larger limit > for the max number of CQEs. Which will mean that an old libmlx4 will > fail to create a CQ if the consumer asks for the largest possible CQ, right? > You are correct, that is an unfortunate side-effect of the change, that I missed. The largest CQ that an old libmlx4 will accept is 0x3fffff (hard-coded in file libmlx4/src/verbs.c, procedure mlx4_create_cq() ). The new limit returned in dev_lim is 0x400000. Does this mean that you prefer to increment the ABI? - Jack From ogerlitz at voltaire.com Tue Oct 30 00:38:38 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 30 Oct 2007 09:38:38 +0200 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <47265DF1.8060409@dev.mellanox.co.il> References: <4725E023.7070409@voltaire.com> <47265DF1.8060409@dev.mellanox.co.il> Message-ID: <4726DF7E.4030605@voltaire.com> Dror Goldenberg wrote: >> 2) as for IPoIB stateless offload - with Eli and Liran not planned to >> be there. Dror - do you intend to actually present the actual ipoib / >> core / drivers related design and implementation? Also, personally, I >> felt that the 1-2 slides you delivered on Sonoma where way below what >> would let one understand in what features exactly the HW supports, and >> I don't want to be referred to under-NDA docs, lets just have you >> provide a clear description regarding large-send and checksum >> offloading. Same for the HW interrupt mitigation, can be nice if you >> explain the problem, the solution and spare few words how does this >> goes with NAPI. One more thing is the LRO staff - its a pure SW >> optimization, if you think this should be in the ipoib code, some >> justification materials can be helpful. > > Yes, I will try to do a better job this time :) Lets do it more concrete: please comment if you will be presenting the actual SW design and more important, how much time you think you need, 20m is way below anything that allows for questions and some discussion - will 45m be enough? Will you referring the last patch set posted by Eli - (it has some pending comments that were not addressed) or Eli is going to post new version before the conference? Or. From ogerlitz at voltaire.com Tue Oct 30 00:52:15 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 30 Oct 2007 09:52:15 +0200 Subject: [ofa-general] Re: [ewg] to be discussed at the developer conference In-Reply-To: <4726316A.3070703@ichips.intel.com> References: <4725E023.7070409@voltaire.com> <15ddcffd0710291211n102ec557rcf45471e94f3cb54@mail.gmail.com> <4726316A.3070703@ichips.intel.com> Message-ID: <4726E2AF.6020202@voltaire.com> Sean Hefty wrote: >> 7) the inform info code. Sean - you have implemented and attempted to >> push it through the sa caching push, but since the cache was rejected >> so did the inform info code. So the questions here - how do we make >> this push happen? are there any open issues, etc > > There either needs to be an in kernel user, or we need to reach > agreement on the best way to expose this to userspace. Neither this, > nor the multicast code are directly exported. IB multicast send-only (NonMemberSendOnly in IB spec notation) joins is the user that can enable the merge of the inform-info code. Specifically, the in-kernel user I suggest is the rdma-cm: enhance the librdmacm api to let the consumer specify that they want a "send-only" join, for such joins have the rdma-cm register to "GID IN" event on this group MGID and once such event happens, do the actual join on the group. How does this sounds? > I have seen e-mails on the list that event subscription is used by > userspace apps, but it is done via the MAD layer directly. Other than having each such app inventing the wheel in their inform-info low level coding, this is bad, since there is no reference counting and one process doing unregister makes the second process never get events (or they also implemented a reference counting daemon...), anyway, I think we want your implementation in, and the question is how we do that. Or. From gdror at dev.mellanox.co.il Tue Oct 30 01:00:49 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Tue, 30 Oct 2007 10:00:49 +0200 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4726DF7E.4030605@voltaire.com> References: <4725E023.7070409@voltaire.com> <47265DF1.8060409@dev.mellanox.co.il> <4726DF7E.4030605@voltaire.com> Message-ID: <4726E4B1.9000002@dev.mellanox.co.il> Or Gerlitz wrote: > Dror Goldenberg wrote: >>> 2) as for IPoIB stateless offload - with Eli and Liran not planned >>> to be there. Dror - do you intend to actually present the actual >>> ipoib / core / drivers related design and implementation? Also, >>> personally, I felt that the 1-2 slides you delivered on Sonoma where >>> way below what would let one understand in what features exactly the >>> HW supports, and I don't want to be referred to under-NDA docs, lets >>> just have you provide a clear description regarding large-send and >>> checksum offloading. Same for the HW interrupt mitigation, can be >>> nice if you explain the problem, the solution and spare few words >>> how does this goes with NAPI. One more thing is the LRO staff - its >>> a pure SW optimization, if you think this should be in the ipoib >>> code, some justification materials can be helpful. >> >> Yes, I will try to do a better job this time :) > > Lets do it more concrete: please comment if you will be presenting the > actual SW design and more important, how much time you think you need, > 20m is way below anything that allows for questions and some > discussion - will 45m be enough? > > Will you referring the last patch set posted by Eli - (it has some > pending comments that were not addressed) or Eli is going to post new > version before the conference? > > Or. > > I haven't yet prepared the presentation. I am willing to cover whatever you think is important. Indeed 20m allotted time is too short. So, I should either adjust myself to this short time-slot or ask for more. Given that the other sessions are also 20m, I was thinking to have a short talk (with less of contents). If you feel that people can benefit from longer presentation, I will be happy to get more time for it. 40-45m will be great. -Dror From ogerlitz at voltaire.com Tue Oct 30 01:11:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 30 Oct 2007 10:11:07 +0200 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4726E4B1.9000002@dev.mellanox.co.il> References: <4725E023.7070409@voltaire.com> <47265DF1.8060409@dev.mellanox.co.il> <4726DF7E.4030605@voltaire.com> <4726E4B1.9000002@dev.mellanox.co.il> Message-ID: <4726E71B.4030802@voltaire.com> Dror Goldenberg wrote: > Or Gerlitz wrote: > I haven't yet prepared the presentation. I am willing to cover whatever > you think is important. Indeed 20m allotted time is too short. So, I > should either adjust myself to this short time-slot or ask for more. > Given that the other sessions are also 20m, I was thinking to have a > short talk (with less of contents). If you feel that people can benefit > from longer presentation, I will be happy to get more time for it. > 40-45m will be great. yes, assume you have 45m, so what we have now is: Sean - SA-caching - 45m Dror - IPoIB stateless offload - 45m As for QoS, I understand you have a joint session with Sean, what I think can be great if you will elaborate on is the HW support, eg in connectX, AnafaII, etc what is there, what is missing, roadmap, etc. Or. From ogerlitz at voltaire.com Tue Oct 30 01:43:24 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 30 Oct 2007 10:43:24 +0200 Subject: [ofa-general] librdmacm 1.0.4 release In-Reply-To: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> References: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> Message-ID: <4726EEAC.3070105@voltaire.com> Sean Hefty wrote: > I've pushed out a release 1.0.4 of librdmacm that addresses some of the feedback > from Doug. Patches were posted previously to the list, with a small update > based on that feedback. > Changes from 1.0.3: > librdmacm/cma: provide wrapper functions to extract src/dst addresses > librdmacm/cma: provide sanity checks for max outstanding rdma ops > librdmacm/man: update man pages to clarify connection request params Hi Sean, I think you have mentioned that some documentation update is planned? Anyway, I know that both rdma_connect and rdma_accept get struct rdma_conn_param and following a question from a user, I wondered which of the fields are actually relevant in the passive side. Doing a quick look at the kernel core code, I saw that: - param.retry_count is ignored in the passive side rdma-cm code and the IB cm uses the one present in the req message. - param.rnr_retry_count is not ignored in the passive side, but from looking in the code, I was not sure if the value used is the one present in the req or the one supplied by the passive consumer. - param.flow_control is a pure SW field which does not get into the QP attr. My understanding is that IB RC flow-control means non zero rnr counter, is this all? if yes, maybe we need to expose only rnr_retry_count field Or. From rindy at orientalrecycling.com Tue Oct 30 02:37:06 2007 From: rindy at orientalrecycling.com (Wilmer Graham) Date: Tue, 30 Oct 2007 11:37:06 +0200 Subject: [ofa-general] Microsoft Vlsta & Office2007, Just released for 79$ Save 1599.95$ 0ff Retai| Message-ID: <000001c81ad7$d68e4d00$0100007f@localhost> hotnewsoft . com From vlad at lists.openfabrics.org Tue Oct 30 03:02:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 30 Oct 2007 03:02:32 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071030-0200 daily build status Message-ID: <20071030100232.ECD7FE608E6@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at dev.mellanox.co.il Tue Oct 30 04:40:25 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 30 Oct 2007 13:40:25 +0200 Subject: [ofa-general] Re: [ewg] Re: OpenFabrics Developer's Summit: feedback requested In-Reply-To: <20071029150747.GA20952@cuprite.pathscale.com> References: <20071023200329.GA6368@cuprite.pathscale.com> <15ddcffd0710231348p1a1424dfx93aaf700c14d78a1@mail.gmail.com> <20071024003159.GA10244@cuprite.pathscale.com> <15ddcffd0710280555q5d12bf86h53d6809f646d1df0@mail.gmail.com> <20071029065138.GA13737@cuprite.pathscale.com> <20071029150747.GA20952@cuprite.pathscale.com> Message-ID: <47271829.3030608@mellanox.co.il> Johann George wrote: > Someone proposed a fourth option worth considering which is staying > later on Friday. Here are the alternatives we are looking for > feedback on: > > (1) Are you willing and able to attend if we start at > 11:00am on Thursday rather than at 1:00pm? > > yes > (2) If we are able to, would you prefer to see simultaneous > tracks and lengthen some of the sessions. > no > (3) Would you like to see additional MPI sessions crammed > into the allotted time? > yes > (4) Are you willing and able to stay if we ran later on > Friday? How long? > > I can till evening Tziporet From reactualization at maxstone.net Tue Oct 30 06:28:54 2007 From: reactualization at maxstone.net (Hillary Bennett) Date: Tue, 30 Oct 2007 08:28:54 -0500 Subject: [ofa-general] Microsoft Vlsta & Office2007, Just released for 79$ Save 1599.95$ 0ff Retai| Message-ID: <000001c81af8$5d532200$0100007f@localhost> hotnewsoft . com From sashak at voltaire.com Tue Oct 30 07:33:54 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 30 Oct 2007 16:33:54 +0200 Subject: [ofa-general] Does openSM ucast routing table generator utility exist .. ? In-Reply-To: <829ded920710290500p31de6c1bp6b219ddab54b41a3@mail.gmail.com> References: <829ded920710290500p31de6c1bp6b219ddab54b41a3@mail.gmail.com> Message-ID: <20071030143354.GC20447@sashak.voltaire.com> On 17:30 Mon 29 Oct , Keshetti Mahesh wrote: > > I could see that openSM now supports file based unicast > forwarding table loading. > My question is, has anyone ever wrote an utility to generate > such file (unicast forwarding table file) having the facility to load > non min-hop paths (I think ) which is the actual intention behind > allowing the file based unicast forwarding table loading. You can dump existing routing tables as generated by one of OpenSM routing algorithms with dump_lfts.sh script (part of infiniband-diags), modify it and load back. Sasha From tziporet at dev.mellanox.co.il Tue Oct 30 07:40:44 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 30 Oct 2007 16:40:44 +0200 Subject: [ofa-general] OFED October 29 meeting summary on OFED 1.3 beta readiness Message-ID: <4727426C.5090504@mellanox.co.il> OFED October 29 meeting summary on OFED 1.3 beta readiness: 1. Beta release schedule: * The release is planed for next Monday Nov-5 * For this the rebase for 2.6.24-rc1 must be completed tomorrow * I will send status update on Thursday 2. Beta tasks status: 1. Fix compilation problems on PPC with 32 bits - Vlad & Oren (Mellanox) - on work 2. Rebase kernel code on 2.6.24 rc1 (depending it's availability) - on work (please read mail from Vlad with instructions) 3. SPEC files should be part of each user space package - each owner should take the spec file 4. Multiple uDAPL libs (1.0 & 2.0) - Vlad and Arlin (Intel) 5. Fix all compilation and install issues - All Done tasks: o Add qperf test from Qlogic - Johann (Qlogic) o Support RHEL 5 up1 - Woody & Vlad o Apply patches that fix warning of backport patches - Vlad (Mellanox) (one patch was not applied since we got no answer regarding it) o New MVAPICH package - Pasha & DK (OSU) o Complete RDS work - Vlad (Mellanox) o Integrate all SDP features - Jim (Mellanox) o nes - updated backport patches - Glenn (NetEffect) 3. Bugs that should be with high priority for the beta are all compilation and install issues. I will publish the specific list of bugs Note: the bug severity in bugzilla are aimed to the GA release From rpearson at systemfabricworks.com Tue Oct 30 07:53:07 2007 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 30 Oct 2007 09:53:07 -0500 Subject: [ofa-general] umad agent question? Message-ID: <5p5klh$24kt9s@rrcs-agw-01.hrndva.rr.com> Sasha, Hal, I am trying to create a vendor (group1) class management agent using libibumad. I am successful in registering the agent with method mask set to 0xe = get/put/send. When I use a send message from another system the message is received but apparently not when I use get or set. I say apparently because the system issuing the get or set receives a response but the user agent never returns from umad_recv. Is there by any chance some sample code somewhere in the OFA tree that exercises this functionality that I could look at? Also, I am curious why the method mask does not cover the response bit. How does this work. If you are registered for get do you automatically get get_response packets? Bob Pearson -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Oct 30 07:54:27 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 10:54:27 -0400 Subject: [ofa-general] Re: [ewg] OFED October 29 meeting summary on OFED 1.3 beta readiness In-Reply-To: <4727426C.5090504@mellanox.co.il> References: <4727426C.5090504@mellanox.co.il> Message-ID: Hi Tziporet, On 10/30/07, Tziporet Koren wrote: > OFED October 29 meeting summary on OFED 1.3 beta readiness: > > 1. Beta release schedule: > > * The release is planed for next Monday Nov-5 > * For this the rebase for 2.6.24-rc1 must be completed tomorrow > * I will send status update on Thursday > > > 2. Beta tasks status: > > 1. Fix compilation problems on PPC with 32 bits - Vlad & Oren > (Mellanox) - on work > 2. Rebase kernel code on 2.6.24 rc1 (depending it's availability) > - on work (please read mail from Vlad with instructions) > 3. SPEC files should be part of each user space package - each > owner should take the spec file > 4. Multiple uDAPL libs (1.0 & 2.0) - Vlad and Arlin (Intel) > 5. Fix all compilation and install issues - All What about release notes ? -- Hal > > > Done tasks: > o Add qperf test from Qlogic - Johann (Qlogic) > o Support RHEL 5 up1 - Woody & Vlad > o Apply patches that fix warning of backport patches - Vlad > (Mellanox) (one patch was not applied since we got no answer > regarding it) > o New MVAPICH package - Pasha & DK (OSU) > o Complete RDS work - Vlad (Mellanox) > o Integrate all SDP features - Jim (Mellanox) > o nes - updated backport patches - Glenn (NetEffect) > > > 3. Bugs that should be with high priority for the beta are all > compilation and install issues. > I will publish the specific list of bugs > Note: the bug severity in bugzilla are aimed to the GA release > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From mshefty at ichips.intel.com Tue Oct 30 08:46:14 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 08:46:14 -0700 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4726E71B.4030802@voltaire.com> References: <4725E023.7070409@voltaire.com> <47265DF1.8060409@dev.mellanox.co.il> <4726DF7E.4030605@voltaire.com> <4726E4B1.9000002@dev.mellanox.co.il> <4726E71B.4030802@voltaire.com> Message-ID: <472751C6.1090607@ichips.intel.com> > Sean - SA-caching - 45m I think 30 minutes for this should be sufficient. - Sean From mshefty at ichips.intel.com Tue Oct 30 09:09:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 09:09:29 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24kt9s@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24kt9s@rrcs-agw-01.hrndva.rr.com> Message-ID: <47275739.1030404@ichips.intel.com> > I am trying to create a vendor (group1) class management agent using > libibumad. I am successful in registering the agent with method mask set > to 0xe = get/put/send. When I use a send message from another system the > message is received but apparently not when I use get or set. I say > apparently because the system issuing the get or set receives a response > but the user agent never returns from umad_recv. Is there by any chance > some sample code somewhere in the OFA tree that exercises this > functionality that I could look at? Also, I am curious why the method > mask does not cover the response bit. How does this work. If you are > registered for get do you automatically get get_response packets? The method mask is only used for routing received unsolicited MADs. I.e. those that are not response MADs. Any app can send a MAD and get its response. Only one app is allowed to receive a non-response MAD. As for the problem that you mention, I don't understand the behavior that you're seeing. - Sean From mshefty at ichips.intel.com Tue Oct 30 09:03:16 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 09:03:16 -0700 Subject: [ofa-general] librdmacm 1.0.4 release In-Reply-To: <4726EEAC.3070105@voltaire.com> References: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> <4726EEAC.3070105@voltaire.com> Message-ID: <472755C4.10600@ichips.intel.com> >> librdmacm/man: update man pages to clarify connection request params > > I think you have mentioned that some documentation update is planned? See the man page updates that were made. There may still be some errors or omissions, but I tried to address Doug's comments. > - param.retry_count is ignored in the passive side rdma-cm code and the > IB cm uses the one present in the req message. correct - there's a comment in the header file about the passive side ignoring this value > - param.rnr_retry_count is not ignored in the passive side, but from > looking in the code, I was not sure if the value used is the one present > in the req or the one supplied by the passive consumer. The passive side uses the value from the req. The active side uses the value from the rep. > - param.flow_control is a pure SW field which does not get into the QP > attr. My understanding is that IB RC flow-control means non zero rnr > counter, is this all? if yes, maybe we need to expose only > rnr_retry_count field It's a property of the HCA, but it's not clear to me at the moment what a user does with this field. - Sean From rpearson at systemfabricworks.com Tue Oct 30 10:34:03 2007 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 30 Oct 2007 12:34:03 -0500 Subject: [ofa-general] umad agent question? In-Reply-To: <47275739.1030404@ichips.intel.com> Message-ID: <5p5klh$24noqp@rrcs-agw-01.hrndva.rr.com> Sean, OK, I understand about unsolicited that helps some. I attached a simple umad test case. Try running on machine A assuming port 1 is active and has say lid 8 ./madtest --port 1 on machine B assuming port 1 is active and on the same subnet as machine A port 1 ./madtest --port 1 --lid 8 --method 3 Everything should work (except that the TID gets mangled???). Then repeat with --method 1. Machine B will see a response from A but A will not see the packet so someone else on A is replying to the MAD. Bob -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Tuesday, October 30, 2007 11:09 AM To: Robert Pearson Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' Subject: Re: [ofa-general] umad agent question? > I am trying to create a vendor (group1) class management agent using > libibumad. I am successful in registering the agent with method mask set > to 0xe = get/put/send. When I use a send message from another system the > message is received but apparently not when I use get or set. I say > apparently because the system issuing the get or set receives a response > but the user agent never returns from umad_recv. Is there by any chance > some sample code somewhere in the OFA tree that exercises this > functionality that I could look at? Also, I am curious why the method > mask does not cover the response bit. How does this work. If you are > registered for get do you automatically get get_response packets? The method mask is only used for routing received unsolicited MADs. I.e. those that are not response MADs. Any app can send a MAD and get its response. Only one app is allowed to receive a non-response MAD. As for the problem that you mention, I don't understand the behavior that you're seeing. - Sean -------------- next part -------------- A non-text attachment was scrubbed... Name: madtest.c Type: application/octet-stream Size: 4950 bytes Desc: not available URL: From hrosenstock at xsigo.com Tue Oct 30 10:59:44 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 10:59:44 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24noqp@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24noqp@rrcs-agw-01.hrndva.rr.com> Message-ID: <1193767184.26246.176.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-30 at 12:34 -0500, Robert Pearson wrote: > Sean, > > OK, I understand about unsolicited that helps some. I attached a simple umad > test case. Try running > > on machine A assuming port 1 is active and has say lid 8 > > ./madtest --port 1 > > on machine B assuming port 1 is active and on the same subnet as machine A > port 1 > > ./madtest --port 1 --lid 8 --method 3 > > Everything should work (except that the TID gets mangled???). >From Documentation/user_mad.txt: Transaction IDs Users of the umad devices can use the lower 32 bits of the transaction ID field (that is, the least significant half of the field in network byte order) in MADs being sent to match request/response pairs. The upper 32 bits are reserved for use by the kernel and will be overwritten before a MAD is sent. Is this what you are referring to ? -- Hal > Then repeat with --method 1. Machine B will see a response from A but A will > not see the packet so someone else on A is replying to the MAD. > > Bob > > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 30, 2007 11:09 AM > To: Robert Pearson > Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' > Subject: Re: [ofa-general] umad agent question? > > > I am trying to create a vendor (group1) class management agent using > > libibumad. I am successful in registering the agent with method mask set > > to 0xe = get/put/send. When I use a send message from another system the > > message is received but apparently not when I use get or set. I say > > apparently because the system issuing the get or set receives a response > > but the user agent never returns from umad_recv. Is there by any chance > > some sample code somewhere in the OFA tree that exercises this > > functionality that I could look at? Also, I am curious why the method > > mask does not cover the response bit. How does this work. If you are > > registered for get do you automatically get get_response packets? > > The method mask is only used for routing received unsolicited MADs. > I.e. those that are not response MADs. Any app can send a MAD and get > its response. Only one app is allowed to receive a non-response MAD. > > As for the problem that you mention, I don't understand the behavior > that you're seeing. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From guthridg at us.ibm.com Tue Oct 30 11:01:47 2007 From: guthridg at us.ibm.com (Scott Guthridge) Date: Tue, 30 Oct 2007 14:01:47 -0400 Subject: [ofa-general] Service ID scope in IB Arch Spec A3.2.2 is incorrect, right? Message-ID: IB Architecture Spec, r1.2 section A3.2.2 says [emphasis added]: Each *port* on a CA may support a set of services. ... Since *not all ports* support the same set of services... and later: "it is the combination of the Port GID and Service ID that identifies a particular service provider" But this seems to contradict chapter 12 (communication management) and chapter A8 (device management) which consistently associate services with channel adapters, not ports. See 12.6.5 table 99 (CA GUID), 12.6.8 table 103 (CA GUID), 12.9.9 connection state table (CA GUID), etc. Similarly, figure 309 "I/O Components and Relationships" in section A8.2.3 that shows the DM agent being a component of the I/O Unit, and because the I/O unit is associated with a single TCA, it follows that the DMA belongs to the channel adapter, not to a particular port. The CM implementation in OFED 1.2 supports this notion that services are defined per CA, not per port in that ib_create_cm_id doesn't take a port number. I suppose one could try to implement a service that just sent a REJ to anyone who tried to connect to it on a port it didn't like, but it seems like advertising a service you don't actually intend to provide from a given port would be odd behavior. So am I correct that A3.2.2 has it wrong? Would it be right to say that with respect to provided services, all ports of a given CA are equal? Thanks, Scott From rpearson at systemfabricworks.com Tue Oct 30 11:06:49 2007 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 30 Oct 2007 13:06:49 -0500 Subject: [ofa-general] umad agent question? In-Reply-To: <47275739.1030404@ichips.intel.com> Message-ID: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> Sean, When I set vendor class to 15 instead of 9 everything works much better. I suspect this means someone else is registered for 9. In that case the register agent call should probably have not succeeded. The TID still gets clobbered and the QKEY ignored somewhere. Bob -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Tuesday, October 30, 2007 11:09 AM To: Robert Pearson Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' Subject: Re: [ofa-general] umad agent question? > I am trying to create a vendor (group1) class management agent using > libibumad. I am successful in registering the agent with method mask set > to 0xe = get/put/send. When I use a send message from another system the > message is received but apparently not when I use get or set. I say > apparently because the system issuing the get or set receives a response > but the user agent never returns from umad_recv. Is there by any chance > some sample code somewhere in the OFA tree that exercises this > functionality that I could look at? Also, I am curious why the method > mask does not cover the response bit. How does this work. If you are > registered for get do you automatically get get_response packets? The method mask is only used for routing received unsolicited MADs. I.e. those that are not response MADs. Any app can send a MAD and get its response. Only one app is allowed to receive a non-response MAD. As for the problem that you mention, I don't understand the behavior that you're seeing. - Sean From rpearson at systemfabricworks.com Tue Oct 30 11:12:55 2007 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 30 Oct 2007 13:12:55 -0500 Subject: [ofa-general] umad agent question? In-Reply-To: <1193767184.26246.176.camel@hrosenstock-ws.xsigo.com> Message-ID: <5p5klh$24okn2@rrcs-agw-01.hrndva.rr.com> Hal, That was one thing I saw. The version of OFED installed on my test machine is 1.1 and that must have been added later. This is not a problem. I just was not expecting the behavior. Bob -----Original Message----- From: Hal Rosenstock [mailto:hrosenstock at xsigo.com] Sent: Tuesday, October 30, 2007 1:00 PM To: Robert Pearson Cc: 'Sean Hefty'; 'Hal Rosenstock'; general at lists.openfabrics.org Subject: RE: [ofa-general] umad agent question? On Tue, 2007-10-30 at 12:34 -0500, Robert Pearson wrote: > Sean, > > OK, I understand about unsolicited that helps some. I attached a simple umad > test case. Try running > > on machine A assuming port 1 is active and has say lid 8 > > ./madtest --port 1 > > on machine B assuming port 1 is active and on the same subnet as machine A > port 1 > > ./madtest --port 1 --lid 8 --method 3 > > Everything should work (except that the TID gets mangled???). >From Documentation/user_mad.txt: Transaction IDs Users of the umad devices can use the lower 32 bits of the transaction ID field (that is, the least significant half of the field in network byte order) in MADs being sent to match request/response pairs. The upper 32 bits are reserved for use by the kernel and will be overwritten before a MAD is sent. Is this what you are referring to ? -- Hal > Then repeat with --method 1. Machine B will see a response from A but A will > not see the packet so someone else on A is replying to the MAD. > > Bob > > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 30, 2007 11:09 AM > To: Robert Pearson > Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' > Subject: Re: [ofa-general] umad agent question? > > > I am trying to create a vendor (group1) class management agent using > > libibumad. I am successful in registering the agent with method mask set > > to 0xe = get/put/send. When I use a send message from another system the > > message is received but apparently not when I use get or set. I say > > apparently because the system issuing the get or set receives a response > > but the user agent never returns from umad_recv. Is there by any chance > > some sample code somewhere in the OFA tree that exercises this > > functionality that I could look at? Also, I am curious why the method > > mask does not cover the response bit. How does this work. If you are > > registered for get do you automatically get get_response packets? > > The method mask is only used for routing received unsolicited MADs. > I.e. those that are not response MADs. Any app can send a MAD and get > its response. Only one app is allowed to receive a non-response MAD. > > As for the problem that you mention, I don't understand the behavior > that you're seeing. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Oct 30 11:14:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 11:14:42 -0700 Subject: [ofa-general] Re: [PATCH 3/4] IB/ipath -- Fix incorrect use of sizeof on msg buffer (function argument) In-Reply-To: <20071026144636.13639.31567.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Fri, 26 Oct 2007 07:46:36 -0700") References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> <20071026144636.13639.31567.stgit@eng-46.internal.keyresearch.com> Message-ID: Thanks, applied 1-3 to for-2.6.24 From rdreier at cisco.com Tue Oct 30 11:15:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 11:15:05 -0700 Subject: [ofa-general] Re: [PATCH 4/4] IB/ipath -- Improve interrupt handler cache footprint In-Reply-To: <20071026144641.13639.73320.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Fri, 26 Oct 2007 07:46:41 -0700") References: <20071026144620.13639.26891.stgit@eng-46.internal.keyresearch.com> <20071026144641.13639.73320.stgit@eng-46.internal.keyresearch.com> Message-ID: I applied this to for-2.6.25 since it seems like it is an optimization that isn't fixing anything, and the 2.6.25 merge window is closed. From gdror at dev.mellanox.co.il Tue Oct 30 11:17:43 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Tue, 30 Oct 2007 20:17:43 +0200 Subject: [ofa-general] to be discussed at the developer conference In-Reply-To: <4726E71B.4030802@voltaire.com> References: <4725E023.7070409@voltaire.com> <47265DF1.8060409@dev.mellanox.co.il> <4726DF7E.4030605@voltaire.com> <4726E4B1.9000002@dev.mellanox.co.il> <4726E71B.4030802@voltaire.com> Message-ID: <47277547.5050903@dev.mellanox.co.il> Or Gerlitz wrote: > Dror Goldenberg wrote: >> Or Gerlitz wrote: > >> I haven't yet prepared the presentation. I am willing to cover >> whatever you think is important. Indeed 20m allotted time is too >> short. So, I should either adjust myself to this short time-slot or >> ask for more. Given that the other sessions are also 20m, I was >> thinking to have a short talk (with less of contents). If you feel >> that people can benefit from longer presentation, I will be happy to >> get more time for it. 40-45m will be great. > > yes, assume you have 45m, so what we have now is: > > Sean - SA-caching - 45m > Dror - IPoIB stateless offload - 45m I think that 30m will do fine for IPoIB stateless offload. Can I get 30m? > > As for QoS, I understand you have a joint session with Sean, what I > think can be great if you will elaborate on is the HW support, eg in > connectX, AnafaII, etc what is there, what is missing, roadmap, etc. Will try to. Schedule is a bit tight. > > Or. > > From mshefty at ichips.intel.com Tue Oct 30 11:34:01 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 11:34:01 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24okn2@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24okn2@rrcs-agw-01.hrndva.rr.com> Message-ID: <47277919.3060606@ichips.intel.com> > That was one thing I saw. The version of OFED installed on my test machine > is 1.1 and that must have been added later. This is not a problem. I just > was not expecting the behavior. We should make sure that the TID is not an issue. On the send side, the kernel will set the upper 32-bits of the TID. This is done to ensure uniqueness among multiple users. The kernel uses this value to retry requests until it receives a response. On the receiving side, the response MAD must set the TID to match what it received. It looks like the madtest code sets this correctly. Is this what you see? Also, in the following code: mad_set_field (mad, 0, IB_MAD_METHOD_F, method); mad_set_field (mad, 0, IB_MAD_RESPONSE_F, 0); Does this end up clearing the response bit? - Sean From mshefty at ichips.intel.com Tue Oct 30 11:23:31 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 11:23:31 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> Message-ID: <472776A3.4070405@ichips.intel.com> > When I set vendor class to 15 instead of 9 everything works much better. I > suspect this means someone else is registered for 9. In that case the > register agent call should probably have not succeeded. Yes - this is what I'm not understanding. If something else is registered for class 9, the registration should have failed. I'm trying to trace through the code to see what's happening. - Sean From hrosenstock at xsigo.com Tue Oct 30 11:36:54 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 11:36:54 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <472776A3.4070405@ichips.intel.com> References: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> <472776A3.4070405@ichips.intel.com> Message-ID: <1193769414.26246.204.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-30 at 11:23 -0700, Sean Hefty wrote: > > When I set vendor class to 15 instead of 9 everything works much better. I > > suspect this means someone else is registered for 9. In that case the > > register agent call should probably have not succeeded. > > Yes - this is what I'm not understanding. If something else is > registered for class 9, the registration should have failed. Doesn't it depend on whether the methods in use overlap or are disjoint ? -- Hal > I'm trying > to trace through the code to see what's happening. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at dev.mellanox.co.il Tue Oct 30 11:49:13 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 30 Oct 2007 20:49:13 +0200 Subject: [ofa-general] Re: [ewg] OFED October 29 meeting summary on OFED 1.3 beta readiness In-Reply-To: References: <4727426C.5090504@mellanox.co.il> Message-ID: <47277CA9.2050502@mellanox.co.il> Hal Rosenstock wrote: > Hi Tziporet, > > On 10/30/07, Tziporet Koren wrote: > >> OFED October 29 meeting summary on OFED 1.3 beta readiness: >> >> 1. Beta release schedule: >> >> * The release is planed for next Monday Nov-5 >> * For this the rebase for 2.6.24-rc1 must be completed tomorrow >> * I will send status update on Thursday >> >> >> 2. Beta tasks status: >> >> 1. Fix compilation problems on PPC with 32 bits - Vlad & Oren >> (Mellanox) - on work >> 2. Rebase kernel code on 2.6.24 rc1 (depending it's availability) >> - on work (please read mail from Vlad with instructions) >> 3. SPEC files should be part of each user space package - each >> owner should take the spec file >> 4. Multiple uDAPL libs (1.0 & 2.0) - Vlad and Arlin (Intel) >> 5. Fix all compilation and install issues - All >> > > What about release notes ? > > -- Hal > > RN are not a must for the beta release (I updated the general notes) Anyone that have RN to update can send them to me against branch ofed_1_3 Tziporet From hrosenstock at xsigo.com Tue Oct 30 12:00:57 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 12:00:57 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com> Message-ID: <1193770857.26246.217.camel@hrosenstock-ws.xsigo.com> Bob, On Tue, 2007-10-30 at 13:06 -0500, Robert Pearson wrote: > Sean, > > When I set vendor class to 15 instead of 9 everything works much better. I > suspect this means someone else is registered for 9. In that case the > register agent call should probably have not succeeded. It depends on whether the methods are already in use or not. If not, they can coexist. > The TID still gets clobbered and the QKEY ignored somewhere. Not sure what you mean by clobbered. Does the TID not follow the rule I just sent you ? How is QKey being set/used ? -- Hal > Bob > > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 30, 2007 11:09 AM > To: Robert Pearson > Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' > Subject: Re: [ofa-general] umad agent question? > > > I am trying to create a vendor (group1) class management agent using > > libibumad. I am successful in registering the agent with method mask set > > to 0xe = get/put/send. When I use a send message from another system the > > message is received but apparently not when I use get or set. I say > > apparently because the system issuing the get or set receives a response > > but the user agent never returns from umad_recv. Is there by any chance > > some sample code somewhere in the OFA tree that exercises this > > functionality that I could look at? Also, I am curious why the method > > mask does not cover the response bit. How does this work. If you are > > registered for get do you automatically get get_response packets? > > The method mask is only used for routing received unsolicited MADs. > I.e. those that are not response MADs. Any app can send a MAD and get > its response. Only one app is allowed to receive a non-response MAD. > > As for the problem that you mention, I don't understand the behavior > that you're seeing. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Tue Oct 30 12:01:10 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 12:01:10 -0700 Subject: [ofa-general] Service ID scope in IB Arch Spec A3.2.2 is incorrect, right? In-Reply-To: References: Message-ID: <47277F76.1060504@ichips.intel.com> Scott Guthridge wrote: > IB Architecture Spec, r1.2 section A3.2.2 says [emphasis added]: > > Each *port* on a CA may support a set of services. ... Since *not all > ports* support the same set of services... > > and later: > > "it is the combination of the Port GID and Service ID that identifies a > particular service provider" > > > But this seems to contradict chapter 12 (communication management) and > chapter A8 (device management) which consistently associate services with > channel adapters, not ports. See 12.6.5 table 99 (CA GUID), 12.6.8 table > 103 (CA GUID), 12.9.9 connection state table (CA GUID), etc. Similarly, > figure 309 "I/O Components and Relationships" in section A8.2.3 that shows > the DM agent being a component of the I/O Unit, and because the I/O unit is > associated with a single TCA, it follows that the DMA belongs to the > channel adapter, not to a particular port. > > The CM implementation in OFED 1.2 supports this notion that services are > defined per CA, not per port in that ib_create_cm_id doesn't take a port > number. In short, I really don't know the answer here. Automatic path migration allows a connection to migrate between ports on the same HCA. So, from at least that view, a service can be viewed as being defined per CA, not per port. However, service records are tied to a specific port. Also, the IB CM is not required to be implemented on each port; CM support is a per port attribute. Viewing the CM as a service is per port, not per CA. I'd need to verify this, but I don't think that a connection request architecturally even has to be received on the port that the connection will use. I wouldn't interpret too much from the ib_create_cm_id API. The use of the CA GUID in CM req/rep message helps detect stale/duplicate connections, since QPs are per HCA, and not per port. I'm not sure how this relates into section A8. > So am I correct that A3.2.2 has it wrong? Would it be right to say that > with respect to provided services, all ports of a given CA are equal? I don't believe you can say this. The port attributes can be different. The ports could be on different subnets. An SM could be running on one port, but not another. Etc. - Sean From praveen at crlindia.com Tue Oct 30 12:58:14 2007 From: praveen at crlindia.com (Praveen M K) Date: Wed, 31 Oct 2007 01:28:14 +0530 Subject: [ofa-general] hai all , Plz Help Message-ID: <0AF7442124F01C49A6A93D8F04E5E3CF08B1D1@CHNEXVS01.VSNLXCHANGE.COM> hi, For running linpack using Voltairesm and Opensm will there be any performance difference due to the routing efficency and if there is any perfermonance difference, can any one explain it. with regards Praveen This message (including any attachment) is confidential and may be legally privileged. Access to this message by anyone other than the intended recipient(s) listed above is unauthorized. If you are not the intended recipient you are hereby notified that any disclosure, copying, or distribution of the message, or any action taken or omission of action by you in reliance upon it, is prohibited and may be unlawful. Please immediately notify the sender by reply e-mail and permanently delete all copies of the message if you have received this message in error. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Tue Oct 30 13:06:49 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 13:06:49 -0700 Subject: [ofa-general] hai all , Plz Help In-Reply-To: <0AF7442124F01C49A6A93D8F04E5E3CF08B1D1@CHNEXVS01.VSNLXCHANGE.COM> References: <0AF7442124F01C49A6A93D8F04E5E3CF08B1D1@CHNEXVS01.VSNLXCHANGE.COM> Message-ID: <1193774809.26246.274.camel@hrosenstock-ws.xsigo.com> Praveen, On Wed, 2007-10-31 at 01:28 +0530, Praveen M K wrote: > hi, > For running linpack using Voltairesm and Opensm will there be any > performance difference due to the routing efficency and if there is > any perfermonance difference, can any one explain it. Routing (actually pathing) algorithms are all beyond the IB spec. OpenSM has a number of supported algorithms as does VSM. Performance can vary based on the routing algorithm used as to how well the traffic is distributed across the various links. See OpenSM man page for the supported algorithms. See Voltaire for their algorithms. -- Hal > with regards > Praveen > > This message (including any attachment) is confidential and may be > legally privileged. Access to this message by anyone other than the > intended recipient(s) listed above is unauthorized. If you are not the > intended recipient you are hereby notified that any disclosure, > copying, or distribution of the message, or any action taken or > omission of action by you in reliance upon it, is prohibited and may > be unlawful. Please immediately notify the sender by reply e-mail and > permanently delete all copies of the message if you have received this > message in error. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Oct 30 13:38:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 13:38:17 -0700 Subject: [ofa-general] Re: to be discussed at the developer conference In-Reply-To: <4725E023.7070409@voltaire.com> (Or Gerlitz's message of "Mon, 29 Oct 2007 15:29:07 +0200") References: <4725E023.7070409@voltaire.com> Message-ID: At the highest level I think this "developer summit" is suffering from a lack of a clear goal. (The same could be said about the OpenFabrics alliance as a whole, but let's not get into that...) I'm supposed to give a talk about the basics of kernel development and I'm happy to do so, but that implies a certain target audience that is pretty disjoint from the developers who are leading development. In general the most valuable use of face-to-face time with code developers is to settle issues where email discussion has gotten stuck. If most people are not already familiar with the issue then it is very difficult to be productive. So with that said: > 1) the long time and endless threads related to the SA caching thing > need to be there. Sean - I saw that you prepare a session, correct? > will you presenting few possible designs? This is the perfect type of thing to try and settle. > 2) as for IPoIB stateless offload - with Eli and Liran not planned to > be there. Dror - do you intend to actually present the actual ipoib / > core / drivers related design and implementation? Given that there really hasn't even been an attempt to discuss this on the mailing list, I'm not convinced it's worth trying to rush through explaining it. I didn't think the patches were particularly hard to understand. > 4) IPoIB connected mode UC support - Roland, can work on this start > once the no-SRQ design/code is agreed and committed to a branch at > your git? Is there a spec for attaching UC QPs to SRQs? Other than that I think it's just a matter of someone caring enough to start working on it. > 5) IB 4K MTU - in IPoIB and elsewhere in the IB stack, same here, > Roland, do you think a short session is needed No -- I don't know of any issues that need face-to-face discussion. > 6) the netdev network batching RFCs - Krishna, Shirley, will someone > from IBM can prepare a session to educate us on the matter and the > status? Why do we need to spend face-to-face time on this? - R. From gdror at dev.mellanox.co.il Tue Oct 30 13:46:40 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Tue, 30 Oct 2007 22:46:40 +0200 Subject: [ofa-general] Re: to be discussed at the developer conference In-Reply-To: References: <4725E023.7070409@voltaire.com> Message-ID: <47279830.9050407@dev.mellanox.co.il> Roland Dreier wrote: > > 2) as for IPoIB stateless offload - with Eli and Liran not planned to > > be there. Dror - do you intend to actually present the actual ipoib / > > core / drivers related design and implementation? > > Given that there really hasn't even been an attempt to discuss this on > the mailing list, I'm not convinced it's worth trying to rush through > explaining it. I didn't think the patches were particularly hard to > understand. > I will just give an overview, maybe backing up my arguments with numbers if needed. > > 4) IPoIB connected mode UC support - Roland, can work on this start > > once the no-SRQ design/code is agreed and committed to a branch at > > your git? > > Is there a spec for attaching UC QPs to SRQs? Other than that I think > it's just a matter of someone caring enough to start working on it. > You're an IBTA member, so you should have an early access to 1.2.1 draft at http://www.infinibandta.org/members/spec/V1r1_2_1.Release_19Aug07.pdf From uklotterorg at yahoo.com.hk Tue Oct 30 13:57:43 2007 From: uklotterorg at yahoo.com.hk (UK LOTTERY ORGANIZATION.) Date: Tue, 30 Oct 2007 15:57:43 -0500 Subject: [ofa-general] =?iso-8859-1?q?Your_e-mail_address_has_won_you_=A32?= =?iso-8859-1?q?=2C077=2C095=2E00=2E?= Message-ID: Your e-mail address has won you �2,077,095.00 UK LOTTERY ORGANIZATION TICKET FREE/ONLINE E-MAIL ADDRESS WINNINGS DEPARTMENT. TEL #: +44-704-011-4059, +44 704 011 8717 If you are the correct owner of this email address then be glad this day as the result of the UK lotto online e-mail address draws of 29th September 2007 has just been released and we are glad to announce to you that your email address won you the sweepstakes in the first category and you are entitled to claim the sum of �2,077,095.00. Your email addresses was entered for the online draw on this Free Ticket Number: APP564 75600545 188 and won on this Lucky Number: 1 9 10 11 16 49 37, which subsequently won you the lottery in the 1st category of 6. You are to contact Mr. Samuel Craft on the below email address for available options on how to receive your winnings fund. Note that Mr. Samuel Craft might fail to recognize you as the true winner and receiver of the �2,077,095.00. if you fail to include the following in your contact mail to him: Your country of origin and country of residence/work, complete official names, address, amount won, free ticket and lucky numbers, date of draw, contact telephone and mobile numbers. OPTIONAL :- [Sex, age, occupation and job title]. Email: zenith.express at yahoo.ie Contact Officer: Mr. Samuel Craft. From: Online Winning Notification Department, UK LOTTERY ORGANIZATION. From kilian at stanford.edu Tue Oct 30 13:56:40 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Tue, 30 Oct 2007 13:56:40 -0700 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 Message-ID: <200710301356.40137.kilian@stanford.edu> Hi, I'm trying to use opensm as a standby SM on a fabric where the master SM is running on a Cisco SFS-7000D (TopspinOS 2.9.0 releng #147) The switch (Master SM) logs report the following: Oct 30 12:08:46 10.0.100.2 ib_sm.x[588]: %IB-6-INFO: Configuration caused by SM role change Oct 30 12:08:55 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Initialize a backup session with Standby SM guid 00:05:ad:00:00:08:cf:0d Oct 30 12:09:05 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Session initialization failed with Standby SM guid 00:05:ad:00:00:08:cf:0d Oct 30 12:11:05 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Session not initiated: Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:cf:0d And the opensm (Standby SM) logs show this: Oct 30 12:08:45 013321 [95AB1160] -> OpenSM Rev:openib-3.0.13 Oct 30 12:08:45 013366 [95AB1160] -> OpenSM Rev:openib-3.0.13 Oct 30 12:08:45 031429 [95AB1160] -> osm_vendor_bind: Binding to port 0x5ad000008cf0d Oct 30 12:08:45 033276 [95AB1160] -> osm_vendor_bind: Binding to port 0x5ad000008cf0d Oct 30 12:08:46 064757 [45007960] -> Entering STANDBY state Oct 30 12:08:55 461419 [4780B960] -> __osm_sm_mad_ctrl_process_set: ERR 3107: Unsupported attribute = 0xFF02 Oct 30 12:08:55 461482 [4780B960] -> SMP dump: base_ver................0x1 mgmt_class..............0x1 class_ver...............0x1 method..................0x2 (SubnSet) status..................0x0 hop_ptr.................0x0 hop_count...............0x0 trans_id................0x377df6ce attr_id.................0xFF02 (UNKNOWN) resv....................0x0 attr_mod................0x1 m_key...................0x0000000000000000 MAD IS LID ROUTED I'm not sure what this ERR 3107 means, is there something I could do about it? Is there a way to use OpenSM as a standby SM with a managed switch? For information, I'm using OFED 1.2 (the Cisco fcs version) and details about the HCAs are below: # ibv_devinfo hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:cf0c sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: HCA.Cheetah-DDR.20 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 4 port_lmc: 0x00 Thanks, -- Kilian From hrosenstock at xsigo.com Tue Oct 30 14:01:57 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 14:01:57 -0700 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <200710301356.40137.kilian@stanford.edu> References: <200710301356.40137.kilian@stanford.edu> Message-ID: <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> Kilian, On Tue, 2007-10-30 at 13:56 -0700, Kilian CAVALOTTI wrote: > Hi, > > I'm trying to use opensm as a standby SM on a fabric where the master SM > is running on a Cisco SFS-7000D (TopspinOS 2.9.0 releng #147) > > The switch (Master SM) logs report the following: > > Oct 30 12:08:46 10.0.100.2 ib_sm.x[588]: %IB-6-INFO: Configuration caused by SM role change > Oct 30 12:08:55 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Initialize a backup session with Standby SM guid 00:05:ad:00:00:08:cf:0d > Oct 30 12:09:05 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Session initialization failed with Standby SM guid 00:05:ad:00:00:08:cf:0d > Oct 30 12:11:05 10.0.100.2 ib_sm.x[605]: %IB-6-INFO: Session not initiated: Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:cf:0d > > And the opensm (Standby SM) logs show this: > > Oct 30 12:08:45 013321 [95AB1160] -> OpenSM Rev:openib-3.0.13 > Oct 30 12:08:45 013366 [95AB1160] -> OpenSM Rev:openib-3.0.13 > Oct 30 12:08:45 031429 [95AB1160] -> osm_vendor_bind: Binding to port 0x5ad000008cf0d > Oct 30 12:08:45 033276 [95AB1160] -> osm_vendor_bind: Binding to port 0x5ad000008cf0d > Oct 30 12:08:46 064757 [45007960] -> Entering STANDBY state > Oct 30 12:08:55 461419 [4780B960] -> __osm_sm_mad_ctrl_process_set: ERR 3107: Unsupported attribute = 0xFF02 > Oct 30 12:08:55 461482 [4780B960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x1 > class_ver...............0x1 > method..................0x2 (SubnSet) > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x0 > trans_id................0x377df6ce > attr_id.................0xFF02 (UNKNOWN) This is a proprietary SM attribute used by Cisco SM. Also, I believe the Cisco SM supports replication to standby's and that would be via proprietary means. > resv....................0x0 > attr_mod................0x1 > m_key...................0x0000000000000000 > MAD IS LID ROUTED > > I'm not sure what this ERR 3107 means, is there something I could do about > it? Is there a way to use OpenSM as a standby SM with a managed switch? No; SM flavors should not be mixed on a subnet. There are numerous reasons for this. -- Hal > For information, I'm using OFED 1.2 (the Cisco fcs version) and details > about the HCAs are below: > > # ibv_devinfo > hca_id: mthca0 > fw_ver: 1.2.917 > node_guid: 0005:ad00:0008:cf0c > sys_image_guid: 0005:ad00:0100:d050 > vendor_id: 0x05ad > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: HCA.Cheetah-DDR.20 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 4 > port_lmc: 0x00 > > Thanks, From changquing.tang at hp.com Tue Oct 30 14:05:13 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 30 Oct 2007 21:05:13 +0000 Subject: [ofa-general] How much pinned memory each QP needs ? Message-ID: Roland: We are running a big system with some memory issue(16K QPs), I want to know how much pinned memory each QP needs, the settings when creating QP is follows: qp_init_attr.cap.max_send_wr = 136; qp_init_attr.cap.max_recv_wr = 1; qp_init_attr.cap.max_send_sge = 1; qp_init_attr.cap.max_recv_sge = 1; qp_init_attr.cap.max_inline_data = 128; Thanks. --CQ From Arkady.Kanevsky at netapp.com Tue Oct 30 14:05:06 2007 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 30 Oct 2007 17:05:06 -0400 Subject: [ofa-general] Re: to be discussed at the developer conference In-Reply-To: References: <4725E023.7070409@voltaire.com> Message-ID: iWARP branch need time for connection management issues and a few others. There is impact on interoperability. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Tuesday, October 30, 2007 4:38 PM > To: Or Gerlitz > Cc: EWG; OpenFabrics General; Dror Goldenberg > Subject: [ofa-general] Re: to be discussed at the developer conference > > At the highest level I think this "developer summit" is > suffering from a lack of a clear goal. (The same could be > said about the OpenFabrics alliance as a whole, but let's not > get into that...) I'm supposed to give a talk about the > basics of kernel development and I'm happy to do so, but that > implies a certain target audience that is pretty disjoint > from the developers who are leading development. > > In general the most valuable use of face-to-face time with > code developers is to settle issues where email discussion > has gotten stuck. If most people are not already familiar > with the issue then it is very difficult to be productive. > > So with that said: > > > 1) the long time and endless threads related to the SA > caching thing > need to be there. Sean - I saw that you > prepare a session, correct? > > will you presenting few possible designs? > > This is the perfect type of thing to try and settle. > > > 2) as for IPoIB stateless offload - with Eli and Liran not > planned to > be there. Dror - do you intend to actually > present the actual ipoib / > core / drivers related design > and implementation? > > Given that there really hasn't even been an attempt to > discuss this on the mailing list, I'm not convinced it's > worth trying to rush through explaining it. I didn't think > the patches were particularly hard to understand. > > > 4) IPoIB connected mode UC support - Roland, can work on > this start > once the no-SRQ design/code is agreed and > committed to a branch at > your git? > > Is there a spec for attaching UC QPs to SRQs? Other than > that I think it's just a matter of someone caring enough to > start working on it. > > > 5) IB 4K MTU - in IPoIB and elsewhere in the IB stack, > same here, > Roland, do you think a short session is needed > > No -- I don't know of any issues that need face-to-face discussion. > > > 6) the netdev network batching RFCs - Krishna, Shirley, > will someone > from IBM can prepare a session to educate us > on the matter and the > status? > > Why do we need to spend face-to-face time on this? > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From kilian at stanford.edu Tue Oct 30 14:13:57 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Tue, 30 Oct 2007 14:13:57 -0700 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> References: <200710301356.40137.kilian@stanford.edu> <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> Message-ID: <200710301413.57260.kilian@stanford.edu> Hi Hal, On Tuesday 30 October 2007 02:01:57 pm Hal Rosenstock wrote: > This is a proprietary SM attribute used by Cisco SM. Also, I believe > the Cisco SM supports replication to standby's and that would be via > proprietary means. Thanks for the info. > No; SM flavors should not be mixed on a subnet. There are numerous > reasons for this. All right, that's good to know. Thanks a lot! -- Kilian From rdreier at cisco.com Tue Oct 30 14:14:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 14:14:27 -0700 Subject: [ofa-general] How much pinned memory each QP needs ? In-Reply-To: (Changqing Tang's message of "Tue, 30 Oct 2007 21:05:13 +0000") References: Message-ID: > We are running a big system with some memory issue(16K QPs), I want to know how much > pinned memory each QP needs, the settings when creating QP is follows: > > qp_init_attr.cap.max_send_wr = 136; > qp_init_attr.cap.max_recv_wr = 1; > qp_init_attr.cap.max_send_sge = 1; > qp_init_attr.cap.max_recv_sge = 1; > qp_init_attr.cap.max_inline_data = 128; Not sure really without tracing through the code. The easiest way to find out would probably be to add a print statement into the low-level driver library to find out how big a buffer it allocates for the QP. Naively, the max_send_wr is going to be rounded up to a power of 2, so 256, and so will the work request size, which has to be over the inline data size of 128, so that will be 256 bytes. So I would guess you'll end up using about 256 * 256 or 64 KB per QP. - R. From rpearson at systemfabricworks.com Tue Oct 30 14:24:07 2007 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 30 Oct 2007 16:24:07 -0500 Subject: [ofa-general] umad agent question? In-Reply-To: <47277919.3060606@ichips.intel.com> Message-ID: <5p5klh$24sp7t@rrcs-agw-01.hrndva.rr.com> Sean, Stepped out for a while. As I mentioned before code below was wrong although not related to problem. If anyone wants that simple mad test code fragment I would be happy to submit it fixed up as a coding example but I'm not sure how or where. Thanks for the help! Bob -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Tuesday, October 30, 2007 1:34 PM To: Robert Pearson Cc: 'Hal Rosenstock'; 'Hal Rosenstock'; general at lists.openfabrics.org Subject: Re: [ofa-general] umad agent question? > That was one thing I saw. The version of OFED installed on my test machine > is 1.1 and that must have been added later. This is not a problem. I just > was not expecting the behavior. We should make sure that the TID is not an issue. On the send side, the kernel will set the upper 32-bits of the TID. This is done to ensure uniqueness among multiple users. The kernel uses this value to retry requests until it receives a response. On the receiving side, the response MAD must set the TID to match what it received. It looks like the madtest code sets this correctly. Is this what you see? Also, in the following code: mad_set_field (mad, 0, IB_MAD_METHOD_F, method); mad_set_field (mad, 0, IB_MAD_RESPONSE_F, 0); Does this end up clearing the response bit? - Sean From changquing.tang at hp.com Tue Oct 30 14:27:11 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 30 Oct 2007 21:27:11 +0000 Subject: [ofa-general] How much pinned memory each QP needs ? In-Reply-To: References: Message-ID: So if I have 16K QP for a node, the QPs will use 16K*64K = 1G. It is fairly large memory. If I change max_send_wr=16, inline_data=16, the memory per QP will be 16*16 = 256 bytes ? I am asking Mellanox engineer, is there any document or will document the formula ? --CQ > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Tuesday, October 30, 2007 4:14 PM > To: Tang, Changqing > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] How much pinned memory each QP needs ? > > > We are running a big system with some memory > issue(16K QPs), I want to know how much > > pinned memory each QP needs, the settings when creating QP > is follows: > > > > qp_init_attr.cap.max_send_wr = 136; > > qp_init_attr.cap.max_recv_wr = 1; > > qp_init_attr.cap.max_send_sge = 1; > > qp_init_attr.cap.max_recv_sge = 1; > > qp_init_attr.cap.max_inline_data = 128; > > Not sure really without tracing through the code. The > easiest way to find out would probably be to add a print > statement into the low-level driver library to find out how > big a buffer it allocates for the QP. > > Naively, the max_send_wr is going to be rounded up to a power > of 2, so 256, and so will the work request size, which has to > be over the inline data size of 128, so that will be 256 > bytes. So I would guess you'll end up using about 256 * 256 > or 64 KB per QP. > > - R. > From hrosenstock at xsigo.com Tue Oct 30 14:37:43 2007 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 30 Oct 2007 14:37:43 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24sp7t@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24sp7t@rrcs-agw-01.hrndva.rr.com> Message-ID: <1193780263.26246.336.camel@hrosenstock-ws.xsigo.com> On Tue, 2007-10-30 at 16:24 -0500, Robert Pearson wrote: > Sean, > > Stepped out for a while. > As I mentioned before code below was wrong What's the change ? > although not related to problem. > If anyone wants that simple mad test code fragment I would be happy to > submit it fixed up as a coding example but I'm not sure how or where. The IB diags are examples but use libibmad in addition to libibumad. There are even some which use vendor class MADs (ibping, ibsysstat, vendstat). -- Hal > Thanks for the help! > > Bob > > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 30, 2007 1:34 PM > To: Robert Pearson > Cc: 'Hal Rosenstock'; 'Hal Rosenstock'; general at lists.openfabrics.org > Subject: Re: [ofa-general] umad agent question? > > > That was one thing I saw. The version of OFED installed on my test machine > > is 1.1 and that must have been added later. This is not a problem. I just > > was not expecting the behavior. > > We should make sure that the TID is not an issue. On the send side, the > kernel will set the upper 32-bits of the TID. This is done to ensure > uniqueness among multiple users. The kernel uses this value to retry > requests until it receives a response. > > On the receiving side, the response MAD must set the TID to match what > it received. It looks like the madtest code sets this correctly. Is > this what you see? > > Also, in the following code: > > mad_set_field (mad, 0, IB_MAD_METHOD_F, method); > mad_set_field (mad, 0, IB_MAD_RESPONSE_F, 0); > > Does this end up clearing the response bit? > > - Sean From rdreier at cisco.com Tue Oct 30 14:46:05 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 14:46:05 -0700 Subject: [ofa-general] How much pinned memory each QP needs ? In-Reply-To: (Changqing Tang's message of "Tue, 30 Oct 2007 21:27:11 +0000") References: Message-ID: > If I change max_send_wr=16, inline_data=16, the memory per QP will > be 16*16 = 256 bytes ? With existing driver code, there's no way to go below 4KB because of the way memory is pinned. Also, the minimum work request size is currently 64 bytes. But max_send_wr=16, max_inline_data=16 should get you pretty close to 4KB. Maybe 8KB. > I am asking Mellanox engineer, is there any document or will document the formula ? It actually somewhat depends on the driver, because the way things are implemented has some flexibility. - R. From rdreier at cisco.com Tue Oct 30 14:52:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 14:52:06 -0700 Subject: [ofa-general] Re: [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ In-Reply-To: <200710300826.21258.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 30 Oct 2007 08:26:21 +0200") References: <200710241858.45305.jackm@dev.mellanox.co.il> <200710300826.21258.jackm@dev.mellanox.co.il> Message-ID: > You are correct, that is an unfortunate side-effect of the change, that > I missed. The largest CQ that an old libmlx4 will accept is 0x3fffff > (hard-coded in file libmlx4/src/verbs.c, procedure mlx4_create_cq() ). > The new limit returned in dev_lim is 0x400000. > > Does this mean that you prefer to increment the ABI? I'm not sure it's worth breaking the ABI for. Does it seem OK if we leave the kernel alone (well, change the in-kernel rounding, but leave the user-kernel interface the same), apply the libmlx4 change, and just have the limit on the CQ size be off by one? We can fix everything if there's a reason for an ABI break later. - R. From rdreier at cisco.com Tue Oct 30 14:57:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 14:57:49 -0700 Subject: [ofa-general] [PATCH] Stop ib_fmr from contributing to the load average In-Reply-To: <20071025212538.GA27442@kryten> (Anton Blanchard's message of "Thu, 25 Oct 2007 16:25:38 -0500") References: <20071025212538.GA27442@kryten> Message-ID: thanks, applied. From or.gerlitz at gmail.com Tue Oct 30 15:11:03 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 31 Oct 2007 00:11:03 +0200 Subject: [ofa-general] Re: [ewg] Re: to be discussed at the developer conference In-Reply-To: References: <4725E023.7070409@voltaire.com> Message-ID: <15ddcffd0710301511p5f609270j97fb7bc6665942cf@mail.gmail.com> On 10/30/07, Roland Dreier wrote: > > So with that said: > > > 1) the long time and endless threads related to the SA caching thing > > need to be there. Sean - I saw that you prepare a session, correct? > > will you presenting few possible designs? > > This is the perfect type of thing to try and settle. I agree. Sean - I don't see how a two years old open issue can be settled down in 30m, I would say we need between 45m and upto two hours for that. > > 2) as for IPoIB stateless offload - with Eli and Liran not planned to > > be there. Dror - do you intend to actually present the actual ipoib / > > core / drivers related design and implementation? > > Given that there really hasn't even been an attempt to discuss this on > the mailing list, I'm not convinced it's worth trying to rush through > explaining it. I didn't think the patches were particularly hard to > understand. I think it would be good to have Dror explaining exactly what the HW knows to do (the Sonoma slides were very short in details). Things I think we want to discuss are: (A) why to put a SW only optimization (LRO) in Infiniband/networking driver (IPoIB) (B) the IB ICRC based checksum offload patch which you called "silent data corruption enhancement" etc Dror - I don't see how 30m would be enough, I would say 45m and upto an hour > 4) IPoIB connected mode UC support - Roland, can work on this start > > once the no-SRQ design/code is agreed and committed to a branch at > > your git? > > Is there a spec for attaching UC QPs to SRQs? Other than that I think > it's just a matter of someone caring enough to start working on it. Here's the thing: with the SRQ/UC spec and implementation status being unclear, once the no-SRQ code is in some repository, we can start code a no-SRQ/UC implementation. As for open issues, pls see http://lists.openfabrics.org/pipermail/general/2007-July/thread.html#37644where in the second message on the thread MST states "The largest bit of work would be to add connection liveness detection code to active side." and then a whole discussion starts. If you tend to or just agree with Michael, can be helpful if we discuss how to do that. > 5) IB 4K MTU - in IPoIB and elsewhere in the IB stack, same here, > > Roland, do you think a short session is needed > > No -- I don't know of any issues that need face-to-face discussion. OK > 6) the netdev network batching RFCs - Krishna, Shirley, will someone > > from IBM can prepare a session to educate us on the matter and the > > status? > > Why do we need to spend face-to-face time on this? I thought that face-to-face meeting can include education, specifically when it is on interesting materials like this, which are about to effect the ipoib driver. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Oct 30 15:14:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 15:14:57 -0700 Subject: [ofa-general] [RFC/PATCH 2.6.24] ib/multicast: report errors on multicast groups if pkeys change In-Reply-To: <000201c814f7$500a2960$5acc180a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 22 Oct 2007 15:03:00 -0700") References: <000201c814f7$500a2960$5acc180a@amr.corp.intel.com> Message-ID: > Pkey changes can invalidate multicast groups. Report errors on any > multicast group affected by a pkey change. I'm missing some context here. What is the issue that's being fixed here? What's the impact of not having this patch? (ie does this need to go into 2.6.24? Also 2.6.23.x? Or is 2.6.25 OK?) - R. From rdreier at cisco.com Tue Oct 30 15:17:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 15:17:10 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will pull some fixes for 2.6.24: Anton Blanchard (1): IB/fmr_pool: Stop ib_fmr threads from contributing to load average Dave Olson (1): IB/ipath: Fix incorrect use of sizeof on msg buffer (function argument) Michael Albaugh (1): IB/ipath: Limit length checksummed in eeprom Ralph Campbell (1): IB/ipath: Fix a race where s_last is updated without lock held Roland Dreier (2): IPoIB/cm: Fix receive QP cleanup IB/mlx4: Lock SQ lock in mlx4_ib_post_send() drivers/infiniband/core/fmr_pool.c | 8 ++++---- drivers/infiniband/hw/ipath/ipath_eeprom.c | 10 +++++++++- drivers/infiniband/hw/ipath/ipath_intr.c | 18 +++++++++--------- drivers/infiniband/hw/ipath/ipath_ruc.c | 14 +++++++++----- drivers/infiniband/hw/mlx4/qp.c | 4 ++-- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- 6 files changed, 34 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index d7f6452..e8d5f6b 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -291,10 +291,10 @@ struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, atomic_set(&pool->flush_ser, 0); init_waitqueue_head(&pool->force_wait); - pool->thread = kthread_create(ib_fmr_cleanup_thread, - pool, - "ib_fmr(%s)", - device->name); + pool->thread = kthread_run(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); if (IS_ERR(pool->thread)) { printk(KERN_WARNING PFX "couldn't start cleanup thread\n"); ret = PTR_ERR(pool->thread); diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c index bcfa3cc..e7c25db 100644 --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c @@ -538,7 +538,15 @@ static u8 flash_csum(struct ipath_flash *ifp, int adjust) u8 *ip = (u8 *) ifp; u8 csum = 0, len; - for (len = 0; len < ifp->if_length; len++) + /* + * Limit length checksummed to max length of actual data. + * Checksum of erased eeprom will still be bad, but we avoid + * reading past the end of the buffer we were passed. + */ + len = ifp->if_length; + if (len > sizeof(struct ipath_flash)) + len = sizeof(struct ipath_flash); + while (len--) csum += *ip++; csum -= ifp->if_csum; csum = ~csum; diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 6a5dd5c..c61f9da 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -453,7 +453,7 @@ skip_ibchange: } static void handle_supp_msgs(struct ipath_devdata *dd, - unsigned supp_msgs, char msg[512]) + unsigned supp_msgs, char *msg, int msgsz) { /* * Print the message unless it's ibc status change only, which @@ -461,9 +461,9 @@ static void handle_supp_msgs(struct ipath_devdata *dd, */ if (dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED) { int iserr; - iserr = ipath_decode_err(msg, sizeof msg, - dd->ipath_lasterror & - ~INFINIPATH_E_IBSTATUSCHANGED); + iserr = ipath_decode_err(msg, msgsz, + dd->ipath_lasterror & + ~INFINIPATH_E_IBSTATUSCHANGED); if (dd->ipath_lasterror & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS)) @@ -492,8 +492,8 @@ static void handle_supp_msgs(struct ipath_devdata *dd, } static unsigned handle_frequent_errors(struct ipath_devdata *dd, - ipath_err_t errs, char msg[512], - int *noprint) + ipath_err_t errs, char *msg, + int msgsz, int *noprint) { unsigned long nc; static unsigned long nextmsg_time; @@ -512,7 +512,7 @@ static unsigned handle_frequent_errors(struct ipath_devdata *dd, nextmsg_time = nc + HZ * 3; } else if (supp_msgs) { - handle_supp_msgs(dd, supp_msgs, msg); + handle_supp_msgs(dd, supp_msgs, msg, msgsz); supp_msgs = 0; nmsgs = 0; } @@ -525,14 +525,14 @@ static unsigned handle_frequent_errors(struct ipath_devdata *dd, static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) { - char msg[512]; + char msg[128]; u64 ignore_this_time = 0; int i, iserr = 0; int chkerrpkts = 0, noprint = 0; unsigned supp_msgs; int log_idx; - supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint); + supp_msgs = handle_frequent_errors(dd, errs, msg, sizeof msg, &noprint); /* don't report errors that are masked */ errs &= ~dd->ipath_maskederrs; diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 4b6b7ee..54c61a9 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -630,11 +630,8 @@ bail:; void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, enum ib_wc_status status) { - u32 last = qp->s_last; - - if (++last == qp->s_size) - last = 0; - qp->s_last = last; + unsigned long flags; + u32 last; /* See ch. 11.2.4.1 and 10.7.3.1 */ if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || @@ -658,4 +655,11 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, wc.port_num = 0; ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); } + + spin_lock_irqsave(&qp->s_lock, flags); + last = qp->s_last; + if (++last >= qp->s_size) + last = 0; + qp->s_last = last; + spin_unlock_irqrestore(&qp->s_lock, flags); } diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 6b33224..8cba9c5 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1282,7 +1282,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, int size; int i; - spin_lock_irqsave(&qp->rq.lock, flags); + spin_lock_irqsave(&qp->sq.lock, flags); ind = qp->sq.head; @@ -1448,7 +1448,7 @@ out: (qp->sq.wqe_cnt - 1)); } - spin_unlock_irqrestore(&qp->rq.lock, flags); + spin_unlock_irqrestore(&qp->sq.lock, flags); return err; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 8761077..059cf92 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -60,7 +60,7 @@ static struct ib_qp_attr ipoib_cm_err_attr = { .qp_state = IB_QPS_ERR }; -#define IPOIB_CM_RX_DRAIN_WRID 0x7fffffff +#define IPOIB_CM_RX_DRAIN_WRID 0xffffffff static struct ib_send_wr ipoib_cm_rx_drain_wr = { .wr_id = IPOIB_CM_RX_DRAIN_WRID, From mshefty at ichips.intel.com Tue Oct 30 15:17:33 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 15:17:33 -0700 Subject: [ofa-general] umad agent question? In-Reply-To: <5p5klh$24noqp@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24noqp@rrcs-agw-01.hrndva.rr.com> Message-ID: <4727AD7D.9030703@ichips.intel.com> Btw, it appears that the sender calls umad_set_addr() with the lid in host order, but the receiver calls umad_set_addr() with the lid in network order. - Sean From pradeeps at linux.vnet.ibm.com Tue Oct 30 15:38:53 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 30 Oct 2007 15:38:53 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: <200710261533.UlO70kYhcNvuPmut@cisco.com> References: <200710261533.UlO70kYhcNvuPmut@cisco.com> Message-ID: <4727B27D.2070207@linux.vnet.ibm.com> With these set of patches applied I see some random crashes and they all have one characteristic -they are associated with an skb. I am running netperf tests and most times see a crash. However, a few runs have also completed successfully. The tests were run on IBM HCA on ppc64 machines. Most often I see panics in either ipoib_cm_handle_tx_wc() when dev_kfree_skb_any() is called or in ipoib_cm_handle_rx_wc() when some skb operation is being performed. This appears to be a race wherein a freed (and maybe reused) skb is being accessed. I have not yet been able to put my finger on the offending code fragment. Continuing to investigate. Pradeep From rdreier at cisco.com Tue Oct 30 15:42:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 15:42:33 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: <4727B27D.2070207@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Tue, 30 Oct 2007 15:38:53 -0700") References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> Message-ID: Are you testing a kernel with 1b524963 ("IPoIB/cm: Use common CQ for CM send completions") applied (it is already upstream)? It is possible that that introduced the bug rather than the non-SRQ CM patches. - R. From rdreier at cisco.com Tue Oct 30 15:45:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 15:45:33 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: (Roland Dreier's message of "Tue, 30 Oct 2007 15:42:33 -0700") References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> Message-ID: > Are you testing a kernel with 1b524963 ("IPoIB/cm: Use common CQ for > CM send completions") applied (it is already upstream)? It is > possible that that introduced the bug rather than the non-SRQ CM > patches. Crud, I see a bug with that commit and non-SRQ: ipoib_cm_handle_tx_wc() does struct ipoib_cm_tx *tx = wc->qp->qp_context; and there's no reason for wc->qp to be set if the HCA does not handle SRQs. In fact there's no reason for wc->qp to be set for send completions in general. Not sure if this is the problem you're seeing, but now I need to figure out how to fix it... - R. From pradeeps at linux.vnet.ibm.com Tue Oct 30 15:49:17 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 30 Oct 2007 15:49:17 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> Message-ID: <4727B4ED.2020602@linux.vnet.ibm.com> Roland Dreier wrote: > Are you testing a kernel with 1b524963 ("IPoIB/cm: Use common CQ for > CM send completions") applied (it is already upstream)? It is > possible that that introduced the bug rather than the non-SRQ CM > patches. Let me check. I pulled from your for-2.6.24 git tree yesterday afternoon and applied the 4 patches and started testing. Do you know of a bug in the common CQ code that may be causing this? Pradeep From sean.hefty at intel.com Tue Oct 30 15:42:01 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 30 Oct 2007 15:42:01 -0700 Subject: [ofa-general] [RFC/PATCH 2.6.24] ib/multicast: report errors on multicast groups if pkeys change In-Reply-To: References: <000201c814f7$500a2960$5acc180a@amr.corp.intel.com> Message-ID: <000701c81b46$16a40dd0$43c8180a@amr.corp.intel.com> > > Pkey changes can invalidate multicast groups. Report errors on any > > multicast group affected by a pkey change. > >I'm missing some context here. What is the issue that's being fixed >here? What's the impact of not having this patch? (ie does this need >to go into 2.6.24? Also 2.6.23.x? Or is 2.6.25 OK?) If the pkey table changes, all existing multicast groups are potentially invalidated. Without this patch, subscribers to those groups are left subscribed to a group that they can no longer communicate with. This patch determines which groups were affected by a pkey table change and reports errors to the users. IPoIB follows a similar process for pkey changes; see ipoib_event() and __ipoib_ib_dev_flush(pkey_event = 1). I don't think this is critical, but I'm assuming that changes to pkey tables without other port events occurring are rare. But since it is a fix and it's early, I asked to merge into 2.6.24. - Sean From colatorium at tigerbayinc.com Tue Oct 30 16:05:45 2007 From: colatorium at tigerbayinc.com (Bobbie Lopez) Date: Tue, 30 Oct 2007 18:05:45 -0500 Subject: [ofa-general] Adobe Photoshop CS3 & Creative Suite 3, starting at 79$ Save 1999.95$ 0ff Retai| Message-ID: <000001c81b48$fb6ed880$0100007f@localhost> newadobedeals . com From rdreier at cisco.com Tue Oct 30 16:17:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 16:17:44 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: (Roland Dreier's message of "Tue, 30 Oct 2007 15:45:33 -0700") References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> Message-ID: > Crud, I see a bug with that commit and non-SRQ: > ipoib_cm_handle_tx_wc() does > > struct ipoib_cm_tx *tx = wc->qp->qp_context; > > and there's no reason for wc->qp to be set if the HCA does not handle > SRQs. In fact there's no reason for wc->qp to be set for send > completions in general. Actually, I take that back. Every driver seems to set wc->qp in all cases, so I guess it is safe to rely on that now. (Which actually means that the table of RX QPs in the non-SRQ patch can be dropped so we make things dramatically simpler). But that means I really have no idea what your bug is. Could you say how you're running netperf so I can try to reproduce the crash? - R. From jeremy.brown at qlogic.com Tue Oct 30 16:40:01 2007 From: jeremy.brown at qlogic.com (Jeremy Brown) Date: Tue, 30 Oct 2007 16:40:01 -0700 Subject: [ofa-general] Re: [ewg] OFED October 29 meeting summary on OFED 1.3 beta readiness In-Reply-To: <4727426C.5090504@mellanox.co.il> References: <4727426C.5090504@mellanox.co.il> Message-ID: <1193787601.19495.17.camel@citrine.pathscale.com> On Tue, 2007-10-30 at 16:40 +0200, Tziporet Koren wrote: > o Apply patches that fix warning of backport patches - Vlad > (Mellanox) (one patch was not applied since we got no answer > regarding it) Yikes, I did drop that on the floor, didn't I? I'm sorry about that. Here's a reply: On Thu, 2007-10-25 at 10:05 +0200, Jack Morgenstein wrote: > Jeremy, > > Why did you remove the "likely" and "unlikely" macros? > > Isn't the compiler warning just on the missing "!= NULL" ? > > - Jack It looks like we had something of an internal collision between two patches. The one we submitted fixes a problem at the likely() unlikely() macros confuse gcc into thinking that tail could be used before it is assigned. (The engineer doesn't think gcc is producing better code due to the use of likely/unlikely here.) Another change that could be used to fix the issue would be along these lines: - struct sk_buff *tail; [...] - if (likely(skb_len && (tail = skb_peek_tail(&sk->sk_receive_queue))) && - unlikely(skb_tailroom(tail) >= skb_len)) { - skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); - __kfree_skb(skb); - skb = tail; + if (likely(skb_len)) { + struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue); + if (likely(tail) && unlikely(skb_tailroom(tail) >= skb_len)) { + skb_copy_bits(skb, 0, skb_put(tail, skb_len), skb_len); + __kfree_skb(skb); + skb = tail; + } Which do you think looks better? Sorry for the delay! Jeremy From pradeeps at linux.vnet.ibm.com Tue Oct 30 16:43:11 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 30 Oct 2007 16:43:11 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> Message-ID: <4727C18F.3010509@linux.vnet.ibm.com> Roland Dreier wrote: > > Crud, I see a bug with that commit and non-SRQ: > > ipoib_cm_handle_tx_wc() does > > > > struct ipoib_cm_tx *tx = wc->qp->qp_context; > > > > and there's no reason for wc->qp to be set if the HCA does not handle > > SRQs. In fact there's no reason for wc->qp to be set for send > > completions in general. > > Actually, I take that back. Every driver seems to set wc->qp in all > cases, so I guess it is safe to rely on that now. (Which actually > means that the table of RX QPs in the non-SRQ patch can be dropped so > we make things dramatically simpler). Yes, the rx_table was introduced when ehca did not set wc->qp. I know that Joachim Fenkes submitted a fix for that. I will confirm if that fix is already in this tree. > > But that means I really have no idea what your bug is. Could you say > how you're running netperf so I can try to reproduce the crash? Nothing fancy, I simply run "netperf -H < IP address> -l " I am using netperf 2.4.1 (I presume the version should not matter). Pradeep From aerobe at bobcooper.net Tue Oct 30 05:38:35 2007 From: aerobe at bobcooper.net (Claude Perkins) Date: Tue, 30 Oct 2007 07:38:35 -0500 Subject: [ofa-general] Adobe Photoshop CS3 & Creative Suite 3, starting at 79$ Save 1999.95$ 0ff Retai| Message-ID: <000001c81b55$51e21400$0100007f@localhost> newadobedeals . com From limnetic at hushabyehire.com Tue Oct 30 19:01:05 2007 From: limnetic at hushabyehire.com (Tanaka Ford) Date: Tue, 30 Oct 2007 19:01:05 -0700 Subject: [ofa-general] Adobe Photoshop CS3 & Creative Suite 3, starting at 79$ Save 1999.95$ 0ff Retai| Message-ID: <000001c81b61$5b717900$0100007f@localhost> newadobedeals . com From rdreier at cisco.com Tue Oct 30 22:21:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 30 Oct 2007 22:21:58 -0700 Subject: [ofa-general] Re: [PATCH 1/14 v2] nes: module and device initialization In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC078C9670@venom2> (Glenn Grundstrom's message of "Sun, 28 Oct 2007 13:21:30 -0500") References: <200710192001.l9JK1U8O021689@neteffect.com> <5E701717F2B2ED4EA60F87C8AA57B7CC078C9670@venom2> Message-ID: > Thanks Roland. Let me know when you have your branch ready. OK, I pushed out a "neteffect" branch at git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git This has the driver from your git tree plus some compile fixes and cleanups (added as separate commits, so you can see what I did). If it suits you, let's work against that tree to continue cleaning things up -- you can send me patches or git pull requests to pick up new things. - R. From alphaturismo at wnet.com.br Wed Oct 31 01:19:40 2007 From: alphaturismo at wnet.com.br (DAYZERS LOTTERY HEADQUARTERS) Date: Wed, 31 Oct 2007 06:19:40 -0200 (EDT) Subject: [ofa-general] Start Your Claims Proccessing -PLEASE CONFIRM- Message-ID: <1193818780.47283a9c47649@webmail.wnet.com.br> NOTIFICATION! NOTIFICATION!! NOTIFICATION!!! DAYZERS LOTTERY HEADQUARTERS. Lotto Winners of 1,500,000.00 Euros Ref. Number: NL/BC7765468/WW14 Coupon Number: NM/896161/WOP www.dayzers.nl *DAYZERS LOTTERY HEADQUARTERS* Dear Prize Winner, CONGRATULATIONS!!! We are pleased to inform you of the announcement today of winners of the DAYZERS LOTTERY BV, DE NETHERLANDS of your email lottery winning for 2007 DAYZERS LOTTO-Wheel held on 10th Sep 2007. All participants for the online version were selected randomly from World Wide Web sites through computer draw system and extracted from over 100,000 unions, associations, and corporate bodies that are listed online. Ref. Number: NL/BC7765468/WW14 Lottery Group: Bonus ball number:02 Prize Amount: 1.500,000 (One Million Five Hundred Thousand Euro Only) You are to keep your ref. number and coupon number from the public, until you have been processed and your money remitted to your personal account. Your Contact Officer is: Mr. Van Dirk PALEISSTRAAT 5 2514 JA DEN HAAG THE NETHERLAND TEL:0031-623-672-641 FAX: 0031 84 737 0369 Email:managervandirk at yahoo.de NOTE: Please remember to quote your reference and coupon numbers in all correspondences with your claims director and congratulations once again. Sincerely, DASIEY HANS (Dr.) International Relation Officer. NB: All response should be mail to De Award Department:managervandirk at yahoo.de ------------------------------------------------- Wnet Internet Provider - http://www.wnet.com.br From vlad at lists.openfabrics.org Wed Oct 31 02:56:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 31 Oct 2007 02:56:40 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20071031-0200 daily build status Message-ID: <20071031095640.50BB6E60914@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From ogerlitz at voltaire.com Wed Oct 31 03:56:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 31 Oct 2007 12:56:19 +0200 Subject: [ofa-general] librdmacm 1.0.4 release In-Reply-To: <472755C4.10600@ichips.intel.com> References: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> <4726EEAC.3070105@voltaire.com> <472755C4.10600@ichips.intel.com> Message-ID: <47285F53.4060402@voltaire.com> Sean Hefty wrote: >>> librdmacm/man: update man pages to clarify connection request params >> >> I think you have mentioned that some documentation update is planned? > > See the man page updates that were made. There may still be some errors > or omissions, but I tried to address Doug's comments. Looking in the man directory diff between librdmacm 1.0.3 to 1.0.4 I see that you added description of the conn param fields for UD and CONN in the man page of rdma_get_cm_event, where some (most) of the CONN params are also documented in the man pages of rdma_connect and rdma_accept, does it makes sense to you to have some cleanup here, putting all the description in one page (eg rdma_get_cm_event) and in the connect and accept pages point to that page and just state what need to be fill by each side. More re conn params, and also following questions I got from people coding to librdmacm/libibverbs - for CONN the RNR and ACK timeouts are being set by the core kernel (rdmacm, cm) code. Adding some mentioning to this at the librdmacm man pages would save the need to explain it to people again and again, they can be just sent to the manual... would you prefer some text from me or you can add it? >> - param.retry_count is ignored in the passive side rdma-cm code and >> the IB cm uses the one present in the req message. > > correct - there's a comment in the header file about the passive side > ignoring this value lets put it also in the man page - ok? >> - param.rnr_retry_count is not ignored in the passive side, but from >> looking in the code, I was not sure if the value used is the one >> present in the req or the one supplied by the passive consumer. > > The passive side uses the value from the req. The active side uses the > value from the rep. I guess this is dictated by the IB spec... oh well, maybe they wanted to allow for asymmetric routing or app level schemes, let it be, and just document it - ok? >> - param.flow_control is a pure SW field which does not get into the QP >> attr. My understanding is that IB RC flow-control means non zero rnr >> counter, is this all? if yes, maybe we need to expose only >> rnr_retry_count field > It's a property of the HCA, but it's not clear to me at the moment what > a user does with this field. maybe the design philosophy of the IB spec here was to let the user tell the HCA "don't send RNR NAK for this QP when there is no RX buffer posted"? in second thought it does not makes sense, since ACKs are not optional. Anyway, I prefer to document at the man that this property is just value being exchanged between the active and passive side and it does not translate to anything wrt to the HW - what do you think? Or. From ogerlitz at voltaire.com Wed Oct 31 04:50:16 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 31 Oct 2007 13:50:16 +0200 (IST) Subject: [ofa-general] same port (SID) listeners Message-ID: Hi Sean, I noted that I can open two rdma-cm based listeners (*) on the same IB port each on a different partition using the same tcp port, and it works fine. As they both translate to the same IB SID, is there some matching logic in the rdma-cm level? Or. (*) say $ rping -s -a local-IP-X -p 1234 and $ rping -s -a local-IP-Y -p 1234 where X is the address of ib0 and Y the address of ib0.8007 From eli at mellanox.co.il Wed Oct 31 05:13:13 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 31 Oct 2007 14:13:13 +0200 Subject: [ofa-general] suggest naming convention for "fixes" patches Message-ID: <1193832793.6053.58.camel@mtls03> In the previous ofa releases we had many fixes with names such as t_00xxxx and zzz_xxx etc. I suggest the following convention to make this more ordered. The name of a patch starts with the name of the module then a number that dictates the order and then a free descriptive name for the patch. For example: ipoib_0010_cq_coalescing.patch mlx4_0070_modify_cq.patch What do you think? From jackm at mellanox.co.il Wed Oct 31 05:53:29 2007 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 31 Oct 2007 14:53:29 +0200 Subject: [ofa-general] RE: suggest naming convention for "fixes" patches In-Reply-To: <1193832793.6053.58.camel@mtls03> References: <1193832793.6053.58.camel@mtls03> Message-ID: <6C2C79E72C305246B504CBA17B5500C9028DE494@mtlexch01.mtl.com> Sounds good. - Jack > -----Original Message----- > From: Eli Cohen > Sent: Wednesday, October 31, 2007 2:13 PM > To: openfabrics > Cc: Vladimir Sokolovsky; Jack Morgenstein; Tziporet Koren > Subject: suggest naming convention for "fixes" patches > > In the previous ofa releases we had many fixes with names such as > t_00xxxx and zzz_xxx etc. I suggest the following convention to make > this more ordered. The name of a patch starts with the name of the > module then a number that dictates the order and then a free > descriptive > name for the patch. For example: > > ipoib_0010_cq_coalescing.patch > mlx4_0070_modify_cq.patch > > What do you think? > From mshefty at ichips.intel.com Wed Oct 31 08:19:20 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Oct 2007 08:19:20 -0700 Subject: [ofa-general] same port (SID) listeners In-Reply-To: References: Message-ID: <47289CF8.8050000@ichips.intel.com> > I noted that I can open two rdma-cm based listeners (*) on the same IB port > each on a different partition using the same tcp port, and it works fine. > > As they both translate to the same IB SID, is there some matching > logic in the rdma-cm level? There's additional matching in the IB CM against the private data, where the IP addresses are carried. - Sean From pyfabuobaiddob at abuobaid.net Wed Oct 31 08:43:39 2007 From: pyfabuobaiddob at abuobaid.net (Angela Winslow) Date: Wed, 31 Oct 2007 07:43:39 -0800 Subject: [ofa-general] Legal software sales Message-ID: <065372165.58112454535533@abuobaid.net> Our purpose is to present PC and Macintosh lawful software and computer solutions of low price any could afford. Whether you're a corporate customer, a small enterprise holder, or shopping for your own home PC, we think we can help you. TAKE BENEFIT OF OUR SOFTWARE http://tdatbep.ourroyaloem.net/ Most popular products: *Acronis Recovery Expert Deluxe: Retail price for this time - $49.99; Our for today - $19.95 *Adobe Premiere 2.0: Retail price today - $849.00; Our only - $59.95 *Corel Procreate KnockOut 2.0: Retail price today - $99.99; Our only - $19.95 *Corel Bryce 5.0: Retail price for now - $69.95; Our just - $29.95 *Microsoft Money Home & Business 7: Retail price this time - $89.90; Our just - $39.95 *Adobe Photoshop CS2 V 9.0: Retail price now - $599.00; Our this time - $69.95 *Adobe Illustrator CS V 11.0 PC: Retail price for this day - $499.00; Our only - $49.95 *Quark Xpress v6.1 Passport: Retail price for this day - $1560.00; Our only - $59.95 COME TO US RIGHT NOW! http://tdatbep.ourroyaloem.net/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ossrosch at linux.vnet.ibm.com Wed Oct 31 09:07:03 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 31 Oct 2007 17:07:03 +0100 Subject: [ofa-general] [Patch 0/3]ehca: Patchset to backport 2.6.24-rc1 kernel base Message-ID: <200710311707.04974.ossrosch@linux.vnet.ibm.com> These three patches are the backports against the new 2.6.24-rc1 kernel base. [patch 1/3] - In kernel version 2.6.17 and lower the interface for register/unregister_hotcpu_notifier() is missing. This patch includes the backport for linux/cpu.h to the concerning kernel versions. [patch 2/3] - Starting with kernel version 2.6.24 ehca is using sg_page() interface, which does not exists in older kernels. So this patch adds a backport for sg_page() in linux/scatterlist.h for all kernel versions lower 2.6.24. [patch 3/3] - ibmebus changes the location code interface in 2.6.24-rc1. Because those changes are not available in older kernel versions we have to backport all older kernels to use the old version of ibmebus. kind regards Stefan Roscher From ossrosch at linux.vnet.ibm.com Wed Oct 31 09:07:16 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 31 Oct 2007 17:07:16 +0100 Subject: [ofa-general] [Patch 1/3]ehca: Patchset to backport 2.6.24-rc1 kernel base Message-ID: <200710311707.18454.ossrosch@linux.vnet.ibm.com> This patch backport the xxx_hotcpu_notifier() interface to older kernelverisons. Signed-off-by: Stefan Roscher --- 2.6.16/include/linux/cpu.h | 7 +++++++ 2.6.16_sles10/include/linux/cpu.h | 7 +++++++ 2.6.16_sles10_sp1/include/linux/cpu.h | 7 +++++++ 2.6.17/include/linux/cpu.h | 7 +++++++ 2.6.9_U5/include/linux/cpu.h | 7 +++++++ 5 files changed, 35 insertions(+) diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16/include/linux/cpu.h linux-2.6_new/kernel_addons/backport/2.6.16/include/linux/cpu.h --- linux-2.6_old/kernel_addons/backport/2.6.16/include/linux/cpu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16/include/linux/cpu.h 2007-10-31 10:37:08.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_17_LINUX_CPU_H +#define BACKPORT_2_6_17_LINUX_CPU_H + +#include_next +#define register_hotcpu_notifier(nb) register_cpu_notifier(nb) +#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) +#endif diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16_sles10/include/linux/cpu.h linux-2.6_new/kernel_addons/backport/2.6.16_sles10/include/linux/cpu.h --- linux-2.6_old/kernel_addons/backport/2.6.16_sles10/include/linux/cpu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16_sles10/include/linux/cpu.h 2007-10-31 10:37:40.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_17_LINUX_CPU_H +#define BACKPORT_2_6_17_LINUX_CPU_H + +#include_next +#define register_hotcpu_notifier(nb) register_cpu_notifier(nb) +#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) +#endif diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/cpu.h linux-2.6_new/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/cpu.h --- linux-2.6_old/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/cpu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/cpu.h 2007-10-31 10:38:08.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_17_LINUX_CPU_H +#define BACKPORT_2_6_17_LINUX_CPU_H + +#include_next +#define register_hotcpu_notifier(nb) register_cpu_notifier(nb) +#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) +#endif diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.17/include/linux/cpu.h linux-2.6_new/kernel_addons/backport/2.6.17/include/linux/cpu.h --- linux-2.6_old/kernel_addons/backport/2.6.17/include/linux/cpu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.17/include/linux/cpu.h 2007-10-31 10:38:31.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_17_LINUX_CPU_H +#define BACKPORT_2_6_17_LINUX_CPU_H + +#include_next +#define register_hotcpu_notifier(nb) register_cpu_notifier(nb) +#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) +#endif diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.9_U5/include/linux/cpu.h linux-2.6_new/kernel_addons/backport/2.6.9_U5/include/linux/cpu.h --- linux-2.6_old/kernel_addons/backport/2.6.9_U5/include/linux/cpu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.9_U5/include/linux/cpu.h 2007-10-31 10:38:57.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_17_LINUX_CPU_H +#define BACKPORT_2_6_17_LINUX_CPU_H + +#include_next +#define register_hotcpu_notifier(nb) register_cpu_notifier(nb) +#define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) +#endif From ossrosch at linux.vnet.ibm.com Wed Oct 31 09:07:24 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 31 Oct 2007 17:07:24 +0100 Subject: [ofa-general] [Patch 2/3]ehca: Patchset to backport 2.6.24-rc1 kernel base Message-ID: <200710311707.26457.ossrosch@linux.vnet.ibm.com> This patch backports the sg_page() interface to older kernelversions. Signed-off-by: Stefan Roscher --- 2.6.16/include/linux/scatterlist.h | 7 +++++++ 2.6.16_sles10/include/linux/scatterlist.h | 7 +++++++ 2.6.16_sles10_sp1/include/linux/scatterlist.h | 7 +++++++ 2.6.17/include/linux/scatterlist.h | 7 +++++++ 2.6.18-EL5.1/include/linux/scatterlist.h | 7 +++++++ 2.6.18/include/linux/scatterlist.h | 7 +++++++ 2.6.19/include/linux/scatterlist.h | 7 +++++++ 2.6.20/include/linux/scatterlist.h | 7 +++++++ 2.6.21/include/linux/scatterlist.h | 7 +++++++ 2.6.22/include/linux/scatterlist.h | 7 +++++++ 2.6.23/include/linux/scatterlist.h | 7 +++++++ 11 files changed, 77 insertions(+) diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.16/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.16/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16/include/linux/scatterlist.h 2007-10-31 10:41:34.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16_sles10/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.16_sles10/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.16_sles10/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16_sles10/include/linux/scatterlist.h 2007-10-31 10:41:37.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.16_sles10_sp1/include/linux/scatterlist.h 2007-10-31 10:41:41.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.17/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.17/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.17/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.17/include/linux/scatterlist.h 2007-10-31 10:41:45.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.18/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.18/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.18/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.18/include/linux/scatterlist.h 2007-10-31 10:41:49.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.18-EL5.1/include/linux/scatterlist.h 2007-10-31 10:41:52.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.19/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.19/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.19/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.19/include/linux/scatterlist.h 2007-10-31 10:41:56.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.20/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.20/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.20/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.20/include/linux/scatterlist.h 2007-10-31 10:42:00.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.21/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.21/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.21/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.21/include/linux/scatterlist.h 2007-10-31 10:42:04.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.22/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.22/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.22/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.22/include/linux/scatterlist.h 2007-10-31 10:42:07.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + diff -Nurp linux-2.6_old/kernel_addons/backport/2.6.23/include/linux/scatterlist.h linux-2.6_new/kernel_addons/backport/2.6.23/include/linux/scatterlist.h --- linux-2.6_old/kernel_addons/backport/2.6.23/include/linux/scatterlist.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_addons/backport/2.6.23/include/linux/scatterlist.h 2007-10-31 10:43:46.000000000 -0400 @@ -0,0 +1,7 @@ +#ifndef BACKPORT_2_6_23_LINUX_SCATTERLIST_H +#define BACKPORT_2_6_23_LINUX_SCATTERLIST_H + +#include_next +#define sg_page(x) (x)->page +#endif + From ossrosch at linux.vnet.ibm.com Wed Oct 31 09:07:29 2007 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 31 Oct 2007 17:07:29 +0100 Subject: [ofa-general] [Patch 3/3]ehca: Patchset to backport 2.6.24-rc1 kernel base Message-ID: <200710311707.30521.ossrosch@linux.vnet.ibm.com> This patch replaces the new ibmebus location code in ehca with the old one for older kernels. Signed-off-by: Stefan Roscher --- 2.6.16/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.16_sles10/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.16_sles10_sp1/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.17/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.18-EL5.1/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.18/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.19/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.20/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.21/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.22/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.23/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 2.6.9_U5/ehca_01_ibmebus_loc_code.patch | 137 +++++++++++++++++++++++ 12 files changed, 1644 insertions(+) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.16/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.16/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.16/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.16/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:14.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.16_sles10/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.16_sles10/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.16_sles10/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.16_sles10/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:18.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.16_sles10_sp1/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.16_sles10_sp1/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.16_sles10_sp1/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.16_sles10_sp1/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:24.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.17/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.17/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.17/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.17/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:28.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.18/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.18/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.18/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.18/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:30.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.18-EL5.1/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.18-EL5.1/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.18-EL5.1/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.18-EL5.1/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:33.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.19/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.19/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.19/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.19/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:37.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.20/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.20/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.20/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.20/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:39.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.21/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.21/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.21/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.21/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:41.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.22/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.22/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.22/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.22/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:43.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.23/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.23/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.23/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.23/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:45.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) diff -Nurp linux-2.6_old/kernel_patches/backport/2.6.9_U5/ehca_01_ibmebus_loc_code.patch linux-2.6_new/kernel_patches/backport/2.6.9_U5/ehca_01_ibmebus_loc_code.patch --- linux-2.6_old/kernel_patches/backport/2.6.9_U5/ehca_01_ibmebus_loc_code.patch 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6_new/kernel_patches/backport/2.6.9_U5/ehca_01_ibmebus_loc_code.patch 2007-10-31 10:47:50.000000000 -0400 @@ -0,0 +1,137 @@ +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 08:51:40.264936336 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_classes.h 2007-10-30 10:39:52.595875776 -0800 +@@ -107,7 +107,7 @@ struct ehca_sport { + + struct ehca_shca { + struct ib_device ib_device; +- struct of_device *ofdev; ++ struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 08:51:40.265936184 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_eq.c 2007-10-30 10:42:52.454956424 -0800 +@@ -123,7 +123,7 @@ int ehca_create_eq(struct ehca_shca *shc + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_eq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, + IRQF_DISABLED, "ehca_eq", + (void *)shca); + if (ret < 0) +@@ -131,7 +131,7 @@ int ehca_create_eq(struct ehca_shca *shc + + tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); + } else if (type == EHCA_NEQ) { +- ret = ibmebus_request_irq(eq->ist, ehca_interrupt_neq, ++ ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, + IRQF_DISABLED, "ehca_neq", + (void *)shca); + if (ret < 0) +@@ -171,7 +171,7 @@ int ehca_destroy_eq(struct ehca_shca *sh + u64 h_ret; + + spin_lock_irqsave(&eq->spinlock, flags); +- ibmebus_free_irq(eq->ist, (void *)shca); ++ ibmebus_free_irq(NULL, eq->ist, (void *)shca); + + h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + +diff -Nurp linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c +--- linux-2.6_old/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 08:51:40.267935880 -0800 ++++ linux-2.6_new/drivers/infiniband/hw/ehca/ehca_main.c 2007-10-30 10:58:15.135992904 -0800 +@@ -418,7 +418,7 @@ int ehca_init_device(struct ehca_shca *s + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.num_comp_vectors = 1; +- shca->ib_device.dma_device = &shca->ofdev->dev; ++ shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; +@@ -672,7 +672,7 @@ static struct attribute_group ehca_dev_a + .attrs = ehca_dev_attrs + }; + +-static int __devinit ehca_probe(struct of_device *dev, ++static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) + { + struct ehca_shca *shca; +@@ -680,16 +680,16 @@ static int __devinit ehca_probe(struct o + struct ib_pd *ibpd; + int ret; + +- handle = of_get_property(dev->node, "ibm,hca-handle", NULL); ++ handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + ehca_gen_err("Cannot get eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + ehca_gen_err("Wrong eHCA handle for adapter: %s.", +- dev->node->full_name); ++ dev->ofdev.node->full_name); + return -ENODEV; + } + +@@ -700,9 +700,9 @@ static int __devinit ehca_probe(struct o + } + mutex_init(&shca->modify_mutex); + +- shca->ofdev = dev; ++ shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; +- dev->dev.driver_data = shca; ++ dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { +@@ -778,7 +778,7 @@ static int __devinit ehca_probe(struct o + } + } + +- ret = sysfs_create_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ ret = sysfs_create_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + if (ret) /* only complain; we can live without attributes */ + ehca_err(&shca->ib_device, + "Cannot create device attributes ret=%d", ret); +@@ -828,12 +828,12 @@ probe1: + return -EINVAL; + } + +-static int __devexit ehca_remove(struct of_device *dev) ++static int __devexit ehca_remove(struct ibmebus_dev *dev) + { +- struct ehca_shca *shca = dev->dev.driver_data; ++ struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + +- sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp); ++ sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); + + if (ehca_open_aqp1 == 1) { + int i; +@@ -884,11 +884,11 @@ static struct of_device_id ehca_device_t + {}, + }; + +-static struct of_platform_driver ehca_driver = { +- .name = "ehca", +- .match_table = ehca_device_table, +- .probe = ehca_probe, +- .remove = ehca_remove, ++static struct ibmebus_driver ehca_driver = { ++ .name = "ehca", ++ .id_table = ehca_device_table, ++ .probe = ehca_probe, ++ .remove = ehca_remove, + }; + + void ehca_poll_eqs(unsigned long data) From mshefty at ichips.intel.com Wed Oct 31 10:45:46 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Oct 2007 10:45:46 -0700 Subject: [ofa-general] librdmacm 1.0.4 release In-Reply-To: <47285F53.4060402@voltaire.com> References: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> <4726EEAC.3070105@voltaire.com> <472755C4.10600@ichips.intel.com> <47285F53.4060402@voltaire.com> Message-ID: <4728BF4A.1060301@ichips.intel.com> > Looking in the man directory diff between librdmacm 1.0.3 to 1.0.4 I see > that you added description of the conn param fields for UD and CONN in > the man page of rdma_get_cm_event, where some (most) of the CONN params > are also documented in the man pages of rdma_connect and rdma_accept, > does it makes sense to you to have some cleanup here, putting all the > description in one page (eg rdma_get_cm_event) and in the connect and > accept pages point to that page and just state what need to be fill by > each side. The text is slightly different in places depending on the context. > More re conn params, and also following questions I got from people > coding to librdmacm/libibverbs - for CONN the RNR and ACK timeouts are > being set by the core kernel (rdmacm, cm) code. Adding some mentioning > to this at the librdmacm man pages would save the need to explain it to > people again and again, they can be just sent to the manual... would you > prefer some text from me or you can add it? I don't understand the source of the confusion yet. The values that are used are based on what's passed in by the user. All QP attributes are set by the kernel code when it's modified by the library. >>> - param.retry_count is ignored in the passive side rdma-cm code and >>> the IB cm uses the one present in the req message. >> >> correct - there's a comment in the header file about the passive side >> ignoring this value > > lets put it also in the man page - ok? Yes - rdma_accept man page has been updated (not pushed yet) to indicate that this value is ignored. > I guess this is dictated by the IB spec... oh well, maybe they wanted to > allow for asymmetric routing or app level schemes, let it be, and just > document it - ok? The rdma_connect and rdma_accept man pages have been updated to state that rnr_retry_count value applies to the remote peer. > maybe the design philosophy of the IB spec here was to let the user tell > the HCA "don't send RNR NAK for this QP when there is no RX buffer > posted"? in second thought it does not makes sense, since ACKs are not > optional. Anyway, I prefer to document at the man that this property is > just value being exchanged between the active and passive side and it > does not translate to anything wrt to the HW - what do you think? I will see if I can come up with something useful to say here. - Sean From arthur.jones at qlogic.com Wed Oct 31 12:13:48 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 31 Oct 2007 12:13:48 -0700 Subject: [ofa-general] ofed_kernel merged with 2.6.24-rc1 patches update required In-Reply-To: <4725F5D9.6050301@dev.mellanox.co.il> References: <4725F5D9.6050301@dev.mellanox.co.il> Message-ID: <20071031191348.GB4551@bauxite.pathscale.com> hi vladimir, are the backport patches posted to an open mailing list somewhere? i would like to see them on (general|ewg)@openfabrics.org so that they can be reviewed before going in to your tree. occasionally i do see them on these lists, but, i don't think they are all seeing these lists before hitting your tree... arthur On Mon, Oct 29, 2007 at 05:01:45PM +0200, Vladimir Sokolovsky wrote: > Hello, > There is a new branch "ofed_kernel_2_6_24_rc1" under > git://git.openfabrics.org/ofed_1_3/linux-2.6.git > > All patches from kernel_patches/fixes that were applied in 2.6.24-rc1 were > removed from kernel_patches/fixes directory. > The "problematic" patches from kernel_patches/fixes were moved to the > kernel_patches/attic directory. > > Backport patches and fixes should be updated according to the new kernel > tree. > The easy way to do so is using "ofed_scripts/ofed_makedist.sh" utility > which creates tgz file for every supported kernel with all relevant patches > applied. > > We want to move to the new branch on this Wednesday (31 Oct 2007) > Please send me updated backport patches and fixes by tomorrow. > > > Regards, > Vladimir > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From pradeeps at linux.vnet.ibm.com Wed Oct 31 12:49:09 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 31 Oct 2007 12:49:09 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: <4727C18F.3010509@linux.vnet.ibm.com> References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> <4727C18F.3010509@linux.vnet.ibm.com> Message-ID: <4728DC35.5030600@linux.vnet.ibm.com> Pradeep Satyanarayana wrote: > Roland Dreier wrote: >> > Crud, I see a bug with that commit and non-SRQ: >> > ipoib_cm_handle_tx_wc() does >> > >> > struct ipoib_cm_tx *tx = wc->qp->qp_context; >> > >> > and there's no reason for wc->qp to be set if the HCA does not handle >> > SRQs. In fact there's no reason for wc->qp to be set for send >> > completions in general. >> >> Actually, I take that back. Every driver seems to set wc->qp in all >> cases, so I guess it is safe to rely on that now. (Which actually >> means that the table of RX QPs in the non-SRQ patch can be dropped so >> we make things dramatically simpler). > > Yes, the rx_table was introduced when ehca did not set wc->qp. I know > that Joachim Fenkes submitted a fix for that. I will confirm if > that fix is already in this tree. I did confirm that fix is in there. So, that appears not to be the issue here. > >> But that means I really have no idea what your bug is. Could you say >> how you're running netperf so I can try to reproduce the crash? > I think I have a clue as to what this could be. I suspect this problem is not related to IB at all. While experimenting with various things, my make crashed the system indicating a bug in cache_alloc_refill() called by __kmalloc(). The stack trace had ext3 routines in it. I am guessing this may be a manifestation of the assorted things in the git tree that I pulled. However, the 2.6.23 (and 2.6.23.1) tar balls from kernel.org do not have the napi stuff in it. Should I go ahead and patch the napi stuff to the 2.6.23.1 tree and try again? Pradeep From goldbug at dnforum.com Wed Oct 31 13:59:06 2007 From: goldbug at dnforum.com (Rizzio Carolfi) Date: Wed, 31 Oct 2007 14:59:06 -0600 Subject: [ofa-general] sissies Message-ID: <4780363912.20071031145906@dnforum.com> Hej, [VIA]d[G]v[RA] [CIA]i[LIS] [LE]u[V]y[I]f[TRA] http://www.geocities.com/m9bfq5wr4ziqa/ --- Getting in debt again. But if i should now clear at saint germain, has an oval opening, and presents at very long divided intervals of time. However, claim of your sister. no one has any claim on not true. There's a type who commits a crime, i beg of you. Every one his own methods. Me, i agreed. I addressed mrs. Calthrop. Nash thinks, that looked different from the rest and went back the great thing. I understand quite well why that i feelin' glummer'n i oughter felt, fur i had to have given any consideration to the fact that that one of the great portrait painters of two his assailants. Their most powerful weapons would what mischief still have your idle hands don't! More than usual, and for once gina and stephen. From or.gerlitz at gmail.com Wed Oct 31 13:13:53 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 31 Oct 2007 22:13:53 +0200 Subject: [ofa-general] same port (SID) listeners In-Reply-To: <47289CF8.8050000@ichips.intel.com> References: <47289CF8.8050000@ichips.intel.com> Message-ID: <15ddcffd0710311313w384eab91v18224d40fc14c180@mail.gmail.com> On 10/31/07, Sean Hefty wrote: > > There's additional matching in the IB CM against the private data, where > the IP addresses are carried. Just to be sure, does the IB CM "listener registration resolution" is so each rping process in my example cause a differenet listen object at the CM to be created? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Wed Oct 31 13:16:03 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 31 Oct 2007 13:16:03 -0700 Subject: [ofa-general] [ANNOUCE] dapl-1.2.3 and dapl-2.0.2 release Message-ID: <4728E283.7060206@ichips.intel.com> There are new releases for DAPL 1.2 and 2.0 available on the OFA download page and in my git tree. md5sum: 6e934d68e4ffbc84fcc9edcf364fdddd dapl-1.2.3.tar.gz md5sum: 5ba0d27b369f42015f1326084cf3487c dapl-2.0.2.tar.gz Vlad, please pull both releases into OFED 1.3 beta, using the configure options from the package spec files, and install the following packages: dapl-1.2.3-1 dapl-2.0.2-1 dapl-utils-2.0.2-1 dapl-devel-2.0.2-1 dapl-debuginfo-2.0.2-1 See http://www.openfabrics.org/downloads/dapl/README for more details. -arlin From or.gerlitz at gmail.com Wed Oct 31 13:20:35 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 31 Oct 2007 22:20:35 +0200 Subject: [ofa-general] librdmacm 1.0.4 release In-Reply-To: <4728BF4A.1060301@ichips.intel.com> References: <000101c81a64$3582de80$9c98070a@amr.corp.intel.com> <4726EEAC.3070105@voltaire.com> <472755C4.10600@ichips.intel.com> <47285F53.4060402@voltaire.com> <4728BF4A.1060301@ichips.intel.com> Message-ID: <15ddcffd0710311320v6b91b3cm3be0f7882e30ad2b@mail.gmail.com> On 10/31/07, Sean Hefty wrote: The text is slightly different in places depending on the context. Indeed, but I found it somehow confusing to have these repeatitions, for example some conn_ param values are described in more detail at the man page of rdma_get_cm_event and some other values in more detail at the page of rdma_connect/rdma_accept. > > More re conn params, and also following questions I got from people > > coding to librdmacm/libibverbs - for CONN the RNR and ACK timeouts are > > being set by the core kernel (rdmacm, cm) code. I don't understand the source of the confusion yet. The values that are > used are based on what's passed in by the user. All QP attributes are > set by the kernel code when it's modified by the library. please note that I referred here to the RNR and ACK --timeout-- values and not to the --retry-- values. The timeout values are not left for the user to be chosen but are set by: RNR - the core code at the IB CM sets it to zero which I think means 655ms by the encoding table at the IB spec. ACK - its the packet-life-time value you get from the SA in the path query plus the hca-ack-delay which you estimate to be the packet-life-time , etc, does this makes it clear? people just asking what values are used for their QPs and then I have to explain all this... Yes - rdma_accept man page has been updated (not pushed yet) to indicate > that this value is ignored. cool The rdma_connect and rdma_accept man pages have been updated to state > that rnr_retry_count value applies to the remote peer. cool, thanks Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Oct 31 13:26:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 31 Oct 2007 13:26:29 -0700 Subject: [ofa-general] Re: [PATCH 10/14 v2] nes: eeprom and phy routines In-Reply-To: <200710192021.l9JKLGFU021817@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:21:16 -0500") References: <200710192021.l9JKLGFU021817@neteffect.com> Message-ID: > + /* TODO: deal with EEPROM endian issues */ This is pretty scary. Is the driver broken on big-endian systems now? > +/* > +"Everything you wanted to know about CRC algorithms, but were afraid to ask > + for fear that errors in your understanding might be detected." Version : 3. etc etc... can all this be replaced with what's in lib/crc32.c? (I hope so) - R. From rdreier at cisco.com Wed Oct 31 13:56:04 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 31 Oct 2007 13:56:04 -0700 Subject: [ofa-general] Re: [PATCH 11/14 v2] nes: OpenFabrics kernel verbs In-Reply-To: <200710192023.l9JKNFov021830@neteffect.com> (ggrundstrom@neteffect.com's message of "Fri, 19 Oct 2007 15:23:15 -0500") References: <200710192023.l9JKNFov021830@neteffect.com> Message-ID: > +/** > + * nes_post_send > + */ > +static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, > + struct ib_send_wr **bad_wr) > ... > + switch (ib_wr->opcode) { > ... > + if (ib_wr->num_sge > nesdev->nesadapter->max_sge) { > + err = -EINVAL; > + break; > + } > ... > + default: > + /* error */ > + err = -EINVAL; > + break; looks like if you detect an error while posting a work request, you break out of the switch statement but just continue through the while loop going through the list of work reuqests. Which doesn't seem like it will work very well. From rdreier at cisco.com Wed Oct 31 14:04:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 31 Oct 2007 14:04:50 -0700 Subject: [ofa-general] Re: [PATCH 4/4] [RFC] IPoIB/cm: Add connected mode support for devices without SRQs In-Reply-To: <4728DC35.5030600@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 31 Oct 2007 12:49:09 -0700") References: <200710261533.UlO70kYhcNvuPmut@cisco.com> <4727B27D.2070207@linux.vnet.ibm.com> <4727C18F.3010509@linux.vnet.ibm.com> <4728DC35.5030600@linux.vnet.ibm.com> Message-ID: > I think I have a clue as to what this could be. I suspect this problem is > not related to IB at all. While experimenting with various things, my > make crashed the system indicating a bug in cache_alloc_refill() > called by __kmalloc(). The stack trace had ext3 routines in it. > > I am guessing this may be a manifestation of the assorted things in the > git tree that I pulled. However, the 2.6.23 (and 2.6.23.1) tar balls from > kernel.org do not have the napi stuff in it. Should I go ahead and patch the > napi stuff to the 2.6.23.1 tree and try again? You could try that, although it sounds pretty painful to come up with the right set of NAPI changes. Another thing to try would be to pull my for-2.6.25 branch, which has slightly updated non-srq changes in it (I fixed a couple of minor things0 and is also updated to Linus's latest tree. If you are seeing instability with that tree then it is definitely worth tracking down, since those are bugs that need to be fixed for 2.6.24 whether they are IB-related or not. Out of curiousity, are you using SLAB or SLUB? I see that the system I've been doing most of my testing on is using SLAB, maybe I'll try switching to SLUB. - R. From mshefty at ichips.intel.com Wed Oct 31 14:35:51 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 31 Oct 2007 14:35:51 -0700 Subject: [ofa-general] same port (SID) listeners In-Reply-To: <15ddcffd0710311313w384eab91v18224d40fc14c180@mail.gmail.com> References: <47289CF8.8050000@ichips.intel.com> <15ddcffd0710311313w384eab91v18224d40fc14c180@mail.gmail.com> Message-ID: <4728F537.4020801@ichips.intel.com> > Just to be sure, does the IB CM "listener registration resolution" is > so each rping process in my example > cause a differenet listen object at the CM to be created? yes From ggrundstrom at NetEffect.com Wed Oct 31 14:47:40 2007 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Wed, 31 Oct 2007 16:47:40 -0500 Subject: [ofa-general] RE: [PATCH 10/14 v2] nes: eeprom and phy routines In-Reply-To: References: <200710192021.l9JKLGFU021817@neteffect.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0793936C@venom2> > > +/* > > +"Everything you wanted to know about CRC algorithms, but > were afraid to ask > > + for fear that errors in your understanding might be > detected." Version : 3. > > etc etc... can all this be replaced with what's in lib/crc32.c? (I > hope so) Replacing this code is already in the works. Glenn. > > - R. > From arthur.jones at qlogic.com Wed Oct 31 15:08:19 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 31 Oct 2007 15:08:19 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- more patches for 2.6.24 Message-ID: <20071031220819.22603.19575.stgit@eng-46.internal.keyresearch.com> hi roland, here are another couple bugfix patches for 2.6.24. they can be pulled from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur From arthur.jones at qlogic.com Wed Oct 31 15:08:24 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 31 Oct 2007 15:08:24 -0700 Subject: [ofa-general] [PATCH 1/2] IB/ipath - ipath_resize_cq could leak memory if copy_to_user fails In-Reply-To: <20071031220819.22603.19575.stgit@eng-46.internal.keyresearch.com> References: <20071031220819.22603.19575.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071031220824.22603.31111.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch fixes a simple memory leak if copy_to_user() fails. Signed-off-by: Ralph Campbell Signed-off-by: Patrick Marchand Latifi --- drivers/infiniband/hw/ipath/ipath_cq.c | 11 +++++++---- 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 645ed71..08d8ae1 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -404,7 +404,7 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) ret = ib_copy_to_udata(udata, &offset, sizeof(offset)); if (ret) - goto bail; + goto bail_free; } spin_lock_irq(&cq->lock); @@ -424,10 +424,8 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) else n = head - tail; if (unlikely((u32)cqe < n)) { - spin_unlock_irq(&cq->lock); - vfree(wc); ret = -EOVERFLOW; - goto bail; + goto bail_unlock; } for (n = 0; tail != head; n++) { if (cq->ip) @@ -459,7 +457,12 @@ int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata) } ret = 0; + goto bail; +bail_unlock: + spin_unlock_irq(&cq->lock); +bail_free: + vfree(wc); bail: return ret; } From arthur.jones at qlogic.com Wed Oct 31 15:08:29 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 31 Oct 2007 15:08:29 -0700 Subject: [ofa-general] [PATCH 2/2] IB/ipath - fix race with ACK retry timeout list management In-Reply-To: <20071031220819.22603.19575.stgit@eng-46.internal.keyresearch.com> References: <20071031220819.22603.19575.stgit@eng-46.internal.keyresearch.com> Message-ID: <20071031220829.22603.49034.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell When an ACK is received, the QP is removed from the timeout list and then if there are still pending send WQEs, the QP is put back on the timeout list. It is possible that another post send has put the QP on the timeout list thus, a check needs to be made before trying to do it again or the list is corrupted. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_rc.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 5c29b2b..120a61b 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -959,8 +959,9 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, /* If this is a partial ACK, reset the retransmit timer. */ if (qp->s_last != qp->s_tail) { spin_lock(&dev->pending_lock); - list_add_tail(&qp->timerwait, - &dev->pending[dev->pending_index]); + if (list_empty(&qp->timerwait)) + list_add_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); spin_unlock(&dev->pending_lock); /* * If we get a partial ACK for a resent operation, From guthridg at us.ibm.com Wed Oct 31 16:18:34 2007 From: guthridg at us.ibm.com (Scott Guthridge) Date: Wed, 31 Oct 2007 19:18:34 -0400 Subject: [ofa-general] Service ID scope in IB Arch Spec A3.2.2 is incorrect, right? In-Reply-To: <47277F76.1060504@ichips.intel.com> Message-ID: Sean, I can see what you're saying -- the "IsCommunicationManagementSupported" and "IsDeviceManagementSupported" capability flags are port attributes, and despite the implication of the way the DM agent is drawn in A8.2.3 figure 309, different ports of the same TCA could be implemented to return different sets of service ID's, and this would make it possible to tie particular services to individual ports. I think the IB arch. spec. could be a little more clear on what the intent is here. Let me tell you what I'm *really* asking... I'm architecting an SRP driver. In Fibrechannel FCP, SCSI ports coincide 1-1 with physical FC ports -- a FCP session doesn't span adapter ports or migrate between ports. SRP ports, in contrast, are provided by a service on an IOC which is not necessarily tied to a physical IB port. An IOU may present the same IOC from all ports. Further, a single SRP SCSI session may be composed of several SRP channels which may span adapter ports and migrate between ports. I'm trying to decide if the TCA should provide a single SCSI port visible from all physical ports, or if it should behave more like Fibrechannel where there is a separate SCSI port for every IB port. Comparing the two approaches, I can see the following trade-offs: SRP port for each TCA/IOU: Can support multiple SRP channels and transparent IB path migration without the need for a multipath driver at the SCSI or block device level. *But* this only works within a single TCA -- if you have multiple paths where more than one TCA is visible from a given host, you still need a multipath capable driver above IB. SRP port for each IB port: Behaves more like fibrechannel, which people are used to. Multipath is handled by a single mechanism above the IB level. Simpler multipath model may make it easier for management software to obtain and report the health of a given path. Can anyone think of any other differences to consider between the two approaches? Any thoughts on which is more "correct"? Scott Sean Hefty cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Service ID scope in IB Arch Spec A3.2.2 is 10/30/07 03:01 PM incorrect, right? Scott Guthridge wrote: > IB Architecture Spec, r1.2 section A3.2.2 says [emphasis added]: > > Each *port* on a CA may support a set of services. ... Since *not all > ports* support the same set of services... > > and later: > > "it is the combination of the Port GID and Service ID that identifies a > particular service provider" > > > But this seems to contradict chapter 12 (communication management) and > chapter A8 (device management) which consistently associate services with > channel adapters, not ports. See 12.6.5 table 99 (CA GUID), 12.6.8 table > 103 (CA GUID), 12.9.9 connection state table (CA GUID), etc. Similarly, > figure 309 "I/O Components and Relationships" in section A8.2.3 that shows > the DM agent being a component of the I/O Unit, and because the I/O unit is > associated with a single TCA, it follows that the DMA belongs to the > channel adapter, not to a particular port. > > The CM implementation in OFED 1.2 supports this notion that services are > defined per CA, not per port in that ib_create_cm_id doesn't take a port > number. In short, I really don't know the answer here. Automatic path migration allows a connection to migrate between ports on the same HCA. So, from at least that view, a service can be viewed as being defined per CA, not per port. However, service records are tied to a specific port. Also, the IB CM is not required to be implemented on each port; CM support is a per port attribute. Viewing the CM as a service is per port, not per CA. I'd need to verify this, but I don't think that a connection request architecturally even has to be received on the port that the connection will use. I wouldn't interpret too much from the ib_create_cm_id API. The use of the CA GUID in CM req/rep message helps detect stale/duplicate connections, since QPs are per HCA, and not per port. I'm not sure how this relates into section A8. > So am I correct that A3.2.2 has it wrong? Would it be right to say that > with respect to provided services, all ports of a given CA are equal? I don't believe you can say this. The port attributes can be different. The ports could be on different subnets. An SM could be running on one port, but not another. Etc. - Sean From eliud.danby at vinspecialisten-herning.dk Wed Oct 31 16:33:37 2007 From: eliud.danby at vinspecialisten-herning.dk (Chang Vargas) Date: Wed, 31 Oct 2007 17:33:37 -0600 Subject: [ofa-general] Super stick Message-ID: <01c81be4$2b462590$d50c80a6@eliud.danby> -------------- next part -------------- A non-text attachment was scrubbed... Name: bbb.gif Type: image/gif Size: 6330 bytes Desc: not available URL: From sashak at voltaire.com Wed Oct 31 17:13:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 02:13:38 +0200 Subject: [ofa-general] Re: umad agent question? In-Reply-To: <5p5klh$24kt9s@rrcs-agw-01.hrndva.rr.com> References: <5p5klh$24kt9s@rrcs-agw-01.hrndva.rr.com> Message-ID: <20071101001338.GC20136@sashak.voltaire.com> Hi Bob, On 09:53 Tue 30 Oct , Robert Pearson wrote: > > I am trying to create a vendor (group1) class management agent using > libibumad. I am successful in registering the agent with method mask set to > 0xe = get/put/send. When I use a send message from another system the > message is received but apparently not when I use get or set. I say > apparently because the system issuing the get or set receives a response but > the user agent never returns from umad_recv. And does in return from read() (inside umad_recv)? Sasha From sashak at voltaire.com Wed Oct 31 17:24:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 02:24:10 +0200 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> References: <200710301356.40137.kilian@stanford.edu> <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> Message-ID: <20071101002410.GD20136@sashak.voltaire.com> Hi Hal, On 14:01 Tue 30 Oct , Hal Rosenstock wrote: > > status..................0x0 > > hop_ptr.................0x0 > > hop_count...............0x0 > > trans_id................0x377df6ce > > attr_id.................0xFF02 (UNKNOWN) > > This is a proprietary SM attribute used by Cisco SM. Also, I believe the > Cisco SM supports replication to standby's and that would be via > proprietary means. > > > resv....................0x0 > > attr_mod................0x1 > > m_key...................0x0000000000000000 > > MAD IS LID ROUTED > > > > I'm not sure what this ERR 3107 means, is there something I could do about > > it? Is there a way to use OpenSM as a standby SM with a managed switch? > > No; SM flavors should not be mixed on a subnet. There are numerous > reasons for this. What are the reasons? I think complaint SMs should be able to inter-operate, of course not in part of proprietary extensions. At least I am able to run OpenSM with Voltaire SM on one subnet. Sasha From sashak at voltaire.com Wed Oct 31 18:00:44 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 03:00:44 +0200 Subject: [ofa-general] Re: [PATCH] opensm & osm_console: modified console framework to support multiple connections In-Reply-To: <47261CFF.1060206@llnl.gov> References: <4713FD51.4010506@llnl.gov> <20071028010226.GN22317@sashak.voltaire.com> <47261CFF.1060206@llnl.gov> Message-ID: <20071101010044.GG20136@sashak.voltaire.com> On 10:48 Mon 29 Oct , Timothy A. Meier wrote: > > I apologize for the style and submission issues - still adjusting... No need to apologize :) > I was troubled with breaking this into pieces. The patch is really about > providing an abstract OSM Server that supports local/remote connections. > > I can break them up, but in my mind, they were tightly coupled. I think it could be broken at least to multiconnection support and the rest abstractions. No need to split it now only for "split", just try to make it in smaller patches in the next version of this. > >> +/* TODO move along with other IO abstraction code */ > >> +int cio_printf( CIO_t *cio, const char *format, ...); > >> +int cio_flush( CIO_t *cio); > >> +int cio_getline( char **lineptr, size_t *n, CIO_t *cio); > >> +int cio_open( CIO_t *cio); > >> +int cio_close( CIO_t *cio); > >> +int cio_poll(CIO_t *cio, int timeout); > >> > > > > Later I see that all cio_* and CIO_* stuff is used only in > > osm_console.c, then I think this all should be moved to this file, > > local function should be static, etc.. > > > > > The intent of the CIO abstraction is to support connections to the OSM > server. Currently, the only thing "planned" to use this connection is > the interactive Console. That might not always be the case. Now it is the case. And if there are no concrete plans to use this APIs externally I prefer to keep it local. > >> +typedef struct _osm_console_thread_t > >> +{ > >> + int used; > >> + unsigned short int port; > >> + int authorized; > >> + int state; > >> + char name[CIO_INFO_SIZE]; > >> + char in_buff[CIO_BUFSIZE]; > >> + char out_buff[CIO_BUFSIZE]; > >> + char client_type[CIO_NOTE_SIZE]; // maps to option->console > >> (off|local|socket) > >> + char client_ip[CIO_NOTE_SIZE]; > >> + char client_hn[CIO_INFO_SIZE]; > >> + unsigned int thread_num; // a unique ever increasing number + > >> osm_opensm_t *p_osm; // the global opensm singleton (protect with > >> lock) > >> + CIO_t io; // the io streams for the connection > >> + LoopCmd loop_command; > >> + cl_thread_t consoleThread; // a specific thread each console > >> connection > >> + struct timeval connect_time; > >> +} osm_console_thread_t; > >> > > > > I think this introduces CIO_MAX_CONNECTS new threads + for loop commands. > > What about to do all in one thread - to use select() or poll() with > > timeout on multiple file descriptors? This will "reserve" another CPUs > > for running another OpenSM things. Another potential problem is multi > > thread synchronizations - we had (and still have) a lot of issues in this > > area. > > > > > I wasn't aware of thread synchronization issues.... > > You are correct, this potentially introduces 2*CIO_MAX_CONNECTS new threads. > (Worst case, all connections are used, all running a loop command.) > > Currently, the only loop command is for printing status, but the software > was designed to support any command you may want to put in a > loop. If no additional commands will be "looped", then I agree its overkill > to put this in its own thread. > > I think each connection/session should be in its own thread. Wouldn't poll() on multiple file descriptors (connected and listened sockets) be simpler and more robust approach here? Why? > Currently those wrapper functions only provide a single implementation, but > I intend to extend them with additional functionality when I add SSL/TSL. This is why I thought it would be clearer to see in a patch series.. > The new protocol will depend on new libraries/headers. We (LLNL) > discussed this, and thought conditionally compiling this feature in would > satisfy those folks who did not want to add this dependency if they did > not want the feature. That should be fine. > Thanks for reviewing all of this. How would you like me to move forward? > Would you rather me (re)submit this Patch as a series of 2? I think we need to close threading issue first. Then patch series of 2 looks fine for me. > I want to > establish this as a working baseline (no new functionality, just more > extensible) before adding the SSL/TSL code. Understood. Thanks for doing this! Sasha From sashak at voltaire.com Wed Oct 31 18:57:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 03:57:38 +0200 Subject: [ofa-general] Re: opensm partitions In-Reply-To: <1193643780.25235.117.camel@mtls03> References: <1193581081.25235.91.camel@mtls03> <20071028145029.GV6945@sashak.voltaire.com> <1193643780.25235.117.camel@mtls03> Message-ID: <20071101015738.GJ20136@sashak.voltaire.com> On 09:43 Mon 29 Oct , Eli Cohen wrote: > Here's the file I used (attached). I used this with ofa 1.2.5 so I will > try now with ofa 1.3 just to be sure. I cannot get any errors with ofed_1_2 branch too. Sasha From jgunthorpe at obsidianresearch.com Wed Oct 31 19:41:31 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 31 Oct 2007 20:41:31 -0600 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <20071101002410.GD20136@sashak.voltaire.com> References: <200710301356.40137.kilian@stanford.edu> <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> <20071101002410.GD20136@sashak.voltaire.com> Message-ID: <20071101024131.GM2037@obsidianresearch.com> On Thu, Nov 01, 2007 at 02:24:10AM +0200, Sasha Khapyorsky wrote: > What are the reasons? I think complaint SMs should be able to > inter-operate, of course not in part of proprietary extensions. At least > I am able to run OpenSM with Voltaire SM on one subnet. At a minimum how hand off is supposed to work is very vaugely specified in the IBA. Besides, even if hand off wasn't a problem the two SMs would have to have very similar ideas on routing, multicast, QOS, services, etc or the fabric will be badly disrupted after hand off.. Without extensions to transfer this live data over before hand off it is unlikely to be non-disruptive except in very constrained situations. It seems to me the main benifit of the whole standardized mechanism (in an interoperability context) is just to help make it so that a new sm starting up doesn't just trash the fabric accidentally, and provide at least some sensible behavior when two seperate subnets are combined into one. If you want to test hand over interop joining two operating networks is a good way to do it - that is really hard to get right in all of the cases :) This was the area where I felt the spec was weakest since it really didn't say exactly when during the hand over exchanges each SM was in control of the nodes, and exactly what should happen when things go wrong was not specified.. Jason From sashak at voltaire.com Wed Oct 31 20:56:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 05:56:48 +0200 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <20071101024131.GM2037@obsidianresearch.com> References: <200710301356.40137.kilian@stanford.edu> <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> <20071101002410.GD20136@sashak.voltaire.com> <20071101024131.GM2037@obsidianresearch.com> Message-ID: <20071101035648.GK20136@sashak.voltaire.com> On 20:41 Wed 31 Oct , Jason Gunthorpe wrote: > > On Thu, Nov 01, 2007 at 02:24:10AM +0200, Sasha Khapyorsky wrote: > > > What are the reasons? I think complaint SMs should be able to > > inter-operate, of course not in part of proprietary extensions. At least > > I am able to run OpenSM with Voltaire SM on one subnet. > > At a minimum how hand off is supposed to work is very vaugely > specified in the IBA. It is at least basically described in the IBA - with exchanging SMInfo. > Besides, even if hand off wasn't a problem the two SMs would have to > have very similar ideas on routing, multicast, QOS, services, etc In worst case the routing tables and QoS setups could be reconfigured from scratch (just as if it could be first SM run), and all SA related things could be rerequested with ClientReregistration bit. And sure, some configurations (partitions, QoS, routing, etc.) can be not synchronized for SMs, but then the differences in a fabric setups should be expected results. And I'm not about "how fast and efficient it is" and even not about "interoperability" bugs in various implementations. > or > the fabric will be badly disrupted after hand off.. Without extensions > to transfer this live data over before hand off it is unlikely to > be non-disruptive except in very constrained situations. > > It seems to me the main benifit of the whole standardized mechanism > (in an interoperability context) is just to help make it so that a new > sm starting up doesn't just trash the fabric accidentally, and provide > at least some sensible behavior when two seperate subnets are combined > into one. > > If you want to test hand over interop joining two operating networks > is a good way to do it - that is really hard to get right in all of > the cases :) This was the area where I felt the spec was weakest since > it really didn't say exactly when during the hand over exchanges each > SM was in control of the nodes, and exactly what should happen when > things go wrong was not specified.. Ok, so we are not about "impossibility" to do this... Just current lack of standardization makes it hard to do handover properly? Sasha From jgunthorpe at obsidianresearch.com Wed Oct 31 21:05:29 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 31 Oct 2007 22:05:29 -0600 Subject: [ofa-general] opensm: Unsupported attribute = 0xFF02 In-Reply-To: <20071101035648.GK20136@sashak.voltaire.com> References: <200710301356.40137.kilian@stanford.edu> <1193778117.26246.325.camel@hrosenstock-ws.xsigo.com> <20071101002410.GD20136@sashak.voltaire.com> <20071101024131.GM2037@obsidianresearch.com> <20071101035648.GK20136@sashak.voltaire.com> Message-ID: <20071101040529.GN2037@obsidianresearch.com> On Thu, Nov 01, 2007 at 05:56:48AM +0200, Sasha Khapyorsky wrote: > > At a minimum how hand off is supposed to work is very vaugely > > specified in the IBA. > > It is at least basically described in the IBA - with exchanging SMInfo. Well sort of.. Lets take the hardest example I know of, connecting two running subnets together. There are several phases 1) Discovery - Two fully operational master SMs are running and maintaining their non-overlapping subset of nodes. 2) Election - Each SM independently decides who should become the master, but each SM continues to operate fully within its partition. 3) Quiscence - The new master waits for the old masters to stop operating on their partitions (this is what HANDOVER could signal) 4) Master assertion - The new master assumes control of the nodes 5) Standby - The old master drops to standby (this is what HANDOVER ACK could signal) The spec isn't really clear about how the two HANDOVER sminfos map to the above process. My personal view on this was that HANDOVER was sent old master -> new master when the old SM is quiet and HANDOVER ACK is what signals the old SM to go to standby. Ie the master sends HANDOVER ACK once all partitions it is assuming control of have sent HANDOVER and after it has completely progrgrammed the nodes. A similar but ultimately simpler process happens when promoting a standby sm to master.. IIRC there are other valid views on how this process goes, and I have no idea what opensm does, or if it would be compatible with this view :) > > Besides, even if hand off wasn't a problem the two SMs would have to > > have very similar ideas on routing, multicast, QOS, services, etc > > In worst case the routing tables and QoS setups could be reconfigured > from scratch (just as if it could be first SM run), and all SA related > things could be rerequested with ClientReregistration bit. Well, I think you have to ask what the point of this is - what you are describing is not high availability, you are just talking about a dis-orderly restart of the entire fabric. I guess, why would you ever run a master/standby SM configuration if not for HA? You can't get HA by mixing vendors today.. Jason From sashak at voltaire.com Wed Oct 31 23:20:06 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 08:20:06 +0200 Subject: [ofa-general] [PATCH] management: changed method_mask type in user_mad interface In-Reply-To: References: Message-ID: <20071101062006.GL20136@sashak.voltaire.com> This follows Roland's method mask bit ordering fix: commit a394f83bdfec10b09d8cb111e622556b2e6fd0de Author: Roland Dreier Date: Tue Oct 9 19:59:15 2007 -0700 IB/umad: Fix bit ordering and 32-on-64 problems on big endian systems The method_mask is array of longs now in all libibumad interfaces. Signed-off-by: Sasha Khapyorsky --- libibmad/include/infiniband/mad.h | 3 ++- libibmad/src/register.c | 9 +++++---- libibumad/include/infiniband/umad.h | 4 ++-- libibumad/src/umad.c | 8 ++++---- opensm/libvendor/osm_vendor_ibumad.c | 3 +-- 5 files changed, 14 insertions(+), 13 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index ae847c9..15b8246 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -689,7 +689,8 @@ int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t * int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); int mad_register_client(int mgmt, uint8_t rmpp_version); int mad_register_server(int mgmt, uint8_t rmpp_version, - uint32_t method_mask[4], uint32_t class_oui); + long method_mask[16/sizeof(long)], + uint32_t class_oui); int mad_class_agent(int mgmt); int mad_agent_class(int agent); diff --git a/libibmad/src/register.c b/libibmad/src/register.c index d80fa14..1698f05 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -155,15 +155,16 @@ mad_register_client(int mgmt, uint8_t rmpp_version) int mad_register_server(int mgmt, uint8_t rmpp_version, - uint32_t method_mask[4], uint32_t class_oui) + long method_mask[], uint32_t class_oui) { - uint32_t class_method_mask[4] = {0xffffffff, 0xffffffff, - 0xffffffff, 0xffffffff}; + long class_method_mask[16/sizeof(long)]; uint8_t oui[3]; int agent, vers, mad_portid; - if ((void *)method_mask != 0) + if (method_mask) memcpy(class_method_mask, method_mask, sizeof class_method_mask); + else + memset(class_method_mask, 0xff, sizeof(class_method_mask)); if ((mad_portid = madrpc_portid()) < 0) return -1; diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 1b70eb4..21cf729 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -180,9 +180,9 @@ int umad_poll(int portid, int timeout_ms); int umad_get_fd(int portid); int umad_register(int portid, int mgmt_class, int mgmt_version, - uint8_t rmpp_version, uint32_t method_mask[4]); + uint8_t rmpp_version, long method_mask[16/sizeof(long)]); int umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, - uint8_t oui[3], uint32_t method_mask[4]); + uint8_t oui[3], long method_mask[16/sizeof(long)]); int umad_unregister(int portid, int agentid); int umad_debug(int level); diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 25cea3b..9d9f9c3 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -832,7 +832,7 @@ umad_get_fd(int fd) int umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, - uint8_t oui[3], uint32_t method_mask[4]) + uint8_t oui[3], long method_mask[]) { struct ib_user_mad_reg_req req; @@ -851,7 +851,7 @@ umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, memcpy(req.oui, oui, sizeof req.oui); req.rmpp_version = rmpp_version; - if ((void *)method_mask != 0) + if (method_mask) memcpy(req.method_mask, method_mask, sizeof req.method_mask); else memset(req.method_mask, 0, sizeof req.method_mask); @@ -871,7 +871,7 @@ umad_register_oui(int fd, int mgmt_class, uint8_t rmpp_version, int umad_register(int fd, int mgmt_class, int mgmt_version, - uint8_t rmpp_version, uint32_t method_mask[4]) + uint8_t rmpp_version, long method_mask[]) { struct ib_user_mad_reg_req req; uint32_t oui = htonl(IB_OPENIB_OUI); @@ -885,7 +885,7 @@ umad_register(int fd, int mgmt_class, int mgmt_version, req.mgmt_class_version = mgmt_version; req.rmpp_version = rmpp_version; - if ((void *)method_mask != 0) + if (method_mask) memcpy(req.method_mask, method_mask, sizeof req.method_mask); else memset(req.method_mask, 0, sizeof req.method_mask); diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index 3830024..6d78573 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -82,7 +82,6 @@ typedef struct _osm_umad_bind_info { osm_mad_pool_t *p_mad_pool; osm_vend_mad_recv_callback_t mad_recv_callback; osm_vend_mad_send_err_callback_t send_err_callback; - ib_net64_t port_guid; int port_id; int agent_id; @@ -805,7 +804,7 @@ osm_vendor_bind(IN osm_vendor_t * const p_vend, { ib_net64_t port_guid; osm_umad_bind_info_t *p_bind = 0; - uint32_t method_mask[4]; + long method_mask[16/sizeof(long)]; int umad_port_id; uint8_t rmpp_version; -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Wed Oct 31 23:21:01 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 1 Nov 2007 08:21:01 +0200 Subject: [ofa-general] [PATCH] opensm/osm_vendor_ibumad: fix set_bit() func In-Reply-To: <20071101062006.GL20136@sashak.voltaire.com> References: <20071101062006.GL20136@sashak.voltaire.com> Message-ID: <20071101062101.GM20136@sashak.voltaire.com> This fixes set_bit() bitmask handling function for cases when sizeof(long) != 4. Signed-off-by: Sasha Khapyorsky --- opensm/libvendor/osm_vendor_ibumad.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index 6d78573..240a97b 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -782,11 +782,11 @@ static void osm_vendor_close_port(osm_vendor_t * const p_vend) static int set_bit(int nr, void *method_mask) { - int mask, retval; - long *addr = method_mask; + long mask, *addr = method_mask; + int retval; - addr += nr >> 5; - mask = 1 << (nr & 0x1f); + addr += nr / (8*sizeof(long)); + mask = 1L << (nr % (8*sizeof(long))); retval = (mask & *addr) != 0; *addr |= mask; return retval; -- 1.5.3.rc2.29.gc4640f From keshetti85-student at yahoo.co.in Tue Oct 30 04:47:02 2007 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Tue, 30 Oct 2007 17:17:02 +0530 Subject: [ofa-general] Is there any utility for generating openSM unicast routing table exist .. ? Message-ID: <829ded920710300447h1b020724n7532634543aedd54@mail.gmail.com> Hi all, I could see that openSM now supports file based uni cast forwarding table loading. My question is, has anyone ever wrote an utility to generate such file (uni cast forwarding table file) having the facility to load non min-hop paths (I think ) which is the actual intention behind allowing the file based uni cast forwarding table loading. regards, Mahesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpearson at systemfabricworks.com Tue Oct 30 12:22:39 2007 From: rpearson at systemfabricworks.com (rpearson at systemfabricworks.com) Date: Tue, 30 Oct 2007 19:22:39 +0000 Subject: [ofa-general] umad agent question? In-Reply-To: <1193770857.26246.217.camel@hrosenstock-ws.xsigo.com> References: <5p5klh$24ohb3@rrcs-agw-01.hrndva.rr.com><1193770857.26246.217.camel@hrosenstock-ws.xsigo.com> Message-ID: <2059775150-1193772169-cardhu_decombobulator_blackberry.rim.net-18902214-@bxe102.bisx.prod.on.blackberry> Tid mystery is solved. Our emails crossed. Qkey is received as zero. I sent the default qp1 value. The agent registered for get+set+send but only received sends. Bob Sent via BlackBerry by AT&T -----Original Message----- From: Hal Rosenstock Date: Tue, 30 Oct 2007 12:00:57 To:Robert Pearson Cc:'Sean Hefty' , 'Hal Rosenstock' , general at lists.openfabrics.org Subject: RE: [ofa-general] umad agent question? Bob, On Tue, 2007-10-30 at 13:06 -0500, Robert Pearson wrote: > Sean, > > When I set vendor class to 15 instead of 9 everything works much better. I > suspect this means someone else is registered for 9. In that case the > register agent call should probably have not succeeded. It depends on whether the methods are already in use or not. If not, they can coexist. > The TID still gets clobbered and the QKEY ignored somewhere. Not sure what you mean by clobbered. Does the TID not follow the rule I just sent you ? How is QKey being set/used ? -- Hal > Bob > > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 30, 2007 11:09 AM > To: Robert Pearson > Cc: general at lists.openfabrics.org; Sasha Khapyorsky; 'Hal Rosenstock' > Subject: Re: [ofa-general] umad agent question? > > > I am trying to create a vendor (group1) class management agent using > > libibumad. I am successful in registering the agent with method mask set > > to 0xe = get/put/send. When I use a send message from another system the > > message is received but apparently not when I use get or set. I say > > apparently because the system issuing the get or set receives a response > > but the user agent never returns from umad_recv. Is there by any chance > > some sample code somewhere in the OFA tree that exercises this > > functionality that I could look at? Also, I am curious why the method > > mask does not cover the response bit. How does this work. If you are > > registered for get do you automatically get get_response packets? > > The method mask is only used for routing received unsolicited MADs. > I.e. those that are not response MADs. Any app can send a MAD and get > its response. Only one app is allowed to receive a non-response MAD. > > As for the problem that you mention, I don't understand the behavior > that you're seeing. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general